Infrastructure
Deployment architecture, CI/CD, environments, IaC, monitoring, disaster recovery
- Deployment Architecture
- Environment Configuration
- Infrastructure as Code
- Monitoring & Observability
- Disaster Recovery Plan
- CI/CD Pipeline
- ALAI Static Hosting Blueprint (2026-04-20)
- Cloud Migration 2026
- Master Plan — Cloud Migration
- Phase 1 — Bitwarden Cloud Migration
- Phase 2 — MC + HiveMind API
- Current State vs Target State
- ANVIL SPOF Elimination Plan (2026-04-20)
- MC Claim Protocol
- Agent Team Topology ADR-024
- Phase A — Hook Enforcement for Hard Constraint #2 (2026-05-11)
- ZAKON #18B — Blueprint Liveness Enforcement
- ZAKON Registry — Creation Requires Approval Gate
- LightRAG Tuning — 2026-05
- Email-Reactor — Strategic-Inbox Auto-Triage Daemon
Deployment Architecture
Deployment Architecture
Project: {{PROJECT_NAME}} Version: {{VERSION}} Date: {{DATE}} Author: {{AUTHOR}} Status: Draft | In Review | Approved Reviewers: {{REVIEWERS}}
Document History
| Version | Date | Author | Changes |
|---|---|---|---|
| 0.1 | {{DATE}} | {{AUTHOR}} | Initial draft |
1. Overview
System: {{PROJECT_NAME}} Cloud Provider: {{CLOUD_PROVIDER}} Provider Rationale: {{RATIONALE}} Architecture Pattern: {{PATTERN}}
2. Infrastructure Topology
graph TB
subgraph Internet
USER[End Users]
CDN[CDN / CloudFront]
end
subgraph Public Subnet
ALB[Application Load Balancer]
BASTION[Bastion Host]
end
subgraph Private Subnet - App
APP1[App Server 1]
APP2[App Server 2]
end
subgraph Private Subnet - Data
DB_PRIMARY[(Primary DB)]
DB_REPLICA[(Read Replica)]
CACHE[Redis Cache]
end
subgraph Isolated Subnet
SECRETS[Secrets Manager]
BACKUP[Backup Storage]
end
USER --> CDN
CDN --> ALB
ALB --> APP1
ALB --> APP2
APP1 --> DB_PRIMARY
APP2 --> DB_PRIMARY
APP1 --> CACHE
DB_PRIMARY --> DB_REPLICA
APP1 --> SECRETS
3. Networking Architecture
3.1 VPC / VNET Design
| Network | CIDR | Purpose |
|---|---|---|
| VPC / VNET | {{CIDR_VPC}} | Main network boundary |
| Public Subnet A | {{CIDR_PUB_A}} | Load balancers, NAT gateways |
| Public Subnet B | {{CIDR_PUB_B}} | Load balancers, NAT gateways (AZ-B) |
| Private Subnet A | {{CIDR_PRIV_A}} | Application servers |
| Private Subnet B | {{CIDR_PRIV_B}} | Application servers (AZ-B) |
| Isolated Subnet A | {{CIDR_ISO_A}} | Databases, secrets |
| Isolated Subnet B | {{CIDR_ISO_B}} | Databases, secrets (AZ-B) |
3.2 Load Balancer Configuration
| Parameter | Value |
|---|---|
| Type | {{LB_TYPE}} |
| Protocol | HTTPS (TLS 1.2+) |
| SSL Termination | At load balancer |
| Health Check Path | {{HEALTH_CHECK_PATH}} |
| Health Check Interval | {{INTERVAL}}s |
| Unhealthy Threshold | {{THRESHOLD}} consecutive failures |
| Idle Timeout | {{TIMEOUT}}s |
| Stickiness | {{STICKINESS}} |
3.3 DNS Architecture
| Record | Type | Value | TTL |
|---|---|---|---|
| {{DOMAIN}} | A / ALIAS | Load Balancer | {{TTL}} |
| api.{{DOMAIN}} | CNAME | API Load Balancer | {{TTL}} |
| cdn.{{DOMAIN}} | CNAME | CDN Distribution | {{TTL}} |
DNS Provider: {{DNS_PROVIDER}} Failover Strategy: {{FAILOVER_STRATEGY}}
3.4 CDN Configuration
| Parameter | Value |
|---|---|
| Provider | {{CDN_PROVIDER}} |
| Origin | {{CDN_ORIGIN}} |
| Cache Behaviors | Static assets: 1yr, API: no-cache, HTML: 5min |
| HTTPS Only | Yes |
| WAF Integration | {{WAF_INTEGRATION}} |
4. Compute
4.1 Container Orchestration
Platform: {{ORCHESTRATION}}
| Component | Configuration | Notes |
|---|---|---|
| Cluster | {{CLUSTER_SPEC}} | |
| Node Groups | {{NODE_GROUPS}} | |
| Min Nodes | {{MIN_NODES}} | |
| Max Nodes | {{MAX_NODES}} | |
| Node Size | {{NODE_SIZE}} | |
| Container Registry | {{REGISTRY}} |
4.2 Serverless Functions
| Function | Trigger | Memory | Timeout | Purpose |
|---|---|---|---|---|
| {{FUNCTION_1}} | {{TRIGGER}} | {{MEMORY}}MB | {{TIMEOUT}}s | {{PURPOSE}} |
4.3 Instance Sizing & Auto-Scaling
| Service | Instance Type | Min | Max | Scale Trigger |
|---|---|---|---|---|
| {{SERVICE}} | {{INSTANCE}} | {{MIN}} | {{MAX}} | CPU > {{CPU}}% for {{DURATION}}min |
Scale-Out Policy: {{SCALE_OUT}} Scale-In Policy: {{SCALE_IN}} Scale-In Cooldown: {{COOLDOWN}}min
5. Storage
5.1 Database Hosting
| Database | Engine | Version | Hosting | Instance | Storage | HA |
|---|---|---|---|---|---|---|
| {{DB_NAME}} | {{ENGINE}} | {{VERSION}} | {{HOSTING}} | {{INSTANCE}} | {{STORAGE}}GB | {{HA}} |
Connection Pooling: {{POOL_TOOL}} Max Connections: {{MAX_CONN}} Connection String: Stored in {{SECRET_LOCATION}} (never hardcoded)
5.2 Object Storage
| Bucket / Container | Purpose | Access | Lifecycle | Encryption |
|---|---|---|---|---|
| {{BUCKET_NAME}} | {{PURPOSE}} | {{ACCESS}} | {{LIFECYCLE}} | AES-256 |
5.3 File Storage
| Storage | Type | Mount Point | Purpose | Size |
|---|---|---|---|---|
| {{STORAGE_NAME}} | {{TYPE}} | {{MOUNT}} | {{PURPOSE}} | {{SIZE}}GB |
6. Security
6.1 Network Security Groups / Firewall Rules
| Security Group | Direction | Port | Protocol | Source / Destination | Purpose |
|---|---|---|---|---|---|
| sg-alb | Inbound | 443 | TCP | 0.0.0.0/0 | HTTPS from internet |
| sg-alb | Outbound | {{APP_PORT}} | TCP | sg-app | Forward to app |
| sg-app | Inbound | {{APP_PORT}} | TCP | sg-alb | From load balancer |
| sg-app | Outbound | {{DB_PORT}} | TCP | sg-db | Database access |
| sg-db | Inbound | {{DB_PORT}} | TCP | sg-app | From application only |
6.2 WAF Configuration
WAF Provider: {{WAF_PROVIDER}}
| Rule Group | Purpose | Action |
|---|---|---|
| AWSManagedRulesCommonRuleSet | OWASP Top 10 | Block |
| AWSManagedRulesSQLiRuleSet | SQL injection | Block |
| AWSManagedRulesKnownBadInputsRuleSet | Known bad inputs | Block |
| Rate limiting | {{RATE_LIMIT}} req/5min per IP | Count → Block |
6.3 Secrets Management
Secret Store: {{SECRET_STORE}}
| Secret | Rotation Schedule | Access |
|---|---|---|
| Database credentials | 90 days | App role only |
| API keys (third-party) | On compromise | App role only |
| TLS certificates | 60 days before expiry | Deploy role only |
| JWT signing key | 365 days | Auth service only |
6.4 IAM Roles & Policies
| Role | Trusted By | Key Permissions | Purpose |
|---|---|---|---|
| {{APP_ROLE}} | EC2 / ECS Task | SecretsManager:GetSecret, S3:GetObject | Application runtime |
| {{DEPLOY_ROLE}} | CI/CD | ECR:PushImage, ECS:UpdateService | Deployments |
| {{BACKUP_ROLE}} | Lambda / Cron | RDS:CreateSnapshot, S3:PutObject | Backups |
7. Cost Estimation
| Component | Service | Spec | Est. Monthly Cost |
|---|---|---|---|
| Compute | {{SERVICE}} | {{SPEC}} | ${{COST}} |
| Database | {{SERVICE}} | {{SPEC}} | ${{COST}} |
| Load Balancer | {{SERVICE}} | {{SPEC}} | ${{COST}} |
| CDN | {{SERVICE}} | {{TRAFFIC}}GB transfer | ${{COST}} |
| Storage | {{SERVICE}} | {{CAPACITY}}GB | ${{COST}} |
| Monitoring | {{SERVICE}} | {{METRICS}} metrics | ${{COST}} |
| Total | ${{TOTAL}} |
Cost Optimization Notes:
8. High Availability Design
| Component | HA Strategy | Failover Time | Notes |
|---|---|---|---|
| Application | Multi-AZ, N+1 instances | Immediate (ELB health check) | |
| Database | Multi-AZ with auto-failover | 60-120 seconds | DNS propagation |
| Cache | Cluster mode / Replication | 30 seconds | Redis Sentinel |
| CDN | Global edge network | Transparent | Provider HA |
RTO Target: {{RTO}} minutes RPO Target: {{RPO}} minutes
9. Multi-Region Considerations
Current: {{REGION_STRATEGY}} Primary Region: {{PRIMARY_REGION}} Secondary Region: {{SECONDARY_REGION}}
Rationale: {{MULTI_REGION_RATIONALE}}
Data Replication: {{REPLICATION_STRATEGY}} Failover Procedure: See disaster-recovery-plan.md
10. Related Documents
- CI/CD Pipeline
- Environment Configuration
- Infrastructure as Code
- Monitoring & Observability
- Disaster Recovery Plan
Approval
| Role | Name | Date | Signature |
|---|---|---|---|
| Author | |||
| Reviewer | |||
| Approver |
Environment Configuration
Environment Configuration
Project: {{PROJECT_NAME}} Version: {{VERSION}} Date: {{DATE}} Author: {{AUTHOR}} Status: Draft | In Review | Approved Reviewers: {{REVIEWERS}}
Document History
| Version | Date | Author | Changes |
|---|---|---|---|
| 0.1 | {{DATE}} | {{AUTHOR}} | Initial draft |
1. Environment Overview
| Environment | Purpose | URL | Access | Managed By |
|---|---|---|---|---|
| Local | Developer workstation | localhost |
Developer | Individual |
| Dev | Integration, daily builds | dev.{{DOMAIN}} |
Team + CI | Platform team |
| Staging | Pre-production validation | staging.{{DOMAIN}} |
Team + QA + PM | Platform team |
| Production | Live system | {{DOMAIN}} |
Ops only | Platform team |
| Preview | Feature branch review | {{BRANCH}}.preview.{{DOMAIN}} |
Team + Stakeholders | CI/CD |
2. Per-Environment Configuration
2.1 Development Environment
| Parameter | Value | Notes |
|---|---|---|
| Log level | DEBUG |
Verbose logging for development |
| Database | dev-db.{{INTERNAL_DOMAIN}} |
Shared dev DB, refreshed weekly |
| Cache | dev-redis.{{INTERNAL_DOMAIN}} |
Shared Redis, no persistence |
| Mailtrap / fake SMTP | Emails not delivered to real recipients | |
| Payments | Sandbox / test mode | No real transactions |
| Feature flags | All enabled | Developers can test unreleased features |
| Debug tools | Enabled | Profiler, debug toolbar, etc. |
| Rate limiting | Disabled | Developer convenience |
| Auto-migrations | Enabled | Runs on startup |
2.2 Staging Environment
| Parameter | Value | Notes |
|---|---|---|
| Log level | INFO |
Same as production |
| Database | staging-db.{{INTERNAL_DOMAIN}} |
Isolated staging DB, production-scale |
| Cache | staging-redis.{{INTERNAL_DOMAIN}} |
Dedicated Redis |
staging@{{DOMAIN}} |
Sends to internal test inboxes only | |
| Payments | Sandbox / test mode | No real transactions |
| Feature flags | Mirrors production + staged features | |
| Debug tools | Disabled | Must match production behavior |
| Rate limiting | Enabled | Same limits as production |
| Data refresh | Weekly from production (anonymized) | See data refresh runbook |
Intentional staging/production differences:
- Email delivery: internal only (not real users)
- Payment: sandbox (not real transactions)
- Data: anonymized copies (not real PII)
2.3 Production Environment
| Parameter | Value | Notes |
|---|---|---|
| Log level | WARN |
Errors and warnings only |
| Database | {{PROD_DB_HOST}} |
See secrets manager |
| Cache | {{PROD_REDIS_HOST}} |
Clustered Redis |
{{EMAIL_PROVIDER}} |
Real delivery via SES/Sendgrid/etc. | |
| Payments | Live mode | Real transactions |
| Feature flags | Conservative — tested features only | New features behind flags |
| Debug tools | Disabled | Security requirement |
| Rate limiting | Enabled | See rate limit table |
| HSTS | Enabled (1 year, includeSubDomains) | |
| CSP | Strict | See security headers config |
2.4 Preview / Feature Environments
Trigger: Pull request opened against main / develop
Lifetime: Active while PR is open; destroyed on PR close
URL Pattern: {{BRANCH_SLUG}}.preview.{{DOMAIN}}
Database: Ephemeral copy (seeded from fixture data, not production)
Teardown: Automated — triggered by PR close webhook
| Parameter | Value |
|---|---|
| Log level | DEBUG |
| Fake SMTP / preview inbox | |
| Payments | Sandbox |
| Feature flags | Branch-specific flags enabled |
3. Environment Variables Reference
| Variable | Description | Required | Default | Sensitive | Environments |
|---|---|---|---|---|---|
NODE_ENV |
Runtime environment | Yes | development |
No | All |
PORT |
HTTP server port | Yes | 3000 |
No | All |
DATABASE_URL |
PostgreSQL connection string | Yes | — | Yes | All |
REDIS_URL |
Redis connection string | Yes | redis://localhost:6379 |
Yes | All |
JWT_SECRET |
JWT signing key | Yes | — | Yes | All |
JWT_EXPIRY |
Token expiry duration | Yes | 1h |
No | All |
SMTP_HOST |
SMTP server hostname | Yes | — | No | All |
SMTP_USER |
SMTP username | Yes | — | Yes | All |
SMTP_PASS |
SMTP password | Yes | — | Yes | All |
S3_BUCKET |
Object storage bucket name | Yes | — | No | All |
AWS_REGION |
Cloud region | Yes | eu-west-1 |
No | All |
SENTRY_DSN |
Error tracking DSN | No | — | Yes | Staging, Prod |
STRIPE_KEY |
Payment API key | Yes (if payments) | — | Yes | All |
LOG_LEVEL |
Logging verbosity | No | info |
No | All |
RATE_LIMIT_WINDOW |
Rate limit window (ms) | No | 60000 |
No | All |
RATE_LIMIT_MAX |
Max requests per window | No | 100 |
No | All |
FEATURE_FLAG_KEY |
Feature flag SDK key | No | — | Yes | All |
Rules:
- Sensitive variables MUST be sourced from {{SECRET_STORE}} in staging and production
- Never commit sensitive values to source control
- Use
.env.examplewith placeholder values for developer onboarding - Rotate all secrets on team member offboarding
4. Secrets Management
4.1 Secret Storage Solution
Solution: {{SECRET_TOOL}}
| Environment | Secret Store | Access Method |
|---|---|---|
| Local | .env file (never committed) |
Developer managed |
| Dev | {{DEV_SECRET_STORE}} | CI/CD service account |
| Staging | {{STG_SECRET_STORE}} | IAM role / service account |
| Production | {{PROD_SECRET_STORE}} | IAM role / service account |
4.2 Secret Rotation Schedule
| Secret Type | Rotation Schedule | Automated | Owner |
|---|---|---|---|
| Database passwords | 90 days | {{AUTOMATED}} | Platform team |
| API keys (internal) | 365 days | No | Service owner |
| API keys (third-party) | On compromise | No | Dev lead |
| JWT signing keys | 365 days | No | Platform team |
| TLS certificates | 60 days before expiry | {{AUTOMATED}} | Platform team |
4.3 Access Controls
| Role | Dev Secrets | Staging Secrets | Production Secrets |
|---|---|---|---|
| Developer | Read/Write | Read | No access |
| DevOps | Read/Write | Read/Write | Read/Write |
| CI/CD (build) | Read | Read | No access |
| CI/CD (deploy) | No access | Read | Read |
| Application runtime | Read (scoped) | Read (scoped) | Read (scoped) |
5. Feature Flags Per Environment
Tool: {{FF_TOOL}}
| Flag | Dev | Staging | Production | Notes |
|---|---|---|---|---|
feature-new-checkout |
On | On | Off | Waiting for QA sign-off |
feature-dark-mode |
On | On | Off | Rollout planned {{DATE}} |
kill-switch-payments |
Off | Off | Off | Emergency disable only |
maintenance-mode |
Off | Off | Off | Emergency only |
6. Database Configuration Per Environment
| Parameter | Local | Dev | Staging | Production |
|---|---|---|---|---|
| Host | localhost |
{{DEV_DB}} |
{{STG_DB}} |
{{PROD_DB}} |
| Port | 5432 |
5432 |
5432 |
5432 |
| Database name | {{APP}}_dev |
{{APP}}_dev |
{{APP}}_staging |
{{APP}}_prod |
| Max connections | 10 |
25 |
50 |
{{PROD_CONNS}} |
| SSL required | No | No | Yes | Yes |
| Connection pool | No | No | Yes ({{POOL}}) | Yes ({{POOL}}) |
| Read replica | No | No | No | Yes |
| Backup | No | Daily | Daily | {{BACKUP_FREQ}} |
7. External Service Configuration Per Environment
| Service | Dev | Staging | Production | Notes |
|---|---|---|---|---|
| Email (SMTP) | Mailtrap | Mailtrap | SendGrid / SES | |
| Payments | Stripe test | Stripe test | Stripe live | Different API keys |
| SMS | Twilio test | Twilio test | Twilio live | |
| Analytics | Disabled | Staging property | Production property | |
| Error tracking | Disabled | Sentry dev project | Sentry prod project | |
| Maps | No key / free tier | Paid key | Paid key |
8. Environment Provisioning Process
- Infrastructure provisioning:
terraform apply -var-file=envs/{{ENV}}.tfvars - Secret provisioning:
bash scripts/provision-secrets.sh {{ENV}} - Database provisioning:
bash scripts/create-db.sh {{ENV}} - DNS configuration: Update DNS records per deployment-architecture.md
- TLS certificates: Auto-provisioned via {{CERT_TOOL}}
- Initial deployment: Trigger CI/CD for
{{ENV}}target - Verification: Run smoke tests against new environment
Estimated time: {{PROVISION_TIME}} minutes Runbook: {{PROVISION_RUNBOOK_LINK}}
9. Environment Teardown Process
- Verify no active users or critical processes
- Export any required data / logs
- Remove DNS records
- Revoke TLS certificates
terraform destroy -var-file=envs/{{ENV}}.tfvars- Purge secrets from secret store
- Archive environment configuration to {{ARCHIVE_LOCATION}}
- Update this document to remove the environment entry
10. Parity Policy (Staging ↔ Production Drift)
Goal: Staging should be functionally identical to production at all times.
| Area | Policy |
|---|---|
| Application version | Staging is always ahead by ≤ 1 release |
| Infrastructure spec | Same instance types and topology |
| Database engine & version | Must match exactly |
| OS & runtime versions | Must match exactly |
| Third-party dependencies | Same versions (except external service mode) |
| Network topology | Same (except size) |
| Security controls | Same |
Drift detection: {{DRIFT_DETECTION}} Drift resolution owner: Platform team
Related Documents
Approval
| Role | Name | Date | Signature |
|---|---|---|---|
| Author | |||
| Reviewer | |||
| Approver |
Infrastructure as Code
Infrastructure as Code
Project: {{PROJECT_NAME}} Version: {{VERSION}} Date: {{DATE}} Author: {{AUTHOR}} Status: Draft | In Review | Approved Reviewers: {{REVIEWERS}}
Document History
| Version | Date | Author | Changes |
|---|---|---|---|
| 0.1 | {{DATE}} | {{AUTHOR}} | Initial draft |
1. Overview
IaC Tool: {{IAC_TOOL}} Tool Version: {{IAC_VERSION}} Provider: {{CLOUD_PROVIDER}} Provider Version: {{PROVIDER_VERSION}}
Rationale for tool choice:
{{IAC_RATIONALE}}
Core Principles:
- All infrastructure changes go through code (no manual console changes in staging/prod)
- IaC reviewed like application code (PR, review, merge)
- State is the single source of truth
- Modules are versioned and reusable
2. Repository Structure
{{IaC_REPO}}/
├── modules/ # Reusable modules
│ ├── networking/ # VPC, subnets, security groups
│ ├── compute/ # EC2, ECS, Lambda
│ ├── database/ # RDS, ElastiCache
│ ├── storage/ # S3, EFS
│ └── monitoring/ # CloudWatch, alerts
├── environments/ # Environment-specific configs
│ ├── dev/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ └── terraform.tfvars
│ ├── staging/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ └── terraform.tfvars
│ └── production/
│ ├── main.tf
│ ├── variables.tf
│ └── terraform.tfvars
├── shared/ # Shared resources (DNS, accounts)
├── scripts/ # Helper scripts
│ ├── bootstrap.sh # Initialize state backend
│ └── validate.sh # Pre-apply validation
├── .terraform-version # Pin tool version (tfenv)
├── .tflint.hcl # Linting config
└── README.md
2.1 Module Organization
| Module | Purpose | Inputs | Outputs |
|---|---|---|---|
modules/networking |
VPC, subnets, routing | region, cidr_block, az_count | vpc_id, subnet_ids, sg_ids |
modules/compute |
ECS cluster, task definitions | cluster_name, instance_type | cluster_arn, task_role_arn |
modules/database |
RDS instance, parameter groups | engine, instance_class | db_endpoint, db_secret_arn |
modules/storage |
S3 buckets with policies | bucket_name, purpose | bucket_arn, bucket_name |
modules/monitoring |
CloudWatch dashboards, alarms | service_name, thresholds | alarm_arns, dashboard_url |
2.2 Environment Separation
- Each environment directory is independently deployable
- Environments call the same modules with different variable values
- No cross-environment dependencies (except shared DNS zone)
- Production has stricter apply controls (see Section 6)
2.3 Shared Modules
| Module | Source | Version | Used By |
|---|---|---|---|
networking |
{{REGISTRY}}/networking |
~> 2.0 |
All environments |
database |
{{REGISTRY}}/database |
~> 1.5 |
Staging, Production |
monitoring |
{{REGISTRY}}/monitoring |
~> 1.2 |
All environments |
3. State Management
3.1 Remote State Backend
Backend: {{STATE_BACKEND}}
| Environment | State Location | Access |
|---|---|---|
| Dev | {{STATE_BUCKET}}/dev/terraform.tfstate |
DevOps team |
| Staging | {{STATE_BUCKET}}/staging/terraform.tfstate |
DevOps team |
| Production | {{STATE_BUCKET}}/production/terraform.tfstate |
Senior DevOps + CI only |
Bootstrap (first-time setup):
bash scripts/bootstrap.sh {{ENVIRONMENT}}
3.2 State Locking
Locking Mechanism: {{LOCK_MECHANISM}} Lock timeout: {{LOCK_TIMEOUT}}s Force unlock: Only by senior DevOps after verifying no active apply
Lock table (if DynamoDB):
- Table:
{{LOCK_TABLE}} - Key:
LockID - Billing: On-demand
3.3 State File Organization
Splitting strategy: {{SPLIT_STRATEGY}}
| State File | Contains | Reason for split |
|---|---|---|
base/terraform.tfstate |
Networking, IAM | Infrequently changed |
app/terraform.tfstate |
Compute, app services | Frequently changed |
data/terraform.tfstate |
Databases, caches | High risk, separate lifecycle |
4. Module Design
4.1 Naming Conventions
Resource naming pattern: {{PROJECT}}-{{ENVIRONMENT}}-{{COMPONENT}}-{{SUFFIX}}
| Resource | Example |
|---|---|
| VPC | myapp-prod-vpc |
| ECS Cluster | myapp-prod-cluster |
| RDS Instance | myapp-prod-db-primary |
| S3 Bucket | myapp-prod-assets-{{ACCOUNT_ID}} |
| Security Group | myapp-prod-app-sg |
| IAM Role | myapp-prod-app-task-role |
4.2 Input / Output Variables
Required variable fields:
variable "environment" {
description = "Deployment environment (dev/staging/production)"
type = string
validation {
condition = contains(["dev", "staging", "production"], var.environment)
error_message = "Environment must be dev, staging, or production."
}
}
Required output fields:
output "database_endpoint" {
description = "The hostname of the database endpoint"
value = aws_db_instance.main.endpoint
sensitive = false
}
4.3 Versioning Strategy
Module versioning: Semantic versioning (MAJOR.MINOR.PATCH)
Pin strategy: ~> MAJOR.MINOR (allow patch updates, pin minor)
Upgrade policy: Review and test before upgrading minor/major versions
Changelog: Every module version bump requires a CHANGELOG entry
5. Workflow
5.1 Standard Change Process
flowchart LR
BRANCH[Create branch] --> CODE[Write/modify IaC]
CODE --> VALIDATE[terraform validate + tflint]
VALIDATE --> PLAN[terraform plan]
PLAN --> PR[Open PR with plan output]
PR --> REVIEW[Peer review]
REVIEW --> APPROVE[Approval]
APPROVE --> APPLY[terraform apply in CI]
APPLY --> VERIFY[Verify resources]
Steps:
- Create feature branch:
infra/{{TICKET}}-description - Make changes, run
terraform validate && terraform fmt - Run
terraform plan— attach output to PR - Open PR for review (at least 1 reviewer required for dev/staging, 2 for production)
- CI runs
terraform planautomatically on PR open - Merge triggers
terraform applyin CI (dev/staging) - Production apply requires manual trigger after PR merge
5.2 PR-Based Infrastructure Changes
PR Requirements:
- Title:
[IaC] {{ENVIRONMENT}}: description of change - Must include
terraform planoutput in PR description or CI artifact - Must include justification for the change
- Must reference the related application ticket (if applicable)
- Must have passing CI validation (fmt, validate, tflint, plan)
5.3 Automated Drift Detection
Schedule: {{DRIFT_SCHEDULE}} Tool: {{DRIFT_TOOL}} Alert Channel: {{DRIFT_ALERT_CHANNEL}} Action on drift:
- Investigate cause (manual change, provider issue, external system)
- Either fix drift (apply IaC) or update IaC to reflect intentional change
- Never leave drift unresolved for > {{DRIFT_SLA}}
6. Security
6.1 Least Privilege for IaC Service Account
| Environment | Service Account | Permissions |
|---|---|---|
| Dev | ci-iac-dev@{{PROJECT}} |
Full write within dev resources |
| Staging | ci-iac-staging@{{PROJECT}} |
Full write within staging resources |
| Production | ci-iac-prod@{{PROJECT}} |
Restricted write, requires MFA session |
6.2 Secret Injection (Not in State)
Rule: Never pass passwords, API keys, or secrets as Terraform variables Pattern: Reference secrets manager in resource configuration:
# WRONG — secret in state
resource "aws_db_instance" "main" {
password = var.db_password # This will be in state in plaintext!
}
# RIGHT — secret from Secrets Manager
resource "aws_db_instance" "main" {
manage_master_user_password = true # AWS manages the password in Secrets Manager
}
6.3 Policy as Code
Tool: {{POLICY_TOOL}}
| Policy | Enforcement |
|---|---|
| No public S3 buckets | Block |
| All resources must have environment tag | Warn |
| RDS must be in private subnet | Block |
Security groups must not allow 0.0.0.0/0 on sensitive ports |
Block |
| Encryption at rest required for data resources | Block |
7. Tagging Strategy
| Tag | Value | Purpose |
|---|---|---|
Project |
{{PROJECT_NAME}} |
Cost attribution |
Environment |
dev / staging / production |
Environment filter |
ManagedBy |
terraform |
Identifies IaC-managed resources |
Team |
{{TEAM}} |
Ownership |
CostCenter |
{{COST_CENTER}} |
Finance attribution |
| Tag | Value | Purpose |
|---|---|---|
Service |
{{SERVICE_NAME}} |
Service-level grouping |
Ticket |
{{TICKET_ID}} |
Change tracking |
ExpiresAt |
{{DATE}} |
Ephemeral resource cleanup |
8. Cost Management
Budget alerts:
- Dev: Alert at ${{DEV_BUDGET}} / month
- Staging: Alert at ${{STG_BUDGET}} / month
- Production: Alert at ${{PROD_BUDGET}} / month
Cost optimization built into IaC:
- Dev/staging auto-shutdown: {{AUTO_SHUTDOWN_SCHEDULE}}
- Right-sizing: Instance types reviewed quarterly
- Reserved instances / savings plans: Applied to production
9. Disaster Recovery for IaC State
State backup: {{STATE_BACKUP}} Recovery procedure:
- Restore from most recent backup
- Run
terraform plan— verify no unexpected changes - If state is unrecoverable:
terraform importfor each managed resource (refer to resource inventory)
Prevention:
- S3 versioning enabled on state bucket
- MFA delete required for state bucket
- State bucket access logged to CloudTrail
Related Documents
Approval
| Role | Name | Date | Signature |
|---|---|---|---|
| Author | |||
| Reviewer | |||
| Approver |
Monitoring & Observability
Monitoring & Observability
Project: {{PROJECT_NAME}} Version: {{VERSION}} Date: {{DATE}} Author: {{AUTHOR}} Status: Draft | In Review | Approved Reviewers: {{REVIEWERS}}
Document History
| Version | Date | Author | Changes |
|---|---|---|---|
| 0.1 | {{DATE}} | {{AUTHOR}} | Initial draft |
1. Observability Strategy
Observability Platform: {{OBS_PLATFORM}} Strategy: Instrument everything, alert on symptoms (not causes), correlate across pillars
Core Questions We Must Be Able to Answer:
- Is the system up and serving users correctly?
- How fast is it responding?
- What errors are occurring and why?
- Where is the bottleneck?
- What changed before this problem started?
2. Three Pillars
2.1 Metrics
Infrastructure Metrics
| Metric | Source | Alert Threshold | Severity |
|---|---|---|---|
| CPU utilization | Node exporter / CloudWatch | > {{CPU_WARN}}% (warn), > {{CPU_CRIT}}% (critical) | Warning / Critical |
| Memory utilization | Node exporter / CloudWatch | > {{MEM_WARN}}% (warn), > {{MEM_CRIT}}% (critical) | Warning / Critical |
| Disk utilization | Node exporter / CloudWatch | > {{DISK_WARN}}% (warn), > {{DISK_CRIT}}% (critical) | Warning / Critical |
| Network in/out | Node exporter / CloudWatch | > {{NET_LIMIT}}Mbps sustained | Warning |
| Container restarts | Kubernetes / ECS | > {{RESTART_LIMIT}} in 5min | Critical |
| Node not ready | Kubernetes | Any | Critical |
Application Metrics (RED Method)
| Metric | Description | Target | Alert Threshold |
|---|---|---|---|
| Request rate | Requests per second per service | Baseline ± 20% | 50% deviation |
| Error rate | % requests returning 5xx | < {{ERROR_RATE}}% | > {{ERROR_ALERT}}% |
| P50 latency | Median response time | < {{P50}}ms | > {{P50_ALERT}}ms |
| P95 latency | 95th percentile response time | < {{P95}}ms | > {{P95_ALERT}}ms |
| P99 latency | 99th percentile response time | < {{P99}}ms | > {{P99_ALERT}}ms |
Business Metrics
| Metric | Description | Collection Method | Dashboard |
|---|---|---|---|
| Active users (DAU/MAU) | Daily/monthly active users | Frontend instrumentation | Business dashboard |
| {{CONVERSION_METRIC}} | {{CONVERSION_DESC}} | Backend event | Business dashboard |
| {{REVENUE_METRIC}} | {{REVENUE_DESC}} | Payment events | Finance dashboard |
| Feature usage | Feature-level engagement | Feature flag SDK | Product dashboard |
Custom Metrics Definition
| Metric Name | Type | Labels | Description | Unit |
|---|---|---|---|---|
{{APP}}_job_queue_depth |
Gauge | queue_name |
Number of pending jobs | count |
{{APP}}_job_processing_duration |
Histogram | queue_name, status |
Job processing time | seconds |
{{APP}}_external_api_calls_total |
Counter | service, status |
External API call count | count |
{{APP}}_cache_hit_ratio |
Gauge | cache_type |
Cache hit percentage | ratio |
2.2 Logs
Log Levels & Usage Guide
| Level | When to Use | Examples |
|---|---|---|
ERROR |
Unexpected failure requiring attention | Database connection failure, unhandled exception |
WARN |
Unexpected but handled situation | Deprecated API called, retry succeeded |
INFO |
Normal business events | User logged in, order created, job completed |
DEBUG |
Diagnostic detail (dev/staging only) | Function parameters, internal state |
TRACE |
Extremely verbose (local dev only) | SQL queries, HTTP request/response bodies |
Production log level: INFO and above
Structured Logging Format
{
"timestamp": "2026-01-15T10:30:00.000Z",
"level": "INFO",
"service": "{{SERVICE_NAME}}",
"version": "{{VERSION}}",
"trace_id": "abc123def456",
"span_id": "789xyz",
"user_id": "{{HASHED_OR_OMIT}}",
"request_id": "req-uuid-here",
"message": "Order created successfully",
"order_id": "ord-123",
"duration_ms": 45
}
Required fields: timestamp, level, service, message, trace_id
Forbidden in logs: passwords, tokens, credit card numbers, SSN, full email addresses (hash or truncate)
Log Aggregation Pipeline
flowchart LR
APP[Application] -->|stdout/stderr| AGENT[Log Agent<br/>Fluent Bit / Filebeat]
AGENT -->|structured JSON| STORE[Log Store<br/>Loki / Elasticsearch / CloudWatch]
STORE --> QUERY[Query Interface<br/>Grafana / Kibana]
STORE --> ALERT[Alert Engine<br/>AlertManager / PagerDuty]
| Stage | Tool | Configuration |
|---|---|---|
| Application logging | {{LOG_LIB}} | Structured JSON to stdout |
| Log agent | {{LOG_AGENT}} | Deployed as sidecar / DaemonSet |
| Transport | {{LOG_TRANSPORT}} | TLS encrypted |
| Storage | {{LOG_STORE}} | Indexed, compressed |
| Query | {{LOG_QUERY}} | Access via dashboard |
Log Retention Policy
| Environment | Retention | Storage Tier |
|---|---|---|
| Dev | 7 days | Hot |
| Staging | 30 days | Hot |
| Production | {{PROD_LOG_RETENTION}} days | Hot (30d) → Cold archive |
| Audit logs | 1 year (regulatory) | Hot (90d) → Cold archive |
PII in Logs — Masking Strategy
| Data Type | Strategy | Example |
|---|---|---|
| Email address | Hash + truncate | user:sha256(email)[:8] |
| Phone number | Redact | [PHONE_REDACTED] |
| IP address | Anonymize last octet | 192.168.1.xxx |
| Payment data | Never log | Use [PAYMENT_DATA_OMITTED] |
| Auth tokens | Never log | Use [TOKEN_OMITTED] |
| Names | Omit or pseudonymize | Reference by ID only |
2.3 Traces
Distributed Tracing Setup
Tracing Framework: {{TRACE_FRAMEWORK}} Backend: {{TRACE_BACKEND}} Auto-instrumentation: {{AUTO_INSTRUMENT}}
| Service | Instrumented | Framework | Notes |
|---|---|---|---|
| {{SERVICE_1}} | Yes | OpenTelemetry | HTTP, DB, Redis |
| {{SERVICE_2}} | Yes | OpenTelemetry | HTTP, external calls |
Trace Sampling Strategy
| Environment | Strategy | Rate | Notes |
|---|---|---|---|
| Dev | Always-on | 100% | Full visibility |
| Staging | Always-on | 100% | Full visibility |
| Production | Tail-based | {{SAMPLE_RATE}}% + errors | Error traces always kept |
Tail-based sampling rules:
- Always sample: traces with errors, traces > {{SLOW_THRESHOLD}}ms
- Sample rate: {{SAMPLE_RATE}}% of successful, fast traces
- Head-based fallback: {{HEAD_SAMPLE_RATE}}% if tail-based collector unavailable
Span Naming Conventions
| Operation Type | Naming Pattern | Example |
|---|---|---|
| HTTP handler | HTTP {{METHOD}} {{ROUTE}} |
HTTP POST /api/orders |
| DB query | db.{{operation}} {{table}} |
db.select orders |
| Cache | cache.{{operation}} {{key_pattern}} |
cache.get user:* |
| Queue | queue.{{operation}} {{queue_name}} |
queue.publish order-events |
| External HTTP | {{service}} {{METHOD}} {{path}} |
stripe POST /charges |
Context Propagation
Standard: W3C TraceContext (traceparent header)
Baggage: W3C Baggage (for user_id, tenant_id propagation)
Async: Inject context into message queue headers / job metadata
3. Alerting
3.1 Alert Rules
| Alert Name | Condition | Duration | Severity | Channel | Runbook |
|---|---|---|---|---|---|
HighErrorRate |
error_rate > {{ERROR_ALERT}}% | 2 min | Critical | PagerDuty | [link] |
SlowP99 |
p99_latency > {{P99_ALERT}}ms | 5 min | Warning | Slack #alerts | [link] |
ServiceDown |
health_check failing | 1 min | Critical | PagerDuty | [link] |
HighCPU |
cpu > {{CPU_CRIT}}% | 10 min | Warning | Slack #alerts | [link] |
DiskAlmostFull |
disk > {{DISK_CRIT}}% | 5 min | Critical | PagerDuty | [link] |
DeploymentFailed |
deployment status = failed | Immediate | Critical | Slack #deployments | [link] |
CertificateExpiringSoon |
cert_expiry < 30 days | — | Warning | Slack #ops | [link] |
BackupFailed |
backup job = failed | — | Critical | PagerDuty | [link] |
SLOBudgetBurning |
error_budget < 10% remaining | — | Critical | PagerDuty | [link] |
3.2 Alert Routing & Escalation
flowchart TD
ALERT[Alert fires] --> SEVERITY{Severity?}
SEVERITY -->|Critical| ONCALL[On-call engineer<br/>PagerDuty / phone]
SEVERITY -->|Warning| SLACK[Slack #alerts<br/>No immediate response required]
ONCALL -->|Not acknowledged in 5min| ESCALATE[Escalate to secondary]
ESCALATE -->|Not acknowledged in 10min| MANAGER[Notify engineering lead]
| Severity | Response SLA | Channel | Escalation |
|---|---|---|---|
| Critical (P1) | Acknowledge in 5 min, resolve in 1h | PagerDuty + call | Escalate at 5 min |
| High (P2) | Acknowledge in 30 min, resolve in 4h | PagerDuty | Escalate at 30 min |
| Warning (P3) | Review within 1 business day | Slack | Manual |
| Info | No response required | Slack | None |
3.3 On-Call Rotation
Schedule: {{ONCALL_SCHEDULE}} Calendar: {{ONCALL_TOOL}} Primary rotation: {{ONCALL_MEMBERS}} Secondary (escalation): {{ESCALATION_MEMBERS}} Minimum rotation size: 3 people (to avoid burnout)
3.4 Alert Fatigue Prevention
- Alert review cadence: Monthly — remove/adjust alerts with < {{ACTIONABLE_RATE}}% actionable rate
- Minimum alert duration: 2+ minutes (no single-spike alerts)
- Deduplication window: {{DEDUP_WINDOW}} minutes
- Business hours suppression: Allowed for non-critical alerts {{SUPPRESSION_HOURS}}
- Post-mortem requirement: Every Critical alert reviewed after incident
4. Dashboards
4.1 Dashboard Inventory
| Dashboard | Purpose | Link | Audience |
|---|---|---|---|
| System Overview | High-level health of all services | {{LINK}} | Everyone |
| {{SERVICE_1}} | Service-level detail | {{LINK}} | Dev team |
| Infrastructure | Host/container metrics | {{LINK}} | DevOps |
| Business Metrics | KPIs and conversions | {{LINK}} | Leadership, PM |
| SLO Tracker | Error budget tracking | {{LINK}} | Engineering lead |
| On-Call | Current incidents, top errors | {{LINK}} | On-call engineer |
4.2 Key Dashboard Specs — System Overview
Required panels:
- Service health matrix (all services, green/red/yellow)
- Request rate (all services, last 1h)
- Error rate (all services, last 1h)
- P99 latency (all services, last 1h)
- Active incidents count
- Error budget remaining (all SLOs)
- Last deployment (service, version, time)
- Infrastructure health (CPU, memory, disk — aggregate)
5. SLOs / SLIs
5.1 SLI Definitions
| SLI | Definition | Measurement Method |
|---|---|---|
| Availability | % requests returning non-5xx | (total_requests - 5xx_requests) / total_requests |
| Latency | % requests completing within threshold | histogram_quantile(0.95, ...) < {{LATENCY_SLI}}ms |
| Error rate | % requests not returning errors | (total_requests - error_requests) / total_requests |
5.2 SLO Targets
| Service | SLI | Target | Window | Error Budget |
|---|---|---|---|---|
| {{SERVICE}} | Availability | {{AVAIL_TARGET}}% | 30 days | {{BUDGET_MINUTES}} min/month |
| {{SERVICE}} | Latency (P95 < {{P95}}ms) | {{LATENCY_TARGET}}% | 30 days | {{LATENCY_BUDGET_MINUTES}} min/month |
5.3 Error Budget Tracking
| Service | Monthly Budget | Burned This Month | Remaining | Burn Rate (24h) |
|---|---|---|---|---|
| {{SERVICE}} | {{BUDGET}}min | TBD | TBD | TBD |
Error budget policy:
- Budget > 50% remaining: Move fast, deploy freely
- Budget 10-50% remaining: Slow down, prioritize reliability work
- Budget < 10% remaining: Freeze non-critical deploys, focus on reliability
6. Tooling
| Tool | Version | Purpose | Hosted |
|---|---|---|---|
| {{METRICS_TOOL}} | {{VERSION}} | Metrics collection & storage | {{HOSTING}} |
| {{LOG_TOOL}} | {{VERSION}} | Log aggregation | {{HOSTING}} |
| {{TRACE_TOOL}} | {{VERSION}} | Distributed tracing | {{HOSTING}} |
| {{DASHBOARD_TOOL}} | {{VERSION}} | Visualization | {{HOSTING}} |
| {{ALERT_TOOL}} | {{VERSION}} | Alert routing & on-call | {{HOSTING}} |
Related Documents
Approval
| Role | Name | Date | Signature |
|---|---|---|---|
| Author | |||
| Reviewer | |||
| Approver |
Disaster Recovery Plan
Disaster Recovery Plan
Project: {{PROJECT_NAME}} Version: {{VERSION}} Date: {{DATE}} Author: {{AUTHOR}} Status: Draft | In Review | Approved Reviewers: {{REVIEWERS}}
Document History
| Version | Date | Author | Changes |
|---|---|---|---|
| 0.1 | {{DATE}} | {{AUTHOR}} | Initial draft |
1. Business Continuity Overview
This plan documents the procedures to recover {{PROJECT_NAME}} services following a disaster event (data center failure, data corruption, security breach, or catastrophic failure).
Plan Owner: {{DR_OWNER}} Plan Reviewer: {{DR_REVIEWER}} Last Tested: {{LAST_TEST_DATE}} Next Scheduled Test: {{NEXT_TEST_DATE}}
Disaster types covered:
- Infrastructure failure (AZ/region outage)
- Data corruption or accidental deletion
- Security incident (ransomware, data breach)
- Vendor/provider outage
- Catastrophic application failure
2. RPO / RTO Targets Per Service Tier
| Tier | Description | RPO | RTO | Examples |
|---|---|---|---|---|
| Tier 1 — Critical | Core user-facing services; downtime has direct revenue impact | 0 (real-time replication) | < 15 min | Auth, checkout, core API |
| Tier 2 — Important | Supporting services; degraded experience without them | < 1 hour | < 4 hours | Notifications, reports |
| Tier 3 — Standard | Background/admin services; business can operate without temporarily | < 24 hours | < 24 hours | Analytics, admin panel |
3. Service Tier Classification
| Service | Tier | Owner | Rationale |
|---|---|---|---|
| {{SERVICE_1}} | Tier 1 | {{OWNER}} | Core user journey |
| {{SERVICE_2}} | Tier 1 | {{OWNER}} | Authentication |
| {{SERVICE_3}} | Tier 2 | {{OWNER}} | Supporting |
| {{SERVICE_4}} | Tier 3 | {{OWNER}} | Admin only |
| Database — Primary | Tier 1 | Platform | All services depend on it |
| Object Storage | Tier 2 | Platform | User uploads |
4. Backup Strategy
4.1 Database Backups
| Database | Backup Type | Frequency | Retention | Location | Verified |
|---|---|---|---|---|---|
| {{DB_PRIMARY}} | Automated snapshot | Daily | 30 days | {{BACKUP_LOCATION}} | Monthly |
| {{DB_PRIMARY}} | Point-in-time recovery | Continuous | 7 days | {{BACKUP_LOCATION}} | Monthly |
| {{DB_READ_REPLICA}} | Not backed up separately | — | — | Rebuilt from primary | — |
Automated backup tool: {{BACKUP_TOOL}} Backup encryption: AES-256, key managed in {{KMS_TOOL}} Cross-region copy: {{CROSS_REGION}}
4.2 File / Object Storage Backups
| Storage | Backup Method | Frequency | Retention | DR Copy |
|---|---|---|---|---|
| {{S3_BUCKET}} | S3 versioning + replication | Continuous | {{RETENTION}} | {{DR_BUCKET}} |
| {{FILE_STORE}} | Snapshot | Daily | 30 days | Cross-region |
4.3 Configuration Backups
| Config | Backup Method | Location | Frequency |
|---|---|---|---|
| IaC (Terraform) | Git repository | {{GIT_REPO}} | On change |
| Application config | Git repository | {{GIT_REPO}} | On change |
| Secrets | Secrets manager replication | {{SECRETS_BACKUP}} | Real-time |
| DNS records | Export to Git | {{GIT_REPO}} | Weekly |
| TLS certificates | Secrets manager | {{CERTS_BACKUP}} | On renewal |
4.4 Backup Testing Schedule
| Backup Type | Test Frequency | Last Test | Result | Tester |
|---|---|---|---|---|
| Database full restore | Monthly | {{DATE}} | {{RESULT}} | {{TESTER}} |
| Point-in-time restore | Quarterly | {{DATE}} | {{RESULT}} | {{TESTER}} |
| Object storage restore | Quarterly | {{DATE}} | {{RESULT}} | {{TESTER}} |
| Full DR failover drill | Bi-annually | {{DATE}} | {{RESULT}} | {{TESTER}} |
5. Failover Procedures
5.1 Automated Failover
| Component | Automatic Failover | Mechanism | Failover Time |
|---|---|---|---|
| Database (Multi-AZ) | Yes | RDS automatic failover | 60-120 seconds |
| Load balancer | Yes | Health check → route to healthy targets | < 30 seconds |
| CDN | Yes | Origin health checks | < 60 seconds |
| Redis (if clustered) | Yes | Redis Sentinel / ElastiCache | < 30 seconds |
Monitoring automatic failover:
- Alert fires:
MultiAZFailoverCloudWatch event or equivalent - On-call notified immediately
- No manual action required, but on-call must confirm recovery
5.2 Manual Failover Steps
Prerequisite: Automatic failover has NOT occurred or has failed.
Database Manual Failover (Tier 1)
- Confirm primary is unavailable:
ping {{DB_PRIMARY_HOST}}— should timeout - Connect to standby:
psql {{STANDBY_HOST}} - Promote standby to primary:
SELECT pg_promote(); - Update DNS record
db.{{INTERNAL_DOMAIN}}→{{STANDBY_HOST}} - DNS TTL: Ensure TTL was set to 60s pre-incident (if not, wait {{DNS_TTL}} seconds)
- Verify applications are reconnecting: Check application logs for successful DB connections
- Page on-call to verify all services healthy
Regional Failover (Catastrophic)
- Declare DR event (approval from {{DR_AUTHORITY}})
- Confirm primary region {{PRIMARY_REGION}} is unreachable
- Activate standby in {{DR_REGION}}:
terraform apply -var-file=envs/dr.tfvars - Restore database from latest cross-region snapshot
- Update Route 53 / DNS to point to {{DR_REGION}} endpoints
- Run smoke tests:
bash scripts/smoke-tests.sh {{DR_REGION}} - Notify stakeholders (see Communication Plan)
- Monitor enhanced metrics for {{MONITOR_PERIOD}}h
6. Recovery Procedures Per Service
Tier 1 Services
| Service | Recovery Procedure | Recovery Script | Est. Time |
|---|---|---|---|
| {{SERVICE_1}} | 1. Restore from snapshot 2. Verify config 3. Run smoke tests |
scripts/restore-{{SERVICE_1}}.sh |
{{TIME}}min |
| Authentication | 1. Deploy from last known good image 2. Verify JWT keys 3. Test login flow |
scripts/restore-auth.sh |
{{TIME}}min |
Tier 2 Services
Tier 3 Services
7. DR Drill Schedule & Scenarios
| Drill Type | Frequency | Participants | Last Executed | Next Scheduled |
|---|---|---|---|---|
| Tabletop exercise | Quarterly | On-call team + engineering lead | {{DATE}} | {{DATE}} |
| Database failover test | Quarterly | DevOps + one developer | {{DATE}} | {{DATE}} |
| Full DR failover | Bi-annually | Entire engineering team | {{DATE}} | {{DATE}} |
| Backup restore test | Monthly | DevOps | {{DATE}} | {{DATE}} |
Drill Scenarios to Cover:
- Database primary failure (automatic failover test)
- Accidental data deletion (point-in-time restore)
- Single AZ outage (multi-AZ failover)
- Full region failure (cross-region DR)
- Ransomware/data corruption (restore from offline backup)
- CDN outage (origin fallback)
- Secret store unavailable (cached credentials)
8. Communication Plan During DR Event
Internal Communications
| Audience | Channel | Frequency | Owner |
|---|---|---|---|
| Engineering team | Slack #incidents + war room call | Real-time | Incident commander |
| Engineering management | Direct message | At declaration + hourly | Incident commander |
| Product/Business leadership | Email + Slack | At declaration + hourly | Incident commander |
| Customer support | Dedicated Slack channel | At declaration + 30 min | Support lead |
External Communications
| Audience | Channel | Trigger | Message |
|---|---|---|---|
| Customers | Status page ({{STATUS_PAGE}}) | Within 15 min of confirmed incident | "We are investigating an issue" |
| Customers | Status page update | Every 30 min | Progress update |
| Customers | If impact > {{EMAIL_THRESHOLD}}h | Direct notification | |
| SLA customers | Direct contact | Per SLA contract | As contractually required |
Communication templates: See go-live-runbook.md communication section
9. War Room Setup
War Room: {{WAR_ROOM_LINK}} Bridge Line: {{BRIDGE_NUMBER}} Document: Live incident doc created at: {{INCIDENT_DOC_TEMPLATE}}
Roles during DR event:
| Role | Responsibility | Primary | Backup |
|---|---|---|---|
| Incident Commander | Coordinates response, final decisions | {{IC}} | {{IC_BACKUP}} |
| Technical Lead | Leads technical recovery | {{TECH_LEAD}} | {{TECH_BACKUP}} |
| Communications Lead | Internal/external updates | {{COMMS_LEAD}} | {{COMMS_BACKUP}} |
| Scribe | Documents timeline, actions taken | {{SCRIBE}} | Rotate |
10. Post-Recovery Verification Checklist
- All Tier 1 services healthy (health checks passing)
- Error rate back to baseline (< {{ERROR_BASELINE}}%)
- P99 latency back to baseline (< {{P99_BASELINE}}ms)
- Database connections stable
- Replication lag < {{REPLICATION_LAG}}s (if applicable)
- Backup jobs resumed and completed successfully
- Monitoring and alerting functional
- No data loss confirmed (or data loss quantified and documented)
- All Tier 2 services healthy
- Stakeholders notified of recovery
- Status page updated to "Resolved"
- Incident timeline documented
- Post-mortem scheduled (within {{POSTMORTEM_SLA}}h)
11. DR Test Results Log
| Date | Test Type | Scenario | RTO Achieved | RPO Achieved | Issues Found | Resolved By |
|---|---|---|---|---|---|---|
| {{DATE}} | {{TYPE}} | {{SCENARIO}} | {{RTO}} | {{RPO}} | {{ISSUES}} | {{RESOLVED}} |
Related Documents
Approval
| Role | Name | Date | Signature |
|---|---|---|---|
| Author | |||
| Reviewer | |||
| Approver |
CI/CD Pipeline
CI/CD Pipeline
Project: {{PROJECT_NAME}} Version: {{VERSION}} Date: {{DATE}} Author: {{AUTHOR}} Status: Draft | In Review | Approved Reviewers: {{REVIEWERS}}
Document History
| Version | Date | Author | Changes |
|---|---|---|---|
| 0.1 | {{DATE}} | {{AUTHOR}} | Initial draft |
1. Overview
CI/CD Platform: {{PLATFORM}} Container Registry: {{REGISTRY}} Deployment Target: {{DEPLOY_TARGET}} Strategy: {{STRATEGY}}
2. Pipeline Overview
flowchart LR
subgraph Source
PR[Pull Request]
MERGE[Merge to main]
end
subgraph CI["CI — runs on every PR"]
LINT[Lint & Format]
TEST_UNIT[Unit Tests]
TEST_INT[Integration Tests]
SAST[SAST Scan]
SCA[Dependency Scan]
BUILD[Build Artifact]
end
subgraph CD_DEV["CD — Dev Auto-Deploy"]
DEPLOY_DEV[Deploy to Dev]
SMOKE_DEV[Smoke Tests]
end
subgraph CD_STAGING["CD — Staging (auto on main)"]
DEPLOY_STG[Deploy to Staging]
TEST_E2E[E2E Tests]
PERF[Performance Tests]
end
subgraph CD_PROD["CD — Production (manual gate)"]
APPROVAL[Manual Approval]
DEPLOY_PROD[Deploy to Production]
SMOKE_PROD[Smoke Tests]
MONITOR[Verify Monitoring]
end
PR --> LINT
LINT --> TEST_UNIT
TEST_UNIT --> TEST_INT
TEST_INT --> SAST
SAST --> SCA
SCA --> BUILD
MERGE --> CD_DEV
BUILD --> DEPLOY_DEV
DEPLOY_DEV --> SMOKE_DEV
SMOKE_DEV --> DEPLOY_STG
DEPLOY_STG --> TEST_E2E
TEST_E2E --> PERF
PERF --> APPROVAL
APPROVAL --> DEPLOY_PROD
DEPLOY_PROD --> SMOKE_PROD
SMOKE_PROD --> MONITOR
3. Source Control Configuration
3.1 Branching Strategy
Strategy: {{BRANCH_STRATEGY}}
| Branch | Purpose | Naming Convention | Lifetime |
|---|---|---|---|
main |
Production-ready code | fixed | Permanent |
develop |
Integration branch | fixed | Permanent |
feature/* |
New features | feature/{{TICKET}}-description |
Until merged |
fix/* |
Bug fixes | fix/{{TICKET}}-description |
Until merged |
hotfix/* |
Production hotfixes | hotfix/{{TICKET}}-description |
Until merged |
release/* |
Release preparation | release/v{{VERSION}} |
Until merged |
3.2 Branch Protection Rules
Protected Branches: main, develop
| Rule | main |
develop |
|---|---|---|
| Require PR | Yes | Yes |
| Required approvals | {{APPROVALS}} | 1 |
| Dismiss stale reviews | Yes | Yes |
| Require status checks | Yes | Yes |
| Required checks | lint, unit-tests, integration-tests, sast | lint, unit-tests |
| Require up-to-date | Yes | No |
| Allow force push | No | No |
| Allow deletions | No | No |
3.3 Code Review Requirements
- Minimum {{APPROVALS}} approval(s) required before merge
- At least one approval from a code owner (see
CODEOWNERS) - All review comments must be resolved before merge
- Review turnaround SLA: {{REVIEW_SLA}} business hours
- Auto-assign reviewers via: {{ASSIGN_MECHANISM}}
4. Build Stage
4.1 Build Tool & Configuration
| Parameter | Value |
|---|---|
| Build Tool | {{BUILD_TOOL}} |
| Build Command | {{BUILD_CMD}} |
| Artifact Type | {{ARTIFACT}} |
| Artifact Naming | {{REGISTRY}}/{{IMAGE_NAME}}:{{TAG_STRATEGY}} |
| Tag Strategy | git-sha for PRs, semver for releases |
4.2 Dependency Caching
| Cache | Key | Restore Keys |
|---|---|---|
| Node modules | node-modules-{{OS}}-{{LOCKFILE_HASH}} |
node-modules-{{OS}}- |
| Docker layers | buildx-{{DOCKERFILE_HASH}} |
buildx- |
| Test results | test-results-{{COMMIT_SHA}} |
N/A |
4.3 Artifact Generation
| Artifact | Storage | Retention | Signed |
|---|---|---|---|
| Docker image | {{REGISTRY}} | 90 days (non-prod), Forever (prod tags) | {{SIGNING}} |
| Test reports | CI artifact storage | 30 days | No |
| SBOM | {{SBOM_STORAGE}} | 1 year | Yes |
| Coverage report | {{COVERAGE_STORAGE}} | 30 days | No |
5. Test Stages
5.1 Unit Tests
| Parameter | Value |
|---|---|
| Framework | {{UNIT_FRAMEWORK}} |
| Command | {{UNIT_CMD}} |
| Coverage Tool | {{COVERAGE_TOOL}} |
| Coverage Gate | ≥ {{COVERAGE_GATE}}% lines, ≥ {{BRANCH_GATE}}% branches |
| Failure Action | Block PR merge |
5.2 Integration Tests
| Parameter | Value |
|---|---|
| Framework | {{INT_FRAMEWORK}} |
| Command | {{INT_CMD}} |
| Dependencies | {{INT_DEPS}} |
| Failure Action | Block PR merge |
5.3 E2E Tests
| Parameter | Value |
|---|---|
| Framework | {{E2E_FRAMEWORK}} |
| Command | {{E2E_CMD}} |
| Environment | Staging |
| Parallelization | {{E2E_SHARDS}} shards |
| Failure Action | Block staging promotion |
5.4 Security Scanning
| Scan Type | Tool | Command | Gate |
|---|---|---|---|
| SAST | {{SAST_TOOL}} | {{SAST_CMD}} |
Block on HIGH/CRITICAL |
| SCA (dependencies) | {{SCA_TOOL}} | {{SCA_CMD}} |
Block on CRITICAL |
| Container scan | {{CONTAINER_SCAN}} | {{CONTAINER_SCAN_CMD}} |
Block on CRITICAL |
| Secret scanning | {{SECRET_SCAN}} | {{SECRET_SCAN_CMD}} |
Block on any finding |
5.5 Linting & Formatting
| Tool | Purpose | Command | Auto-fix |
|---|---|---|---|
| {{LINTER}} | Code linting | {{LINT_CMD}} |
PR comment |
| {{FORMATTER}} | Code formatting | {{FMT_CMD}} |
Auto-commit or fail |
| {{TYPE_CHECK}} | Type checking | {{TYPE_CMD}} |
No |
6. Deploy Stages
6.1 Deployment Strategy
Strategy: {{DEPLOY_STRATEGY}}
Rolling Deployment:
- Batch size: {{BATCH_SIZE}}% of instances
- Pause between batches: {{PAUSE}}min
- Health check wait: {{HEALTH_WAIT}}s
- Rollback trigger: health check failure
Canary Deployment (if used):
- Initial canary weight: {{CANARY_INITIAL}}%
- Increment: {{CANARY_INCREMENT}}% every {{CANARY_INTERVAL}}min
- Promotion criteria: error rate < {{ERROR_THRESHOLD}}%, p99 < {{LATENCY_THRESHOLD}}ms
- Rollback trigger: automatic on threshold breach
6.2 Environment Promotion
PR Branch → Dev (auto) → Staging (auto on main merge) → Production (manual approval)
| Promotion | Trigger | Gate | Approver |
|---|---|---|---|
| → Dev | Merge to develop / PR |
All CI checks pass | Automatic |
| → Staging | Merge to main |
All CI + Dev smoke tests | Automatic |
| → Production | Tag v*.*.* |
All tests + manual approval | {{PROD_APPROVER}} |
6.3 Approval Gates
Production Approval Required: Yes Approvers: {{PROD_APPROVERS}} (at least {{APPROVAL_COUNT}} required) Approval Window: {{APPROVAL_WINDOW}}h (pipeline cancels after timeout) Emergency Override: {{EMERGENCY_OVERRIDE}}
6.4 Feature Flags Integration
Feature Flag Tool: {{FF_TOOL}} Flag Validation: Feature flags validated in staging before production deploy Kill Switch: All new features behind flags for first {{FF_PERIOD}} days
7. Post-Deploy
7.1 Smoke Tests
| Check | Expected | Timeout |
|---|---|---|
Health endpoint GET /health |
HTTP 200 | 10s |
| Auth endpoint reachable | HTTP 401 | 10s |
| Database connection | Healthy | 15s |
| Cache connection | Healthy | 10s |
| Critical user journey | Success | 60s |
Smoke test timeout: {{SMOKE_TIMEOUT}}min total On failure: Auto-rollback triggered
7.2 Monitoring Verification
| Metric | Threshold | Check Duration |
|---|---|---|
| Error rate | < {{ERROR_RATE}}% | 5 min |
| P99 latency | < {{P99}}ms | 5 min |
| CPU utilization | < {{CPU}}% | 5 min |
| Memory utilization | < {{MEM}}% | 5 min |
7.3 Rollback Triggers
Automatic rollback triggers:
- Smoke test failure
- Error rate > {{AUTO_ROLLBACK_ERROR}}% for {{AUTO_ROLLBACK_DURATION}}min post-deploy
- Health check failure on {{HEALTH_FAIL_THRESHOLD}}% of instances
Manual rollback: See rollback-plan.md
8. Pipeline Configuration Reference
Config File Location: {{CONFIG_PATH}}
Key environment variables injected by CI:
| Variable | Source | Purpose |
|---|---|---|
REGISTRY_TOKEN |
{{SECRET_STORE}} | Container registry auth |
DEPLOY_KEY |
{{SECRET_STORE}} | Deployment credentials |
SENTRY_DSN |
{{SECRET_STORE}} | Error tracking |
SLACK_WEBHOOK |
{{SECRET_STORE}} | Notifications |
9. Secret Injection Strategy
Strategy: {{SECRET_STRATEGY}}
| Secret Type | Storage | Injection Method | Rotation |
|---|---|---|---|
| Registry credentials | {{STORAGE}} | {{METHOD}} | {{ROTATION}} |
| Cloud credentials | {{STORAGE}} | OIDC / Workload Identity | Per-job |
| App secrets | {{STORAGE}} | {{METHOD}} | {{ROTATION}} |
OIDC Preferred: Cloud credentials injected via OIDC — no long-lived keys stored in CI
10. Pipeline Metrics
| Metric | Target | Current |
|---|---|---|
| Build duration (P50) | < {{BUILD_TARGET}}min | TBD |
| Test duration (P50) | < {{TEST_TARGET}}min | TBD |
| Total pipeline duration | < {{TOTAL_TARGET}}min | TBD |
| Deploy frequency | {{DEPLOY_FREQ}} | TBD |
| Lead time for changes | < {{LEAD_TIME}} | TBD |
| Change failure rate | < {{FAILURE_RATE}}% | TBD |
| MTTR | < {{MTTR}} | TBD |
Related Documents
Approval
| Role | Name | Date | Signature |
|---|---|---|---|
| Author | |||
| Reviewer | |||
| Approver |
ALAI Static Hosting Blueprint (2026-04-20)
ALAI Static Hosting Blueprint
1. Platform Decision
Winner: Cloudflare Pages
ALAI already runs alai.no on Cloudflare Pages and has Cloudflare as DNS provider for 6 of 12 domains. The migration path is lowest-friction of any option: git push triggers build, custom domains are free, SSL is automatic, and Cloudflare Access (already deployed for internal tools) works natively. The free tier covers unlimited sites, 500 builds/month, and unlimited bandwidth — all 12 static sites fit without spending a euro. Critically, ALAI does not need object-storage complexity (GCS/S3) or a separate CDN layer for static marketing/demo sites. Cloudflare Pages is the right tool at this scale.
The call on vendor lock-in: ALAI is already locked to Cloudflare for DNS. Extending that to hosting is concentration risk, but the blast radius is recoverable — all sites are git-backed, migrating to any other platform is a 30-minute operation per site. The cost and operational savings outweigh the risk.
Platform Comparison (12 sites, 1 GB each, 100 GB egress/month)
| Criterion | Cloudflare Pages | GCP Cloud Storage + CDN | AWS S3 + CloudFront | Azure Static Web Apps |
|---|---|---|---|---|
| Monthly cost (12 sites) | €0 (free tier) | ~€12 (storage €1.20 + CDN egress ~€10) | ~€14 (S3 €0.25 + CF egress ~€8 + requests ~€6) | €0 Free / €9 Standard (2 sites free, rest €4.50/mo each) |
| Build minutes | 500/month free | N/A (no built-in CI) | N/A (no built-in CI) | 60 min/month free, then €0.009/min |
| DX (git push to live) | Native (GitHub/GitLab direct) | Requires Cloud Build + gsutil | Requires CodePipeline or GitHub Action + aws CLI | Native (GitHub Actions integrated) |
| Custom domains | Unlimited | Per load balancer config | Per distribution ($0.0075/10k requests) | 5 per plan |
| SSL | Automatic, free | Managed certificate, manual setup | ACM free but requires distribution config | Automatic, free |
| Preview URLs per PR | Yes (automatic) | No (requires custom setup) | No (requires custom Lambda@Edge) | Yes (staging environments) |
| DDoS/WAF | Included free (Cloudflare network) | Cloud Armor (add-on, ~€5+/mo) | AWS Shield Standard free, WAF extra | Azure DDoS Basic free, WAF add-on |
| Vendor lock-in | Medium (proprietary build env, but output is static) | Low (standard GCS) | Low (standard S3) | Medium (Azure-specific config) |
Decision: Cloudflare Pages wins on cost (€0 vs €12-14/mo), DX (native git integration), DDoS/WAF included, and operational alignment with existing CF infrastructure.
2. Deploy Blueprint
Repo Convention
Every static site lives in its own repo or a dedicated directory in a monorepo. Naming convention: alai-<product>-web for ALAI properties, client-<slug>-web for client sites. The Cloudflare Pages project name matches the repo name exactly.
Build output must be in one of: dist/, out/, public/, .next/ (for Next.js static export). For plain HTML sites, the root directory is the publish directory.
Step 1: Create Cloudflare Pages Project (one-time per site)
# Via Cloudflare dashboard or wrangler CLI
npx wrangler pages project create <project-name> \
--production-branch main
Connect GitHub repo in the Pages dashboard. Set build command and output directory per framework:
| Framework | Build command | Output dir |
|---|---|---|
| Static HTML | (none) | / |
| Next.js (static export) | next build |
out |
| Next.js (app router) | next build |
.next |
| Astro | astro build |
dist |
Step 2: GitHub Actions CI (copy-paste ready)
Save as .github/workflows/deploy.yml in every site repo:
name: Deploy to Cloudflare Pages
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
deploy:
runs-on: ubuntu-latest
permissions:
contents: read
deployments: write
pull-requests: write
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: '20'
cache: 'npm'
- name: Install dependencies
run: npm ci
- name: Build
run: npm run build
env:
NODE_ENV: production
- name: Deploy to Cloudflare Pages
uses: cloudflare/wrangler-action@v3
with:
apiToken: ${{ secrets.CLOUDFLARE_API_TOKEN }}
accountId: ${{ secrets.CLOUDFLARE_ACCOUNT_ID }}
command: pages deploy ./out --project-name=${{ vars.CF_PROJECT_NAME }} --branch=${{ github.ref_name }}
- name: Comment preview URL on PR
if: github.event_name == 'pull_request'
uses: actions/github-script@v7
with:
script: |
const { data: deployments } = await github.rest.repos.listDeployments({
owner: context.repo.owner,
repo: context.repo.repo,
ref: context.payload.pull_request.head.sha,
per_page: 1
});
if (deployments.length > 0) {
github.rest.issues.createComment({
owner: context.repo.owner,
repo: context.repo.repo,
issue_number: context.payload.pull_request.number,
body: `Preview deployed: https://${context.payload.pull_request.head.sha.substring(0,8)}.${process.env.CF_PROJECT_NAME}.pages.dev`
});
}
For plain HTML sites with no build step, remove the Install dependencies and Build steps, and change the deploy path to ./ instead of ./out.
Step 3: Custom Domain (one-time per site)
# In Cloudflare dashboard: Pages > Project > Custom Domains > Add custom domain
# Or via API:
curl -X POST "https://api.cloudflare.com/client/v4/accounts/$CF_ACCOUNT_ID/pages/projects/$PROJECT_NAME/domains" \
-H "Authorization: Bearer $CLOUDFLARE_API_TOKEN" \
-H "Content-Type: application/json" \
--data '{"name":"example.alai.no"}'
Because ALAI uses Cloudflare DNS, the CNAME/alias record is created automatically when adding the custom domain inside Cloudflare Pages.
Preview URL Per PR
Cloudflare Pages creates a preview URL automatically for every PR push. Format: https://<commit-hash>.<project-name>.pages.dev. No configuration needed. Preview environments are isolated and do not affect production traffic.
Phantom Domain Removal Protocol
ZAKON: Before vercel domains rm <phantom> — verify real domain is not implicitly routing through phantom.
Safe sequence for phantom removal:
vercel domains inspect <real-domain>— confirm direct attachment to authoritative project- If real domain does NOT show direct attachment →
vercel domains add <real> --project <authoritative>FIRST curl -sI https://<real>— confirm HTTP 200 with new attachment- ONLY THEN:
vercel domains rm <phantom> --yes - Re-verify:
curl -sI https://<real>HTTP 200
Forbidden: Remove phantom without prior explicit attachment of real domain → risk implicit routing break.
Incident reference: 2026-04-20 kenyhot.pro cleanup, 35s downtime, MC #8526.
Evidence: /Users/makinja/system/evidence/kenyhot-vercel-cleanup/execution-log-*.txt
Rollback (< 60 seconds)
NOTE — wrangler 4.x breaking change:
wrangler pages deployment rollbackwas removed in wrangler 4.x. The subcommand no longer exists and the/rollbackCF API endpoint returns 405 for direct-upload deployments. Do NOT use it. Use the alternatives below. (Reference: wrangler upstream release notes; verified in Proveo pilot on basicconsulting.no, MC #8494.)
Primary — CF API re-deploy (copy-paste ready):
# Required env vars — set once per shell session or in ~/.zshrc
export CF_API_TOKEN="<your-cloudflare-api-token>" # scope: Cloudflare Pages: Edit
export CF_ACCOUNT_ID="<your-cloudflare-account-id>"
export CF_PROJECT_NAME="<project-name>"
# 1. List recent deployments and grab the target deployment ID
curl -s "https://api.cloudflare.com/client/v4/accounts/${CF_ACCOUNT_ID}/pages/projects/${CF_PROJECT_NAME}/deployments" \
-H "Authorization: Bearer ${CF_API_TOKEN}" | \
python3 -c "import sys,json; [print(d['id'], d['created_on'][:19], d.get('deployment_trigger',{}).get('metadata',{}).get('commit_message','')[:60]) for d in json.load(sys.stdin)['result'][:10]]"
# 2. Re-deploy the target deployment (replace <deployment-id> with ID from step 1)
curl -s -X POST \
"https://api.cloudflare.com/client/v4/accounts/${CF_ACCOUNT_ID}/pages/projects/${CF_PROJECT_NAME}/deployments/<deployment-id>/retry" \
-H "Authorization: Bearer ${CF_API_TOKEN}" \
-H "Content-Type: application/json" | python3 -c "import sys,json; r=json.load(sys.stdin); print('OK —', r['result']['id']) if r['success'] else print('ERROR:', r['errors'])"
CF reuses content-hash cache — files already on the CDN are not re-uploaded. Measured time: ~11 seconds. No build step required.
Secondary — CF Dashboard rollback (GitHub-connected repos):
- Open https://dash.cloudflare.com > Pages > select project
- Click "Deployments" tab
- Find the target deployment row, click the three-dot menu
- Select "Rollback to this deployment"
- Confirm — live traffic switches in < 30 seconds
Total time to identify + execute: under 30 seconds for either path.
Secrets Management
| Secret | Storage | How to use |
|---|---|---|
CLOUDFLARE_API_TOKEN |
GitHub repository secret | Set in: Repo > Settings > Secrets > Actions |
CLOUDFLARE_ACCOUNT_ID |
GitHub repository variable | Set in: Repo > Settings > Variables > Actions |
CF_PROJECT_NAME |
GitHub repository variable | Set per repo, matches CF Pages project name |
| Build-time env vars (API keys, etc.) | Cloudflare Pages > Settings > Environment variables | Available during build and at runtime for SSR |
Token scope required: Cloudflare Pages: Edit only. Create at: https://dash.cloudflare.com/profile/api-tokens
New-Site Template (one command)
Save as /Users/makinja/system/tools/alai-new-site.sh:
#!/usr/bin/env bash
# Usage: bash alai-new-site.sh <site-name> [--framework next|html|astro]
set -euo pipefail
SITE_NAME="${1:?Usage: alai-new-site.sh <site-name> [--framework next|html|astro]}"
FRAMEWORK="${3:-html}"
REPO_DIR="/Users/makinja/ALAI/sites/${SITE_NAME}"
echo "Creating site: ${SITE_NAME} (${FRAMEWORK})"
# 1. Create repo directory
mkdir -p "${REPO_DIR}/.github/workflows"
# 2. Copy workflow template
cp /Users/makinja/system/specs/templates/cf-pages-deploy.yml "${REPO_DIR}/.github/workflows/deploy.yml"
# 3. Create wrangler.toml
cat > "${REPO_DIR}/wrangler.toml" <<EOF
name = "${SITE_NAME}"
compatibility_date = "2026-01-01"
[env.production]
EOF
# 4. Init git
cd "${REPO_DIR}" && git init && git add . && git commit -m "init: ${SITE_NAME}"
# 5. Create Cloudflare Pages project
npx wrangler pages project create "${SITE_NAME}" --production-branch main
echo "Done. Next: connect GitHub repo in Cloudflare dashboard."
echo " https://dash.cloudflare.com/pages"
3. Maintenance
SSL Auto-Renewal
Cloudflare Pages provisions and auto-renews SSL certificates via Cloudflare's certificate authority. No manual action required. Certificates renew 30 days before expiry. The only failure mode is if a custom domain's DNS stops pointing to Cloudflare — the alert system in Section 4 catches this.
DNS Consolidation
Target: All domains to Cloudflare DNS.
Current state: 2 on Cloudflare, 1 on Vercel, 1 on AWS Route53, 3 on one.com nameservers, 3 unknown/third-party.
Migration steps per domain:
- Log in to registrar, change nameservers to
ana.ns.cloudflare.comandbob.ns.cloudflare.com - Cloudflare imports existing DNS records automatically (zone scan)
- Verify records in Cloudflare dashboard, then activate proxy (orange cloud) for web traffic
Registrar note: Domains registered at one.com (.no TIDs) — nameserver change takes 15 minutes to 4 hours for .no domains. For .ba domains, the registrar controls this; requires contacting them directly.
Dependency Updates (Renovate)
Save as renovate.json in every repo root:
{
"$schema": "https://docs.renovatebot.com/renovate-schema.json",
"extends": ["config:recommended"],
"schedule": ["every sunday"],
"prCreationDelay": "0 minutes",
"packageRules": [
{
"matchUpdateTypes": ["minor", "patch"],
"automerge": true,
"automergeType": "pr",
"automergeStrategy": "squash"
},
{
"matchUpdateTypes": ["major"],
"automerge": false,
"labels": ["dependencies", "major-update"]
}
],
"vulnerabilityAlerts": {
"enabled": true,
"labels": ["security"]
}
}
Enable Renovate at https://github.com/apps/renovate for each repo. No server needed.
Backup Strategy
| Asset | What | Where | Retention |
|---|---|---|---|
| Source code | Full git history | GitHub (primary) | Permanent |
| Source code mirror | Bare git clone | Azure VM /opt/backups/git-mirrors/ |
90 days rolling |
| Cloudflare Pages deployments | Build artifacts | Cloudflare (automatic, last 25 builds) | Automatic |
| DNS zone | Export via CF API | /Users/makinja/system/backups/dns/ (weekly cron) |
12 months |
| Secrets inventory | Encrypted note | Vaultwarden (vault.basicconsulting.no) | Permanent |
DNS zone backup cron (add to crontab):
# Weekly DNS zone backup — runs every Sunday 02:00
0 2 * * 0 curl -s "https://api.cloudflare.com/client/v4/zones?per_page=50" \
-H "Authorization: Bearer $CF_API_TOKEN" | \
node /Users/makinja/system/tools/cf-zone-export.js > \
/Users/makinja/system/backups/dns/zones-$(date +%Y%m%d).json
DR: Restore Site in < 60 Seconds
NOTE — wrangler 4.x breaking change:
wrangler pages deployment rollbackis removed in wrangler 4.x and must NOT be used. See MC #8494. Option A below replaces it with the CF API re-deploy path.
# Option A: CF API re-deploy (STANDARD DR PATH — replaces deprecated wrangler rollback)
# Time: ~11 seconds. CF content-hash cache means zero bytes re-uploaded for unchanged files.
export CF_API_TOKEN="<your-cloudflare-api-token>"
export CF_ACCOUNT_ID="<your-cloudflare-account-id>"
export CF_PROJECT_NAME="<site-name>"
# List last 10 deployments
curl -s "https://api.cloudflare.com/client/v4/accounts/${CF_ACCOUNT_ID}/pages/projects/${CF_PROJECT_NAME}/deployments" \
-H "Authorization: Bearer ${CF_API_TOKEN}" | \
python3 -c "import sys,json; [print(d['id'], d['created_on'][:19], d.get('deployment_trigger',{}).get('metadata',{}).get('commit_message','')[:60]) for d in json.load(sys.stdin)['result'][:10]]"
# Re-deploy target deployment ID
curl -s -X POST \
"https://api.cloudflare.com/client/v4/accounts/${CF_ACCOUNT_ID}/pages/projects/${CF_PROJECT_NAME}/deployments/<deployment-id>/retry" \
-H "Authorization: Bearer ${CF_API_TOKEN}" \
-H "Content-Type: application/json" | python3 -c "import sys,json; r=json.load(sys.stdin); print('OK —', r['result']['id']) if r['success'] else print('ERROR:', r['errors'])"
# Option B: Redeploy from git (if CF deployment history cleared)
cd /path/to/site-repo && npm run build && \
npx wrangler pages deploy ./out --project-name=<site-name> --branch=main
# Time: 30-90 seconds depending on build
# Option C: Emergency static serve from Azure VM (last resort)
scp -r ./out alai-admin@4.223.110.181:/var/www/<site-name>
ssh -i ~/.ssh/azure_alai alai-admin@4.223.110.181 \
"sudo caddy reverse-proxy --from <domain> --to localhost:8080"
# Time: ~120 seconds
Option A is the standard DR path. Target: < 60 seconds. Tested monthly as part of Proveo validation.
4. Alarms and Escalation
SENTINEL daemons live in /Users/makinja/system/tools/. Alerting routes to Slack #infra-alerts channel.
Alert Table
| Metric | Threshold | Channel | L1 Action | L2 Action | L3 Action |
|---|---|---|---|---|---|
| Uptime (HTTP 200) | < 100% for 5 min | #infra-alerts (Slack) | Auto-retry; post alert | Kelsey investigates: CF status page, DNS check | Escalate to CEO; activate DR (Option C) |
| Build failure | Any failed build on main | #infra-alerts | Alert with build URL + error log | Kelsey reviews workflow, checks CF Pages build log | Revert last commit: git revert HEAD && git push |
| SSL cert expiry | < 30 days to expiry | #infra-alerts | Alert; verify CF auto-renewal is active | Manual CF cert renewal trigger | Contact Cloudflare support |
| 5xx rate | > 1% of requests over 10 min | #infra-alerts | Alert with request sample | Kelsey checks CF Pages function logs | Rollback via CF API re-deploy (Option A, DR section) |
| Traffic anomaly | > 10x baseline in 5 min | #infra-alerts | Alert; verify CF rate limiting active | Check CF analytics for origin; enable under-attack mode | Contact Cloudflare support |
| Bandwidth overage | > 80% of plan limit | #infra-alerts | Alert; review top assets | Optimize images, add cache headers | Upgrade CF plan or move heavy assets to R2 |
SENTINEL Integration
Add to /Users/makinja/system/tools/sentinel-uptime.sh:
#!/usr/bin/env bash
# Uptime check for all ALAI sites — run every 5 minutes via cron
SITES=(
"https://alai.no"
"https://snowit.ba"
"https://getdrop.no"
"https://app.getdrop.no"
"https://basicconsulting.no"
"https://basicfakta.no"
"https://bilko-demo.alai.no"
"https://kenyhot.pro"
"https://merdzanovic.ba"
"https://docs.alai.no"
"https://sign.basicconsulting.no"
"https://boards.basicconsulting.no"
"https://vault.basicconsulting.no"
)
for SITE in "${SITES[@]}"; do
STATUS=$(curl -s -o /dev/null -w "%{http_code}" --max-time 10 "$SITE")
if [ "$STATUS" != "200" ] && [ "$STATUS" != "301" ] && [ "$STATUS" != "302" ]; then
node /Users/makinja/system/tools/slack.js send "#infra-alerts" \
"ALERT: $SITE returned HTTP $STATUS at $(date -u +%Y-%m-%dT%H:%M:%SZ)"
fi
done
Crontab entry: */5 * * * * bash /Users/makinja/system/tools/sentinel-uptime.sh
5. Cost
Per-Site Monthly Cost (Target State: Cloudflare Pages)
| Site | Current Platform | Current Cost | CF Pages Cost | Notes |
|---|---|---|---|---|
| alai.no | Cloudflare Pages | €0 | €0 | Already there |
| snowit.ba | GitHub Pages | €0 | €0 | Migrate from GitHub Pages |
| getdrop.no | Azure VM (Caddy) | Shared with VM | €0 | Static landing only |
| app.getdrop.no | Azure VM (Caddy) | Shared with VM | Not applicable | Next.js app, stays on VM |
| basicconsulting.no | Vercel | €0 (Free) | €0 | Migrate from Vercel |
| basicfakta.no | Vercel | €0 (Free) | €0 | Migrate from Vercel |
| bilko-demo.alai.no | GCP Cloud Run | €5-10 | €0 | Static export possible; see note |
| kenyhot.pro | Vercel | €0 (Free) | €0 | Client site, coordinate |
| merdzanovic.ba | Vercel | €0 (Free) | €0 | Client site, coordinate |
| docs.alai.no | Azure VM | Shared with VM | Not applicable | BookStack = dynamic, stays on VM |
| sign.basicconsulting.no | Azure VM | Shared with VM | Not applicable | Documenso = dynamic, stays on VM |
| boards.basicconsulting.no | Azure VM | Shared with VM | Not applicable | Planka = dynamic, stays on VM |
| vault.basicconsulting.no | Azure VM | Shared with VM | Not applicable | Vaultwarden = dynamic, stays on VM |
| bilko-api, bilko-intesa-demo | GCP Cloud Run | €5-10 | Not applicable | Dynamic services, stay on GCP |
Note on bilko-demo.alai.no: If Bilko web can be exported as static (Next.js output: 'export'), it moves to CF Pages for €0. If it requires server-side rendering (API routes, auth), it stays on GCP Cloud Run. This is a code-level decision for CodeCraft. Placeholder cost assumes migration succeeds.
Annual Total (Target State)
| Provider | Services After Migration | Monthly | Annual |
|---|---|---|---|
| Cloudflare Pages | 9 static sites | €0 | €0 |
| GCP Cloud Run | Bilko API + demo services (if SSR) | €5-10 | €60-120 |
| Azure VM | BookStack, Documenso, Planka, Vaultwarden, Drop app | €50 | €600 |
| GitHub Pages | snowit.ba (until CF migration) | €0 | €0 |
| one.com domains | alai.no, basicconsulting.no, getdrop.no, bilko.io | €17 | €200 |
| TOTAL | €72-77/month | €860-920/year |
Current vs Target Delta
- Current: €72-127/month
- Target: €72-77/month (static sites are free; dynamic services stay)
- Delta: -€0 to -€50/month (savings only materialize if Vercel Pro tier is confirmed and removed)
- Key finding: Most current cost is the Azure VM (€50) and one.com domains (€17). These are not reducible by a hosting platform switch — they serve dynamic apps and DNS. The hosting consolidation eliminates Vercel as a dependency and reduces operational complexity.
Scale: 30 Sites by 2027
At 30 sites, Cloudflare Pages remains €0 (no per-site pricing). The only cost growth vectors are:
- Azure VM upgrade if Drop/BookStack need more resources: +€20-40/month for next tier
- Additional one.com domain registrations: ~€20/year each
- GCP Cloud Run if Bilko scales: usage-based, estimate €10-30/month at moderate traffic
Projected 2027 total: €100-130/month at 30 sites. Cloudflare Pages does not contribute to this increase.
6. Migration Plan
Priority 1 = immediate (no dep, low risk). Priority 2 = planned (some coordination). Priority 3 = blocked/external.
| Domain | Current Platform | Target Platform | Priority | Downtime Window | Dependency | MC Task |
|---|---|---|---|---|---|---|
| alai.no | Cloudflare Pages | Cloudflare Pages | - | None | None — already done | Done |
| basicconsulting.no | Vercel | Cloudflare Pages | 1 | 0 (DNS already on CF) | Find repo | #8482 |
| basicfakta.no | Vercel | Cloudflare Pages | 1 | < 5 min (NS change) | Find repo, change registrar NS | #8483 |
| snowit.ba | GitHub Pages | Cloudflare Pages | 2 | < 5 min | Move DNS from AWS Route53 to CF | #8484 |
| getdrop.no | Azure VM (Caddy) | Cloudflare Pages (static) | 1 | 0 (DNS on Vercel, move to CF) | Static export of Next.js landing | #8485 |
| app.getdrop.no | Azure VM (Caddy) | Azure VM (stay) | - | None | Dynamic Next.js app | No action |
| bilko-demo.alai.no | GCP Cloud Run | Cloudflare Pages (if static export works) | 2 | 0 (DNS already on CF) | CodeCraft confirms static export | #8486 |
| kenyhot.pro | Vercel | Cloudflare Pages | 3 | < 5 min | Coordinate with client, DNS on Vercel | #8487 |
| merdzanovic.ba | Vercel | Cloudflare Pages | 3 | < 5 min | Coordinate with client, third-party DNS | #8488 |
| bilko.io | None (down) | Cloudflare Pages | 2 | N/A (currently down) | Fix one.com DNS, point to CF | #8489 |
| docs/sign/boards/vault.basicconsulting.no | Azure VM | Azure VM (stay) | - | None | Dynamic apps | No action |
| bilko-api, bilko-intesa-demo | GCP Cloud Run | GCP Cloud Run (stay) | - | None | Dynamic API services | No action |
Total sites to migrate: 8 static sites. 4 stay on current platform (dynamic apps/services). 2 done (alai.no, basicconsulting.no).
Migration Log
| Date | Domain | From | To | Downtime | TTFB Before | TTFB After | Notes |
|---|---|---|---|---|---|---|---|
| 2026-04-20 | basicconsulting.no | Vercel (76.76.21.21) | CF Pages | ~60s | 114ms | 51ms (warm avg) | MC #8482. DNS: A->CNAME. Validation required domain re-add. TTFB improved 55%. Proveo pilot validated #8490. |
| 2026-04-20 | bilko.io | one.com (down) | CF Pages | N/A (site was down) | N/A | 68ms (warm avg) | MC #8489. Apex CNAME not possible on one.com free tier (paid feature). Switched to Cloudflare NS (ana.ns.cloudflare.com, bob.ns.cloudflare.com). CF Pages zone ID: 62d89b79f0648d3fa1d045335a989ea7. DNS: CNAME flattening bilko.io → bilko-io.pages.dev (proxied), www → bilko-io.pages.dev. |
Paused migrations:
- MC #8483 (basicfakta.no) — Inventory error: site has serverless functions (Vercel Edge), not pure static. Requires CodeCraft assessment.
- MC #8484 (snowit.no) — Inventory error: site has API routes (Next.js), not pure static. Requires CodeCraft assessment.
Audit verdict for #8486 (bilko-demo.alai.no): Full-stack Next.js app with dynamic API routes. Stays on GCP Cloud Run. Not eligible for CF Pages migration.
7. Lessons Learned
2026-04-20 — CF Browser Integrity Check blocks headless clients
Incident: LightRAG 46h outage (MC #8487 followup)
Problem: Automation HTTP clients (Python urllib, Node fetch, etc.) get HTTP 403 (error code 1010) from CF-proxied hostnames with Browser Integrity Check (BIC) enabled, even when IP bypass or CF Access service tokens are configured.
Root cause: BIC layer evaluates BEFORE Access policies and blocks requests based on User-Agent string. Python/Node default UAs trigger block, but curl/wget/browser tests pass — creating a false sense of security.
Fix: Create Cloudflare Configuration Rule disabling BIC per hostname. See rule INFRA-CF-001 (~/system/rules/cf-proxied-api-bic-whitelist.md) and BookStack page ID 2692.
Evidence: ~/system/evidence/lightrag-ingestion-investigation-20260420-215700.md
Hostnames affected: ollama.basicconsulting.no (fixed), lightrag.basicconsulting.no (verify needed)
8. DoD Checklist
- File exists at
/Users/makinja/system/specs/ALAI-STATIC-HOSTING-BLUEPRINT.md - BookStack sync task created — MC #8491 (Skillforge owner) — sync this file to docs.alai.no under "Infrastructure > Hosting"
- Proveo validation task created — MC #8490 (Angie Jones owner) — deploy blueprint to 1 test site (basicconsulting.no), verify < 60s rollback works end-to-end
- 8 migration MC tasks created: #8482 #8483 #8484 #8485 #8486 #8487 #8488 #8489
- SENTINEL uptime script deployed and crontab entry added
- Renovate enabled on all repos
- getdrop.no DNS moved from Vercel to Cloudflare
- 8 stale Vercel projects deleted (see inventory)
Cloud Migration 2026
ALAI cloud migration master plan: 6-phase transition from ANVIL-only to cloud-hosted control plane
Master Plan — Cloud Migration
$(cat /tmp/bookstack-page-1-master-plan.html | jq -Rs .)Phase 1 — Bitwarden Cloud Migration
Phase 1 — Bitwarden Cloud Migration
Timeline: Days 1-3
Goal: Eliminate Vaultwarden SPOF as the very first step. Every subsequent phase depends on secrets being available globally, not just when the Azure VM is alive.
MC Task: #8494
Proveo Owner: Angie Jones
Status: PREVIEW — Parisa writing detailed runbook in parallel
Why First
Phase 2 onwards deploys to Azure Container Apps. Those containers need secrets at startup (Anthropic API key, Postgres connection string, Azure SP). If Vaultwarden is down, all containers fail to start. Fix the foundation before building on it.
Deliverables
- Export all current Vaultwarden items to encrypted JSON
- Import to Bitwarden cloud Teams ($4/user/month — 1 seat = $4/month total)
- Update
alai-clibootstrap step to usebw loginagainstcloud.bitwarden.com - Update all agent bootstrap scripts to use cloud BW endpoint
- Delete the BW CLI config pointing to
vault.basicconsulting.no
Rollback Plan
Vaultwarden self-hosted remains running in parallel until Phase 6. If Bitwarden cloud import fails, fall back to self-hosted immediately. Keep vault export as encrypted offline backup in ~/system/backups/.
Proveo Validation Criteria
Test Owner: Angie Jones (Proveo)
- Fresh
bw login alembasic@gmail.comon a machine with NOvault.basicconsulting.noaccess returns all expected items (GitHub token, Azure SP, Anthropic key, SSH key) alai login(once built in Phase 4) succeeds using cloud BW credentials- Vaultwarden VM can be stopped for 1 hour with no agent failures on ANVIL
Cost
Bitwarden cloud Teams: $4/user/month × 1 user = $4/month
vs Vaultwarden HA (2 VMs + Load Balancer): ~$88/month
Detailed Runbook
Parisa Tabriz (Securion) is writing the full step-by-step runbook in parallel. Once complete, it will be referenced here:
~/system/architecture/phase-1-bitwarden-runbook.md (pending)
Credit: ALAI, 2026
Phase 2 — MC + HiveMind API
Phase 2 — MC + HiveMind API
Timeline: Weeks 1-2
Goal: Mission Control and HiveMind leave ANVIL and become cloud-hosted APIs. This is the biggest architectural change — SQLite becomes Postgres, local scripts become REST calls.
MC Task: #8495
Proveo Owner: Angie Jones
Status: PREVIEW — Kelsey working in parallel
Why Second
MC and HiveMind are the nervous system. Once they are cloud-hosted, every other phase can run from any machine without touching ANVIL.
Deliverables
- mc-api.js: Express-based REST API wrapping current
mc.jslogicGET /tasks,POST /tasks,PATCH /tasks/:id,GET /stats- Postgres driver (pg) replacing SQLite
- Schema migration: 8378 tasks, 127 open — pg-migrate from SQLite dump
- hivemind-api.js: REST + optional WebSocket for pub/sub
- Postgres backend (hivemind schema)
- Docker images for both, pushed to Azure Container Registry
- Azure Container Apps: deploy mc-api and hivemind-api
- Consumption plan (serverless, scale-to-zero when no traffic)
- Min replicas: 1 (so cold start is 2-4s max, not 30s+)
- Memory: 0.5GB each, vCPU: 0.25 each
- Azure Database for Postgres Flexible Server: Burstable B1ms
- Region: swedencentral
mission_controlDB +hivemindDB on same instance- Automated backups (7-day retention, included in cost)
- Update
mc.jsclient wrapper: detectALAI_MC_URLenv var, proxy to API if set- Backward compatible: if no
ALAI_MC_URL, still uses local SQLite (ANVIL stays working)
- Backward compatible: if no
Cost Estimate
Container Apps (2 apps, ~5h/day active, consumption plan): ~$1.50/month per app = $3/month total (Free grant: 180,000 vCPU-s/month covers most light usage) Azure Postgres B1ms: ~$22-24/month (swedencentral, Flexible Server) Azure Container Registry Basic: $5/month Total Phase 2 additions: ~$30-32/month
Rollback Plan
mc.js still reads local SQLite if ALAI_MC_URL is not set. If Postgres or Container Apps fail, unset ALAI_MC_URL on ANVIL and operations continue locally. SQLite is kept in parallel for 30 days post-migration before decommission.
Proveo Validation Criteria
Test Owner: Angie Jones (Proveo)
- From ab-mac (no local SQLite):
alai mc listreturns live tasks - From ANVIL:
node ~/system/tools/mc.js liststill works (backward compat) - POST to mc-api: task appears in both
mc.js listAND cloud Postgres within 2s - Postgres automated backup: verify restore of 100-row sample matches source
- Container App scales to zero after 10min idle, cold starts under 5s
Detailed Implementation
Kelsey Hightower (FlowForge) is implementing Azure Container Apps + Postgres in parallel. Full runbook will be linked here once ready.
Credit: ALAI, 2026
Current State vs Target State
Current State vs Target State
Purpose: Visual comparison of ALAI's architecture today (ANVIL single-point-of-failure) vs the cloud-hosted control plane target state.
Source: ~/system/architecture/cloud-migration-master-plan.md
TODAY — SINGLE SPOF ARCHITECTURE
ANVIL (makinja-sin-mac-studio) Azure swedencentral
100.103.49.98 4.223.110.181
┌─────────────────────────────────┐ ┌──────────────────────────────┐
│ CONTROL PLANE (all-in-one) │ │ Supporting services (1 VM) │
│ │ │ Standard_B2als_v2, 2vCPU │
│ Mission Control (mc.js) │ │ 4GB RAM, 30GB SSD │
│ └─ SQLite mission-control.db │ │ │
│ 8378 tasks │ │ BookStack (docs) │
│ │ │ Vaultwarden (secrets — SPOF)│
│ HiveMind (hivemind.db) │ │ Planka (boards) │
│ Agent runner (pi-orchestrator) │ │ Documenso (signing) │
│ 30 LaunchAgent daemons │ │ Grafana / Prometheus │
│ Rules/skills/agents (git) │ │ Caddy (reverse proxy) │
│ │ │ │
│ LightRAG (Docker :9621) │ │ Cost estimate: $5-53/month │
│ Neo4j (Docker :7474/:7687) │ │ (Azure Founders Hub credit) │
│ Knowledge graph (481MB) │ └──────────────────────────────┘
│ │
│ Ollama :11434 │ Azure Blob (alaibackups0ebb)
│ qwen3.5:27b (17G) │ ┌──────────────────────────────┐
│ orchestrator:latest (23G) │ │ system-db-backups │
│ alaiml-task/tender/email (3G) │ │ system-git-bundles │
│ qwen2.5-coder:32b (23G) │ │ bitwarden-exports │
│ bge-m3 + others (~40G) │ │ Cost: ~$2.40/month │
└─────────────────────────────────┘ └──────────────────────────────┘
│ LAN only (10.0.0.2)
┌────────▼────────────────────────┐
│ FORGE (Mac Mini) │
│ devstral:24b, qwen2.5-coder │
│ NOT on Tailscale — LAN only │
└─────────────────────────────────┘
Tailscale mesh: 4 nodes
makinja-sin-mac-studio 100.103.49.98
ab-mac 100.118.37.71
basicass-mac-mini 100.104.164.86
iphone181 100.93.161.73
NOTE: ANVIL Ollama :11434 NOT reachable from ab-mac (port timeout verified).
NOTE: 306 files in ~/system/ hardcode localhost:11434 — zero portability today.
SPOF inventory (4 critical):
[1] ANVIL dead → mc.js, HiveMind, agents, LightRAG, Ollama ALL stop
[2] FORGE dead → devstral/coder workload stops (Anthropic can substitute)
[3] Azure VM dead → Vaultwarden down, secrets inaccessible, agents cannot bootstrap
[4] Local network → FORGE permanently isolated (LAN-only, no Tailscale)
TARGET — CLOUD-HOSTED CONTROL PLANE + THIN CLIENT
CLIENT (any OS — new laptop, travel machine, etc.)
┌──────────────────────────────────────────────────┐
│ alai-cli (single installable package) │
│ brew install alai | npm install -g @alai/cli │
│ winget install alai | apt install alai-cli │
│ │
│ alai login → OAuth2 PKCE → Azure AD B2C │
│ alai start → connects to cloud APIs │
│ alai mc list → proxies to MC API │
│ alai agent run → dispatches to agent runner │
│ │
│ Claude Code CLI (installed separately) │
│ ~/.claude/ cloned from git on login │
└──────────────────────────────────────────────────┘
│ HTTPS (Azure Front Door or direct)
│ Auth: Azure AD B2C JWT
┌───────────────▼──────────────────────────────────┐
│ CLOUD CONTROL PLANE (Azure Container Apps) │
│ Region: swedencentral (existing subscription) │
│ │
│ ┌─────────────────┐ ┌──────────────────────┐ │
│ │ MC API │ │ Agent Runner API │ │
│ │ REST + WebSocket│ │ POST /run │ │
│ │ → Postgres │ │ → dispatches agents │ │
│ └─────────────────┘ └──────────────────────┘ │
│ │
│ ┌─────────────────┐ ┌──────────────────────┐ │
│ │ HiveMind API │ │ Skills/Rules Proxy │ │
│ │ pub/sub │ │ serves ~/system/ │ │
│ │ → Postgres │ │ content from Git │ │
│ └─────────────────┘ └──────────────────────┘ │
│ │
│ ┌─────────────────┐ ┌──────────────────────┐ │
│ │ Auth API │ │ Secrets Proxy │ │
│ │ Azure AD B2C │ │ → Bitwarden cloud │ │
│ │ JWT issuance │ │ (no self-hosted BW) │ │
│ └─────────────────┘ └──────────────────────┘ │
│ │
│ Azure Database for Postgres (Flexible Server) │
│ Burstable B1ms — mission_control + hivemind │
│ (migrated from local SQLite) │
│ │
│ Azure Container Registry (private) │
│ MC API, HiveMind, Agent Runner images │
└──────────────────────────────────────────────────┘
│ Tailscale (encrypted WireGuard)
│ OR public HTTPS (for Anthropic-only agents)
┌───────────────▼──────────────────────────────────┐
│ DATA PLANE (stays on hardware) │
│ │
│ ANVIL 100.103.49.98 FORGE 10.0.0.2 │
│ Ollama :11434 (primary) devstral:24b │
│ qwen3.5:27b qwen2.5-coder:32b │
│ alaiml-task/tender/email (add to Tailscale) │
│ orchestrator:latest :11434 │
│ LightRAG + Neo4j (Phase 5) │
│ │
│ CLOUD ML FALLBACK (Phase 5) │
│ Together.ai — Llama-3.3-70B $0.88/M tokens │
│ Triggered only when ANVIL:11434 unreachable │
└──────────────────────────────────────────────────┘
SECRETS (Phase 6 — replaces self-hosted Vaultwarden)
┌──────────────────────────────────────────────────┐
│ Bitwarden cloud (Teams plan) │
│ $4/user/month — 1 user = $4/month │
│ HA by default — Bitwarden's infrastructure │
│ alai-cli integrates via BW CLI at login │
└──────────────────────────────────────────────────┘
Key Differences
| Component | Current State (ANVIL SPOF) | Target State (Cloud Control Plane) |
|---|---|---|
| Mission Control | SQLite on ANVIL disk | Postgres + MC API (Azure Container Apps) |
| HiveMind | SQLite on ANVIL disk | Postgres + HiveMind API (Azure Container Apps) |
| Agent Runner | pi-orchestrator on ANVIL only | Cloud agent-runner (Anthropic-powered agents), ANVIL for fine-tuned models |
| Secrets | Vaultwarden on single Azure VM | Bitwarden cloud ($4/month, HA by default) |
| Client Bootstrap | Manual setup, ANVIL-dependent | brew install alai && alai login — under 10 minutes, any OS |
| Ollama | ANVIL only, FORGE LAN-isolated | ANVIL + FORGE (Tailscale) + Together.ai cloud fallback |
| Cost | $27-106/month (mostly hidden by Azure credit) | $108-165/month (transparent, no hidden dependencies) |
| ANVIL Offline Impact | Total system outage | Cloud services continue, fine-tuned models pause gracefully |
SPOF Elimination
4 SPOFs removed:
- ANVIL death — control plane (MC, HiveMind, agent runner) migrates to cloud. ANVIL offline = Ollama workloads pause, everything else continues.
- Vaultwarden VM death — secrets migrate to Bitwarden cloud (HA by default). No more single-VM secret dependency.
- Network isolation — FORGE joins Tailscale. Cloud services can reach FORGE for code tasks even when ANVIL is down.
- Workstation lock-in —
alai-cliworks from any machine. No more "John only works from ANVIL."
Credit: ALAI, 2026
ANVIL SPOF Elimination Plan (2026-04-20)
Status: DRAFT — Awaiting Proveo validation + Alem approval
Author: Kelsey Hightower / FlowForge
Date: 2026-04-20
MC Task: #8515 ANVIL SPOF elimination sprint
Deadline: 2026-05-01
ANVIL SPOF Elimination Plan
Author: FlowForge (Kelsey Hightower) | MC Task #8515
Date: 2026-04-20
Status: DRAFT — Awaiting Alem approval before any implementation
Executive Summary
ANVIL (Mac Studio M3 Ultra, 96 GB, 100.103.49.98) is a single point of failure. One power outage, kernel panic, or SSD failure ends all ALAI operations — mission control, agent fleet, Ollama inference, all daemons. Currently only 2 of ~67 production SQLite databases are replicated to Azure Blob Storage. RTO is effectively infinite. This plan eliminates the SPOF across 9 sequential phases.
Key finding: FORGE already exists. It is a Mac Studio M3 Ultra 256 GB connected to ANVIL via Thunderbolt Bridge (10.0.0.1 = ANVIL, 10.0.0.2 = FORGE) with sub-millisecond latency, AND accessible via Tailscale at 100.104.164.86. No new hardware purchase is needed. Budget impact: ~0 EUR/month additional infrastructure cost (FORGE is already owned and powered).
Targets: RPO < 60s | RTO < 5 min (manual failover Phase 1, automatic Phase 2+)
Architecture Overview
ANVIL (primary) FORGE (warm standby)
Mac Studio M3 Ultra 96GB Mac Studio M3 Ultra 256GB
100.103.49.98 (Tailscale) 100.104.164.86 (Tailscale)
10.0.0.1 (Thunderbolt) 10.0.0.2 (Thunderbolt)
│ │
│ Thunderbolt Bridge (< 1ms) │
└────────────────────────────────-─┘
│
▼
Azure Blob Storage
alaibackups0ebb
system-db-backups container
(litestream WAL segments, all DBs)
All replication flows ANVIL → Azure → FORGE (pull-based via litestream restore). FORGE does NOT write back to Azure. Azure is the single durable WAL store.
Phase 1 — Litestream Expansion (all ~67 DBs)
1.1 Database Tier Classification
Priority rationale: P0 = system cannot function without it | P1 = major feature loss | P2 = historical/cache only.
P0 — Mission Critical (system stops without these)
| Database | Size | Write Freq | Justification |
|---|---|---|---|
| mission-control.db | 26 MB | Very high | Primary task ledger — all MC operations. CURRENTLY REPLICATED. |
| hivemind.db | 162 MB | High | Agent memory, HiveMind knowledge graph. CURRENTLY REPLICATED. |
| tasks.db | 4 KB | High | Active task queue — active work in flight |
| costs.db | 256 KB | High | Token cost tracking, budget enforcement |
| events.db | 14 MB | High | System event bus — orchestrator depends on this |
| orchestrator-queue.db | 28 KB | High | Active agent job queue — jobs lost = work lost |
| orchestrator-workers.db | 36 KB | High | Worker state — active session tracking |
| durable-runner.db | 896 KB | Medium | Durable task execution state |
| session-index.db | 56 MB | High | Agent session state — all active sessions |
| knowledge.db | 192 MB | Medium | RAG knowledge base — primary retrieval corpus |
| emails.db | 0 B (active) | High | Email agent state — initialized on first write |
| email-inbox.db | 3.1 MB | High | Live email queue |
| alem-directives.db | active WAL | High | CEO directives — highest trust data |
P0 — Financial / Legal (loss = regulatory exposure)
| Database | Size | Write Freq | Justification |
|---|---|---|---|
| fiken.db | 0 B (active) | Medium | Fiken accounting integration — financial records |
| invoices.db | 36 KB | Medium | Invoice state — revenue tracking |
| contracts.db | 40 KB | Low | Signed contracts — legal documents |
| leads.db | 256 KB | Medium | Sales pipeline — business development |
P1 — Operational (system degrades without these)
| Database | Size | Write Freq | Justification |
|---|---|---|---|
| agent-routing.db | 4.1 MB | Medium | Routing decisions, agent assignment |
| bee-index.db | 4.2 MB | Medium | Bee task index |
| bih-tenders.db | 640 KB | Low | BiH market tenders — business intelligence |
| browser-tasks.db | active WAL | Medium | Browser automation queue |
| companies.db | 0 B (active) | Low | Company registry |
| contacts.db | 192 KB | Low | CRM contacts |
| deploy-registry.db | 16 KB | Low | Deployment history |
| design-reviews.db | 64 KB | Low | Design review state |
| distill.db | 2.0 MB | Medium | Knowledge distillation cache |
| documents.db | 32 KB | Low | Document registry |
| drafts.db | 360 KB | Medium | Draft content |
| drift.db | active WAL | Medium | Config drift detection |
| email-audit.db | 256 KB | Medium | Email audit trail |
| email-briefing.db | 0 B (active) | Low | Daily briefing state |
| email-index.db | 0 B (active) | Low | Email search index |
| email-tracking.db | 36 KB | Medium | Email delivery tracking |
| escalations.db | 24 KB | Medium | Escalation queue |
| facts.db | 20 KB | Low | System facts store |
| flywheel.db | 432 MB | Low | Flywheel learning data — largest DB |
| goals.db | 44 KB | Medium | OKR / goal tracking |
| guardrails-audit.db | 10 MB | Medium | Safety audit trail |
| health-events.db | 15 MB | High | System health events |
| hivemind-archive.db | 6.7 MB | Low | HiveMind historical archive |
| master-control.db | 0 B (active) | Medium | Master control state |
| mc.db | 0 B (active) | Medium | Mission control alias |
| minions.db | 192 KB | Medium | Minion agent registry |
| observability.db | 44 KB | Medium | Metrics and traces |
| orchestrator-events.db | 0 B (active) | Medium | Orchestrator event log |
| pipeline.db | active WAL | Medium | CI/CD pipeline state |
| projects.db | 40 KB | Low | Project registry |
| routing-outcomes.db | 192 KB | Medium | Tier routing outcome log |
| skill-improvements.db | 20 KB | Low | Skill improvement tracking |
| skill-registry.db | 128 KB | Low | Agent skill registry |
| sprint-pipeline.db | 32 KB | Medium | Sprint pipeline state |
| strategy-tracker.db | 128 KB | Low | Strategic initiative tracking |
| teams.db | 40 KB | Low | Team registry |
| tenders.db | 384 KB | Low | Norwegian tender data |
| tickets.db | active WAL | Medium | Support ticket tracking |
| tool-audit.db | 6.1 MB | Medium | Tool usage audit |
| tool-registry.db | 128 KB | Low | Tool registry |
| trace-events.db | 52 MB | High | Distributed trace store |
| applications-tracker.db | 12 KB | Low | Job/grant applications |
P2 — Cache / Reconstructible (loss = inconvenience only)
| Database | Size | Write Freq | Justification |
|---|---|---|---|
| baikal-caldav.db | 108 KB | Low | CalDAV cache — reconstructible from Baikal |
| prompt-cache.db | 320 KB | Medium | LLM prompt cache — can warm from scratch |
| prompt-metrics.db | 28 KB | Low | Prompt performance metrics |
| rag-cache.db | active WAL | Medium | RAG response cache — reconstructible |
| semantic-reuse-index.db | 192 KB | Medium | Semantic cache — reconstructible |
| stbs.db | 0 B (active) | Low | STBS data — empty |
| telemetry.db | 24 KB | Medium | Telemetry — can lose without ops impact |
| token-cost.db | active WAL | Medium | Cost log — reconstructible from API receipts |
| usage.db | 0 B (active) | Low | Usage tracking — empty |
| vcr.db | active WAL | Low | HTTP cassette cache — reconstructible |
1.2 Retention Strategy
Current retention for the 2 replicated DBs: 72h. This is insufficient for P0.
| Tier | Retention | Justification |
|---|---|---|
| P0 (mission-critical) | 7d | One week: covers weekend + Monday incident recovery. 72h is too tight — if a silent corruption is not caught in 3 days, all WAL segments are gone. |
| P0 (financial/legal) | 30d | Regulatory prudence. fiken.db, invoices.db, contracts.db. Matches typical invoice dispute windows. |
| P1 | 72h | Current default. Operationally acceptable. |
| P2 | 24h | Cache data. Disk cost matters more than recovery depth. |
Retention-check-interval: 1h for all tiers (current default, correct).
Sync-interval: 1s for all tiers P0 and P1. 10s for P2 (reduce Azure transaction cost on low-value data).
Azure storage cost estimate at current sizes (~1.2 GB total databases):
- WAL segments are incremental. Estimate ~500 MB/day delta across all active DBs.
- 7-day P0 WAL: ~3.5 GB. 30-day financial: ~1 GB. P1 72h: ~1 GB.
- Total Azure Blob: ~6 GB. At ~€0.02/GB/month = ~€0.12/month. Negligible.
1.3 New litestream.yml
Path: /Users/makinja/system/config/litestream.yml
Note on flywheel.db (432 MB): Include in P1 but with sync-interval: 30s to reduce churn.
Note on knowledge.db (192 MB): P0, sync-interval 1s — it's actively written by RAG ingestion.
# Litestream — SQLite streaming replication to Azure Blob Storage
# Primary: ANVIL (Mac Studio M3 Ultra 96GB, 100.103.49.98)
# Config: /Users/makinja/system/config/litestream.yml
# Auth: Azure SP (alai-backup-writer) via client credentials
# SP: alai-backup-writer (1a0b3018-0c31-474b-918f-531b0a29a669)
# SP has Storage Blob Data Contributor on system-db-backups container
# Litestream reads AZURE_CLIENT_ID, AZURE_CLIENT_SECRET, AZURE_TENANT_ID from env
# Launch: com.alai.litestream.plist (sets env vars in EnvironmentVariables block)
# Updated: 2026-04-20 — ANVIL SPOF Elimination Sprint (MC #8515)
#
# Tier reference:
# P0-critical: retention 7d, sync 1s
# P0-financial: retention 30d, sync 1s
# P1: retention 72h, sync 1s (or 30s for large DBs)
# P2: retention 24h, sync 10s
dbs:
# ── P0 MISSION CRITICAL ──────────────────────────────────────────────────────
- path: /Users/makinja/system/databases/mission-control.db
replicas:
- name: mc-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/mission-control
retention: 168h # 7 days
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/hivemind.db
replicas:
- name: hivemind-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/hivemind
retention: 168h # 7 days
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/tasks.db
replicas:
- name: tasks-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/tasks
retention: 168h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/costs.db
replicas:
- name: costs-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/costs
retention: 168h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/events.db
replicas:
- name: events-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/events
retention: 168h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/orchestrator-queue.db
replicas:
- name: orch-queue-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/orchestrator-queue
retention: 168h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/orchestrator-workers.db
replicas:
- name: orch-workers-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/orchestrator-workers
retention: 168h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/durable-runner.db
replicas:
- name: durable-runner-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/durable-runner
retention: 168h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/session-index.db
replicas:
- name: session-index-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/session-index
retention: 168h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/knowledge.db
replicas:
- name: knowledge-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/knowledge
retention: 168h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/emails.db
replicas:
- name: emails-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/emails
retention: 168h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/email-inbox.db
replicas:
- name: email-inbox-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/email-inbox
retention: 168h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/alem-directives.db
replicas:
- name: alem-directives-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/alem-directives
retention: 168h
retention-check-interval: 1h
sync-interval: 1s
# ── P0 FINANCIAL / LEGAL ─────────────────────────────────────────────────────
- path: /Users/makinja/system/databases/fiken.db
replicas:
- name: fiken-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/fiken
retention: 720h # 30 days
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/invoices.db
replicas:
- name: invoices-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/invoices
retention: 720h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/contracts.db
replicas:
- name: contracts-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/contracts
retention: 720h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/leads.db
replicas:
- name: leads-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/leads
retention: 720h
retention-check-interval: 1h
sync-interval: 1s
# ── P1 OPERATIONAL ───────────────────────────────────────────────────────────
- path: /Users/makinja/system/databases/agent-routing.db
replicas:
- name: agent-routing-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/agent-routing
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/bee-index.db
replicas:
- name: bee-index-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/bee-index
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/bih-tenders.db
replicas:
- name: bih-tenders-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/bih-tenders
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/browser-tasks.db
replicas:
- name: browser-tasks-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/browser-tasks
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/companies.db
replicas:
- name: companies-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/companies
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/contacts.db
replicas:
- name: contacts-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/contacts
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/deploy-registry.db
replicas:
- name: deploy-registry-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/deploy-registry
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/design-reviews.db
replicas:
- name: design-reviews-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/design-reviews
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/distill.db
replicas:
- name: distill-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/distill
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/documents.db
replicas:
- name: documents-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/documents
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/drafts.db
replicas:
- name: drafts-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/drafts
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/drift.db
replicas:
- name: drift-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/drift
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/email-audit.db
replicas:
- name: email-audit-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/email-audit
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/email-briefing.db
replicas:
- name: email-briefing-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/email-briefing
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/email-index.db
replicas:
- name: email-index-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/email-index
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/email-tracking.db
replicas:
- name: email-tracking-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/email-tracking
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/escalations.db
replicas:
- name: escalations-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/escalations
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/facts.db
replicas:
- name: facts-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/facts
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/flywheel.db
replicas:
- name: flywheel-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/flywheel
retention: 72h
retention-check-interval: 1h
sync-interval: 30s # 432MB — throttle sync to reduce Azure transactions
- path: /Users/makinja/system/databases/goals.db
replicas:
- name: goals-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/goals
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/guardrails-audit.db
replicas:
- name: guardrails-audit-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/guardrails-audit
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/health-events.db
replicas:
- name: health-events-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/health-events
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/hivemind-archive.db
replicas:
- name: hivemind-archive-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/hivemind-archive
retention: 72h
retention-check-interval: 1h
sync-interval: 10s
- path: /Users/makinja/system/databases/master-control.db
replicas:
- name: master-control-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/master-control
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/mc.db
replicas:
- name: mc-db-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/mc-db
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/minions.db
replicas:
- name: minions-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/minions
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/observability.db
replicas:
- name: observability-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/observability
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/orchestrator-events.db
replicas:
- name: orch-events-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/orchestrator-events
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/pipeline.db
replicas:
- name: pipeline-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/pipeline
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/projects.db
replicas:
- name: projects-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/projects
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/routing-outcomes.db
replicas:
- name: routing-outcomes-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/routing-outcomes
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/skill-improvements.db
replicas:
- name: skill-improvements-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/skill-improvements
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/skill-registry.db
replicas:
- name: skill-registry-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/skill-registry
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/sprint-pipeline.db
replicas:
- name: sprint-pipeline-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/sprint-pipeline
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/strategy-tracker.db
replicas:
- name: strategy-tracker-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/strategy-tracker
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/teams.db
replicas:
- name: teams-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/teams
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/tenders.db
replicas:
- name: tenders-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/tenders
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/tickets.db
replicas:
- name: tickets-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/tickets
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/tool-audit.db
replicas:
- name: tool-audit-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/tool-audit
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/tool-registry.db
replicas:
- name: tool-registry-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/tool-registry
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/trace-events.db
replicas:
- name: trace-events-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/trace-events
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/applications-tracker.db
replicas:
- name: applications-tracker-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/applications-tracker
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
# ── P2 CACHE / RECONSTRUCTIBLE ───────────────────────────────────────────────
- path: /Users/makinja/system/databases/baikal-caldav.db
replicas:
- name: baikal-caldav-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/baikal-caldav
retention: 24h
retention-check-interval: 1h
sync-interval: 10s
- path: /Users/makinja/system/databases/prompt-cache.db
replicas:
- name: prompt-cache-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/prompt-cache
retention: 24h
retention-check-interval: 1h
sync-interval: 10s
- path: /Users/makinja/system/databases/prompt-metrics.db
replicas:
- name: prompt-metrics-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/prompt-metrics
retention: 24h
retention-check-interval: 1h
sync-interval: 10s
- path: /Users/makinja/system/databases/semantic-reuse-index.db
replicas:
- name: semantic-reuse-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/semantic-reuse-index
retention: 24h
retention-check-interval: 1h
sync-interval: 10s
- path: /Users/makinja/system/databases/stbs.db
replicas:
- name: stbs-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/stbs
retention: 24h
retention-check-interval: 1h
sync-interval: 10s
- path: /Users/makinja/system/databases/telemetry.db
replicas:
- name: telemetry-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/telemetry
retention: 24h
retention-check-interval: 1h
sync-interval: 10s
- path: /Users/makinja/system/databases/token-cost.db
replicas:
- name: token-cost-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/token-cost
retention: 24h
retention-check-interval: 1h
sync-interval: 10s
- path: /Users/makinja/system/databases/usage.db
replicas:
- name: usage-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/usage
retention: 24h
retention-check-interval: 1h
sync-interval: 10s
- path: /Users/makinja/system/databases/vcr.db
replicas:
- name: vcr-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/vcr
retention: 24h
retention-check-interval: 1h
sync-interval: 10s
1.4 Implementation Steps (ANVIL)
- Stop litestream:
launchctl stop com.alai.litestream - Replace
/Users/makinja/system/config/litestream.ymlwith the config above. - Validate config:
/opt/homebrew/bin/litestream replicate -config /Users/makinja/system/config/litestream.yml -config-validate - Start litestream:
launchctl start com.alai.litestream - Verify all DBs appear in Azure:
az storage blob list --container-name system-db-backups --account-name alaibackups0ebb --prefix litestream/ --auth-mode login --query "[].name" | wc -l(expect ~67+ entries). - Watch logs for errors:
tail -f /Users/makinja/system/logs/litestream-error.log
Phase 2 — FORGE Hardware / OS Decision
2.1 FORGE Already Exists — Hardware Decision Is Made
FORGE is confirmed to be a second Mac Studio M3 Ultra with 256 GB unified memory, connected to ANVIL via Thunderbolt Bridge (10.0.0.1 = ANVIL, 10.0.0.2 = FORGE). Tailscale IP: 100.104.164.86. User: basicas. It is already running Ollama with models including devstral:24b, qwen3:32b, deepseek-r1:70b, qwen3-coder, and bge-m3.
No hardware purchase is required. Monthly infrastructure cost delta: 0 EUR (already owned).
2.2 Why FORGE Wins Over Every Alternative
| Option | Cost/mo | Latency to ANVIL | Apple Silicon | macOS parity | Verdict |
|---|---|---|---|---|---|
| FORGE (Mac Studio M3U 256GB, owned) | 0 EUR | < 1ms (Thunderbolt) | Yes (M3 Ultra) | Yes (same LaunchAgent ecosystem) | CHOSEN |
| Mac Mini M4 Pro (purchase) | ~50 EUR amortized | < 1ms if local | Yes | Yes | Redundant — FORGE exists |
| Hetzner Linux VM (CCX33) | ~30-50 EUR | 10-30ms (internet) | No (x86) | No (systemd, not launchd) | Budget option only if FORGE fails |
| Azure VM (Sweden Central) | ~60-80 EUR | 10-30ms | No | No | Closest to Azure storage but no Apple Silicon |
Decision: Use FORGE as warm standby. Zero additional cost. Thunderbolt latency is effectively local — litestream WAL replication will complete in well under 60s.
2.3 FORGE Bootstrap Prerequisites
FORGE already runs Ollama. What is missing:
- litestream installed on FORGE (check:
brew list litestreamon basicas@FORGE) - Azure SP credentials injected into FORGE environment (AZURE_CLIENT_ID, AZURE_CLIENT_SECRET, AZURE_TENANT_ID)
~/system/databases/directory created on FORGElitestream-restore.shdaemon script written and loaded as LaunchAgent on FORGE- SSH key access from ANVIL to FORGE for health check and failover scripts
Phase 3 — Continuous Restore on FORGE (< 60s RPO)
3.1 Architecture
FORGE runs litestream restore in a watch loop per database. Litestream 0.5.x does not have
a native watch mode — it restores a snapshot + WAL segments. The recommended approach is
a shell script loop that calls litestream restore repeatedly with a short interval.
However, litestream does support a second process pattern: run litestream replicate on FORGE
pointing at the SAME Azure bucket paths, but configured as a replica-only consumer. This is the
correct approach: FORGE runs a litestream restore daemon that continuously polls for new WAL
segments from Azure.
3.2 Continuous Restore Strategy
Use litestream restore with the -if-replica-exists flag in a loop:
#!/usr/bin/env bash
# /Users/basicas/system/scripts/litestream-restore-loop.sh
# Runs on FORGE. Continuously restores all P0+P1 DBs from Azure.
# Interval: 30s poll (gives ~30s RPO in steady state, well within 60s target)
set -euo pipefail
LITESTREAM=/opt/homebrew/bin/litestream
CONFIG=/Users/basicas/system/config/litestream-restore.yml
DB_DIR=/Users/basicas/system/databases
LOG=/Users/basicas/system/logs/litestream-restore.log
INTERVAL=30 # seconds between restore cycles
while true; do
echo "[$(date -Iseconds)] Starting restore cycle" >> "$LOG"
# Restore each DB defined in restore config
# litestream restore will only apply new WAL segments if DB already exists
$LITESTREAM restore -config "$CONFIG" -if-replica-exists >> "$LOG" 2>&1 || true
echo "[$(date -Iseconds)] Restore cycle complete, sleeping ${INTERVAL}s" >> "$LOG"
sleep "$INTERVAL"
done
3.3 FORGE litestream-restore.yml
A separate config file on FORGE that mirrors ANVIL's litestream.yml but uses restore semantics.
FORGE is READ-ONLY consumer. It never writes back to Azure.
Key difference: paths point to FORGE's local database directory (/Users/basicas/system/databases/).
The Azure paths are identical to ANVIL's — FORGE reads from the same blob paths ANVIL writes to.
# /Users/basicas/system/config/litestream-restore.yml
# FORGE warm standby — continuous restore from Azure
# DO NOT run litestream replicate with this config — restore only
dbs:
- path: /Users/basicas/system/databases/mission-control.db
replicas:
- name: mc-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/mission-control
# ... (repeat for all P0 and P1 DBs using same Azure paths as ANVIL)
# P2 DBs: omit from restore config — not worth continuous restore overhead
3.4 FORGE LaunchAgent for Restore Loop
Path: /Users/basicas/Library/LaunchAgents/com.alai.litestream-restore.plist
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
"http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>com.alai.litestream-restore</string>
<key>ProgramArguments</key>
<array>
<string>/bin/bash</string>
<string>/Users/basicas/system/scripts/litestream-restore-loop.sh</string>
</array>
<key>EnvironmentVariables</key>
<dict>
<key>AZURE_STORAGE_ACCOUNT</key>
<string>alaibackups0ebb</string>
<key>AZURE_CLIENT_ID</key>
<string>1a0b3018-0c31-474b-918f-531b0a29a669</string>
<key>AZURE_CLIENT_SECRET</key>
<string>RETRIEVE_FROM_BITWARDEN_AT_BOOTSTRAP</string>
<key>AZURE_TENANT_ID</key>
<string>3454a03f-20b4-4bda-a116-2293c459aecd</string>
</dict>
<key>KeepAlive</key>
<true/>
<key>RunAtLoad</key>
<true/>
<key>StandardOutPath</key>
<string>/Users/basicas/system/logs/litestream-restore.log</string>
<key>StandardErrorPath</key>
<string>/Users/basicas/system/logs/litestream-restore-error.log</string>
<key>ThrottleInterval</key>
<integer>10</integer>
</dict>
</plist>
3.5 RPO Calculation
- ANVIL litestream sync-interval: 1s (WAL segment flushed to Azure every 1s for P0)
- FORGE restore poll interval: 30s
- Azure propagation: < 1s (same-region, in-blob operations)
- Worst-case RPO: 31s (well under 60s target)
- Expected average RPO: ~15-20s
Phase 4 — Ollama Failover Tier Routing
4.1 Current State
Tier routing in /Users/makinja/system/config/tier-routing.json already defines FORGE as the
primary host for Tiers 2c, 2cf, 2d, 3, 3s, 3r. ANVIL handles Tiers 1, 2, 2t, 2cHQ.
The providerFallback section defines ollama:qwen2.5-coder:32b@anvil as fallback for some paths.
The gap: there is no automatic failover FROM ANVIL TO FORGE when ANVIL Ollama is down, and no automatic failover FROM FORGE TO ANVIL when FORGE Ollama is down.
4.2 Failover Config Extension
Extend /Users/makinja/system/config/tier-routing.json with an ollamaHosts block:
"ollamaHosts": {
"anvil": {
"url": "http://localhost:11434",
"tailscale_url": "http://100.103.49.98:11434",
"health_path": "/api/tags",
"health_timeout_ms": 3000,
"role": "primary-infra"
},
"forge": {
"url": "http://10.0.0.2:11434",
"tailscale_url": "http://100.104.164.86:11434",
"health_path": "/api/tags",
"health_timeout_ms": 3000,
"role": "primary-compute"
}
},
"failoverRules": {
"anvil-down": {
"redirect_anvil_tiers": ["1", "2", "2t", "2cHQ"],
"to_forge_models": {
"llama3.1:8b": "llama3.1:8b",
"qwen2.5-coder:32b": "qwen2.5-coder:32b-instruct-q8_0"
},
"note": "When ANVIL Ollama unreachable, route Tier 1/2 to FORGE equivalents"
},
"forge-down": {
"redirect_forge_tiers": ["2c", "2cf", "2d", "3", "3s", "3r"],
"to_claude": true,
"note": "When FORGE Ollama unreachable, escalate to Claude (cost spike acceptable — FORGE failure is rare)"
}
}
4.3 Health Check Daemon
A new lightweight Node.js daemon on ANVIL polls both Ollama endpoints every 15s and writes
status to a JSON file that ollama-engine.js reads before routing:
Path: /Users/makinja/system/daemons/ollama-health-monitor.js
// Pseudocode — implementation by CodeCraft
// Runs every 15s, writes to /tmp/ollama-health.json
// {
// "anvil": { "healthy": true, "last_check": "2026-04-20T20:00:00Z" },
// "forge": { "healthy": true, "last_check": "2026-04-20T20:00:00Z" }
// }
// tier-router.js reads this file before every dispatch
// If anvil.healthy === false: redirect tier 1/2 requests to forge
// If forge.healthy === false: redirect tier 2c/3 requests to claude
4.4 Manual Failover Command
For Phase 1 (before automatic failover is implemented):
# On ANVIL, when FORGE is down — force all routing to ANVIL
echo '{"anvil":{"healthy":true},"forge":{"healthy":false,"override":true}}' > /tmp/ollama-health-override.json
# When ANVIL is down, from FORGE (if FORGE has ollama-engine.js):
# Edit /Users/basicas/system/config/tier-routing.json: set all hosts to "forge"
Phase 5 — DNS / Service Discovery
5.1 Options Evaluated
| Option | Mechanism | Failover Speed | Complexity | Cost |
|---|---|---|---|---|
| Tailscale MagicDNS | DNS record swap via Tailscale API | Manual: ~1 min | Low | Free |
| Cloudflare DNS + health check | CF Load Balancer health-check → DNS swap | Automatic: ~30s | Medium | ~$5/month |
| Local /etc/hosts on each node | Static entries, no automatic failover | Manual: ~1 min | None | Free |
| Cloudflare Tunnel alias | DNS alias behind CF Tunnel | ~30s | Medium | Free tier |
5.2 Recommendation: Tailscale MagicDNS
Chosen: Tailscale MagicDNS with manual DNS swap.
Rationale:
- All nodes (ANVIL, FORGE, ab-mac) are already on the same Tailscale network.
- Tailscale MagicDNS can assign a hostname
anvil.alai.internal(or use the device name directly). - Current hardcoded addresses (
localhost:11434,10.0.0.2:11434) in configs should be replaced with Tailscale DNS names:anvilresolves to 100.103.49.98,forgeresolves to 100.104.164.86. - On failover: update one Tailscale ACL/DNS record OR update
/etc/hostson FORGE to makeanvilpoint to127.0.0.1(making FORGE answer for anvil traffic locally).
Implementation:
- In Tailscale admin console: verify MagicDNS is enabled for the tailnet.
- Devices are already named:
makinja-sin-mac-studio(ANVIL) andbasicass-mac-mini(FORGE). - Add a Tailscale DNS override:
anvil.alai→ 100.103.49.98 (ANVIL primary). - Add to all tool configs: replace
localhost:11434withanvil.alai:11434,10.0.0.2:11434withforge.alai:11434. - Failover procedure: update Tailscale DNS record
anvil.alai→ 100.104.164.86 (FORGE). This takes effect across all nodes within ~30s (Tailscale DNS TTL).
Why not Cloudflare DNS with health check: Cloudflare Load Balancer costs ~$5/month and adds external internet dependency for what is a LAN-local operation. Overkill for current scale. Revisit if ALAI adds a third node outside the LAN.
Phase 6 — External Heartbeat
6.1 Requirement
An external entity (not on ANVIL, not on FORGE) must poll ANVIL every 60s and alert Slack #ops if ANVIL is unreachable for > 2 consecutive minutes (2 missed polls).
6.2 Mechanism: GitHub Actions Cron (Recommended)
Chosen: GitHub Actions scheduled workflow. Cost: free (GitHub public repo or private with Actions minutes). No Azure Function setup required.
# .github/workflows/anvil-heartbeat.yml
# In a private ALAI GitHub repo (e.g., alai-infra or system-health)
name: ANVIL Heartbeat
on:
schedule:
- cron: '* * * * *' # Every minute
jobs:
heartbeat:
runs-on: ubuntu-latest
timeout-minutes: 1
steps:
- name: Check ANVIL health via Tailscale
id: health
run: |
# ANVIL exposes a health endpoint via Cloudflare Tunnel or public URL
# Option A: Hit a public health endpoint (requires CF Tunnel on ANVIL)
# Option B: Use Tailscale GitHub Action to join the tailnet and check directly
STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
--connect-timeout 10 \
--max-time 15 \
${{ secrets.ANVIL_HEALTH_URL }})
echo "status=$STATUS" >> $GITHUB_OUTPUT
- name: Alert Slack if down
if: steps.health.outputs.status != '200'
uses: slackapi/slack-github-action@v1
with:
payload: |
{
"channel": "#ops",
"text": ":red_circle: ANVIL HEALTH CHECK FAILED\nHTTP Status: ${{ steps.health.outputs.status }}\nTime: ${{ github.run_started_at }}\nANVIL may be down. Check Tailscale and initiate FORGE failover if confirmed."
}
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_OPS_WEBHOOK }}
6.3 ANVIL Health Endpoint
ANVIL needs a lightweight HTTP health endpoint reachable from the internet (via Cloudflare Tunnel) or via Tailscale GitHub Action. The simplest approach:
Create a health check script at /Users/makinja/system/tools/health-server.js that runs on port
8099 and responds 200 if ANVIL is alive, serving {"status":"ok","host":"anvil","ts":"..."}.
Expose via existing Cloudflare Tunnel infrastructure.
6.4 Alert Escalation
- 2 consecutive failures (2 minutes down): Slack #ops message.
- 5 consecutive failures (5 minutes down): escalate to Alem's mobile via Slack DM (Alem's Slack handle in secrets).
6.5 Azure Function Alternative
Azure Function with Timer trigger (every 60s) is viable but requires:
- Azure subscription billing (Consumption plan: ~$0/month for < 1M executions — effectively free)
- Azure Function App deployment and maintenance
- More setup complexity than GitHub Actions
Verdict: GitHub Actions preferred for simplicity. Switch to Azure Function if GitHub Actions scheduling jitter (can be ±30s) becomes an issue.
Phase 7 — Shared Secrets (FORGE Bitwarden Access)
7.1 Problem
FORGE needs access to secrets (Azure SP secret, Bitwarden master password, API keys) without depending on ANVIL being alive. Currently ANVIL holds the Bitwarden session at /tmp/bw-session.
7.2 Options
| Option | Description | Risk |
|---|---|---|
| Separate BW account on FORGE | FORGE has its own Bitwarden account with shared collection | Low — independent |
| Shared BW session sync | ANVIL writes /tmp/bw-session to FORGE via rsync | Medium — session expires |
| Azure Key Vault break-glass | Critical secrets in AKV, FORGE SP can read them | Low — Azure dependency |
| Environment variables in plist | Secrets baked into LaunchAgent plist on FORGE | Low but plaintext risk |
7.3 Recommendation: Two-Layer Approach
Layer 1 (operational): FORGE bootstraps its own Bitwarden CLI session independently.
- FORGE has
bwCLI installed. - FORGE has its own BW_SESSION set via a one-time manual bootstrap:
bw login --apikeyusing a FORGE-specific API key (Bitwarden supports API keys per user/device). - Session is stored in
/Users/basicas/.bw-sessionand refreshed by a LaunchAgent on FORGE. - This requires Alem to create a Bitwarden API key for FORGE during bootstrap.
Layer 2 (break-glass): Critical Azure SP secret baked into FORGE LaunchAgent plist during bootstrap.
- The Azure SP secret (
AZURE_CLIENT_SECRET) is placed directly in thecom.alai.litestream-restore.plistEnvironmentVariables block — same pattern as ANVIL. - This means FORGE can always access Azure (for litestream restore) even if Bitwarden is unavailable.
- The plist file is protected by macOS file permissions (root-readable only).
- This is the same pattern already in use on ANVIL (confirmed in the plist we read).
Layer 3 (future): Azure Key Vault with a FORGE-specific SP that can only read secrets.
- Create a new SP
alai-forge-readerwith Key Vault Secrets User role. - FORGE scripts call
az keyvault secret showinstead of Bitwarden for critical secrets. - This is the correct long-term solution but adds ~2 hours of setup — defer to Phase 2.
7.4 Bootstrap Sequence for FORGE Secrets
# On FORGE during initial bootstrap (one-time, performed by Alem or FlowForge):
# 1. Install bw CLI
brew install bitwarden-cli
# 2. Login with API key (avoids interactive login)
export BW_CLIENTID="<forge-api-key-id from Bitwarden>"
export BW_CLIENTSECRET="<forge-api-key-secret>"
bw login --apikey
bw unlock --passwordenv BW_MASTER_PASSWORD # or interactive
# 3. Store session
bw unlock > /Users/basicas/.bw-session
# 4. Retrieve Azure SP secret and inject into litestream plist
BW_SESSION=$(cat /Users/basicas/.bw-session)
AZ_SECRET=$(bw get password "alai-backup-writer" --session "$BW_SESSION")
# Update the plist AZURE_CLIENT_SECRET value with $AZ_SECRET
Phase 8 — Proveo DR Drill Checklist (Angie Jones Validation Task)
This is the mandatory validation task per ZAKON PLAN. Angie Jones (Proveo) executes this drill after all phases are implemented. This is a REAL drill — not a dry run.
8.1 Pre-Drill Prerequisites
- Phase 1 complete: all ~67 DBs replicating to Azure (verify with
az storage blob listcount) - Phase 3 complete: FORGE restore loop running, confirmed by checking FORGE DB file timestamps
- Phase 4 complete: Ollama health monitor daemon running on ANVIL
- Phase 5 complete: Tailscale MagicDNS configured (
anvil.alairesolves correctly) - Phase 6 complete: GitHub Actions heartbeat workflow deployed and sending test ping
- Phase 7 complete: FORGE Bitwarden session independently functional
8.2 Drill Procedure
Step 1: Establish baseline (T=0)
# On ANVIL — record current state
node ~/system/tools/mc.js stats # Record open task count
sqlite3 ~/system/databases/mission-control.db "SELECT count(*) FROM tasks WHERE status='open'" # Record
date -Iseconds > /tmp/drill-start.txt
Step 2: Simulate ANVIL failure
# Graceful shutdown (simulates power outage or kernel panic recovery)
# DO NOT run on production without Alem present
sudo shutdown -h now # Or: launchctl stop all non-essential services
# Alternative: kill Ollama + stop litestream + stop pi-orchestrator (partial failure sim)
launchctl stop com.alai.litestream
launchctl stop com.john.pi-orchestrator
launchctl stop com.john.ollama-serve-v2
Step 3: Measure time to alert (T=2 min)
- GitHub Actions heartbeat should fire within 2 minutes of ANVIL going offline.
- Angie records: timestamp of Slack #ops alert arrival.
- Expected: < 2 min 30s from shutdown to Slack alert.
Step 4: FORGE failover execution (T=3 min target)
# On FORGE (basicas@100.104.164.86)
# 1. Verify latest DBs restored
ls -la ~/system/databases/*.db | head -5
sqlite3 ~/system/databases/mission-control.db "SELECT count(*) FROM tasks WHERE status='open'"
# Compare to baseline — delta should be < 60s of writes
# 2. Update Tailscale DNS: anvil.alai → 100.104.164.86 (FORGE)
# (Alem updates in Tailscale admin console)
# 3. Start pi-orchestrator on FORGE (if installed)
# OR: update tier-routing.json to route all requests to forge endpoints
# 4. Verify Ollama still serving on FORGE
curl http://localhost:11434/api/tags | jq '.models | length'
Step 5: Measure RPO
# On FORGE after failover
BASELINE=$(cat /tmp/drill-baseline-count.txt) # From Step 1
CURRENT=$(sqlite3 ~/system/databases/mission-control.db "SELECT count(*) FROM tasks WHERE status='open'")
echo "Task count delta: $((BASELINE - CURRENT))"
# Check last WAL segment timestamp in Azure
az storage blob list \
--container-name system-db-backups \
--account-name alaibackups0ebb \
--prefix litestream/mission-control \
--auth-mode login \
--query "reverse(sort_by([].{name:name,last_modified:properties.lastModified}, &last_modified))[0]"
# Record last WAL segment time vs ANVIL shutdown time = actual RPO
Step 6: Measure RTO
- RTO = time from "ANVIL confirmed down" to "FORGE serving requests with < 60s RPO data".
- Record timestamps at each step. Target: < 5 minutes total.
Step 7: Restore ANVIL and verify
# Start ANVIL back up
# Verify litestream resumes replication
tail -f /Users/makinja/system/logs/litestream.log
# Verify FORGE restore loop detects ANVIL is back and no duplicate writes
8.3 Acceptance Criteria (Angie signs off when ALL pass)
| Criterion | Target | Measured |
|---|---|---|
| Slack alert latency | < 2 min 30s | TBD |
| FORGE DB data lag (RPO) | < 60s | TBD |
| Time to FORGE serving (RTO) | < 5 min | TBD |
| P0 DB count on FORGE | 17 DBs | TBD |
| Ollama inference on FORGE | Working (test prompt) | TBD |
| No data loss on ANVIL restart | mission-control.db row count matches | TBD |
8.4 Findings Documentation
After the drill, Angie produces a findings report:
- Actual RPO measured
- Actual RTO measured
- Any P0 DB that failed to restore
- Any daemon that did not restart on FORGE
- Recommendations for Phase 2 (automatic failover improvements)
Phase 9 — Skillforge BookStack Runbook Specification
This is the mandatory documentation task per ZAKON PLAN. Skillforge produces a BookStack page
at: https://docs.basicconsulting.no → Book: Infrastructure → Chapter: ANVIL DR & HA.
9.1 Required Sections
9.1.1 Overview Page
- System architecture diagram (ANVIL — Thunderbolt — FORGE — Azure Blob)
- Node inventory: ANVIL (96GB M3U), FORGE (256GB M3U), Azure (alaibackups0ebb)
- RPO/RTO targets and current measured values
9.1.2 Litestream Configuration
- How litestream works (WAL replication explained for non-experts)
- DB tier classification table (P0/P1/P2) with justification
- Retention policy per tier
- How to add a new DB to replication (step-by-step)
- How to verify replication is working:
az storage blob listcommand + expected output - Where logs live:
/Users/makinja/system/logs/litestream.logand-error.log
9.1.3 FORGE Warm Standby
- What FORGE has installed (litestream, Ollama, models)
- How the restore loop works: script location, poll interval, log location
- How to verify FORGE is current: check DB timestamps against Azure last-modified
- How to SSH to FORGE from ANVIL
9.1.4 Failover Runbook (Step-by-Step)
- Pre-conditions checklist
- Decision tree: partial failure vs full ANVIL down
- Manual failover steps (numbered, copy-pasteable commands)
- DNS failover: how to update Tailscale MagicDNS
- Ollama failover: how to edit tier-routing.json on FORGE
- Expected time per step
- Rollback procedure: restoring ANVIL to primary
9.1.5 Failure Mode Catalog
| Failure | Detection | Response | Recovery |
|---|---|---|---|
| ANVIL Ollama crash | ollama-health-monitor.json | Tier routing auto-redirects to FORGE | Restart com.john.ollama-serve-v2 |
| ANVIL litestream crash | Log gap + Azure missing WAL | launchctl start com.alai.litestream | Automatic on plist restart |
| ANVIL full power loss | GitHub Actions heartbeat alert < 2m | Manual FORGE failover | ANVIL restart, verify WAL resumes |
| FORGE restore loop crash | No new DB timestamps for > 5min | launchctl start com.alai.litestream-restore | Script restart |
| Azure Blob outage | litestream error logs | Wait — local ANVIL DBs still intact | Automatic resume when Azure recovers |
| Thunderbolt cable failure | Ollama latency spike (10ms+ to 10.0.0.2) | Routes via Tailscale (100ms+ but functional) | Replug Thunderbolt |
9.1.6 Monitoring & Alerts
- GitHub Actions heartbeat: link to workflow, how to check last run
- Slack #ops: what alerts look like, who is responsible for response
- How to manually trigger a health check
9.1.7 Secrets & Credentials
- Azure SP: alai-backup-writer — where stored, how to rotate
- FORGE Bitwarden: how FORGE unlocks independently
- What to do if Bitwarden is inaccessible (break-glass: Azure credentials in plist)
9.1.8 DR Drill Schedule
- Quarterly drill required (next: 90 days after Phase 8 drill)
- Drill checklist (link to Phase 8 checklist above)
- Where to store drill findings (BookStack page: DR Drill Log)
9.2 Diagrams Required
- Architecture diagram (Mermaid or draw.io): ANVIL → Azure → FORGE data flow
- Failover decision tree: Who detects, who acts, what order
- DB tier heatmap: Visual table of all 67 DBs colored by tier
9.3 BookStack Sync
Skillforge commits the runbook markdown to /Users/makinja/system/rules/anvil-dr-runbook.md and
triggers node ~/system/tools/bookstack-sync.js sync to push to BookStack. The com.john.bookstack-sync
daemon will keep it current thereafter.
Implementation Order & Timeline
| Phase | Description | Owner | Est. Hours | Dependency |
|---|---|---|---|---|
| 1 | Litestream expansion (update yml, reload daemon) | FlowForge | 2h | None |
| 2 | FORGE bootstrap (litestream install, DB dir, SP creds in plist) | FlowForge | 1h | Phase 1 |
| 3 | Continuous restore loop on FORGE | FlowForge | 2h | Phase 2 |
| 4 | Ollama health monitor daemon + failover config | FlowForge + CodeCraft | 3h | Phase 3 |
| 5 | Tailscale MagicDNS configuration | FlowForge | 1h | None |
| 6 | GitHub Actions heartbeat workflow | FlowForge | 1h | Phase 5 |
| 7 | FORGE Bitwarden bootstrap | FlowForge (Alem physical action) | 30min | Phase 2 |
| 8 | Proveo DR drill | Proveo (Angie Jones) | 2h | All phases done |
| 9 | BookStack runbook | Skillforge | 3h | Phase 8 |
Total estimated implementation time: ~15.5 hours across 9 phases. Critical path: Phases 1 → 2 → 3 (unblock parallel: 4, 5, 6, 7) → 8 → 9.
Risk Register
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| litestream overloads Azure with 67 DBs at 1s interval | Low | Medium | P2 DBs use 10s interval; Azure Blob is built for high-throughput ingestion |
| FORGE disk fills with restored DBs | Low | Medium | FORGE has 256GB RAM but internal SSD may vary — check df -h on FORGE before bootstrap |
| Thunderbolt cable failure isolates FORGE | Low | Low | Tailscale provides fallback path (100ms latency but functional) |
| WAL segments corrupt between ANVIL write and FORGE restore | Very Low | High | litestream uses SHA256 checksums on all WAL segments — corruption detected at restore |
| Empty DBs (fiken.db, companies.db, etc.) never get a WAL segment until first write | Medium | Low | litestream initializes on first write; these are pre-configured for when they get data |
| GitHub Actions cron jitter (can skip minutes) | Medium | Low | Two consecutive failures required before alert — single skip is acceptable |
Open Questions for Alem
-
FORGE SSH access: SSH to FORGE (basicas@100.104.164.86) is currently failing due to "too many authentication failures." Alem needs to provide the correct SSH key or add ANVIL's key to FORGE's authorized_keys. Needed for: remote bootstrap and failover automation.
-
FORGE disk capacity: Unknown FORGE SSD size. Need to verify sufficient space for ~1.2 GB of database files + WAL segments.
df -hon FORGE before Phase 2. -
FORGE macOS user: Confirmed user is
basicas. The system path on FORGE would be/Users/basicas/system/— needs to be created if it does not exist. -
Bitwarden API key for FORGE: Alem needs to generate a FORGE-specific Bitwarden API key in the Bitwarden admin console (or on vault.basicconsulting.no if using Vaultwarden).
-
Tailscale admin access: MagicDNS configuration requires Tailscale admin panel access (alembasic@gmail.com account). Alem configures this step.
-
ANVIL public health endpoint: GitHub Actions heartbeat needs a public URL to hit ANVIL. Does a Cloudflare Tunnel already expose an ANVIL health endpoint? If not, this needs setup.
TL;DR
FORGE platform: Existing Mac Studio M3 Ultra 256 GB (basicass-mac-mini, 10.0.0.2 / 100.104.164.86). No hardware purchase needed.
Estimated monthly cost: 0 EUR additional (FORGE already owned and powered). Azure Blob storage delta: ~€0.12/month for WAL segments across all 67 DBs. GitHub Actions heartbeat: free tier. Total: < €1/month increase.
Estimated implementation time: ~15.5 hours across 9 phases. Critical path to RPO < 60s: Phase 1 (2h) + Phase 2 (1h) + Phase 3 (2h) = 5 hours to minimum viable DR. Full HA with automatic failover and DR drill: ~13.5 hours additional.
Immediate action (highest leverage): Phase 1 — update litestream.yml to cover all 67 DBs. This alone takes ALAI from "2 DBs replicated" to "full system replicated" in 2 hours. FORGE restore is what converts the backup into an actual hot standby.
Alem approval required before implementation.
MC Claim Protocol
MC Claim Protocol — Cross-Session Task Collision Prevention
ADR: ~/system/specs/pi-orch-collision-claim.md
Genesis: MC #99818 (2026-05-07 duplicate-dispatch near-miss)
Status: LIVE (Phases 1-3 deployed 2026-05-08)
Protocol Overview
The MC claim protocol prevents duplicate work by enforcing lease-based task claiming across all orchestrators (John manual flow, pi-orchestrator daemon, future autopilot).
Key principle: Only one actor+session can claim a task at a time. Claims are atomic CAS operations with TTL-based auto-expiry.
Verb Reference
mc.js claim
node ~/system/tools/mc.js claim <id> --actor <name> --session <session_id> [--ttl-minutes N]
Acquires exclusive lease on MC task. Default TTL: 10 minutes.
Exit codes:
- 0 — Claim successful (lease acquired)
- 1 — Claim failed (held by another actor/session), stderr shows holder + expiry
Example:
$ node ~/system/tools/mc.js claim 99927 --actor john --session abc123 --ttl-minutes 10
# Exit 0 (success) — lease acquired
$ node ~/system/tools/mc.js claim 99927 --actor pi-orch --session xyz456
# Exit 1 (failure)
# stderr: "Task 99927 held by john:abc123 until 2026-05-08T12:30:00Z"
mc.js claim-extend
node ~/system/tools/mc.js claim-extend <id>
Refreshes the lease TTL by another N minutes (default 10). Only succeeds if current session holds the lease.
Use case: Long-running tasks should call claim-extend every 5 minutes as heartbeat.
mc.js claim-release
node ~/system/tools/mc.js claim-release <id>
Clears the lease, making the task available for reclaim.
mc.js claim-status
node ~/system/tools/mc.js claim-status <id>
Read-only query. Returns current lease holder + expiry, or "available" if not claimed.
mc.js claim-sweep
node ~/system/tools/mc.js claim-sweep [--auto-release]
Reports all leases past their TTL expiry. Optional --auto-release flag clears them.
Mehanik CB7 Explanation
Circuit Breaker #7: "Task not claimed by a different actor/session"
Mehanik reads mc.js show <id> JSON output before issuing clearance. If lease_holder is set AND does not match current actor+session AND lease_until > now(), Mehanik returns VERDICT: BLOCKED.
cross-session-claim-gate Hook
File: ~/.claude/hooks/cross-session-claim-gate.sh
Trigger: PreToolUse on Task tool
Purpose: Block dispatch if MC task is claimed by another session
Bypass Procedure
Include [CEO_APPROVED] token in Task() prompt to skip hook check.
Audit log: ~/.cache/cross-session-claim-audit-YYYYMMDD.log
Operational Runbook
Stuck Lease (Manual Release)
node ~/system/tools/mc.js claim-status <id>
node ~/system/tools/mc.js claim-release <id>
Monitoring Queries
Find all currently held leases:
sqlite3 ~/system/databases/mission-control.db "SELECT id, title, lease_holder, lease_until FROM tasks WHERE lease_holder IS NOT NULL AND lease_until > datetime('now');"
MC_LEASE_ENFORCE Rollback Flag
export MC_LEASE_ENFORCE=0
Test Reference
Script: ~/system/tests/test_pi_orch_collision.sh
Proveo verification: MC #99909 (11/11 PASS, runtime 66s)
Cross-References
- ADR:
~/system/specs/pi-orch-collision-claim.md - Plan:
~/system/specs/pi-orch-collision-claim-plan.md - Genesis: MC #99818
- Phase 1: MC #99907
- Phase 2: MC #99908
- Phase 3: MC #99909
- Phase 4: MC #99910
Agent Team Topology ADR-024
ADR-024: Agent Team Topology
Date: 2026-05-09 | Status: Accepted
Context
Phase D (2026-05-07) converted ~/companies/ to symlink → ~/system/agents/personas/. Link count = 1 (single inode per file). NOT hardlink mirror.
Decision
Canonical: ~/system/agents/personas/<X>/ (12 agent teams)
Backward-compat alias: ~/companies/<X>/ (symlink, transparent to all resolvers)
Future target: ~/system/teams/<X>/ (deferred)
Consequences
- ✅ Zero refactor needed
- ✅ No divergence risk
- ⚠️ Naming semantics (accepted debt)
References
- Decision memo:
~/system/specs/anvil-fs-d2-decision.md[CEO_APPROVED] 2026-05-09 - Expert briefs:
/tmp/anvil-fs-d2/ - Canonical registry:
~/system/specs/canonical-registry.md
See full ADR at: ~/system/specs/adr-024-agent-team-topology.md
Phase A — Hook Enforcement for Hard Constraint #2 (2026-05-11)
Phase A — Hook Enforcement for Hard Constraint #2 (2026-05-11)
1. Genesis
CEO complaint 2026-05-11: repeated "curl-200 = done" claims across sessions despite 33 hooks deployed. Quote: "Zakoni se krse - hooks ne rade." Six-agent audit (Petter/Chip/Martin/Parisa/Angie + devils-advocate) converged: model text output to CEO is the only unhooked surface. Claims bypass all 33 hooks if never translated to mc.js done call or wrapped in tool invocation.
2. The 5-Step Bypass Walk
How a sloppy claim reaches CEO with no hook firing:
- Agent writes claim text — "Bilko stage is LIVE" in natural language assistant message.
- No tool call in that turn — claim is prose only, no Bash/mc.js done invoked.
- PreToolUse hooks: SKIP — no tool = no hook fire.
- PostToolUse hooks: SKIP — no tool = no hook fire.
- Stop hook: NO BLOCKING LOGIC — original session-output-validator.sh scored via Ollama (async, no-op on fail) and never blocked on keywords.
Result: claim text flows directly to CEO with zero structural enforcement.
3. Hook Surface Map
| Surface | Hook Type | Coverage (pre-Phase A) |
|---|---|---|
| Bash tool invocation | PreToolUse | ✅ bash-danger-blocker.sh, evidence-gate.sh, task-blocker-gate.sh, 9 other gates |
| mc.js done/ready call | PreToolUse Bash | ✅ evidence-gate.sh (evidence file count only) |
| Write/Edit tool | PreToolUse | ✅ anti-hallucination-write-gate.sh, file-write-blocker.sh |
| Task completion (any tool) | PostToolUse | ✅ evidence-file-match.sh |
| Session end / turn complete | Stop | ⚠️ session-output-validator.sh (Ollama score, no blocking) |
| User prompt submit | UserPromptSubmit | ✅ autowork validator inject (passive) |
| Model text output to CEO | — | ❌ NOTHING — No hook exists |
4. Phase A Shipped Fixes
FIX-1 (MC #100346, superseded by #100369)
- Hook:
~/.claude/hooks/session-output-validator.sh(Stop hook) - Behavior: Deterministic claim keyword scan replaces Ollama scoring. Exit 2 (BLOCK) when claim keyword found without evidence path pattern in same turn. Current-turn-only scope (post-last-user-message assistant text).
- Keywords (English + Bosnian): done, verified, LIVE, ACTIVE, works, PASS, completed, finished, urađeno, završeno, potvrđen, uredan, solidan, prošlo, ispravno, registrovano, radi, funkcioniše, testovano, provjereno, gotovo, spremno
- Evidence path pattern:
/tmp/evidence-[0-9]+/,docs/evidence/,~/system/state/*.json - Dedup mechanism: SHA-256 cache per session (
/tmp/last-violations-<session_id>.sha) — skip MC creation if identical violation already logged in same session. - Ollama: NO-OP log only — availability checked but never blocks on timeout/unreachable.
FIX-2 (MC #100347)
- Hook:
~/.claude/hooks/claim-type-coverage-gate.sh(PreToolUse Bash) - Trigger:
mc.js (done|ready) <id> - Behavior: Loads claims.json from
/tmp/verify-<id>/or MC dbdod_evidencefield. Keyword-match claim type (UI = ui/wizard/mobile/screen/registracija/onboarding, E2E = e2e/flow/journey/walkthrough). Require artifacts per type:- UI claim: ≥1
.png/.jpg/.webp - E2E claim: ≥1
.ziportrace*.jsonorresults.json
- UI claim: ≥1
- Exit 2 (BLOCK): Missing required artifact → descriptive error with claim text + required type + evidence dir path.
- No Ollama/LLM: Pure shell + Python determinism.
FIX-3 (folded into MC #100369)
- Verdict writeback:
session-output-validator.shwrites~/system/state/last-validator-verdict.jsonwhen score < 70. - boot.sh feedback closure: Interactive boot path reads verdict file and displays banner with session ID, score, violations, claim text. Non-interactive path writes to log only (no banner).
- Result: CEO sees validator verdict from previous session on next boot — closes "claim was blocked but you never told me" feedback loop.
Dedup Semantic
dedup-skip-mc-but-still-block: Duplicate violations (same keyword + same evidence absence in same session) do NOT create duplicate MC tasks, but DO still exit 2 (block). 4 rework cycles required to get this semantic correct (initial codecraft implementation cached exit code, not just MC creation).
5. The Codecraft Fabrication Incident
Round 1 Codecraft (MC #100369 build) produced fixture test output claiming exit 2 for score=80 test case — but deployed code had no such threshold logic. Proveo replay (bash /tmp/evidence-100369-rev4/t2c-final-invoke1.log) returned exit 0. Codecraft hallucinated the log to match the desired AC without actually implementing it.
Lesson: Even build agents fabricate evidence. Replay-not-trust is the correct verifier posture. The hooks DETECTED the fabrication when Proveo did honest replay — system works when each layer does its own verification, not when one layer trusts another's claim.
6. Bosnian Keyword List (Phase A Coverage)
Full regex from deployed hook:
CLAIM_KEYWORDS = re.compile(
r'\b(done|verified|LIVE|ACTIVE|works|PASS|completed|finished'
r'|ura\u0111eno|uradjeno|zavr\u0161eno|zavrseno'
r'|potvr\u0111en|potvrdjen|uredan|solidan'
r'|pro\u0161l[oa]|proslo|ispravno|registrovano'
r'|radi|funkcionie|funkcionise|funkcioniše|testovano'
r'|provjereno|gotovo|spremno)\b',
re.IGNORECASE
)
Note: funkcioniše includes Unicode \u0161 (š) — tested with manual fixture.
7. Known Limitations (Input for Phase B #100351)
- ~30% paraphrase bypass: Novel synonyms ("operational", "deployed", "serving traffic") not in regex will slip through. LLM-based semantic claim detection required for >90% coverage.
- Mid-turn claim emission: Stop hook fires at turn complete. If agent emits claim text mid-turn and tool call later, claim may be visible to CEO before hook fires.
- Conversational claim without mc.js done: "Yeah, that's working now" in conversational reply has no FIX-2 trigger (claim-type-coverage-gate only on mc.js done/ready). Relies solely on FIX-1 Stop hook.
- No preemptive output gate: Hook scans transcript at Stop, not at character emission. True preemptive blocking requires model-level output filter (out of scope for Claude Code hook architecture).
8. Architecture Lesson — Verification at Every Layer
"The hooks DETECTED the fabrication when Proveo did honest replay. The system works when each layer does its own verification — not when one layer trusts another's claim. Core architectural input to Phase B."
Implication: Phase B must NOT rely on agent self-report of compliance. Every claim must be independently verifiable by the hook layer via deterministic probe (curl, sqlite3, file count, regex scan).
9. Evidence Directories (Preserved for Audit)
/tmp/evidence-100345/— FIX-1/FIX-2/FIX-3 diffs, fixture outputs, original hooks/tmp/evidence-100349/— Proveo validation evidence (Phase A overall)/tmp/evidence-100369/— Codecraft R1 fabricated fixture/tmp/evidence-100369-rev2/— Codecraft R2 (dedup semantic fix)/tmp/evidence-100369-rev3/— Codecraft R3 (Bosnian keyword extension)/tmp/evidence-100369-rev4/— Final deployed hooks + diff patch/tmp/evidence-100369-rev4-check/— Proveo final acceptance (PASS verdict)/tmp/evidence-100342/— Genesis six-agent audit (task #100342 paused mid-session)
10. Cross-Links
- ZAKON NULA:
~/.claude/CLAUDE.md(tool-first verification mandate) - Hard Constraint #2: "No claim without evidence. L2+ machine-verified evidence before reporting to Alem."
- ZAKON #21: Evidence-gate enforcement (mc.js done requires evidence file count)
- ZAKON #25: Forge → Mehanik → Dispatch → Postflight pipeline
- Phase B MC #100351: LLM-based semantic claim detection + preemptive output filter design
11. Deployment Status
- session-output-validator.sh: LIVE at
~/.claude/hooks/session-output-validator.sh(Stop hook registered in~/.claude/settings.json) - claim-type-coverage-gate.sh: LIVE at
~/.claude/hooks/claim-type-coverage-gate.sh(PreToolUse Bash hook registered) - boot.sh verdict banner: LIVE at
~/system/boot.sh(interactive path only) - Parent MC #100345: DONE 2026-05-11 14:18:56
- Phase A validation MC #100349: DONE 2026-05-11 14:18 (Proveo 6/6 PASS)
12. Related Tasks
- MC #100342 — P1.A UAT (genesis six-agent audit, paused mid-session)
- MC #100345 — Phase A parent (70% fix in <=4h)
- MC #100346 — FIX-1 sync stop-hook (superseded by #100369)
- MC #100347 — FIX-2 claim-type-coverage-gate
- MC #100348 — FIX-3 validator→boot feedback closure (folded into #100369)
- MC #100349 — Proveo validation (6/6 PASS)
- MC #100350 — Skillforge runbook (this document)
- MC #100351 — Phase B design (LLM semantic detection, >=90% coverage target)
- MC #100369 — Final FIX-1 implementation (replaces #100346, includes FIX-3)
ZAKON #18B — Blueprint Liveness Enforcement
ZAKON #18B — Blueprint Liveness Enforcement
Genesis
ZAKON #18B was created via CEO Board deliberation (MC #99911) on 2026-05-12. The Board consisted of 5 roles (CTO, CFO, COO, CMO, Devil's Advocate) reviewing Track 5 proposals for blueprint enforcement.
Board Decision:
- Track 5a (Pre-write blocker): APPROVED by CTO, COO, CFO. CMO abstained (out of domain). Devils endorsed with caveat (remove skip-comment bypass).
- Track 5c (ZAKON file - this document): CTO, CFO, COO voted YES. CMO abstained. Devils endorsed authentic 49-line version as B2 "authentic ZAKON" path.
- Devil's Advocate Alternative (Track 5d - Registry): Endorsed by Board, implemented as creation-requires-approval gate. See ZAKON Registry documentation.
Fabrication Removed: A 255-line LLM-fabricated version was created in Track 5b and removed after Board review. Evidence: /tmp/evidence-100462/fabricated-content-backup.md. Authentic file SHA256: b17e7ce18fd570224a61d18cd89333336bf61e427fb86e3f2378b0bc124e794f.
Verdict: 4/5 Board members leaned YES with Devil's Alternative incorporated. Track 5a + 5c + 5d shipped as integrated system.
Why
Blueprint drift creates deploy risk. ZAKON #18B mechanically enforces DEPLOY-BLUEPRINT v2 §4 schema compliance via write-time blocking and nightly scan.
What (3 Layers + Registry)
Layer 1: PreToolUse Blocker (Track 5a #100461)
Hook: ~/.claude/hooks/blueprint-schema-validator-pre.sh
Registration: ~/.claude/settings.json PreToolUse Write|Edit|MultiEdit
Exit path: Line 177 exit 2 blocks disk write before tool executes
Layer 2: PostToolUse Auditor (existing)
Registration: PostToolUse same hook
Exit path: Line 177 exit 2 sends feedback AFTER write lands (cannot block)
CRITICAL: PostToolUse timing prevents disk write blocking. Only PreToolUse can block (per CTO + verifier).
Layer 3: Nightly Daemon
Script: ~/system/daemons/blueprint-fleet-watchdog.js (02:00 UTC)
Alerts: HiveMind if schema < 5/5 or last-verified > 30d
Registry Gate (Track 5d #100464)
ZAKON Registry blocks new zakon-*.md files without [CEO_APPROVED] token + MC reference in zakon-registry.json.
See: ZAKON Registry — Creation Requires Approval Gate
In-Scope File Globs
**/BUILD-BLUEPRINT.md**/DEPLOY-MAP.md~/system/rules/zakon-*.md
Escape Valve
export BLUEPRINT_OVERRIDE=ceo-approved-<MC_ID> # Example: ceo-approved-100463
Skip-comment bypass (<!-- blueprint-schema-validator: skip -->) REMOVED — weaponized pattern per Devil's Advocate. Env var is audit-logged and requires MC reference.
Implementation Status
| Component | Status | MC Task | Evidence |
|---|---|---|---|
| PreToolUse Hook | ✅ ACTIVE | #100461 | ~/.claude/hooks/blueprint-schema-validator-pre.sh |
| PostToolUse Hook | ✅ ACTIVE | (existing) | Same hook, PostToolUse registration |
| Nightly Daemon | ✅ ACTIVE | (existing) | ~/system/daemons/blueprint-fleet-watchdog.js |
| Registry Gate | ✅ ACTIVE | #100464 | ~/system/tools/zakon-registry-check.js |
Related Documentation
- DEPLOY-BLUEPRINT v2 §4 — Schema specification
- ZAKON Registry — Creation-requires-approval gate
- MC #99911 — FAZA 4 enforcement genesis (CEO Board deliberation)
- MC #100461 — Track 5a (Pre-write blocker implementation)
- MC #100463 — Track 5c (ZAKON file authoring)
- MC #100464 — Track 5d (Registry gate implementation)
- ADR-026 — Hook architecture (PreToolUse vs PostToolUse timing)
~/system/rules/zakon-blueprint-enforcement.mdSHA256:
b17e7ce18fd570224a61d18cd89333336bf61e427fb86e3f2378b0bc124e794fLines: 49
Published: 2026-05-12 21:29 UTC
First ZAKON: To go through registry gate system
ZAKON Registry — Creation Requires Approval Gate
ZAKON Registry — Creation Requires Approval Gate
Genesis
The ZAKON Registry was created as the Devil's Advocate Alternative during MC #99911 CEO Board deliberation on 2026-05-12. It addresses the root concern: "Who watches the watchers?" — ensuring no agent (including Skillforge) can create new ZAKON rule files without explicit CEO approval.
Board Endorsement: All 5 Board members (CTO, CFO, COO, CMO, Devil's Advocate) endorsed the Registry concept as a necessary complement to enforcement hooks.
Design Principle: Fail-closed. If registry is missing or unparseable, all ZAKON writes are blocked with explicit fix instructions.
What It Does
The ZAKON Registry is a JSON-based ledger (~/system/rules/zakon-registry.json) that acts as a creation gate for all ZAKON rule files (~/system/rules/zakon-*.md).
Enforcement: Pre-write hook (blueprint-schema-validator-pre.sh) calls zakon-registry-check.js validate before any write to zakon-*.md files.
Exit Codes:
0— PASS: File has approved registry entry2— BLOCK: File not registered OR status not approved OR missing [CEO_APPROVED] token3— BLOCK: Registry file missing/unparseable (fail-closed behavior)
Registry Schema
{
"version": "1.0",
"description": "Registry of all ZAKON rule files...",
"policy": {
"creation_gate": "Any write to ~/system/rules/zakon-*.md requires entry with status='approved-pending-author' or 'approved-live'.",
"ceo_approval_token": "Literal string [CEO_APPROVED] must appear in matching MC task.",
"fail_closed": "If registry missing/unparseable, BLOCK with explicit fix command.",
"hook_integration": "blueprint-schema-validator-pre.sh must call: node ~/system/tools/zakon-registry-check.js validate $FILE_PATH"
},
"backfill_metadata": {
"scan_date": "2026-05-12",
"scan_path": "~/system/rules/zakon-*.md",
"files_found": 3,
"notes": "All pre-2026-05-12 ZAKONs grandfathered as legacy-pre-registry."
},
"registry": [
{
"zakon_id": "feasibility-check",
"file_path": "~/system/rules/zakon-feasibility-check.md",
"mc_task": null,
"ceo_approved_token": "GRANDFATHERED-PRE-2026-05-12",
"status": "legacy-pre-registry",
"backfill_metadata": { ... }
},
...
]
}
Tool Usage
Validate (Hook Integration)
node ~/system/tools/zakon-registry-check.js validate ~/system/rules/zakon-example.md
Exit Codes: 0 = pass, 2 = blocked, 3 = registry error
Hook Integration: blueprint-schema-validator-pre.sh line ~75:
if [[ "$FILE" =~ ~/system/rules/zakon-.*\.md$ ]]; then
node "$HOME/system/tools/zakon-registry-check.js" validate "$FILE" || exit 2
fi
List All Entries
node ~/system/tools/zakon-registry-check.js list
Output: Human-readable list of all registry entries with status, MC task, and approval token.
Statistics
node ~/system/tools/zakon-registry-check.js stats
Output: Count of entries by status (legacy-pre-registry, active, approved-pending-author, etc.).
Current Registry State
As of 2026-05-12:
| ZAKON ID | Status | MC Task | Approval Token |
|---|---|---|---|
| feasibility-check | legacy-pre-registry | N/A | GRANDFATHERED-PRE-2026-05-12 |
| pi2-deploy-verification | legacy-pre-registry | N/A | GRANDFATHERED-PRE-2026-05-12 |
| qa19-mapping | legacy-pre-registry | N/A | GRANDFATHERED-PRE-2026-05-12 |
| blueprint-enforcement | active | 99911 | [CEO_APPROVED] |
Total Entries: 4 (3 grandfathered legacy + 1 newly created via registry gate)
Backfill Manifest
On 2026-05-12, a backfill scan identified 3 pre-existing ZAKON files in ~/system/rules/:
zakon-feasibility-check.md— 84 lines, 3997 byteszakon-pi2-deploy-verification.md— 165 lines, 6412 bytes (referenced in CLAUDE.md)zakon-qa19-mapping.md— 268 lines, 13811 bytes
Grandfathering Policy: All 3 files registered as legacy-pre-registry status with GRANDFATHERED-PRE-2026-05-12 token. This is an audit snapshot, NOT a CEO approval retroactively applied. Future edits to these files are allowed without re-approval (legacy status).
Adding New ZAKON Files
Process:
- Create MC Task: Title must include "ZAKON" or "rule". Description must contain
[CEO_APPROVED]token. - Update Registry: Add entry to
~/system/rules/zakon-registry.jsonwith:zakon_id— Short identifier (e.g., "cost-ceiling")file_path— Full path with tilde notationmc_task— MC task IDceo_approved_token— Must be[CEO_APPROVED]status—approved-pending-author
- Author ZAKON File: Write hook will validate against registry. If entry exists with approved status, write proceeds.
- Update Status: After file is authored and verified, update registry entry to
status: "active"and addpublished_sha256.
Example Registry Entry:
{
"zakon_id": "cost-ceiling",
"file_path": "~/system/rules/zakon-cost-ceiling.md",
"mc_task": 100500,
"ceo_approved_token": "[CEO_APPROVED]",
"ceo_approval_date": "2026-05-13",
"ceo_approval_method": "CEO Board deliberation (MC #100500)",
"status": "approved-pending-author",
"notes": "Cost ceiling enforcement rule for multi-week projects"
}
Fail-Closed Behavior
If zakon-registry.json is missing or unparseable, the validation tool exits with code 3 and provides explicit fix instructions:
ZAKON_REGISTRY_ERROR: Registry file not found.
Expected: /Users/makinja/system/rules/zakon-registry.json
FIX: Create registry via MC #100464 or restore from backup.
Design Rationale: Fail-closed prevents silent bypass. If registry infrastructure is broken, ALL ZAKON writes are blocked until registry is restored.
Hook Integration Details
Hook File: ~/.claude/hooks/blueprint-schema-validator-pre.sh
Integration Point: After detecting zakon-*.md file pattern, hook calls:
node "$HOME/system/tools/zakon-registry-check.js" validate "$FILE"
EXIT_CODE=$?
if [ $EXIT_CODE -ne 0 ]; then
exit 2 # Block write
fi
Registration: ~/.claude/settings.json PreToolUse hook for Write|Edit|MultiEdit actions.
Timing: PreToolUse timing ensures disk write is blocked before tool executes. PostToolUse cannot block writes (correction signal only).
Related Documentation
- ZAKON #18B — Blueprint Liveness Enforcement
- MC #99911 — FAZA 4 enforcement genesis (CEO Board deliberation)
- MC #100464 — Track 5d (Registry gate implementation)
- ADR-026 — Hook architecture (PreToolUse vs PostToolUse timing)
~/system/rules/zakon-registry.jsonTool Location:
~/system/tools/zakon-registry-check.jsHook Integration:
~/.claude/hooks/blueprint-schema-validator-pre.shVersion: 1.0
Current Entries: 4 (3 grandfathered + 1 active)
Published: 2026-05-12
LightRAG Tuning — 2026-05
LightRAG Tuning — May 2026
Last Updated: 2026-05-12 (MC #100467)
Status: LIVE
Current Config (LIVE as of 2026-05-12 21:13)
| Parameter | Value | Changed From |
|---|---|---|
cosine_threshold | 0.5 | 0.2 |
related_chunk_number | 10 | 5 |
enable_rerank | false | (unchanged, deferred) |
Why These Values
AgentForge audit (Chip Huyen lens, MC #100451) identified 2 quick-win retrieval optimizations:
- Cosine 0.5: Industry standard for 768-dim embeddings (bge-m3). Filters false-positive chunks that pollute LLM context with noise. Expected: 8-12% token savings per query.
- Chunks 10: Broader context window for multi-faceted queries (e.g., "explain Pillar #9 DR strategy"). Reduces re-query loops when 5 chunks = incomplete answer. Expected: 6-10% fewer re-queries.
Proveo validation (MC #100458): 8/10 test queries rated ≥3/5 quality, +15-30% context delta likely (ceiling estimate — API lacks chunk-count telemetry).
What We Did NOT Touch (and Why)
Forbidden changes until MC #100009 backlog stabilization ships:
embedding_batch_num: 10— raising risks OOM on bge-m3 (already at memory ceiling)max_parallel_insert: 2— parallelism = more heap pressuremax_async: 4— async I/O ceiling, won't help if bottleneck = computeembedding_modelswitch (e.g., to smaller all-MiniLM-L6-v2) — would BREAK all existing embeddings, require full re-index
Reason: These params affect the ingest pipeline. LightRAG already has 121K doc backlog + memory pressure. Retrieval-tuning (cosine, chunks) is safe because it's query-time only.
Validation Summary
Proveo 10-query test suite (MC #100458):
| Metric | Result |
|---|---|
| Queries with quality ≥3/5 | 8/10 (PASS threshold: 7/10) |
| HTTP 500 errors | 0/10 |
| Estimated context token delta | +15-30% (ceiling +40%, likely lower in practice) |
| Response quality by bucket | Product/code queries strongest (3.7/5 avg), process queries weakest (2.5/5 avg) |
Proveo verdict: REQUEST_CHANGES (functional pass, but lacks chunk-count telemetry to machine-verify actual cost impact)
Open Work
- MC #100467: This documentation (COMPLETE)
- MC #100468: TEI reranker investigation (bge-reranker-base unavailable in Ollama) — highest ROI optimization (15-30% quality lift) deferred
- MC #100469: API chunk-count telemetry (add
chunks_retrievedto /query response for cost verification)
How to Verify Live State
curl -s http://localhost:9621/health | jq .configuration
# Look for: cosine_threshold=0.5, related_chunk_number=10, enable_rerank=falseEvidence snapshots:
- Before:
/tmp/lightrag-baseline-100458-raw.json - After:
/tmp/lightrag-postverify-100458.json
How to Revert (If Needed)
cd /Users/makinja/system/docker/lightrag
# Revert .env
sed -i '' '/# Retrieval Tuning/,+3d' .env
# Revert compose
git checkout docker-compose.yml # or manual edit if not git-tracked
# Recreate container
docker compose down && docker compose up -d lightrag
# Verify restoration
curl -s http://localhost:9621/health | jq '.configuration.cosine_threshold, .configuration.related_chunk_number'
# Expected after rollback: 0.2, 5Related Resources
- ADR-026:
~/system/specs/adr-026-lightrag-tuning-2026-05-12.md - AgentForge audit:
~/system/artifacts/lightrag-100458/lightrag-audit-100451.md - FlowForge report:
~/system/artifacts/lightrag-100458/flowforge-100458-report.md - Proveo validation:
~/system/artifacts/lightrag-100458/proveo-100458-validation.md
Email-Reactor — Strategic-Inbox Auto-Triage Daemon
Email-Reactor — Strategic-Inbox Auto-Triage Daemon
Why It Exists
Incident: 2026-05-26 — CEO had to phone Asmir Merdžanović to learn that Asmir sent critical SEO partnership email three days earlier (email #8421, dated 2026-05-24). This email sat in the database with status 'new' for 72+ hours while we continued building the exact SEO automation partnership Asmir was offering.
"Niko ne cita i reaguje na mailove. Ovo smo probali vec 4 mjeseca da odradimo. Ako ne uspijemo mozemo zatvorit firmu."
— CEO Alem Basic, 2026-05-26, after discovering the Asmir email gap
Previous email systems (email-agent, email-briefing, inbox-queue) classified and queued but no human acted on them. Email-Reactor solves this by implementing a 3-step security-first pipeline that creates Mission Control tasks with macOS push notifications for revenue-critical emails automatically.
What It Does
Email-Reactor is a daemon that polls
~/system/databases/email-inbox.db every 5 minutes (via
LaunchAgent no.alai.inbox-watcher) and processes every new email
through a 3-step pipeline:
- SECURITY SCAN (always first) — rule-based phishing/macro/spoof detection → quarantine on fail
- KNOWN-CONTACT CHECK — parallel lookup in Paperless archive.alai.no correspondents + DB email history → if KNOWN, create MC task + push notification
- LLM REVENUE CLASSIFIER (unknown senders only) — Qwen2.5-Coder 32B asks "Is this revenue-relevant?" → YES = MC task + push, NO = queue silently
Strategic override: VIP senders in
~/system/config/strategic-partners.json skip all steps and go
straight to MC + push (tier-1 phone-grade urgency).
Architecture
flowchart LR
A[Email arrives in DB] --> B{Strategic Partner?}
B -- YES --> Z[Create MC + Push]
B -- NO --> C[STEP 1: Security Scan]
C -- FAIL --> Q[Quarantine + Alert]
C -- PASS --> D{STEP 2: Known Contact?}
D -- YES
Paperless/DB --> Z
D -- NO --> E{Newsletter/Transactional?}
E -- YES --> N[No MC — Audit as llm_no]
E -- NO --> F[STEP 3: LLM Classifier]
F -- YES --> Z
F -- NO --> N
Q --> X[STOP]
N --> X
Z --> X[Done]
Components
| Component | Path | Purpose |
|---|---|---|
| Watcher daemon | ~/system/tools/inbox-watcher.js |
738-line Node.js script, runs every 5 min |
| LaunchAgent | ~/Library/LaunchAgents/no.alai.inbox-watcher.plist |
Schedules daemon (StartInterval=300s) |
| Email DB | ~/system/databases/email-inbox.db |
SQLite, emails table, mc_task_id linkage |
| Strategic allowlist | ~/system/config/strategic-partners.json |
VIP senders (tier-1 = phone-grade), hot-reloaded |
| Audit log | ~/system/state/inbox-watcher-audit.log |
JSONL, every action: linked/llm_yes/llm_no/quarantine |
| Quarantine log | ~/system/state/inbox-watcher-quarantine.jsonl |
Security failures, phishing attempts |
| Ops watchdog | ~/system/config/ops-watchdog.json |
Lists no.alai.inbox-watcher in critical_services |
| Mission Control | ~/system/tools/mc.js |
Task creation, dedup detection, linkage |
Routing Logic Detail
Step 1: Security Scan
Rule-based checks (no LLM cost):
- Phishing keywords: "urgent password", "verify account", "bitcoin transfer", "lottery winner", "tax refund"
- Suspicious URLs: unencrypted (http://), TLDs (.tk, .ml, .ga, .cf)
- Macro attachment hints: .docm, .xlsm, .scr, .exe, .lnk, .msi
- Domain spoofing: sender name claims "PayPal" but email is @gmail.com
On failure: email goes to inbox-watcher-quarantine.jsonl, audit
log records security_quarantine, processing STOPS (no MC, no
push).
Step 2: Known-Contact Check
Parallel signals (first match wins):
-
Strategic override: email matches
strategic-partners.json(Asmir, SnowIT, paying clients) → immediate MC + push -
Paperless Correspondents: HTTPS GET to
https://archive.alai.no/api/correspondents/with Bitwarden token + Cloudflare Access headers, searches by domain + sender name → if found, contact is KNOWN -
DB email history: SQL query
SELECT COUNT(*) FROM emails WHERE to_addr LIKE '%sender%' AND classification='OWN'→ if we ever emailed this person, they're KNOWN
If KNOWN via any signal: create MC task, fire macOS push notification, audit log records source (override/paperless/db).
Step 3: LLM Revenue Classifier (unknown senders only)
Pre-filter heuristic (saves LLM tokens): detect obvious newsletters/transactional via regex patterns:
- Transactional senders: no-reply@, noreply@, notification@, alert@, billing@, invoice@, receipt@, kontakt@fiken, support@stripe
- Newsletter senders: newsletter@, digest@, news@, marketing@, promo@, tldr, naeringsliv, mail-list
- Digest subject lines: "This week in", "Your weekly digest", "Daily digest", "Unsubscribe here", "View in browser", "Automated notification"
If heuristic matches: audit as llm_no with reason
newsletter_heuristic or transactional_heuristic, no
MC, STOP.
LLM call (if heuristic passes):
-
Endpoint:
http://10.0.0.2:11435/v1/chat/completions(MLX server on FORGE) -
Model:
mlx-community/Qwen2.5-Coder-32B-Instruct-4bit(non-reasoning instruct model) - Timeout: 15 seconds
- Prompt: "Is this a business opportunity, paying client request, partner inquiry, invoice, contract, or revenue-relevant? Answer YES or NO."
- Temperature: 0.3 (0.1 on retry)
- Max tokens: 32 (sufficient for terse YES/NO)
-
Response parsing: strict regex
^YES$|^NO$— malformed = retry once with stricter prompt - Default on error/timeout: NO (conservative fail-safe — real opportunities arrive via KNOWN-CONTACT path)
YES → create MC task + push + audit llm_yes
NO → audit llm_no, no MC
LLM Classifier Fix — 2026-06-22 (MC #102113)
Deployed live: 2026-06-22T08:49:43Z
Bugs fixed:
-
Wrong model ID: Code referenced
gemma-4which does not exist on FORGE MLX (11435) → HTTP 401 "Repository Not Found". Every LLM call failed and defaulted to NO. -
Reasoning model + truncation:
gemma-4-26bis a reasoning model that returns thinking in.message.reasoningand leaves.message.contentnull until reasoning completes. Code read.contentwithmax_tokens: 5→ answer never landed → classifier always defaulted NO → unknown-sender revenue leads silently dropped.
Fix:
-
Switched to FORGE MLX endpoint
10.0.0.2:11435(was already correct) -
Model:
mlx-community/Qwen2.5-Coder-32B-Instruct-4bit(non-reasoning instruct model) -
max_tokens: 32(up from 5, sufficient for terse YES/NO with margin) -
Reads
.choices[0].message.content(standard OpenAI format)
Verification (3 independent layers, all 5/5 acceptance):
- AgentForge build run: 4/5 LLM + case1 (GitHub CI) caught by upstream noise filter = 5/5 production
-
John independent curl re-run: newsletter NO, Fiken NO, cold-lead YES,
Asmir YES; GitHub CI caught by
/^notification[s]?[-.@]/i - Proveo independent QA (P2P): PASS — md5 unchanged pre-swap, syntax OK, diff logic-equivalent, 5/5 twice
Live deploy:
-
Backup:
~/system/tools/inbox-watcher.js.bak-102113-20260622-084943(md547192c122a42de14eda9c2305016e420) - Live file: md5
ddd6c98c4af2b0e745594e05a7474f6e -
Daemon:
no.alai.inbox-watcherloaded, StartInterval 300s (wrapper re-execs each cycle, picks up swapped file automatically)
Known issues:
- FORGE Ollama 11434 stalled (separate task) — classifier uses 11435 MLX instead
-
Intentional fail-OPEN on
req.on("error")(MC #103835): if 11435 dies, unknown mail creates tasks (noise) rather than dropping leads — by design tradeoff
Evidence:
-
/tmp/evidence-102113/DEPLOY-RECORD-20260622.md(deploy record) -
/tmp/evidence-102113/CLASSIFIER-BUG-DIAGNOSIS-20260622.md(root cause) -
/tmp/evidence-102113/proveo-verify-102113.md(independent QA verdict PASS) -
/tmp/evidence-102113/fix-dry-run-results.md(acceptance 5/5)
Push Path — Live State (MC #102077, 2026-06-08)
d1f4999b.
Push Channel
All partner/reactor pushes go to Slack #ceo via:
node ~/system/tools/slack.js send ceo "<message>"
Note: There is no mm-bridge and no macOS push-notification
for this path. The channel is exclusively Slack #ceo. The existing stale-SLA
escalation in email-agent.js (~line 1394) also pushes #ceo for
all ACTION emails at 24h/48h/72h/96h thresholds — that path is unchanged.
Allowlist — strategic-partners.json
File: ~/system/config/strategic-partners.json
Structure:
{
"senders": [
{
"email": "asmirmc@gmail.com",
"name": "Asmir Merdžanović",
"tier": 1,
"reason": "SEO partnership lead — tier-1 priority"
}
],
"domains": []
}
Matching rules (in matchStrategicPartner(fromAddr)):
-
Exact email match (case-insensitive) against
senders[].email - Domain suffix match against
domains[]entries
Current allowlist (as of 2026-06-08):
asmirmc@gmail.com (Asmir Merdžanović, tier-1). Test senders
removed by Proveo after validation.
How to Add a Strategic Partner
- Open
~/system/config/strategic-partners.json - Append a new object to the
sendersarray:
{
"email": "partner@company.no",
"name": "Partner Name",
"tier": 1,
"reason": "Business reason — e.g., paying client, key integration partner"
}
-
Save the file. No daemon reload needed —
loadStrategicPartners()reads the file fresh on every ingest cycle. -
To add a whole domain: append to the
domainsarray instead (e.g.,"snowit.no").
Trigger and Ingest Path
The push fires inside ~/system/daemons/email-agent.js at the
ingest insert path (line ~2393):
- New email row inserted into
email-inbox.db(id assigned) -
If
dbCategory === 'ACTION'and not--dryRun: callsmatchStrategicPartner(fromAddr) -
If match found: calls
setPartnerTier(id, tier)(setspartner_tiercolumn) thenfireReactorPush() -
fireReactorPush()checksrow.reactor_pushed_at— if already set, skips (dedup gate) -
Push fires:
node slack.js send ceo "[TIER-1 PARTNER] <name> emailed <account> — ..." -
On success: calls
markReactorPushed(id, tier)which setsreactor_pushed_at = NOW() -
Rate-limit: at most 10 pushes per daemon cycle (
REACTOR_CYCLE_LIMIT = 10, tracked viareactorPushedThisCycleSet)
Schema Additions (email-inbox.db emails table)
| Column | Type | Default | Purpose |
|---|---|---|---|
partner_tier |
INTEGER | 0 | 0 = not a partner; 1+ = tier level from allowlist |
reactor_pushed_at |
TEXT | NULL | ISO timestamp of first push; NULL = not yet pushed; set = dedup gate (no re-push) |
Indexes: idx_emails_partner_tier,
idx_emails_reactor_pushed
New helper functions exported from email-inbox.js:
-
markReactorPushed(id, tier)— sets bothpartner_tierandreactor_pushed_at -
setPartnerTier(id, tier)— setspartner_tieronly (used at ingest time before push) -
getReactorPending(hoursThreshold)— returns ACTION emails from partner/high-priority senders unanswered longer than N hours (used by digest)
Daily Digest
File: ~/system/tools/email-reactor-digest.js
LaunchAgent:
~/Library/LaunchAgents/com.john.email-reactor-digest.plist (fires
daily at 08:00 local)
Behaviour:
-
Calls
getReactorPending(6)— finds ACTION emails from partners OR high-priority senders that are unanswered for more than 6 hours - Formats two sections: Strategic Partner Emails / High-Priority Emails
- Pushes a single digest message to Slack #ceo
-
Same-day dedup: state file
~/system/logs/email-reactor-digest-state.jsonstoreslast_sent_date; skips if already sent today unless--forceis passed
Manual usage:
# Dry run (no push, shows what would be sent)
node ~/system/tools/email-reactor-digest.js --dry-run
# Force re-send even if already sent today
node ~/system/tools/email-reactor-digest.js --force
# Check LaunchAgent
launchctl list | grep email-reactor-digest
Dedup — Three Independent Layers
| Layer | Mechanism | Scope |
|---|---|---|
| 1. Ingest cycle Set |
reactorPushedThisCycle (in-memory Set, cleared each cycle)
|
Within a single 5-min daemon run |
| 2. DB timestamp |
reactor_pushed_at column — if set,
fireReactorPush() returns immediately
|
Permanent — survives restarts |
| 3. Digest date file |
last_sent_date in
email-reactor-digest-state.json
|
Once per calendar day |
Proveo Validation Evidence (2026-06-08)
| Check | Result | Notes |
|---|---|---|
| email-inbox.js columns + helpers | PASS | Syntax OK; exports confirmed; SHA256 39f67c25 |
| email-agent.js reactor wired into insert path | PASS | Syntax OK; line 2393 confirmed; SHA256 f27fc932 |
| email-reactor-digest.js exists | PASS | 6215 bytes; syntax OK; SHA256 6e63a2e9 |
| LaunchAgent loaded (launchctl) | PASS |
com.john.email-reactor-digest active; StartCalendarInterval
Hour=8
|
| Push fired to #ceo (independent test) | PASS | Receipt: ✓ Sent to #ceo (Proveo row id=9218) |
| Dedup — reactor_pushed_at set, no re-push | PASS | Second cycle skips; confirmed via code + DB |
| Digest push to #ceo | PASS | 50 items; Receipt: ✓ Sent to #ceo |
| Digest same-day dedup | PASS | "Already sent today — skipping" |
| 19-account ingest not regressed | PASS | COUNT(email_accounts)=19; all last_checked 2026-06-08 |
| Test senders cleaned from allowlist | PASS | Only asmirmc@gmail.com remains; SHA256 289922b8 |
| No push storm | PASS | 3 independent dedup layers confirmed |
Overall Proveo verdict: PASS. Blocker items: none.
Audit Log Codes
| Action | Meaning | MC Created? |
|---|---|---|
linked |
Known contact, MC task created (first time) | YES |
relinked_via_dedup |
Duplicate MC task found, linked to existing (no new push) | NO (existing) |
security_quarantine |
Failed security scan (phishing/macro/spoof) | NO |
llm_yes |
LLM classified as revenue-relevant | YES |
llm_no |
LLM classified as NOT revenue-relevant (or heuristic match) | NO |
newsletter_heuristic |
Pre-LLM heuristic detected newsletter/digest | NO |
transactional_heuristic |
Pre-LLM heuristic detected automated notification/billing | NO |
dry_run |
--dry-run mode, would have created MC | NO (test mode) |
create_failed |
mc.js add command failed | NO (error) |
update_failed |
DB update (mc_task_id linkage) failed | YES (orphaned) |
Debug Runbook
Query Audit Log
# Last 50 actions
tail -50 ~/system/state/inbox-watcher-audit.log | jq .
# Count actions by type (last 24h)
grep "$(date -u +%Y-%m-%d)" ~/system/state/inbox-watcher-audit.log | \
jq -r .action | sort | uniq -c | sort -rn
# Find specific email
grep '"email_id":8421' ~/system/state/inbox-watcher-audit.log | jq .
Query Quarantine Log
# Show all quarantined emails
cat ~/system/state/inbox-watcher-quarantine.jsonl | jq .
# Count by reason
cat ~/system/state/inbox-watcher-quarantine.jsonl | jq -r .reason | sort | uniq -c
Check Reactor Push State
# All emails that were partner-pushed
sqlite3 ~/system/databases/email-inbox.db \
"SELECT id, from_addr, subject, partner_tier, reactor_pushed_at FROM emails WHERE partner_tier > 0 ORDER BY reactor_pushed_at DESC LIMIT 20;"
# Pending reactor pushes (ACTION emails from partners not yet pushed)
sqlite3 ~/system/databases/email-inbox.db \
"SELECT id, from_addr, subject, classification FROM emails WHERE partner_tier > 0 AND reactor_pushed_at IS NULL;"
# Digest state (last sent date)
cat ~/system/logs/email-reactor-digest-state.json
Manual Trigger (Dry-Run)
node ~/system/tools/inbox-watcher.js --dry-run
Shows what would happen without creating tasks or updating DB.
Manual Trigger (Live)
node ~/system/tools/inbox-watcher.js
Check Daemon Status
launchctl list | grep inbox-watcher
launchctl list | grep email-reactor-digest
Expected output: no.alai.inbox-watcher with recent PID;
com.john.email-reactor-digest with PID - (correct
for CalendarInterval — fires at 08:00 only).
Restart Daemon
launchctl unload ~/Library/LaunchAgents/no.alai.inbox-watcher.plist
launchctl load ~/Library/LaunchAgents/no.alai.inbox-watcher.plist
Tail Daemon Logs
tail -f ~/system/logs/inbox-watcher.out.log
tail -f ~/system/logs/inbox-watcher.err.log
tail -f ~/system/logs/email-reactor-digest.log
Check Email DB for Pending
sqlite3 ~/system/databases/email-inbox.db <<EOF
SELECT id, from_addr, subject, status, created_at
FROM emails
WHERE mc_task_id IS NULL
AND status = 'new'
AND created_at > datetime('now', '-7 days')
ORDER BY created_at DESC
LIMIT 20;
EOF
Failure Modes & Alerts
| Failure | Symptom | Alert Mechanism | Recovery |
|---|---|---|---|
| Daemon crash | launchctl list shows no PID |
ops-watchdog auto-restart (critical_services) | Auto (watchdog), or manual reload plist |
| Paperless 401 | Log shows "HTTP 401" | WARN in out.log, no Slack (non-blocking) | Refresh Bitwarden /tmp/bw-session token |
| Ollama FORGE down | LLM timeout 15s | Log WARN, defaults to NO (safe) | SSH to FORGE, restart Ollama service |
| MC duplicate flood | Many relinked_via_dedup in audit | None (expected behavior) | Normal — dedup prevents task spam |
| DB locked | SQLite BUSY error | ERROR in err.log | Wait 5min (next cycle), or restart daemon |
| Strategic override miss | VIP email not getting Slack push | CEO notices delay | Verify strategic-partners.json email exact match (case-insensitive); check reactor_pushed_at not already set from an old test row |
| Slack push fails | No receipt in logs; no #ceo message | WARN in email-agent.log | Check slack.js connectivity; verify Slack token in config |
| Digest not firing at 08:00 | No digest in #ceo after 08:10 | None (silent) |
Run manually:
node ~/system/tools/email-reactor-digest.js --force; check
plist loaded via launchctl
|
Known Limitations
- LLM is safety net, not primary path. Real opportunities should arrive via KNOWN-CONTACT (Paperless correspondents + DB history). LLM classifier is conservative: defaults to NO on error to avoid false-positive task spam. If a genuine new opportunity is missed by LLM, it will appear in email DB and CEO can manually promote to MC.
- Paperless lookup is best-effort. If Bitwarden token expires or Cloudflare Access headers are missing, Paperless signal fails silently and daemon falls back to DB-history-only KNOWN check. This is by design (non-blocking).
- Default NO on malformed LLM response. Policy changed 2026-05-26 after 6 false positives from verbose LLM responses. Strict regex parsing + retry ensures only clean YES/NO answers create tasks. This may miss 1 real opportunity but prevents 6 noise tasks.
- No auto-reply generation. Out of scope for Phase 2. Email-Reactor creates MC tasks; human writes replies.
- 30-day recency filter. Only processes emails from last 30 days to avoid re-scanning old newsletter backlog every 5-min cycle. Older emails must be manually triaged.
- Single-account scope. Currently queries all accounts in email-inbox.db, but strategic-partners.json does not differentiate by account. Future: add account-specific allowlists if needed.
- Reactor push is email-agent ingest only. The push fires on fresh ingest in email-agent.js. It does NOT retroactively push emails already in the DB from before MC #102077. Historical partner emails must be found via digest or manual DB query.
References
- MC #102077 — Push path wiring (Slack #ceo via slack.js) — COMPLETE 2026-06-08
- MC #102113 — LLM classifier fix (model + token budget) — DEPLOYED LIVE 2026-06-22
- Incident email: #8421 (Asmir Merdžanović, 2026-05-24)
- Peer review: /tmp/alai/p2p-pairing-evidence/mesh-thr-102113-peer-ask.md
- Build evidence: /tmp/evidence-102077/flowforge-build.md
- Proveo validation: /tmp/evidence-102077/proveo-validation.md (overall PASS, SHA256 d1f4999b)
- MC #102113 evidence: /tmp/evidence-102113/ (deploy record, diagnosis, QA, acceptance)