# Go-Live Runbook

# Go-Live Runbook

> **Project:** {{PROJECT_NAME}}
> **Version:** {{VERSION}}
> **Date:** {{DATE}}
> **Author:** {{AUTHOR}}
> **Status:** Draft | In Review | Approved
> **Reviewers:** {{REVIEWERS}}

## Document History
| Version | Date | Author | Changes |
|---------|------|--------|---------|
| 0.1     | {{DATE}} | {{AUTHOR}} | Initial draft |

---

## 1. Go-Live Overview

<!-- GUIDANCE: Summarize the go-live event. Who is going live with what, when, and why now. -->

**What:** {{PROJECT_NAME}} v{{VERSION}} production launch
**When:** {{LAUNCH_DATE}} at {{LAUNCH_TIME}} {{TIMEZONE}}
**Deployment window:** {{WINDOW_START}} – {{WINDOW_END}} ({{WINDOW_DURATION}}h window)
**Go-Live Type:** {{TYPE}} <!-- New launch / Major release / Migration / Cutover -->

**Incident Commander:** {{IC}} (primary), {{IC_BACKUP}} (backup)
**Technical Lead:** {{TECH_LEAD}}
**Communications Lead:** {{COMMS_LEAD}}
**War Room:** {{WAR_ROOM_LINK}}
**Status Page:** {{STATUS_PAGE_URL}}

---

## 2. Pre-Launch Checklist

### T-7 Days: Infrastructure Verification

<!-- GUIDANCE: One week before go-live, confirm all infrastructure is provisioned and stable. -->

- [ ] All production infrastructure provisioned and tested
- [ ] Load balancer health checks passing for all instances
- [ ] Auto-scaling groups configured and tested (scale-up + scale-down)
- [ ] Database replicas in sync and replication lag < {{REPLICATION_LAG}}s
- [ ] Backup jobs running successfully (last backup verified: {{VERIFY_DATE}})
- [ ] CDN configured and serving assets correctly
- [ ] All IAM roles and permissions verified
- [ ] Infrastructure monitoring dashboards showing green
- [ ] Estimated cost reviewed and within budget

**Owner:** {{INFRA_OWNER}} | **Due:** T-7 days

---

### T-5 Days: DNS Configuration

- [ ] DNS records created/updated in {{DNS_PROVIDER}}
  - `{{DOMAIN}}` → Load balancer (TTL set to {{LOW_TTL}} for easy rollback)
  - `api.{{DOMAIN}}` → API load balancer
  - `www.{{DOMAIN}}` → Redirect to `{{DOMAIN}}`
- [ ] DNS propagation verified (check from multiple regions)
- [ ] DNS failover routing configured (if applicable)
- [ ] Old DNS records documented (for rollback reference)

**Owner:** {{DNS_OWNER}} | **Due:** T-5 days

---

### T-5 Days: SSL Certificates

- [ ] TLS certificates provisioned for all domains
  - `{{DOMAIN}}` ✅
  - `*.{{DOMAIN}}` ✅
- [ ] Certificate expiry > 90 days from go-live date
- [ ] HTTPS redirect configured (HTTP → HTTPS)
- [ ] HSTS header configured
- [ ] SSL Labs test: Grade A or better ({{SSL_TEST_LINK}})

**Owner:** {{SSL_OWNER}} | **Due:** T-5 days

---

### T-3 Days: CDN Configuration

- [ ] CDN distribution pointing to production origin
- [ ] Cache behaviors configured per specification
- [ ] Static asset cache headers correct (1yr for fingerprinted assets)
- [ ] CDN WAF rules enabled and tested
- [ ] CDN purge command tested and documented
- [ ] CDN performance verified from target geographies

**Owner:** {{CDN_OWNER}} | **Due:** T-3 days

---

### T-3 Days: Database Migration

- [ ] Final migration scripts reviewed and approved
- [ ] Migration tested on staging with production-sized data (timing recorded: {{MIGRATION_TIME}}min)
- [ ] Rollback/down migration tested
- [ ] Migration script idempotent (safe to run twice)
- [ ] Database backup taken immediately before migration window
- [ ] Data integrity checks script prepared (`scripts/verify-migration.sh`)

**Owner:** {{DB_OWNER}} | **Due:** T-3 days

---

### T-2 Days: Feature Flags

- [ ] All new features behind feature flags
- [ ] Feature flags defaulting to OFF in production
- [ ] Flag rollout plan documented (which flags, in what order, with what criteria)
- [ ] Kill switch flags configured (disable any feature immediately if needed)

**Owner:** {{FF_OWNER}} | **Due:** T-2 days

---

### T-2 Days: Third-Party Integrations

- [ ] {{INTEGRATION_1}} — live API keys configured in secrets manager
- [ ] {{INTEGRATION_2}} — live API keys configured in secrets manager
- [ ] Payment gateway: live mode activated and tested with real card (refunded)
- [ ] Email service: sending domain authenticated (SPF, DKIM, DMARC)
- [ ] All integrations tested in production with smoke tests
- [ ] Webhook URLs updated to production endpoints

**Owner:** {{INTEGRATION_OWNER}} | **Due:** T-2 days

---

### T-1 Day: Monitoring & Alerting

- [ ] All alert rules deployed to production monitoring
- [ ] Alert routing configured — PagerDuty / on-call active
- [ ] Dashboards showing production data
- [ ] Log aggregation capturing production logs
- [ ] Distributed tracing enabled
- [ ] Synthetic monitoring configured (uptime checks every 1 min)
- [ ] Alert test fired and received by on-call

**Owner:** {{MONITORING_OWNER}} | **Due:** T-1 day

---

### T-1 Day: Backup Verification

- [ ] Production backup job running on schedule
- [ ] Last backup restored to test environment and verified
- [ ] Backup storage has sufficient capacity (> {{BACKUP_DAYS}} days)
- [ ] Point-in-time recovery tested

**Owner:** {{BACKUP_OWNER}} | **Due:** T-1 day

---

### T-1 Day: Legal / Compliance Sign-off

- [ ] Privacy policy published and linked
- [ ] Terms of service published and linked
- [ ] Cookie consent banner implemented (if required by jurisdiction)
- [ ] GDPR data processing inventory updated
- [ ] Security assessment completed and any findings resolved or accepted
- [ ] Legal sign-off obtained: {{LEGAL_SIGNOFF}} on {{DATE}}

**Owner:** {{LEGAL_OWNER}} | **Due:** T-1 day

---

### T-0: Pre-Launch Final Checks (Within 2 Hours of Launch)

- [ ] Staging smoke tests passing (last run: {{TIMESTAMP}})
- [ ] All engineers briefed and available
- [ ] War room open and all participants joined
- [ ] Rollback procedure rehearsed mentally
- [ ] Monitoring dashboards open
- [ ] Status page updated: "Scheduled maintenance: {{TIME}} - {{END_TIME}}"
- [ ] Customer support briefed on launch features and potential issues
- [ ] Deployment script / CI pipeline ready to trigger

---

## 3. Launch Day Procedure (Hour by Hour)

<!-- GUIDANCE: This section becomes the live execution log during the launch. Times are offsets from launch start. -->

### H-0: Deployment Start

| Time | Action | Owner | Status | Notes |
|------|--------|-------|--------|-------|
| H+0:00 | Announce in war room: "Deployment started" | {{IC}} | | |
| H+0:00 | Take final pre-deploy database backup | {{DB_OWNER}} | | |
| H+0:05 | Enable maintenance mode (if applicable) | {{DEPLOY_OWNER}} | | |
| H+0:10 | Trigger production deployment pipeline | {{DEPLOY_OWNER}} | | Pipeline: {{PIPELINE_LINK}} |
| H+0:15 | Monitor deployment progress | {{TECH_LEAD}} | | |

### H+0:15 → H+0:45: Database Migration Execution

| Time | Action | Owner | Status |
|------|--------|-------|--------|
| H+0:15 | Confirm deployment artifact ready | {{DEPLOY_OWNER}} | |
| H+0:20 | Run database migrations: `bash scripts/migrate-prod.sh` | {{DB_OWNER}} | |
| H+0:25 | Verify migration completed: `bash scripts/verify-migration.sh` | {{DB_OWNER}} | |
| H+0:30 | Confirm new application instances healthy | {{TECH_LEAD}} | |
| H+0:40 | Deploy new application version to all instances | {{DEPLOY_OWNER}} | |

### H+0:45 → H+1:00: DNS Cutover

| Time | Action | Owner | Status |
|------|--------|-------|--------|
| H+0:45 | Point DNS to production load balancer | {{DNS_OWNER}} | |
| H+0:50 | Monitor DNS propagation | {{DNS_OWNER}} | |
| H+0:55 | Confirm HTTPS working from external network | {{TECH_LEAD}} | |
| H+1:00 | Disable maintenance mode | {{DEPLOY_OWNER}} | |

### H+1:00 → H+1:30: Smoke Tests

| Time | Action | Owner | Status |
|------|--------|-------|--------|
| H+1:00 | Run automated smoke tests: `bash scripts/smoke-tests.sh production` | {{QA_OWNER}} | |
| H+1:10 | Manual smoke test — critical user journey 1 | {{QA_OWNER}} | |
| H+1:15 | Manual smoke test — critical user journey 2 | {{QA_OWNER}} | |
| H+1:20 | Verify payment processing (test transaction) | {{QA_OWNER}} | |
| H+1:25 | Verify email delivery (test email) | {{QA_OWNER}} | |
| H+1:30 | All smoke tests PASS → proceed to monitoring | {{IC}} | |

### H+1:30 → H+2:00: Monitoring Verification

| Time | Action | Owner | Status |
|------|--------|-------|--------|
| H+1:30 | Verify error rate < {{ERROR_THRESHOLD}}% | {{TECH_LEAD}} | |
| H+1:35 | Verify P99 latency < {{P99_THRESHOLD}}ms | {{TECH_LEAD}} | |
| H+1:40 | Verify no unexpected spikes in DB CPU/connections | {{DB_OWNER}} | |
| H+1:50 | Begin enabling feature flags (per rollout plan) | {{FF_OWNER}} | |
| H+2:00 | Declare go-live successful | {{IC}} | |

---

## 4. Post-Launch Monitoring (T+1 to T+7)

### Enhanced Monitoring Period

<!-- GUIDANCE: First week after go-live requires closer attention than normal operations. -->

**Duration:** {{POST_LAUNCH_MONITORING}}h enhanced monitoring
**Monitoring cadence:** Every 30 min for first 4h, then hourly for 24h, then normal

| Period | Check Frequency | Responsible |
|--------|-----------------|-------------|
| H+0 to H+4 | Every 30 min | On-call engineer |
| H+4 to H+24 | Every 60 min | On-call engineer |
| Day 2-7 | Standard monitoring | On-call rotation |

**Metrics to watch during enhanced monitoring:**
- Error rate (target: < {{ERROR_THRESHOLD}}%)
- P99 latency (target: < {{P99_THRESHOLD}}ms)
- DB connection pool utilization (target: < {{DB_POOL}}%)
- Cache hit rate (target: > {{CACHE_HIT}}%)
- Memory trend (should be stable, not growing)

### Support Escalation Procedures

| Issue Type | First Contact | Escalation |
|------------|--------------|------------|
| User-facing errors | Customer support → Engineering | On-call engineer |
| Performance degradation | On-call engineer | Tech lead + Eng manager |
| Data issues | On-call engineer | DB owner + Engineering lead |
| Security concern | Security contact → CISO | Immediate escalation |

### Performance Baseline Comparison

Compare post-launch metrics to pre-launch staging baseline:

| Metric | Staging Baseline | Production Actual | Delta | Status |
|--------|-----------------|-------------------|-------|--------|
| P95 latency | {{STG_P95}}ms | TBD | TBD | TBD |
| Error rate | {{STG_ERR}}% | TBD | TBD | TBD |
| Throughput | {{STG_RPS}} rps | TBD | TBD | TBD |

---

## 5. Rollback Triggers & Procedure

<!-- GUIDANCE: Define objective criteria for rolling back. Remove ambiguity from the decision. -->

### Rollback Decision Criteria

**Automatic rollback triggers:**
- Smoke tests fail after deployment
- Error rate > {{ROLLBACK_ERROR_RATE}}% for {{ROLLBACK_DURATION}} consecutive minutes
- Database migration causes data integrity issues

**Manual rollback triggers (decision by {{ROLLBACK_AUTHORITY}}):**
- P99 latency > {{ROLLBACK_P99}}ms sustained for {{ROLLBACK_LATENCY_DURATION}} min
- Critical feature broken with no quick fix available
- Security vulnerability discovered in new release

### Rollback Procedure (Quick Reference)

1. Announce in war room: "Initiating rollback"
2. Update status page: "We are investigating an issue and may revert recent changes"
3. Run: `bash scripts/rollback.sh production` (or trigger CI pipeline rollback)
4. Monitor health checks — confirm previous version healthy
5. If DB migration included: run down migration `bash scripts/migrate-down.sh production`
6. Verify all smoke tests pass on previous version
7. Update status page: "Issue resolved, system restored"
8. Notify stakeholders

**Full rollback procedure:** See [rollback-plan.md](../RELEASE/rollback-plan.md)

---

## 6. Communication Plan

### Pre-Launch Communications

| Audience | Channel | When | Message |
|----------|---------|------|---------|
| Internal team | Slack #launches | T-3 days | Launch schedule and plan |
| Customer support | Briefing doc + Slack | T-2 days | Features, FAQ, escalation path |
| Existing users | Email / in-app banner | T-1 day | "Exciting updates coming" |
| Status page subscribers | Status page | T-4 hours | Scheduled maintenance notification |

### Launch Day Communications

| Audience | Channel | When | Message |
|----------|---------|------|---------|
| Status page | status page | T-0 | "Scheduled deployment in progress" |
| Internal | Slack #launches | At success | "🚀 {{PROJECT}} is live!" |
| Users | Email / in-app | H+1 after success | Launch announcement |
| Status page | status page | H+1 | "Deployment complete — all systems normal" |

---

## 7. Stakeholder Notification Timeline

| Milestone | Notify | Channel | Owner |
|-----------|--------|---------|-------|
| Deployment started | Engineering team | Slack war room | {{IC}} |
| Smoke tests pass | Engineering + Product | Slack | {{IC}} |
| Go-live declared | All stakeholders | Email + Slack | {{COMMS_LEAD}} |
| Rollback initiated | All stakeholders + Management | Immediate call + Slack | {{IC}} |

---

## Related Documents

- [Deployment Checklist](../RELEASE/deployment-checklist.md)
- [Rollback Plan](../RELEASE/rollback-plan.md)
- [Operational Runbook](./operational-runbook.md)
- [Monitoring & Observability](../INFRASTRUCTURE/monitoring-observability.md)
- [Disaster Recovery Plan](../INFRASTRUCTURE/disaster-recovery-plan.md)

---

## Approval
| Role | Name | Date | Signature |
|------|------|------|-----------|
| Author | | | |
| Reviewer | | | |
| Approver | | | |