Go-Live Runbook
Go-Live Runbook
Project: {{PROJECT_NAME}} Version: {{VERSION}} Date: {{DATE}} Author: {{AUTHOR}} Status: Draft | In Review | Approved Reviewers: {{REVIEWERS}}
Document History
| Version | Date | Author | Changes |
|---|---|---|---|
| 0.1 | {{DATE}} | {{AUTHOR}} | Initial draft |
1. Go-Live Overview
What: {{PROJECT_NAME}} v{{VERSION}} production launch When: {{LAUNCH_DATE}} at {{LAUNCH_TIME}} {{TIMEZONE}} Deployment window: {{WINDOW_START}} – {{WINDOW_END}} ({{WINDOW_DURATION}}h window) Go-Live Type: {{TYPE}}
Incident Commander: {{IC}} (primary), {{IC_BACKUP}} (backup) Technical Lead: {{TECH_LEAD}} Communications Lead: {{COMMS_LEAD}} War Room: {{WAR_ROOM_LINK}} Status Page: {{STATUS_PAGE_URL}}
2. Pre-Launch Checklist
T-7 Days: Infrastructure Verification
- All production infrastructure provisioned and tested
- Load balancer health checks passing for all instances
- Auto-scaling groups configured and tested (scale-up + scale-down)
- Database replicas in sync and replication lag < {{REPLICATION_LAG}}s
- Backup jobs running successfully (last backup verified: {{VERIFY_DATE}})
- CDN configured and serving assets correctly
- All IAM roles and permissions verified
- Infrastructure monitoring dashboards showing green
- Estimated cost reviewed and within budget
Owner: {{INFRA_OWNER}} | Due: T-7 days
T-5 Days: DNS Configuration
- DNS records created/updated in {{DNS_PROVIDER}}
{{DOMAIN}}→ Load balancer (TTL set to {{LOW_TTL}} for easy rollback)api.{{DOMAIN}}→ API load balancerwww.{{DOMAIN}}→ Redirect to{{DOMAIN}}
- DNS propagation verified (check from multiple regions)
- DNS failover routing configured (if applicable)
- Old DNS records documented (for rollback reference)
Owner: {{DNS_OWNER}} | Due: T-5 days
T-5 Days: SSL Certificates
- TLS certificates provisioned for all domains
{{DOMAIN}}✅*.{{DOMAIN}}✅
- Certificate expiry > 90 days from go-live date
- HTTPS redirect configured (HTTP → HTTPS)
- HSTS header configured
- SSL Labs test: Grade A or better ({{SSL_TEST_LINK}})
Owner: {{SSL_OWNER}} | Due: T-5 days
T-3 Days: CDN Configuration
- CDN distribution pointing to production origin
- Cache behaviors configured per specification
- Static asset cache headers correct (1yr for fingerprinted assets)
- CDN WAF rules enabled and tested
- CDN purge command tested and documented
- CDN performance verified from target geographies
Owner: {{CDN_OWNER}} | Due: T-3 days
T-3 Days: Database Migration
- Final migration scripts reviewed and approved
- Migration tested on staging with production-sized data (timing recorded: {{MIGRATION_TIME}}min)
- Rollback/down migration tested
- Migration script idempotent (safe to run twice)
- Database backup taken immediately before migration window
- Data integrity checks script prepared (
scripts/verify-migration.sh)
Owner: {{DB_OWNER}} | Due: T-3 days
T-2 Days: Feature Flags
- All new features behind feature flags
- Feature flags defaulting to OFF in production
- Flag rollout plan documented (which flags, in what order, with what criteria)
- Kill switch flags configured (disable any feature immediately if needed)
Owner: {{FF_OWNER}} | Due: T-2 days
T-2 Days: Third-Party Integrations
- {{INTEGRATION_1}} — live API keys configured in secrets manager
- {{INTEGRATION_2}} — live API keys configured in secrets manager
- Payment gateway: live mode activated and tested with real card (refunded)
- Email service: sending domain authenticated (SPF, DKIM, DMARC)
- All integrations tested in production with smoke tests
- Webhook URLs updated to production endpoints
Owner: {{INTEGRATION_OWNER}} | Due: T-2 days
T-1 Day: Monitoring & Alerting
- All alert rules deployed to production monitoring
- Alert routing configured — PagerDuty / on-call active
- Dashboards showing production data
- Log aggregation capturing production logs
- Distributed tracing enabled
- Synthetic monitoring configured (uptime checks every 1 min)
- Alert test fired and received by on-call
Owner: {{MONITORING_OWNER}} | Due: T-1 day
T-1 Day: Backup Verification
- Production backup job running on schedule
- Last backup restored to test environment and verified
- Backup storage has sufficient capacity (> {{BACKUP_DAYS}} days)
- Point-in-time recovery tested
Owner: {{BACKUP_OWNER}} | Due: T-1 day
T-1 Day: Legal / Compliance Sign-off
- Privacy policy published and linked
- Terms of service published and linked
- Cookie consent banner implemented (if required by jurisdiction)
- GDPR data processing inventory updated
- Security assessment completed and any findings resolved or accepted
- Legal sign-off obtained: {{LEGAL_SIGNOFF}} on {{DATE}}
Owner: {{LEGAL_OWNER}} | Due: T-1 day
T-0: Pre-Launch Final Checks (Within 2 Hours of Launch)
- Staging smoke tests passing (last run: {{TIMESTAMP}})
- All engineers briefed and available
- War room open and all participants joined
- Rollback procedure rehearsed mentally
- Monitoring dashboards open
- Status page updated: "Scheduled maintenance: {{TIME}} - {{END_TIME}}"
- Customer support briefed on launch features and potential issues
- Deployment script / CI pipeline ready to trigger
3. Launch Day Procedure (Hour by Hour)
H-0: Deployment Start
| Time | Action | Owner | Status | Notes |
|---|---|---|---|---|
| H+0:00 | Announce in war room: "Deployment started" | {{IC}} | ||
| H+0:00 | Take final pre-deploy database backup | {{DB_OWNER}} | ||
| H+0:05 | Enable maintenance mode (if applicable) | {{DEPLOY_OWNER}} | ||
| H+0:10 | Trigger production deployment pipeline | {{DEPLOY_OWNER}} | Pipeline: {{PIPELINE_LINK}} | |
| H+0:15 | Monitor deployment progress | {{TECH_LEAD}} |
H+0:15 → H+0:45: Database Migration Execution
| Time | Action | Owner | Status |
|---|---|---|---|
| H+0:15 | Confirm deployment artifact ready | {{DEPLOY_OWNER}} | |
| H+0:20 | Run database migrations: bash scripts/migrate-prod.sh |
{{DB_OWNER}} | |
| H+0:25 | Verify migration completed: bash scripts/verify-migration.sh |
{{DB_OWNER}} | |
| H+0:30 | Confirm new application instances healthy | {{TECH_LEAD}} | |
| H+0:40 | Deploy new application version to all instances | {{DEPLOY_OWNER}} |
H+0:45 → H+1:00: DNS Cutover
| Time | Action | Owner | Status |
|---|---|---|---|
| H+0:45 | Point DNS to production load balancer | {{DNS_OWNER}} | |
| H+0:50 | Monitor DNS propagation | {{DNS_OWNER}} | |
| H+0:55 | Confirm HTTPS working from external network | {{TECH_LEAD}} | |
| H+1:00 | Disable maintenance mode | {{DEPLOY_OWNER}} |
H+1:00 → H+1:30: Smoke Tests
| Time | Action | Owner | Status |
|---|---|---|---|
| H+1:00 | Run automated smoke tests: bash scripts/smoke-tests.sh production |
{{QA_OWNER}} | |
| H+1:10 | Manual smoke test — critical user journey 1 | {{QA_OWNER}} | |
| H+1:15 | Manual smoke test — critical user journey 2 | {{QA_OWNER}} | |
| H+1:20 | Verify payment processing (test transaction) | {{QA_OWNER}} | |
| H+1:25 | Verify email delivery (test email) | {{QA_OWNER}} | |
| H+1:30 | All smoke tests PASS → proceed to monitoring | {{IC}} |
H+1:30 → H+2:00: Monitoring Verification
| Time | Action | Owner | Status |
|---|---|---|---|
| H+1:30 | Verify error rate < {{ERROR_THRESHOLD}}% | {{TECH_LEAD}} | |
| H+1:35 | Verify P99 latency < {{P99_THRESHOLD}}ms | {{TECH_LEAD}} | |
| H+1:40 | Verify no unexpected spikes in DB CPU/connections | {{DB_OWNER}} | |
| H+1:50 | Begin enabling feature flags (per rollout plan) | {{FF_OWNER}} | |
| H+2:00 | Declare go-live successful | {{IC}} |
4. Post-Launch Monitoring (T+1 to T+7)
Enhanced Monitoring Period
Duration: {{POST_LAUNCH_MONITORING}}h enhanced monitoring Monitoring cadence: Every 30 min for first 4h, then hourly for 24h, then normal
| Period | Check Frequency | Responsible |
|---|---|---|
| H+0 to H+4 | Every 30 min | On-call engineer |
| H+4 to H+24 | Every 60 min | On-call engineer |
| Day 2-7 | Standard monitoring | On-call rotation |
Metrics to watch during enhanced monitoring:
- Error rate (target: < {{ERROR_THRESHOLD}}%)
- P99 latency (target: < {{P99_THRESHOLD}}ms)
- DB connection pool utilization (target: < {{DB_POOL}}%)
- Cache hit rate (target: > {{CACHE_HIT}}%)
- Memory trend (should be stable, not growing)
Support Escalation Procedures
| Issue Type | First Contact | Escalation |
|---|---|---|
| User-facing errors | Customer support → Engineering | On-call engineer |
| Performance degradation | On-call engineer | Tech lead + Eng manager |
| Data issues | On-call engineer | DB owner + Engineering lead |
| Security concern | Security contact → CISO | Immediate escalation |
Performance Baseline Comparison
Compare post-launch metrics to pre-launch staging baseline:
| Metric | Staging Baseline | Production Actual | Delta | Status |
|---|---|---|---|---|
| P95 latency | {{STG_P95}}ms | TBD | TBD | TBD |
| Error rate | {{STG_ERR}}% | TBD | TBD | TBD |
| Throughput | {{STG_RPS}} rps | TBD | TBD | TBD |
5. Rollback Triggers & Procedure
Rollback Decision Criteria
Automatic rollback triggers:
- Smoke tests fail after deployment
- Error rate > {{ROLLBACK_ERROR_RATE}}% for {{ROLLBACK_DURATION}} consecutive minutes
- Database migration causes data integrity issues
Manual rollback triggers (decision by {{ROLLBACK_AUTHORITY}}):
- P99 latency > {{ROLLBACK_P99}}ms sustained for {{ROLLBACK_LATENCY_DURATION}} min
- Critical feature broken with no quick fix available
- Security vulnerability discovered in new release
Rollback Procedure (Quick Reference)
- Announce in war room: "Initiating rollback"
- Update status page: "We are investigating an issue and may revert recent changes"
- Run:
bash scripts/rollback.sh production(or trigger CI pipeline rollback) - Monitor health checks — confirm previous version healthy
- If DB migration included: run down migration
bash scripts/migrate-down.sh production - Verify all smoke tests pass on previous version
- Update status page: "Issue resolved, system restored"
- Notify stakeholders
Full rollback procedure: See rollback-plan.md
6. Communication Plan
Pre-Launch Communications
| Audience | Channel | When | Message |
|---|---|---|---|
| Internal team | Slack #launches | T-3 days | Launch schedule and plan |
| Customer support | Briefing doc + Slack | T-2 days | Features, FAQ, escalation path |
| Existing users | Email / in-app banner | T-1 day | "Exciting updates coming" |
| Status page subscribers | Status page | T-4 hours | Scheduled maintenance notification |
Launch Day Communications
| Audience | Channel | When | Message |
|---|---|---|---|
| Status page | status page | T-0 | "Scheduled deployment in progress" |
| Internal | Slack #launches | At success | "🚀 {{PROJECT}} is live!" |
| Users | Email / in-app | H+1 after success | Launch announcement |
| Status page | status page | H+1 | "Deployment complete — all systems normal" |
7. Stakeholder Notification Timeline
| Milestone | Notify | Channel | Owner |
|---|---|---|---|
| Deployment started | Engineering team | Slack war room | {{IC}} |
| Smoke tests pass | Engineering + Product | Slack | {{IC}} |
| Go-live declared | All stakeholders | Email + Slack | {{COMMS_LEAD}} |
| Rollback initiated | All stakeholders + Management | Immediate call + Slack | {{IC}} |
Related Documents
- Deployment Checklist
- Rollback Plan
- Operational Runbook
- Monitoring & Observability
- Disaster Recovery Plan
Approval
| Role | Name | Date | Signature |
|---|---|---|---|
| Author | |||
| Reviewer | |||
| Approver |