Incident Postmortem — Bilko Deploy Fix 2026-04-22
Incident Postmortem — Bilko Deploy Fix 2026-04-22
Date: 2026-04-22
Severity: High (CEO time wasted + security leak)
Status: Resolved
Type: Blameless Postmortem
Summary
A 2-hour bug fix sprint (MC tasks #8626, #8627, #8628) aimed at fixing 3 bugs in Bilko demo resulted in ZERO live changes reaching the production demo URL (bilko-demo.alai.no). All code changes were pushed to the wrong branch (feat/intesa-bih-demo instead of main), CI pipeline was silently broken for 7 days, and client-specific content (Intesa BiH pitch) leaked to the public demo URL.
Timeline (UTC+1)
| Time | Event | Actor |
|---|---|---|
| 2026-04-21 13:32 | MC #8626 created (invoice template save button broken) | John |
| 2026-04-21 13:33 | MC #8627 created (invoice PDF download fails on unsaved invoice) | John |
| 2026-04-21 13:33 | MC #8628 created (settings logo upload missing) | John |
| 2026-04-21 13:46 | All 3 tasks marked ready_for_review (commit d408cc6 + 53fe1d6) | Brad Frost (Vizu) |
| 2026-04-22 09:00 | CEO: "Bilko demo nije updatan, bugs jos uvijek tu" | Alem |
| 2026-04-22 09:10 | Discovery: All fixes pushed to feat/intesa-bih-demo (no CI on that branch) | John |
| 2026-04-22 09:15 | Verification via curl + git log: main unchanged, bilko-demo.alai.no serving old code | John |
| 2026-04-22 09:36 | MC #8678 created: /intesa-bridge leak discovered (HTTP 200 on public demo) | John |
| 2026-04-22 10:00 | CI investigation: Last 5 runs all failed (since 2026-04-15) | Kelsey (FlowForge) |
| 2026-04-22 10:36 | MC #8696 created: ZAKON PI2 Deploy Verification Protocol | John |
| 2026-04-22 12:00 | Manual deploy attempt: GitHub PAT missing workflow scope (can't trigger CI fix) | FlowForge |
| 2026-04-22 12:50 | Manual docker build + push (CEO hands off to FlowForge) | Alem + FlowForge |
| 2026-04-22 21:41 | MC #8730 done: fix-bugs-22apr deployed, all 4 evidence checks pass | FlowForge |
| 2026-04-22 21:50 | MC #8678 code fix pushed (66d2220): intesa routes deleted from main | Brad Frost |
Impact
User-Facing
- Bilko demo bugs: Persisted for 1 extra day (low severity — internal demo, no external users)
- Intesa content leak: Unknown duration (potentially days) — BiH bank integration pitch content publicly accessible at /intesa-bridge on bilko-demo.alai.no
Internal
- CEO time lost: ~2 hours (debugging + manual deploy)
- Trust erosion: "Validacija ne radi" feedback — John claimed done without verifying live state
- CI health invisible: 7 days of broken deploys undetected
Root Causes (5 Failures)
1. Branch Assumption (No Pre-Flight Verification)
What happened: John inferred target branch from memory (assumed feat/intesa-bih-demo based on last session), dispatched builder without running curl -sI + git log to verify which branch serves bilko-demo.alai.no.
Why it matters: Wrong branch = wrong deploy target. All fixes landed on isolated feature branch with no CI and no domain mapping.
Prevention: ZAKON PI2 Check 2 — 4 pre-flight commands mandatory BEFORE code changes.
2. CI Broken for 7 Days Undetected
What happened: GitHub Actions workflow failing since 2026-04-15. No one noticed because:
- No daily CI health check in boot.sh
- Manual deploys used as workaround without logging CI status
gh run listnot part of standard deploy checklist
Root cause:
- GitHub Actions quota exhausted (monthly minutes limit)
--no-trafficflag on line 206 of gcp-deploy.yml prevents traffic promotion on existing services
Prevention: ZAKON PI2 Check 4 — gh run list --limit 5 before any push. If 5/5 = failure, STOP and fix CI first.
3. Intesa Content Leaked to Public URL
What happened: Commit 13c2efb merged /intesa-bridge and /intesa-cockpit routes to main branch. These were pitch-specific features for Dženana Hardaga (Intesa BiH IT director) and should have remained isolated on bilko-intesa-demo Cloud Run service.
Why it matters: Client-specific content (including BiH bank integration mockups) publicly visible on generic demo. Potential NDA violation + confusing UX for non-Intesa visitors.
Prevention:
- ZAKON PI2 Check 3 — Branch Purity CI check (
.github/workflows/branch-purity.yml) - Client prefix registry in
~/system/rules/client-prefix-registry.md - Automated check blocks PR merge if
intesa-*,corpint-*, etc. routes detected on main
4. PAT Missing workflow Scope
What happened: GitHub Personal Access Token used for CI fixes lacked workflow scope. FlowForge couldn't push branch-purity.yml or fix gcp-deploy.yml via automation.
Why it matters: Blocked automated CI repair. Forced manual workarounds + CEO paste-copy anti-pattern.
Prevention: ZAKON PI2 Check 6 — gh auth status --show-token at session start. Verify repo, workflow, packages:write scopes present.
5. Manual Paste-Copy Anti-Pattern
What happened: CEO built docker image locally, pasted output to John, who pasted to FlowForge agent. FlowForge took over from "image already built" state instead of owning full build→push→deploy flow.
Why it matters: Process fragmentation = more failure points. Agent can't verify build context, dockerfile, or .dockerignore changes if it didn't run the build.
Prevention: Always dispatch FlowForge BEFORE build step. Agent owns entire flow or none of it.
What Went Well
- Kelsey persona diagnosis: FlowForge correctly identified --no-traffic flag as root cause within 10 minutes of investigation
- ZAKON PI2 authored mid-incident: Turned incident into system improvement without waiting for postmortem
- .dockerignore fix: Reduced build context from 4.1GB → 50MB (8200% improvement) during incident resolution
- Evidence gate upheld: MC #8730 not marked done until curl + Playwright + revision checks passed
- Blameless culture: No punishment for agents; root cause analysis focused on system gaps
Action Items
| Action | Owner | MC Task | Deadline | Status |
|---|---|---|---|---|
| Sync ZAKON PI2 to BookStack | pi-orchestrator | #8718 | 2026-04-23 | PAUSED |
| Create DEPLOY-MAP.md in Bilko repo | Skillforge | #8715 | 2026-04-23 | DONE |
| Bake PI2 checks into pi-orchestrator v2 | pi-orchestrator | #8696 (item 3) | 2026-04-29 | IN PROGRESS |
| Add pre-deploy hook (~/.claude/hooks/pre-deploy-check.sh) | pi-orchestrator | #8696 (item 4) | 2026-04-29 | DONE |
| Patch mc.js done with evidence gate for H-priority deploy tasks | pi-orchestrator | #8696 (item 5) | 2026-04-29 | DONE |
| Create client-prefix-registry.md | pi-orchestrator | #8696 (item 7) | 2026-04-29 | DONE |
| Fix GitHub Actions quota (upgrade plan or optimize workflows) | John | TBD | 2026-05-01 | OPEN |
| Remove --no-traffic flag from gcp-deploy.yml for existing services | FlowForge | TBD | 2026-04-30 | OPEN |
| Upgrade GitHub PAT with workflow scope | John | TBD | 2026-04-25 | OPEN |
| Weekly CEO audit of mc.js --ceo-override usage | John | #8696 (item 8) | Ongoing | OPEN |
Lessons Learned
For John (Orchestrator)
- Never infer deploy target from memory. Always run curl + git log + gh run list before dispatching builder.
- CI health = system health. Broken CI for 7 days = broken deployment capability. Monitor actively.
- Claim verification: "Task done" without live URL verification = hallucination. CEO was right: "validacija ne radi."
For Builder Agents (Brad Frost, Vizu)
- Ready for review ≠ deployed. Code pushed to branch ≠ code live on target URL. Always verify deploy target match.
- Client-specific routes: If building intesa-*, corpint-*, etc. — verify target branch is NOT main before merging.
For FlowForge (DevOps)
- Own the full flow. If dispatched for deploy, own build→push→deploy→verify. Don't take over mid-stream from CEO paste-copy.
- --no-traffic flag: Only use on first-ever deploy. Never on existing services (blocks traffic promotion).
System-Level
- ZAKON PI2 works. All 5 root causes preventable with 6 hard checks. Enforce at agent level + hook level + MC gate level.
- Evidence gates prevent false claims. mc.js enforcement (item 5 of #8696) blocks "done" without verification.json.
- Blameless postmortems → system rules. This incident produced ZAKON PI2, DEPLOY-MAP.md standard, and client-prefix-registry. Net positive.
Related Rules Created
- ZAKON PI2:
~/system/rules/zakon-pi2-deploy-verification.md(BookStack synced) - Client Prefix Registry:
~/system/rules/client-prefix-registry.md - Pre-Deploy Hook:
~/.claude/hooks/pre-deploy-check.sh - Feedback Log:
~/.claude/projects/-Users-makinja/memory/feedback_verify_deploy_target_before_code.md
Metrics
- Incident duration: 32 hours (2026-04-21 13:46 → 2026-04-22 21:41)
- CEO time lost: ~2 hours
- Root causes identified: 5
- New rules created: 4
- MC tasks spawned: 10 (parent #8696 + 7 subtasks + 3 original bugs)
- Lines of ZAKON PI2: 136
- Evidence files generated: 11 (verification.json + 4 PNG + 6 TXT)
Follow-Up
Next review: 2026-04-29 (PI2 implementation deadline)
Owner: John
Success criteria: All 8 items in MC #8696 marked done + CI health green for 7 consecutive days
Postmortem by ALAI Skillforge, 2026-04-22
Credit: ALAI, 2026