Alert Routing — Channel Mapping & SLA
Alert Routing — Channel Mapping & SLA
Purpose: Who gets what alert, on which channel, with what latency target.
Audience: John (orchestrator), Alem (CEO), ops-watchdog daemon, agent builders
Last updated: 2026-04-19 (SENTINEL Sprint Task 7)
Alert Severity Table
| Severity | Channel | Target Audience | Latency SLA | Retry Logic | Example Alerts |
|---|---|---|---|---|---|
| P0 Critical | Slack #ops + Email fallback | Alem + John | ≤ 60s | Retry 3x, then email | Public surface 502 (≥2 cycles), Cloudflared tunnel down, Slack bot SIGKILL |
| P1 High | Slack #ops | John (on-call) | ≤ 3 min | Retry 2x, then DLQ | Daemon exit nonzero (critical services), Email DLQ > 5 entries, TLS cert expiry ≤ 7 days |
| P2 Info | john-daily-digest | Alem (morning review) | Daily 08:00 CET | Buffered, no retry | New skill proposal, briefing summary, task ready for review, HiveMind research |
| P3 Debug | Log file only | Archive (no human) | n/a | Write once | Heartbeat OK pulses, ops-watchdog check passed, daemon start/stop routine |
Key principle: P0/P1 alerts MUST be actionable. If no action is needed → downgrade to P2 or P3. Alert fatigue = blind system.
Channel Routing Details
1. Slack #ops (Primary Technical Channel)
Purpose: Real-time technical alerts requiring immediate investigation or fix.
Routing sources:
- BetterStack webhook (external monitors: 7 public endpoints)
- ops-watchdog Slack bot (internal monitors: 17 critical daemons + 6 public endpoints)
- HiveMind
kind=alertsubscriber (agent-generated alerts, e.g., security scan fail, cost budget exceeded)
Target audience:
- John (orchestrator) — primary on-call
- FlowForge/CodeCraft agents (when delegated)
- Alem (if John offline or P0 escalation)
Message format:
[SOURCE] Severity: Alert Title
Details: <brief description>
Time: 2026-04-19 10:24:15 CET
Runbook: ~/system/docs/runbooks/<name>.md (if available)
Example:
[SENTINEL ALERT] P0: ⚠️ PUBLIC SURFACE DOWN: alai.no
Details: HTTP 502 — connection refused (detected 2 consecutive cycles)
Time: 2026-04-19 10:24:15 CET
Runbook: ~/system/docs/runbooks/incident-response-playbook.md#1-alai-no-502
Cooldown: Same alert within 15 min = suppressed (prevents spam from flapping services). After 15 min silence, next occurrence fires new alert.
Alert count limit: If same service fires > 5 alerts in 1 hour → escalate to P0 + tag Alem ("Repeated failure — may need architectural fix").
2. Email Fallback ([email protected])
Purpose: Backup channel when Slack #ops is unreachable OR Slack bot (com.john.slack-bot) is dead.
Trigger conditions:
- Slack bot PID = "-" (daemon stopped/killed) — ops-watchdog detects this via
critical_servicescheck - Slack API returns 5xx error for 3 consecutive attempts (Slack platform outage)
- Ops-watchdog config
email_fallback.enabled = true(set after SENTINEL sprint)
Routing logic (ops-watchdog):
# Pseudocode from ops-watchdog daemon:
if slack_bot_dead() or slack_api_unavailable():
send_email(
to="[email protected]",
subject="[SENTINEL FALLBACK] Alert: <title>",
body="Slack #ops unreachable. Alert details:\n<full alert message>"
)
Latency SLA: ≤ 90s from alert trigger (Slack primary is 60s, email fallback is 30s slower due to SMTP handshake).
Example email:
Subject: [SENTINEL FALLBACK] P0: PUBLIC SURFACE DOWN: alai.no
Body:
Slack #ops is unreachable (slack-bot SIGKILL'd).
Alert routed via email fallback.
Alert: ⚠️ PUBLIC SURFACE DOWN: alai.no
Details: HTTP 502 — connection refused (detected 2 consecutive cycles)
Time: 2026-04-19 10:24:15 CET
Source: ops-watchdog (internal monitor)
Action: Restart cloudflared tunnel + verify origin
Runbook: ~/system/docs/runbooks/incident-response-playbook.md#1-alai-no-502
— ops-watchdog daemon
Alert count in fallback mode: All P0 alerts go to email. P1 alerts are buffered to DLQ (~/system/logs/alert-dlq.jsonl) until Slack bot is restored. After restoration, DLQ replays to Slack #ops.
3. john-daily-digest (Summary Layer)
Purpose: Non-urgent aggregated summary for Alem's morning review (08:00 CET).
Content sources:
- Overnight task completions (mc.js done events)
- New intake (email, Slack, contact forms) classified as P2
- HiveMind
kind=briefingevents (daily briefing, weekly summaries) - Cost tracker daily spend (if > 50 USD)
- Skill proposals from agents (new cookbook entries, tool upgrades)
Delivery:
- Channel: Slack DM to Alem (private message, NOT #ops or #exec)
- Schedule: Daily 08:00 CET (launchctl StartCalendarInterval)
- Format: Markdown summary, max 500 words, grouped by category
Example digest:
Good morning Alem. Overnight summary (2026-04-18 18:00 → 2026-04-19 08:00 CET):
## Tasks Completed (3)
- #8370: SENTINEL T2a BetterStack 6 monitors (FlowForge) — 6 new public endpoint monitors added
- #8371: SENTINEL T3b Email DLQ (CodeCraft) — Dead-letter queue operational, tested with vault failure
- #8372: SENTINEL T7 BookStack 3 runbooks (Skillforge) — Documentation complete
## New Intake (2)
- Email from prospect (forwarded by John): Inquiring about AI consulting for retail chain (200 stores)
- Slack message from partner: Entur wants to schedule follow-up call for RAG demo
## Cost Alert (1)
- Yesterday spend: 67 USD (above 50 USD threshold)
- Azure VM: 22 USD
- OpenAI API: 38 USD (Opus 4 tasks)
- Vercel: 7 USD
## System Health
- Dead daemons: 12 (down from 16 yesterday — 4 fixed)
- Public surfaces: 6 of 7 green (snowit.ba still NXDOMAIN)
- Email DLQ: 1 entry (from validation test)
Next: Phase 2 sprint planning (secondary tunnel + 12 dead daemons).
— John
Opt-out: Alem can pause digest via node ~/system/tools/mc.js config set digest.enabled false (not recommended — digest is designed to prevent morning blind spots).
Alert Routing by Source
BetterStack (External SaaS Monitors)
| Monitor Name | URL | Check Interval | Alert Channel | Escalation |
|---|---|---|---|---|
| Drop Landing Page | https://getdrop.no | 3 min | Slack #ops | P0 if down > 10 min (revenue event) |
| alai.no Landing | https://alai.no | 3 min | Slack #ops | P0 if down > 5 min |
| lumiscare.alai.no | https://lumiscare.alai.no | 3 min | Slack #ops | P1 (demo, not production) |
| BookStack docs | https://docs.alai.no | 3 min | Slack #ops | P1 (internal wiki, not customer-facing) |
| Vaultwarden vault | https://vault.basicconsulting.no | 3 min | Slack #ops | P0 (email intake depends on it) |
| Documenso sign | https://sign.basicconsulting.no | 3 min | Slack #ops | P1 (signing, not immediate revenue) |
| snowit.ba | https://snowit.ba | 3 min | Slack #ops | P2 (currently NXDOMAIN, owner decision pending) |
Alert message format from BetterStack:
[BetterStack] Monitor DOWN: <Monitor Name>
URL: <URL>
Status: <HTTP status code or DNS error>
Duration: <time since first failure>
Dashboard: https://betterstack.com/uptime
Cooldown: BetterStack has built-in "confirmation period" (30s) — waits 30s after first failure before firing alert (prevents transient network blip alerts).
ops-watchdog (Internal Daemon Monitors)
| Service | Check Type | Interval | Alert Channel | Consecutive Failures Required |
|---|---|---|---|---|
| com.john.slack-bot | PID check | 2 min | Email fallback (if dead, can't alert via Slack) | 2 |
| com.john.cloudflared | PID + exit status | 2 min | Slack #ops + Email | 2 |
| com.john.ops-watchdog | Self-health check | 2 min | Email (watchdog can't alert itself via Slack if dead) | 2 |
| com.john.email-agent | PID + last-success file age | 2 min | Slack #ops | 2 |
| com.john.mc-dashboard | PID + curl :3030 | 2 min | Slack #ops | 2 |
| com.john.bookstack-sync | PID | 2 min | Slack #ops | 3 (less critical) |
| 11 other critical daemons | PID check | 2 min | Slack #ops | 2 |
Public endpoint health checks (curl-based):
| Endpoint | Check Command | Alert Channel | Consecutive Failures |
|---|---|---|---|
| alai.no | curl -sf https://alai.no | grep 'ALAI Holding' |
Slack #ops | 2 |
| lumiscare.alai.no | curl -sf https://lumiscare.alai.no | grep 'LumisCare' |
Slack #ops | 2 |
| getdrop.no | curl -sfL https://getdrop.no | grep 'Send penger' |
Slack #ops | 2 |
| docs.alai.no | curl -sf https://docs.alai.no | grep 'BookStack' |
Slack #ops | 2 |
| vault.basicconsulting.no | curl -sf https://vault.basicconsulting.no | grep 'Vaultwarden' |
Slack #ops | 2 |
| sign.basicconsulting.no | curl -s -o /dev/null -w '%{http_code}' https://sign.basicconsulting.no | grep -E '^(200|301|302)' |
Slack #ops | 2 |
Why 2 consecutive failures: Prevents false alerts from transient network hiccups. 2 failures = 4 min downtime before alert (2 min × 2 cycles).
Alert message format from ops-watchdog:
[SENTINEL ALERT] P<severity>: <Service Name> <Status>
Details: <exit code / curl error / PID missing>
Last check: 2026-04-19 10:24:15 CET
Config: ~/system/config/ops-watchdog.json
Runbook: ~/system/docs/runbooks/incident-response-playbook.md
HiveMind Event Bus (Agent-Generated Alerts)
| Event Kind | Subscriber | Alert Channel | Latency | Example |
|---|---|---|---|---|
kind=alert |
hivemind-alert-relay.js | Slack #ops | ≤ 10s | Security scan fail, cost budget exceeded, agent loop detected |
kind=intake |
hivemind-intake-mc-bridge.js | MC auto-task + john-daily-digest | ≤ 30s | Email classified as support request, contact form submission |
kind=briefing |
john-daily-digest | Slack DM to Alem (08:00 CET) | Daily | Overnight summary, weekly report |
kind=research |
(no subscriber yet) | None | n/a | Agent research outcomes stored but not alerted |
kind=skill_proposal |
john-daily-digest | Slack DM to Alem (08:00 CET) | Daily | New skill added to library, cookbook entry |
Alert message format from HiveMind:
[HM-ALERT] agent: <agent_name> | kind: <event_kind>
Message: <alert_message>
Timestamp: 2026-04-19T08:24:15Z
Evidence: <evidence_uri> (if available)
Action: <suggested_action> (if available)
Example:
[HM-ALERT] agent: securion-sentinel | kind: alert
Message: Public GitHub repo detected with potential ALAI internal code
Timestamp: 2026-04-19T08:24:15Z
Evidence: https://github.com/unknown-user/alai-leaked-repo
Action: Verify if repo is authorized OR issue DMCA takedown
TLS Cert Expiry Monitor (Scheduled Daily)
| Domain | Check Schedule | Alert Thresholds | Channel | Escalation |
|---|---|---|---|---|
| alai.no | Daily 07:00 CET | 30d, 14d, 7d before expiry | Slack #ops | P0 at 7d (outage imminent) |
| lumiscare.alai.no | Daily 07:00 CET | 30d, 14d, 7d | Slack #ops | P1 (demo, not production) |
| getdrop.no | Daily 07:00 CET | 30d, 14d, 7d | Slack #ops | P0 (revenue app) |
| docs.alai.no | Daily 07:00 CET | 30d, 14d, 7d | Slack #ops | P1 (internal wiki) |
| vault.basicconsulting.no | Daily 07:00 CET | 30d, 14d, 7d | Slack #ops | P0 (email intake depends on it) |
| sign.basicconsulting.no | Daily 07:00 CET | 30d, 14d, 7d | Slack #ops | P1 (signing tool) |
| bilko-demo.basicconsulting.no | Daily 07:00 CET | 30d, 14d, 7d | Slack #ops | P2 (demo, not used) |
| snowit.ba | Daily 07:00 CET | 30d, 14d, 7d | Slack #ops | P2 (currently NXDOMAIN) |
| 2 internal domains | Daily 07:00 CET | 30d, 14d, 7d | Slack #ops | P1 |
Alert message format:
[CERT-EXPIRY] P<severity>: <domain> expires in <days> days
Expiry date: <YYYY-MM-DD HH:MM:SS UTC>
Current cert issuer: <Let's Encrypt / Cloudflare / etc>
Action: Verify auto-renewal OR renew manually
Runbook: ~/system/docs/runbooks/incident-response-playbook.md#8-tls-cert-expiry
Why daily schedule: Cert renewal is not urgent (30d, 14d, 7d warnings). Checking every 2 min (like ops-watchdog) is wasteful. Daily check at 07:00 CET catches issues before business hours.
Alert Cooldowns & Rate Limiting
Goal: Prevent alert fatigue from flapping services or repeated failures.
Same-Alert Cooldown (15 min)
If same alert (same service + same failure type) fires within 15 min of previous alert → suppressed.
Example:
- 10:00: "lumiscare.alai.no 502" → alert fires
- 10:02: "lumiscare.alai.no 502" → suppressed (within 15 min)
- 10:04: "lumiscare.alai.no 502" → suppressed
- 10:16: "lumiscare.alai.no 502" → new alert fires (15 min elapsed)
Exception: If service recovers and then fails again → new alert immediately (no cooldown on recovery → failure transition).
Repeated-Alert Escalation (5 alerts in 1 hour)
If same service fires > 5 alerts in 1 hour → escalate to P0 + tag Alem in Slack.
Example:
- 10:00, 10:16, 10:32, 10:48, 11:04, 11:20: "lumiscare.alai.no 502" (6 alerts in 80 min)
- 11:20 alert message: "[ESCALATED] P0: lumiscare.alai.no 502 — REPEATED FAILURE (6th alert in 80 min). Tagging @Alem — may need architectural fix or Azure migration."
Email Fallback Rate Limit (10 emails per hour)
If Slack bot is dead and email fallback is active, limit emails to 10 per hour (prevents inbox flood during incident storm).
After 10 emails in 1 hour:
- Next email: "[SENTINEL FALLBACK RATE LIMIT] 10 alerts sent in last hour. Further alerts buffered to ~/system/logs/alert-dlq.jsonl. Check Slack bot status."
Buffered alerts replay to Slack #ops once bot is restored.
Which Daemons Send to Which Channel
| Daemon | Alert Channel | Reason |
|---|---|---|
| com.john.ops-watchdog | Slack #ops OR Email (if slack-bot dead) | Core monitoring daemon — alerts about OTHER services |
| com.john.slack-bot | Email only | Can't alert itself via Slack (messenger is dead), must use email fallback |
| com.alai.john-daily-digest | Slack DM to Alem | Summary layer, not real-time alert |
| com.john.email-agent | Slack #ops | P1 if down (email intake stops) |
| com.john.cloudflared | Slack #ops + Email | P0 SPOF (26 hostnames die if tunnel down) |
| com.john.mc-dashboard | Slack #ops | P1 (internal dashboard, not customer-facing) |
| com.john.bookstack-sync | Slack #ops | P2 (wiki sync can lag 10 min without issue) |
| com.alai.cert-expiry-monitor | Slack #ops | P1 at 30d/14d, P0 at 7d |
| com.john.event-dispatcher | Slack #ops | P1 (HiveMind event bus — if dead, agent alerts stop flowing) |
| com.john.hook-daemon | Slack #ops | P0 (security enforcement — ZAKON NULA anti-hallucination gate) |
| 7 other daemons | Slack #ops | P1 or P2 depending on criticality |
Adding New Alert Routes
Step 1: Identify alert source (BetterStack, ops-watchdog, HiveMind, or new daemon).
Step 2: Determine severity (P0/P1/P2/P3) based on:
- P0: Customer-facing outage OR security breach OR revenue impact
- P1: Internal service down OR data pipeline broken
- P2: Non-urgent issue OR daily summary
- P3: Debug/trace logs only
Step 3: Choose channel:
- P0/P1: Slack #ops (primary) + Email fallback (if critical SPOF like cloudflared)
- P2: john-daily-digest (08:00 CET summary)
- P3: Log file only (no human alert)
Step 4: Update routing config:
- BetterStack: Add monitor via dashboard (https://betterstack.com/uptime) → reuses existing Slack webhook
- ops-watchdog: Edit
~/system/config/ops-watchdog.json→ add tocritical_servicesorcustom_health_checks - HiveMind: Register subscriber script (example:
~/system/tools/hivemind-<kind>-relay.js) → write to events table withkind=<new_kind>
Step 5: Test alert delivery:
- Trigger synthetic failure (stop service, disable monitor, post fake HiveMind event)
- Verify alert arrives in target channel within SLA (60s for P0, 3 min for P1)
- Verify cooldown works (trigger same alert within 15 min → should suppress)
Cross-References
- SENTINEL Reliability Sprint Overview — System architecture after sprint
- Incident Response Playbook — "When X alert fires, do Y"
- BetterStack Setup Recipe — Step-by-step guide to add monitors
- Email Intake Revival — Vault ETIMEDOUT fix + DLQ replay
Evidence:
- ~/system/evidence/sentinel-triage-2026-04-19/ (Phase 0 triage: 30-day incident ledger)
- ~/system/config/ops-watchdog.json (critical_services + custom_health_checks + email_fallback config)
Alert routing maintained by: Skillforge (SENTINEL Task 7)
Last updated: 2026-04-19 (after SENTINEL sprint validation)
Next review: After Phase 2 sprint (secondary tunnel + 12 dead daemons fixed)