Alert Routing — Channel Mapping & SLA

Purpose: Who gets what alert, on which channel, with what latency target.
Audience: John (orchestrator), Alem (CEO), ops-watchdog daemon, agent builders
Last updated: 2026-04-19 (SENTINEL Sprint Task 7)

Alert Severity Table

Severity	Channel	Target Audience	Latency SLA	Retry Logic	Example Alerts
P0 Critical	Slack #ops + Email fallback	Alem + John	≤ 60s	Retry 3x, then email	Public surface 502 (≥2 cycles), Cloudflared tunnel down, Slack bot SIGKILL
P1 High	Slack #ops	John (on-call)	≤ 3 min	Retry 2x, then DLQ	Daemon exit nonzero (critical services), Email DLQ > 5 entries, TLS cert expiry ≤ 7 days
P2 Info	john-daily-digest	Alem (morning review)	Daily 08:00 CET	Buffered, no retry	New skill proposal, briefing summary, task ready for review, HiveMind research
P3 Debug	Log file only	Archive (no human)	n/a	Write once	Heartbeat OK pulses, ops-watchdog check passed, daemon start/stop routine

Key principle: P0/P1 alerts MUST be actionable. If no action is needed → downgrade to P2 or P3. Alert fatigue = blind system.

Channel Routing Details

1. Slack #ops (Primary Technical Channel)

Purpose: Real-time technical alerts requiring immediate investigation or fix.

Routing sources:

BetterStack webhook (external monitors: 7 public endpoints)
ops-watchdog Slack bot (internal monitors: 17 critical daemons + 6 public endpoints)
HiveMind kind=alert subscriber (agent-generated alerts, e.g., security scan fail, cost budget exceeded)

Target audience:

John (orchestrator) — primary on-call
FlowForge/CodeCraft agents (when delegated)
Alem (if John offline or P0 escalation)

Message format:

[SOURCE] Severity: Alert Title
Details: <brief description>
Time: 2026-04-19 10:24:15 CET
Runbook: ~/system/docs/runbooks/<name>.md (if available)

Example:

[SENTINEL ALERT] P0: ⚠️ PUBLIC SURFACE DOWN: alai.no
Details: HTTP 502 — connection refused (detected 2 consecutive cycles)
Time: 2026-04-19 10:24:15 CET
Runbook: ~/system/docs/runbooks/incident-response-playbook.md#1-alai-no-502

Cooldown: Same alert within 15 min = suppressed (prevents spam from flapping services). After 15 min silence, next occurrence fires new alert.

Alert count limit: If same service fires > 5 alerts in 1 hour → escalate to P0 + tag Alem ("Repeated failure — may need architectural fix").

2. Email Fallback ([email protected])

Purpose: Backup channel when Slack #ops is unreachable OR Slack bot (com.john.slack-bot) is dead.

Trigger conditions:

Slack bot PID = "-" (daemon stopped/killed) — ops-watchdog detects this via critical_services check
Slack API returns 5xx error for 3 consecutive attempts (Slack platform outage)
Ops-watchdog config email_fallback.enabled = true (set after SENTINEL sprint)

Routing logic (ops-watchdog):

# Pseudocode from ops-watchdog daemon:
if slack_bot_dead() or slack_api_unavailable():
    send_email(
        to="[email protected]",
        subject="[SENTINEL FALLBACK] Alert: <title>",
        body="Slack #ops unreachable. Alert details:\n<full alert message>"
    )

Latency SLA: ≤ 90s from alert trigger (Slack primary is 60s, email fallback is 30s slower due to SMTP handshake).

Example email:

Subject: [SENTINEL FALLBACK] P0: PUBLIC SURFACE DOWN: alai.no
Body:
Slack #ops is unreachable (slack-bot SIGKILL'd).
Alert routed via email fallback.

Alert: ⚠️ PUBLIC SURFACE DOWN: alai.no
Details: HTTP 502 — connection refused (detected 2 consecutive cycles)
Time: 2026-04-19 10:24:15 CET
Source: ops-watchdog (internal monitor)

Action: Restart cloudflared tunnel + verify origin
Runbook: ~/system/docs/runbooks/incident-response-playbook.md#1-alai-no-502

— ops-watchdog daemon

Alert count in fallback mode: All P0 alerts go to email. P1 alerts are buffered to DLQ (~/system/logs/alert-dlq.jsonl) until Slack bot is restored. After restoration, DLQ replays to Slack #ops.

3. john-daily-digest (Summary Layer)

Purpose: Non-urgent aggregated summary for Alem's morning review (08:00 CET).

Content sources:

Overnight task completions (mc.js done events)
New intake (email, Slack, contact forms) classified as P2
HiveMind kind=briefing events (daily briefing, weekly summaries)
Cost tracker daily spend (if > 50 USD)
Skill proposals from agents (new cookbook entries, tool upgrades)

Delivery:

Channel: Slack DM to Alem (private message, NOT #ops or #exec)
Schedule: Daily 08:00 CET (launchctl StartCalendarInterval)
Format: Markdown summary, max 500 words, grouped by category

Example digest:

Good morning Alem. Overnight summary (2026-04-18 18:00 → 2026-04-19 08:00 CET):

## Tasks Completed (3)
- #8370: SENTINEL T2a BetterStack 6 monitors (FlowForge) — 6 new public endpoint monitors added
- #8371: SENTINEL T3b Email DLQ (CodeCraft) — Dead-letter queue operational, tested with vault failure
- #8372: SENTINEL T7 BookStack 3 runbooks (Skillforge) — Documentation complete

## New Intake (2)
- Email from prospect (forwarded by John): Inquiring about AI consulting for retail chain (200 stores)
- Slack message from partner: Entur wants to schedule follow-up call for RAG demo

## Cost Alert (1)
- Yesterday spend: 67 USD (above 50 USD threshold)
  - Azure VM: 22 USD
  - OpenAI API: 38 USD (Opus 4 tasks)
  - Vercel: 7 USD

## System Health
- Dead daemons: 12 (down from 16 yesterday — 4 fixed)
- Public surfaces: 6 of 7 green (snowit.ba still NXDOMAIN)
- Email DLQ: 1 entry (from validation test)

Next: Phase 2 sprint planning (secondary tunnel + 12 dead daemons).

— John

Opt-out: Alem can pause digest via node ~/system/tools/mc.js config set digest.enabled false (not recommended — digest is designed to prevent morning blind spots).

Alert Routing by Source

BetterStack (External SaaS Monitors)

Monitor Name	URL	Check Interval	Alert Channel	Escalation
Drop Landing Page	https://getdrop.no	3 min	Slack #ops	P0 if down > 10 min (revenue event)
alai.no Landing	https://alai.no	3 min	Slack #ops	P0 if down > 5 min
lumiscare.alai.no	https://lumiscare.alai.no	3 min	Slack #ops	P1 (demo, not production)
BookStack docs	https://docs.~~basicconsulting.~~alai.no	3 min	Slack #ops	P1 (internal wiki, not customer-facing)
Vaultwarden vault	https://vault.basicconsulting.no	3 min	Slack #ops	P0 (email intake depends on it)
Documenso sign	https://sign.basicconsulting.no	3 min	Slack #ops	P1 (signing, not immediate revenue)
snowit.ba	https://snowit.ba	3 min	Slack #ops	P2 (currently NXDOMAIN, owner decision pending)

Alert message format from BetterStack:

[BetterStack] Monitor DOWN: <Monitor Name>
URL: <URL>
Status: <HTTP status code or DNS error>
Duration: <time since first failure>
Dashboard: https://betterstack.com/uptime

Cooldown: BetterStack has built-in "confirmation period" (30s) — waits 30s after first failure before firing alert (prevents transient network blip alerts).

ops-watchdog (Internal Daemon Monitors)

Service	Check Type	Interval	Alert Channel	Consecutive Failures Required
com.john.slack-bot	PID check	2 min	Email fallback (if dead, can't alert via Slack)	2
com.john.cloudflared	PID + exit status	2 min	Slack #ops + Email	2
com.john.ops-watchdog	Self-health check	2 min	Email (watchdog can't alert itself via Slack if dead)	2
com.john.email-agent	PID + last-success file age	2 min	Slack #ops	2
com.john.mc-dashboard	PID + curl :3030	2 min	Slack #ops	2
com.john.bookstack-sync	PID	2 min	Slack #ops	3 (less critical)
11 other critical daemons	PID check	2 min	Slack #ops	2

Public endpoint health checks (curl-based):

Endpoint	Check Command	Alert Channel	Consecutive Failures
alai.no	`curl -sf https://alai.no \| grep 'ALAI Holding'`	Slack #ops	2
lumiscare.alai.no	`curl -sf https://lumiscare.alai.no \| grep 'LumisCare'`	Slack #ops	2
getdrop.no	`curl -sfL https://getdrop.no \| grep 'Send penger'`	Slack #ops	2
docs.~~basicconsulting.~~alai.no	`curl -sf https://docs.basicconsulting.alai.no \| grep 'BookStack'`	Slack #ops	2
vault.basicconsulting.no	`curl -sf https://vault.basicconsulting.no \| grep 'Vaultwarden'`	Slack #ops	2
sign.basicconsulting.no	`curl -s -o /dev/null -w '%{http_code}' https://sign.basicconsulting.no \| grep -E '^(200\|301\|302)'`	Slack #ops	2

Why 2 consecutive failures: Prevents false alerts from transient network hiccups. 2 failures = 4 min downtime before alert (2 min × 2 cycles).

Alert message format from ops-watchdog:

[SENTINEL ALERT] P<severity>: <Service Name> <Status>
Details: <exit code / curl error / PID missing>
Last check: 2026-04-19 10:24:15 CET
Config: ~/system/config/ops-watchdog.json
Runbook: ~/system/docs/runbooks/incident-response-playbook.md

HiveMind Event Bus (Agent-Generated Alerts)

Event Kind	Subscriber	Alert Channel	Latency	Example
`kind=alert`	hivemind-alert-relay.js	Slack #ops	≤ 10s	Security scan fail, cost budget exceeded, agent loop detected
`kind=intake`	hivemind-intake-mc-bridge.js	MC auto-task + john-daily-digest	≤ 30s	Email classified as support request, contact form submission
`kind=briefing`	john-daily-digest	Slack DM to Alem (08:00 CET)	Daily	Overnight summary, weekly report
`kind=research`	(no subscriber yet)	None	n/a	Agent research outcomes stored but not alerted
`kind=skill_proposal`	john-daily-digest	Slack DM to Alem (08:00 CET)	Daily	New skill added to library, cookbook entry

Alert message format from HiveMind:

[HM-ALERT] agent: <agent_name> | kind: <event_kind>
Message: <alert_message>
Timestamp: 2026-04-19T08:24:15Z
Evidence: <evidence_uri> (if available)
Action: <suggested_action> (if available)

Example:

[HM-ALERT] agent: securion-sentinel | kind: alert
Message: Public GitHub repo detected with potential ALAI internal code
Timestamp: 2026-04-19T08:24:15Z
Evidence: https://github.com/unknown-user/alai-leaked-repo
Action: Verify if repo is authorized OR issue DMCA takedown

TLS Cert Expiry Monitor (Scheduled Daily)

Domain	Check Schedule	Alert Thresholds	Channel	Escalation
alai.no	Daily 07:00 CET	30d, 14d, 7d before expiry	Slack #ops	P0 at 7d (outage imminent)
lumiscare.alai.no	Daily 07:00 CET	30d, 14d, 7d	Slack #ops	P1 (demo, not production)
getdrop.no	Daily 07:00 CET	30d, 14d, 7d	Slack #ops	P0 (revenue app)
docs.~~basicconsulting.~~alai.no	Daily 07:00 CET	30d, 14d, 7d	Slack #ops	P1 (internal wiki)
vault.basicconsulting.no	Daily 07:00 CET	30d, 14d, 7d	Slack #ops	P0 (email intake depends on it)
sign.basicconsulting.no	Daily 07:00 CET	30d, 14d, 7d	Slack #ops	P1 (signing tool)
bilko-demo.basicconsulting.no	Daily 07:00 CET	30d, 14d, 7d	Slack #ops	P2 (demo, not used)
snowit.ba	Daily 07:00 CET	30d, 14d, 7d	Slack #ops	P2 (currently NXDOMAIN)
2 internal domains	Daily 07:00 CET	30d, 14d, 7d	Slack #ops	P1

Alert message format:

[CERT-EXPIRY] P<severity>: <domain> expires in <days> days
Expiry date: <YYYY-MM-DD HH:MM:SS UTC>
Current cert issuer: <Let's Encrypt / Cloudflare / etc>
Action: Verify auto-renewal OR renew manually
Runbook: ~/system/docs/runbooks/incident-response-playbook.md#8-tls-cert-expiry

Why daily schedule: Cert renewal is not urgent (30d, 14d, 7d warnings). Checking every 2 min (like ops-watchdog) is wasteful. Daily check at 07:00 CET catches issues before business hours.

Alert Cooldowns & Rate Limiting

Goal: Prevent alert fatigue from flapping services or repeated failures.

Same-Alert Cooldown (15 min)

If same alert (same service + same failure type) fires within 15 min of previous alert → suppressed.

Example:

10:00: "lumiscare.alai.no 502" → alert fires
10:02: "lumiscare.alai.no 502" → suppressed (within 15 min)
10:04: "lumiscare.alai.no 502" → suppressed
10:16: "lumiscare.alai.no 502" → new alert fires (15 min elapsed)

Exception: If service recovers and then fails again → new alert immediately (no cooldown on recovery → failure transition).

Repeated-Alert Escalation (5 alerts in 1 hour)

If same service fires > 5 alerts in 1 hour → escalate to P0 + tag Alem in Slack.

Example:

10:00, 10:16, 10:32, 10:48, 11:04, 11:20: "lumiscare.alai.no 502" (6 alerts in 80 min)
11:20 alert message: "[ESCALATED] P0: lumiscare.alai.no 502 — REPEATED FAILURE (6th alert in 80 min). Tagging @Alem — may need architectural fix or Azure migration."

Email Fallback Rate Limit (10 emails per hour)

If Slack bot is dead and email fallback is active, limit emails to 10 per hour (prevents inbox flood during incident storm).

After 10 emails in 1 hour:

Next email: "[SENTINEL FALLBACK RATE LIMIT] 10 alerts sent in last hour. Further alerts buffered to ~/system/logs/alert-dlq.jsonl. Check Slack bot status."

Buffered alerts replay to Slack #ops once bot is restored.

Which Daemons Send to Which Channel

Daemon	Alert Channel	Reason
com.john.ops-watchdog	Slack #ops OR Email (if slack-bot dead)	Core monitoring daemon — alerts about OTHER services
com.john.slack-bot	Email only	Can't alert itself via Slack (messenger is dead), must use email fallback
com.alai.john-daily-digest	Slack DM to Alem	Summary layer, not real-time alert
com.john.email-agent	Slack #ops	P1 if down (email intake stops)
com.john.cloudflared	Slack #ops + Email	P0 SPOF (26 hostnames die if tunnel down)
com.john.mc-dashboard	Slack #ops	P1 (internal dashboard, not customer-facing)
com.john.bookstack-sync	Slack #ops	P2 (wiki sync can lag 10 min without issue)
com.alai.cert-expiry-monitor	Slack #ops	P1 at 30d/14d, P0 at 7d
com.john.event-dispatcher	Slack #ops	P1 (HiveMind event bus — if dead, agent alerts stop flowing)
com.john.hook-daemon	Slack #ops	P0 (security enforcement — ZAKON NULA anti-hallucination gate)
7 other daemons	Slack #ops	P1 or P2 depending on criticality

Adding New Alert Routes

Step 1: Identify alert source (BetterStack, ops-watchdog, HiveMind, or new daemon).

Step 2: Determine severity (P0/P1/P2/P3) based on:

P0: Customer-facing outage OR security breach OR revenue impact
P1: Internal service down OR data pipeline broken
P2: Non-urgent issue OR daily summary
P3: Debug/trace logs only

Step 3: Choose channel:

P0/P1: Slack #ops (primary) + Email fallback (if critical SPOF like cloudflared)
P2: john-daily-digest (08:00 CET summary)
P3: Log file only (no human alert)

Step 4: Update routing config:

BetterStack: Add monitor via dashboard (https://betterstack.com/uptime) → reuses existing Slack webhook
ops-watchdog: Edit ~/system/config/ops-watchdog.json → add to critical_services or custom_health_checks
HiveMind: Register subscriber script (example: ~/system/tools/hivemind-<kind>-relay.js) → write to events table with kind=<new_kind>

Step 5: Test alert delivery:

Trigger synthetic failure (stop service, disable monitor, post fake HiveMind event)
Verify alert arrives in target channel within SLA (60s for P0, 3 min for P1)
Verify cooldown works (trigger same alert within 15 min → should suppress)

Cross-References

SENTINEL Reliability Sprint Overview — System architecture after sprint
Incident Response Playbook — "When X alert fires, do Y"
BetterStack Setup Recipe — Step-by-step guide to add monitors
Email Intake Revival — Vault ETIMEDOUT fix + DLQ replay

Evidence:

~/system/evidence/sentinel-triage-2026-04-19/ (Phase 0 triage: 30-day incident ledger)
~/system/config/ops-watchdog.json (critical_services + custom_health_checks + email_fallback config)

Alert routing maintained by: Skillforge (SENTINEL Task 7)
Last updated: 2026-04-19 (after SENTINEL sprint validation)
Next review: After Phase 2 sprint (secondary tunnel + 12 dead daemons fixed)