Skip to main content

Alert Routing — Channel Mapping & SLA

Alert Routing — Channel Mapping & SLA

Purpose: Who gets what alert, on which channel, with what latency target.
Audience: John (orchestrator), Alem (CEO), ops-watchdog daemon, agent builders
Last updated: 2026-04-19 (SENTINEL Sprint Task 7)


Alert Severity Table

Severity Channel Target Audience Latency SLA Retry Logic Example Alerts
P0 Critical Slack #ops + Email fallback Alem + John ≤ 60s Retry 3x, then email Public surface 502 (≥2 cycles), Cloudflared tunnel down, Slack bot SIGKILL
P1 High Slack #ops John (on-call) ≤ 3 min Retry 2x, then DLQ Daemon exit nonzero (critical services), Email DLQ > 5 entries, TLS cert expiry ≤ 7 days
P2 Info john-daily-digest Alem (morning review) Daily 08:00 CET Buffered, no retry New skill proposal, briefing summary, task ready for review, HiveMind research
P3 Debug Log file only Archive (no human) n/a Write once Heartbeat OK pulses, ops-watchdog check passed, daemon start/stop routine

Key principle: P0/P1 alerts MUST be actionable. If no action is needed → downgrade to P2 or P3. Alert fatigue = blind system.


Channel Routing Details

1. Slack #ops (Primary Technical Channel)

Purpose: Real-time technical alerts requiring immediate investigation or fix.

Routing sources:

  • BetterStack webhook (external monitors: 7 public endpoints)
  • ops-watchdog Slack bot (internal monitors: 17 critical daemons + 6 public endpoints)
  • HiveMind kind=alert subscriber (agent-generated alerts, e.g., security scan fail, cost budget exceeded)

Target audience:

  • John (orchestrator) — primary on-call
  • FlowForge/CodeCraft agents (when delegated)
  • Alem (if John offline or P0 escalation)

Message format:

[SOURCE] Severity: Alert Title
Details: <brief description>
Time: 2026-04-19 10:24:15 CET
Runbook: ~/system/docs/runbooks/<name>.md (if available)

Example:

[SENTINEL ALERT] P0: ⚠️ PUBLIC SURFACE DOWN: alai.no
Details: HTTP 502 — connection refused (detected 2 consecutive cycles)
Time: 2026-04-19 10:24:15 CET
Runbook: ~/system/docs/runbooks/incident-response-playbook.md#1-alai-no-502

Cooldown: Same alert within 15 min = suppressed (prevents spam from flapping services). After 15 min silence, next occurrence fires new alert.

Alert count limit: If same service fires > 5 alerts in 1 hour → escalate to P0 + tag Alem ("Repeated failure — may need architectural fix").


2. Email Fallback ([email protected])

Purpose: Backup channel when Slack #ops is unreachable OR Slack bot (com.john.slack-bot) is dead.

Trigger conditions:

  1. Slack bot PID = "-" (daemon stopped/killed) — ops-watchdog detects this via critical_services check
  2. Slack API returns 5xx error for 3 consecutive attempts (Slack platform outage)
  3. Ops-watchdog config email_fallback.enabled = true (set after SENTINEL sprint)

Routing logic (ops-watchdog):

# Pseudocode from ops-watchdog daemon:
if slack_bot_dead() or slack_api_unavailable():
    send_email(
        to="[email protected]",
        subject="[SENTINEL FALLBACK] Alert: <title>",
        body="Slack #ops unreachable. Alert details:\n<full alert message>"
    )

Latency SLA: ≤ 90s from alert trigger (Slack primary is 60s, email fallback is 30s slower due to SMTP handshake).

Example email:

Subject: [SENTINEL FALLBACK] P0: PUBLIC SURFACE DOWN: alai.no
Body:
Slack #ops is unreachable (slack-bot SIGKILL'd).
Alert routed via email fallback.

Alert: ⚠️ PUBLIC SURFACE DOWN: alai.no
Details: HTTP 502 — connection refused (detected 2 consecutive cycles)
Time: 2026-04-19 10:24:15 CET
Source: ops-watchdog (internal monitor)

Action: Restart cloudflared tunnel + verify origin
Runbook: ~/system/docs/runbooks/incident-response-playbook.md#1-alai-no-502

— ops-watchdog daemon

Alert count in fallback mode: All P0 alerts go to email. P1 alerts are buffered to DLQ (~/system/logs/alert-dlq.jsonl) until Slack bot is restored. After restoration, DLQ replays to Slack #ops.


3. john-daily-digest (Summary Layer)

Purpose: Non-urgent aggregated summary for Alem's morning review (08:00 CET).

Content sources:

  • Overnight task completions (mc.js done events)
  • New intake (email, Slack, contact forms) classified as P2
  • HiveMind kind=briefing events (daily briefing, weekly summaries)
  • Cost tracker daily spend (if > 50 USD)
  • Skill proposals from agents (new cookbook entries, tool upgrades)

Delivery:

  • Channel: Slack DM to Alem (private message, NOT #ops or #exec)
  • Schedule: Daily 08:00 CET (launchctl StartCalendarInterval)
  • Format: Markdown summary, max 500 words, grouped by category

Example digest:

Good morning Alem. Overnight summary (2026-04-18 18:00 → 2026-04-19 08:00 CET):

## Tasks Completed (3)
- #8370: SENTINEL T2a BetterStack 6 monitors (FlowForge) — 6 new public endpoint monitors added
- #8371: SENTINEL T3b Email DLQ (CodeCraft) — Dead-letter queue operational, tested with vault failure
- #8372: SENTINEL T7 BookStack 3 runbooks (Skillforge) — Documentation complete

## New Intake (2)
- Email from prospect (forwarded by John): Inquiring about AI consulting for retail chain (200 stores)
- Slack message from partner: Entur wants to schedule follow-up call for RAG demo

## Cost Alert (1)
- Yesterday spend: 67 USD (above 50 USD threshold)
  - Azure VM: 22 USD
  - OpenAI API: 38 USD (Opus 4 tasks)
  - Vercel: 7 USD

## System Health
- Dead daemons: 12 (down from 16 yesterday — 4 fixed)
- Public surfaces: 6 of 7 green (snowit.ba still NXDOMAIN)
- Email DLQ: 1 entry (from validation test)

Next: Phase 2 sprint planning (secondary tunnel + 12 dead daemons).

— John

Opt-out: Alem can pause digest via node ~/system/tools/mc.js config set digest.enabled false (not recommended — digest is designed to prevent morning blind spots).


Alert Routing by Source

BetterStack (External SaaS Monitors)

Monitor Name URL Check Interval Alert Channel Escalation
Drop Landing Page https://getdrop.no 3 min Slack #ops P0 if down > 10 min (revenue event)
alai.no Landing https://alai.no 3 min Slack #ops P0 if down > 5 min
lumiscare.alai.no https://lumiscare.alai.no 3 min Slack #ops P1 (demo, not production)
BookStack docs https://docs.basicconsulting.alai.no 3 min Slack #ops P1 (internal wiki, not customer-facing)
Vaultwarden vault https://vault.basicconsulting.no 3 min Slack #ops P0 (email intake depends on it)
Documenso sign https://sign.basicconsulting.no 3 min Slack #ops P1 (signing, not immediate revenue)
snowit.ba https://snowit.ba 3 min Slack #ops P2 (currently NXDOMAIN, owner decision pending)

Alert message format from BetterStack:

[BetterStack] Monitor DOWN: <Monitor Name>
URL: <URL>
Status: <HTTP status code or DNS error>
Duration: <time since first failure>
Dashboard: https://betterstack.com/uptime

Cooldown: BetterStack has built-in "confirmation period" (30s) — waits 30s after first failure before firing alert (prevents transient network blip alerts).


ops-watchdog (Internal Daemon Monitors)

Service Check Type Interval Alert Channel Consecutive Failures Required
com.john.slack-bot PID check 2 min Email fallback (if dead, can't alert via Slack) 2
com.john.cloudflared PID + exit status 2 min Slack #ops + Email 2
com.john.ops-watchdog Self-health check 2 min Email (watchdog can't alert itself via Slack if dead) 2
com.john.email-agent PID + last-success file age 2 min Slack #ops 2
com.john.mc-dashboard PID + curl :3030 2 min Slack #ops 2
com.john.bookstack-sync PID 2 min Slack #ops 3 (less critical)
11 other critical daemons PID check 2 min Slack #ops 2

Public endpoint health checks (curl-based):

Endpoint Check Command Alert Channel Consecutive Failures
alai.no curl -sf https://alai.no | grep 'ALAI Holding' Slack #ops 2
lumiscare.alai.no curl -sf https://lumiscare.alai.no | grep 'LumisCare' Slack #ops 2
getdrop.no curl -sfL https://getdrop.no | grep 'Send penger' Slack #ops 2
docs.basicconsulting.alai.no curl -sf https://docs.basicconsulting.alai.no | grep 'BookStack' Slack #ops 2
vault.basicconsulting.no curl -sf https://vault.basicconsulting.no | grep 'Vaultwarden' Slack #ops 2
sign.basicconsulting.no curl -s -o /dev/null -w '%{http_code}' https://sign.basicconsulting.no | grep -E '^(200|301|302)' Slack #ops 2

Why 2 consecutive failures: Prevents false alerts from transient network hiccups. 2 failures = 4 min downtime before alert (2 min × 2 cycles).

Alert message format from ops-watchdog:

[SENTINEL ALERT] P<severity>: <Service Name> <Status>
Details: <exit code / curl error / PID missing>
Last check: 2026-04-19 10:24:15 CET
Config: ~/system/config/ops-watchdog.json
Runbook: ~/system/docs/runbooks/incident-response-playbook.md

HiveMind Event Bus (Agent-Generated Alerts)

Event Kind Subscriber Alert Channel Latency Example
kind=alert hivemind-alert-relay.js Slack #ops ≤ 10s Security scan fail, cost budget exceeded, agent loop detected
kind=intake hivemind-intake-mc-bridge.js MC auto-task + john-daily-digest ≤ 30s Email classified as support request, contact form submission
kind=briefing john-daily-digest Slack DM to Alem (08:00 CET) Daily Overnight summary, weekly report
kind=research (no subscriber yet) None n/a Agent research outcomes stored but not alerted
kind=skill_proposal john-daily-digest Slack DM to Alem (08:00 CET) Daily New skill added to library, cookbook entry

Alert message format from HiveMind:

[HM-ALERT] agent: <agent_name> | kind: <event_kind>
Message: <alert_message>
Timestamp: 2026-04-19T08:24:15Z
Evidence: <evidence_uri> (if available)
Action: <suggested_action> (if available)

Example:

[HM-ALERT] agent: securion-sentinel | kind: alert
Message: Public GitHub repo detected with potential ALAI internal code
Timestamp: 2026-04-19T08:24:15Z
Evidence: https://github.com/unknown-user/alai-leaked-repo
Action: Verify if repo is authorized OR issue DMCA takedown

TLS Cert Expiry Monitor (Scheduled Daily)

Domain Check Schedule Alert Thresholds Channel Escalation
alai.no Daily 07:00 CET 30d, 14d, 7d before expiry Slack #ops P0 at 7d (outage imminent)
lumiscare.alai.no Daily 07:00 CET 30d, 14d, 7d Slack #ops P1 (demo, not production)
getdrop.no Daily 07:00 CET 30d, 14d, 7d Slack #ops P0 (revenue app)
docs.basicconsulting.alai.no Daily 07:00 CET 30d, 14d, 7d Slack #ops P1 (internal wiki)
vault.basicconsulting.no Daily 07:00 CET 30d, 14d, 7d Slack #ops P0 (email intake depends on it)
sign.basicconsulting.no Daily 07:00 CET 30d, 14d, 7d Slack #ops P1 (signing tool)
bilko-demo.basicconsulting.no Daily 07:00 CET 30d, 14d, 7d Slack #ops P2 (demo, not used)
snowit.ba Daily 07:00 CET 30d, 14d, 7d Slack #ops P2 (currently NXDOMAIN)
2 internal domains Daily 07:00 CET 30d, 14d, 7d Slack #ops P1

Alert message format:

[CERT-EXPIRY] P<severity>: <domain> expires in <days> days
Expiry date: <YYYY-MM-DD HH:MM:SS UTC>
Current cert issuer: <Let's Encrypt / Cloudflare / etc>
Action: Verify auto-renewal OR renew manually
Runbook: ~/system/docs/runbooks/incident-response-playbook.md#8-tls-cert-expiry

Why daily schedule: Cert renewal is not urgent (30d, 14d, 7d warnings). Checking every 2 min (like ops-watchdog) is wasteful. Daily check at 07:00 CET catches issues before business hours.


Alert Cooldowns & Rate Limiting

Goal: Prevent alert fatigue from flapping services or repeated failures.

Same-Alert Cooldown (15 min)

If same alert (same service + same failure type) fires within 15 min of previous alert → suppressed.

Example:

  • 10:00: "lumiscare.alai.no 502" → alert fires
  • 10:02: "lumiscare.alai.no 502" → suppressed (within 15 min)
  • 10:04: "lumiscare.alai.no 502" → suppressed
  • 10:16: "lumiscare.alai.no 502" → new alert fires (15 min elapsed)

Exception: If service recovers and then fails again → new alert immediately (no cooldown on recovery → failure transition).

Repeated-Alert Escalation (5 alerts in 1 hour)

If same service fires > 5 alerts in 1 hour → escalate to P0 + tag Alem in Slack.

Example:

  • 10:00, 10:16, 10:32, 10:48, 11:04, 11:20: "lumiscare.alai.no 502" (6 alerts in 80 min)
  • 11:20 alert message: "[ESCALATED] P0: lumiscare.alai.no 502 — REPEATED FAILURE (6th alert in 80 min). Tagging @Alem — may need architectural fix or Azure migration."

Email Fallback Rate Limit (10 emails per hour)

If Slack bot is dead and email fallback is active, limit emails to 10 per hour (prevents inbox flood during incident storm).

After 10 emails in 1 hour:

  • Next email: "[SENTINEL FALLBACK RATE LIMIT] 10 alerts sent in last hour. Further alerts buffered to ~/system/logs/alert-dlq.jsonl. Check Slack bot status."

Buffered alerts replay to Slack #ops once bot is restored.


Which Daemons Send to Which Channel

Daemon Alert Channel Reason
com.john.ops-watchdog Slack #ops OR Email (if slack-bot dead) Core monitoring daemon — alerts about OTHER services
com.john.slack-bot Email only Can't alert itself via Slack (messenger is dead), must use email fallback
com.alai.john-daily-digest Slack DM to Alem Summary layer, not real-time alert
com.john.email-agent Slack #ops P1 if down (email intake stops)
com.john.cloudflared Slack #ops + Email P0 SPOF (26 hostnames die if tunnel down)
com.john.mc-dashboard Slack #ops P1 (internal dashboard, not customer-facing)
com.john.bookstack-sync Slack #ops P2 (wiki sync can lag 10 min without issue)
com.alai.cert-expiry-monitor Slack #ops P1 at 30d/14d, P0 at 7d
com.john.event-dispatcher Slack #ops P1 (HiveMind event bus — if dead, agent alerts stop flowing)
com.john.hook-daemon Slack #ops P0 (security enforcement — ZAKON NULA anti-hallucination gate)
7 other daemons Slack #ops P1 or P2 depending on criticality

Adding New Alert Routes

Step 1: Identify alert source (BetterStack, ops-watchdog, HiveMind, or new daemon).

Step 2: Determine severity (P0/P1/P2/P3) based on:

  • P0: Customer-facing outage OR security breach OR revenue impact
  • P1: Internal service down OR data pipeline broken
  • P2: Non-urgent issue OR daily summary
  • P3: Debug/trace logs only

Step 3: Choose channel:

  • P0/P1: Slack #ops (primary) + Email fallback (if critical SPOF like cloudflared)
  • P2: john-daily-digest (08:00 CET summary)
  • P3: Log file only (no human alert)

Step 4: Update routing config:

  • BetterStack: Add monitor via dashboard (https://betterstack.com/uptime) → reuses existing Slack webhook
  • ops-watchdog: Edit ~/system/config/ops-watchdog.json → add to critical_services or custom_health_checks
  • HiveMind: Register subscriber script (example: ~/system/tools/hivemind-<kind>-relay.js) → write to events table with kind=<new_kind>

Step 5: Test alert delivery:

  • Trigger synthetic failure (stop service, disable monitor, post fake HiveMind event)
  • Verify alert arrives in target channel within SLA (60s for P0, 3 min for P1)
  • Verify cooldown works (trigger same alert within 15 min → should suppress)

Cross-References

Evidence:

  • ~/system/evidence/sentinel-triage-2026-04-19/ (Phase 0 triage: 30-day incident ledger)
  • ~/system/config/ops-watchdog.json (critical_services + custom_health_checks + email_fallback config)

Alert routing maintained by: Skillforge (SENTINEL Task 7)
Last updated: 2026-04-19 (after SENTINEL sprint validation)
Next review: After Phase 2 sprint (secondary tunnel + 12 dead daemons fixed)