SENTINEL Reliability Sprint — System Overview

SENTINEL Reliability Sprint — System Overview

Status: COMPLETE — 2026-04-19 Sprint Leader: Petter Graff (L1) Team: Kelsey Hightower (DevOps), Martin Kleppmann (data/events), Angie Jones (validator), Skillforge (docs) Trigger: CEO complaint 2026-04-19 — "sistem pada, gubim novac, blind sam"


Executive Summary

Before this sprint: 16 dead daemons, 4 active public surface incidents (lumiscare 502, mc 502, snowit NXDOMAIN, bilko TLS mismatch), email intake dead 53 days, Slack alert bot SIGKILL'd. Zero automated alerts reached Alem for 15 of 17 incidents in 30-day window.

After this sprint: 12 dead daemons (4 fixed), 6 public surface monitors (BetterStack + ops-watchdog), email DLQ operational, Slack bot alive with email fallback, TLS cert expiry monitor, HiveMind alert subscribers.

Key metric: Time to alert on public surface down: was ∞ (never) → now ≤ 60 seconds (Slack + email).


Sprint Metrics (Tool-Verified)

Metric Before After Evidence
Dead daemons 16 12 launchctl list snapshot
Public surface monitors 1 (Drop only) 7 (6 new) BetterStack + ops-watchdog.json
Alert delivery channels 1 (email) 3 (Slack #ops + email + digest) Slack bot PID + email-fallback config
Email DLQ none ~/system/logs/email-dlq.jsonl File exists + tested with synthetic fail
Cert expiry monitoring none com.alai.cert-expiry-monitor launchctl list
HiveMind alert subscribers 0 2 (kind=alert, kind=intake) hivemind.db subscriptions table
Time to alert (public 502) ∞ (never) 60s (Slack) / 180s (BetterStack) Angie validation Task 6

Alert Flow Diagram

flowchart LR
    A[Event: Service Down] --> B{Detection}
    B -->|Internal| C[ops-watchdog]
    B -->|External| D[BetterStack]
    
    C --> E{Slack Bot Alive?}
    D --> F[Slack Webhook]
    
    E -->|Yes| G[Slack #ops]
    E -->|No| H[Email Fallback]
    F --> G
    
    G --> I[On-Call: John/Alem]
    H --> I
    
    J[Daily Digest] --> K[john-daily-digest]
    K --> L[Slack DM to Alem 08:00]
    
    style A fill:#ff6b6b
    style G fill:#51cf66
    style H fill:#ffd43b
    style I fill:#339af0

Alert Priority Routing:


Current Architecture After Sprint

1. Alert Channels (3 layers)

Channel Purpose Latency Target Config
Slack #ops Technical alerts (primary) ≤ 60s ~/system/config/ops-watchdog.json + BetterStack webhook
Email fallback When Slack bot down OR Slack API fails ≤ 90s ops-watchdog.json → email_fallback.enabled = true
john-daily-digest Summary layer (non-urgent) Daily 08:00 CET com.alai.john-daily-digest → Alem DM

Critical: Slack bot itself (com.john.slack-bot) is monitored by ops-watchdog. If messenger dies, email fallback activates automatically.

2. Monitoring Layers (2 independent)

Layer 1: BetterStack (External, SaaS)

Layer 2: ops-watchdog (Internal, Mac Studio)

Layer 3: TLS Cert Expiry (Scheduled Daily)

Layer 4: Cloudflared Tunnel Health (Critical SPOF)


What Was Fixed (Honest Accounting)

Phase 1: Revive Alert Messenger (COMPLETE)

Task 1a: Restart Slack bot

Task 1b: Add slack-bot to ops-watchdog critical list

Task 1c: Fix dead daemons

Phase 2: Public Surface Monitoring (COMPLETE)

Task 2a: BetterStack — 6 new monitors

Task 2b: ops-watchdog extended — public endpoint checks

Task 2c: TLS cert expiry monitor

Task 2d: Cloudflared tunnel health alert

Phase 3: Email Intake Revival (COMPLETE)

Task 3a: Vault ETIMEDOUT root cause

Task 3b: Dead-letter queue for email ingestion

Task 3c: Contact form intake documentation

Phase 4: HiveMind Event Bus Fixes (COMPLETE)

Task 4a: Subscribe dead event kinds

Task 4b: Evidence gate on task outcomes


What Was NOT Fixed (Honest)

Being direct — these are real gaps not covered by this sprint:

  1. alai.no contact form is dead stub — No backend action on form submission. Visitors think they're submitting but nothing happens. URGENT ticket #8379 created (owner: Vizu — frontend form + backend hook).

  2. snowit.ba DNS NXDOMAIN — Domain lapsed or DNS misconfigured. Owner decision needed: renew domain, redirect to alai.no, or sunset? MC ticket #8374 assigned to John.

  3. Mac Studio tunnel SPOF — All 26 cloudflared hostnames through one tunnel on one consumer machine. If Mac sleeps/crashes/loses power, ALL public surfaces die simultaneously. Phase 2 sprint (2-week scope, Azure secondary tunnel + cost optimization).

  4. 12 remaining dead daemons — Sprint fixed 4 of 16. Remaining 12: some are deprecated (com.john.unified-dispatcher), some need creds (com.john.b2-offsite-backup), some need investigation (com.alai.meta-agent-loop exit 78). Phase 2 sprint.

  5. Vaultwarden Docker down — Root cause of email intake death was vault container stopped on Azure VM. Why it stopped is unknown (no crash logs, VM uptime 47d). Needs monitoring: add vault.alai.no to Docker health check script.

  6. sign.alai.no redirect storm — 2388 cloudflared errors in 7-day log. Root cause unknown (Documenso redirect loop?). BetterStack now monitors it but fix requires Documenso investigation.

  7. b2-offsite-backup exit 1 — Possible B2 quota exceeded or creds issue. Sprint does not address backup verification. If backup is silently failing, data loss risk accumulates. Needs Backblaze billing review.

  8. Domain expiry monitoring — No whois check for snowit.ba, getdrop.no, alai.no. A lapsed domain = NXDOMAIN with zero alert until BetterStack fires HTTP error. Needs separate com.alai.domain-expiry-monitor daemon.

  9. VM-level monitoring — vm-alai-support hosts BookStack, Vault, Documenso. If the VM stops, all 3 go down. BetterStack HTTP monitors cover public URLs but not Azure VM health. Azure Monitor or SSH keepalive not in scope.

  10. HiveMind 33,406 unread events — Sprint fixes kind=alert and kind=intake subscribers. Other kinds (briefing, research, skill_proposal) remain with zero subscribers. Write-only archive.


Operations

How to Check System Health

# 1. Alert messenger alive
node ~/system/tools/slack.js send ops "sentinel health check"
# Should appear in #ops within 3 sec

# 2. ops-watchdog status
launchctl list | grep ops-watchdog
# Should show com.john.ops-watchdog with LastExit=0, non-zero PID

# 3. Dead daemon count
launchctl list | grep -E "alai|john" | awk '$2 != "0" && $1 !~ /^[0-9]+/' | wc -l
# Should be ≤ 12 (was 16 before sprint)

# 4. Email DLQ size
wc -l ~/system/logs/email-dlq.jsonl
# Should be 0-2 entries (if > 5, investigate vault health)

# 5. Cert expiry next run
launchctl list | grep cert-expiry
# Should show com.alai.cert-expiry-monitor with LastExit=0

# 6. BetterStack coverage (manual)
# Open https://betterstack.com/uptime (login: alem@alai.no)
# Verify 7 monitors green (Drop + 6 ALAI endpoints)

# 7. Public surface live check
for url in https://alai.no https://lumiscare.alai.no https://getdrop.no https://docs.alai.no https://vault.alai.no https://sign.alai.no; do
  echo -n "$url: "
  curl -sfL --max-time 10 -o /dev/null -w '%{http_code}\n' "$url"
done
# All should return 200 or 3xx (except snowit.ba NXDOMAIN)

How to Add New Endpoint to Monitor

BetterStack (3-min external check):

  1. Log into https://betterstack.com/uptime (alem@alai.no)
  2. Click MonitorsCreate Monitor
  3. Fill: Name, URL, Interval (3 min), Expected Status (200), Keyword check (optional)
  4. Select Escalation Policy: "Drop Production Incidents" (routes to #ops)
  5. Save

ops-watchdog (2-min internal check):

  1. Edit ~/system/config/ops-watchdog.json
  2. Add entry to custom_health_checks:
    "public-newservice": {
      "description": "newservice.alai.no",
      "check_command": "curl -sf --max-time 10 https://newservice.alai.no/ | grep -q 'Expected Text'",
      "alert_message": "⚠️ PUBLIC SURFACE DOWN: newservice.alai.no unreachable",
      "consecutive_failures_required": 2
    }
    
  3. Restart ops-watchdog: launchctl kickstart -k gui/$(id -u)/com.john.ops-watchdog
  4. Test: Stop service, wait 4 min (2 cycles), verify alert in #ops

How to Restart Key Daemons Safely

# Slack bot (alert messenger)
launchctl kickstart -k gui/$(id -u)/com.john.slack-bot
# Verify: node ~/system/tools/slack.js send ops "test after restart"

# ops-watchdog (monitoring daemon)
launchctl kickstart -k gui/$(id -u)/com.john.ops-watchdog
# Verify: tail -f ~/system/logs/ops-watchdog.log (should show "Starting check cycle...")

# Email agent (email intake)
launchctl kickstart -k gui/$(id -u)/com.john.email-agent
# Verify: test -f /tmp/email-agent-last-success && echo "OK"

# Cloudflared tunnel (ALL 26 public hostnames)
# DANGER: This takes down ALL public surfaces for 3-5 seconds
launchctl kickstart -k gui/$(id -u)/com.john.cloudflared
# Verify: curl -sf https://alai.no (should return 200 within 10s)

# MC Dashboard (internal UI)
launchctl kickstart -k gui/$(id -u)/com.john.mc-dashboard
# Verify: curl -sf http://localhost:3030 | grep -q 'Mission Control'

Cross-References

Evidence bundle:


Success Criteria (CEO-Reportable)

After this sprint, the following are TRUE (tool-verified):

✅ 4 active incidents found during audit RESOLVED or ticketed (lumiscare 502 → ticket #8373, mc 502 → fixed, snowit NXDOMAIN → ticket #8374, bilko TLS → ticket #8375)

✅ Alem receives Slack alert ≤ 60s of any of 6 public surfaces going down (validated: stopped cloudflared, alert arrived in 47s via email fallback + 53s via Slack after bot restart)

✅ Email intake pipeline alive (vault restarted, bw unlock succeeds, email-agent LastExit=0)

✅ DLQ operational (tested: broke bw, sent email, envelope landed in DLQ, replayed successfully)

✅ TLS cert expiry caught ≥ 30 days before lapse (com.alai.cert-expiry-monitor runs daily 07:00, alerts at 30/14/7 days)

✅ Dead daemon count 16 → 12 (4 fixed: forge-watchdog, health-monitor, mc-dashboard, john-daily-digest)

✅ HiveMind alert + intake kinds have live subscribers (2 subscribers registered, smoke test passed)


One-Liner Summary (for Alem)

Već imamo watchdogs, BetterStack, i ops-watchdog — ali Slack bot (poštar) je bio SIGKILL-ovan pa je sve bilo tiho; email intake mrtav 53 dana; 4 public endpointa pala RIGHT NOW a niko te nije obavijestio. Ovaj sprint je popravio poštara, dodao 6 BetterStack monitora, napravio DLQ za email, i sada dobijaš Slack alert za 60 sekundi ako bilo koji public surface padne. 16 dead daemona → 12 (4 fixed). Phase 2 sprint dolazi za secondary tunnel + 12 preostalih daemona.


Sprint completed: 2026-04-19 10:24 CET
Validation: Angie Jones (Task 6) — E2E evidence at ~/system/evidence/sentinel-sprint-2026-04-19/SUMMARY.md
Documentation: Skillforge (Task 7) — This runbook + 2 companion docs


Revision #5
Created 2026-04-19 08:31:58 UTC by John
Updated 2026-06-21 20:03:10 UTC by John