Ollama Fleet Architecture
Ollama Fleet Architecture
Overview
ALAI operates a two-node Ollama fleet: ANVIL (local dev Mac) and FORGE (Ubuntu 22.04 GPU workstation). ANVIL handles triage workloads (email, TLDR, quick classification), FORGE handles heavy inference (32B+ models, RAG pipelines).
ANVIL Ollama Configuration
Capacity Limits
- MAX_LOADED_MODELS: 2 (prevents RAM exhaustion)
- KEEP_ALIVE: 30s (default for on-demand models)
- Hardware: M1 Pro, 32GB RAM, 5GB reserved for triage model
LaunchAgent: com.alai.ollama-serve-v2
Label: com.alai.ollama-serve-v2
Plist: ~/Library/LaunchAgents/com.alai.ollama-serve-v2.plist
Port: 11434
Environment:
OLLAMA_FLASH_ATTENTION=1
OLLAMA_KV_CACHE_TYPE=q8_0
OLLAMA_MAX_LOADED_MODELS=2
OLLAMA_KEEP_ALIVE=30s
Triage Preload Pattern
MC #8477 — Prevent qwen2.5-coder:32b (23GB) from blocking email/TLDR triage.
Strategy
Preload llama3.1:8b with keep_alive=-1 (indefinite) so it's always resident for fast triage operations. 5GB footprint.
LaunchAgent: com.john.ollama-triage-preload
Label: com.john.ollama-triage-preload
Script: ~/system/tools/ollama-triage-preload.sh
Trigger: RunAtLoad + StartInterval 300s (every 5 min)
Log: ~/system/logs/ollama-triage-preload-stdout.log
Script Logic (ollama-triage-preload.sh)
- Check if
llama3.1:8bis already loaded via/api/ps - If not loaded, send minimal prompt with
keep_alive=-1 - Log success/skip
curl -sf -X POST "$OLLAMA_URL/api/generate" \
-H "Content-Type: application/json" \
-d "{
\"model\": \"llama3.1:8b\",
\"prompt\": \"ready\",
\"stream\": false,
\"keep_alive\": -1,
\"options\": {
\"num_predict\": 1
}
}"
Model Tier System
| Tier | Model | Size | Use Case | Keep Alive | Node |
|---|---|---|---|---|---|
| Triage | llama3.1:8b | 5GB | Email classification, TLDR summarization, quick routing | -1 (indefinite) | ANVIL |
| Heavy | qwen2.5-coder:32b | 23GB | Code generation, architecture review, complex reasoning | 30s (on-demand) | ANVIL |
| Primary | devstral:24b | ~15GB | Agent orchestration, planning, context routing | 300s | FORGE |
FORGE Failover
Consumers (email-agent.js, tldr-briefing.js, YouTube daemon) can set FORGE_FIRST=0 environment variable to skip FORGE and use ANVIL directly.
# Force ANVIL-only
export FORGE_FIRST=0
node ~/system/daemons/youtube-daemon.js
Default behavior: Try FORGE (10.0.0.2:11434), fallback to ANVIL (localhost:11434) on timeout.
Vault-Keeper Watchdog (MC #8471 — PENDING)
Monitors ~/system/.cache/vault-keeper-heartbeat file. If stale > 1 hour, SENTINEL alerts.
Implementation
LaunchAgent: com.john.vault-keeper-watchdog
Interval: 600s (10 min)
Script: ~/system/daemons/vault-keeper-watchdog.sh
Alert: Slack #sentinel-alerts
Logic
- Read heartbeat file timestamp
- Compare with current time
- If > 3600s, send SENTINEL alert with vault-keeper logs
YouTube Daemon Lesson (MC #8472)
Log redirection corruption: tee + subshell arithmetic capture caused output mangling.
Anti-Pattern
# WRONG — tee inside $() breaks arithmetic
NEW_COUNT=$(node ~/system/daemons/youtube-processor.js | tee -a "$LOG")
Correct Pattern
# RIGHT — separate logging stream
node ~/system/daemons/youtube-processor.js >> "$LOG" 2>&1
LaunchAgent Duplication
Never use both KeepAlive and StartInterval in same plist. StartInterval triggers even if process is still running, causing overlap.
# WRONG
<key>KeepAlive</key>
<true/>
<key>StartInterval</key>
<integer>3600</integer>
# RIGHT (pick one)
<key>StartInterval</key>
<integer>3600</integer>
Fleet Monitoring
ANVIL
curl http://localhost:11434/api/ps
curl http://localhost:11434/api/tags
tail -f ~/system/logs/ollama-triage-preload-stdout.log
FORGE
curl http://10.0.0.2:11434/api/ps
ssh forge "tail -f /var/log/ollama.log"
Mission Control
node ~/system/tools/mc.js list --tag ollama
node ~/system/tools/cost-tracker.js summary --service ollama
Generated by Skillforge | ALAI, 2026
No comments to display
No comments to display