LightRAG Stabilization Runbook — 2026-05-08
Genesis
On 2026-05-08 at 14:00, Kelsey Hightower reported LightRAG returning 502 errors. By 19:05 the service had degraded to complete timeout (000). Root cause: MainThread synchronous list comprehension in lightrag/lightrag.py:872 (apipeline_process_enqueue_documents) iterating over 125,341-record JsonDocStatusStorage on the asyncio event loop. A single POST to /documents/text triggered full file rewrite + pipeline iteration over 121K pending docs → CPU pegged at 100% → /health unreachable. The issue was compounded by running sbnb/lightrag:latest amd64-only image under Rosetta on Apple Silicon, incurring 2-3× performance tax.
Six-Step Fix Applied
S1: Disable Runaway Ingest Agents
Stopped LaunchAgents: com.alai.lightrag-outbox-ingest, com.alai.lightrag-migrate-pump, com.alai.lightrag-watchdog. Kept: keepwarm, backup, monitor.
S2: Prune Pending Queue
Stopped container, backed up doc_status.json. Filtered to status=processed only: 8,357 records retained; 116,986 pending/processing/failed quarantined to backup. Restarted container. CPU dropped to 0.31%.
S3: Verify Queryability
Tested naive mode + only_need_context=true (bypasses LLM, returns ALAI corpus chunks). Graph/label endpoint returned 200+ entities. Service functionally restored.
S4: Image Swap for Native ARM64
Replaced sbnb/lightrag:latest (amd64, v1.3.4) with ghcr.io/hkuds/lightrag:latest (native arm64, v1.4.16, official upstream). Verified via docker manifest inspect.
S5: Resource Limits
Added cgroup-enforced limits in compose: cpus: 2.0, memory: 4G.
S6: Re-Ingest Worker Design
Designed (not implemented) re-ingest worker with: batch_size=10, cooldown=60s, health-gate, pre-flight LLM availability check, cursor-based restart safety. Build gated on CEO OCD-3 (aging policy decision).
Verified Post-State
- Container: Up, healthy
- CPU: 4.29%
- Memory: 1.24/4 GiB
/health: HTTP 200 in 3.7ms- Naive query: Returns ALAI documentation chunks
- Knowledge graph: Queryable
Known Follow-Ups
- Child MC #100027 (M, FlowForge):
LLM_MODEL=qwen3:8b-q8_0in.envnot on Ollama; hybrid/local/global modes return 404. Fix: one-line.envchange tollama3.1:8bORollama pull qwen3:8b-q8_0. - CEO OCD-3 aging policy: Decision required before re-ingest build. 100% of 117K backlog has
unknown_source. Option A (drop-all) recommended due to zero provenance utility. - Martin's asyncio event-loop freeze: HKUDS 1.4.16 may have fixed root cause. Verify before relaxing
batch_sizein re-ingest worker.
Evidence Files (Local, Transient)
/tmp/lightrag-stabilization-step1-evidence.txt— launchctl list before/after/tmp/lightrag-stabilization-step2-evidence.txt— doc_status counts, /health timing post-restart/tmp/lightrag-stabilization-step3-evidence.txt— query mode results, graph endpoint outputs/tmp/lightrag-stabilization-step4-evidence.txt— manifest inspect, docker inspect/tmp/lightrag-stabilization-step5-evidence.txt— docker stats with limits, compose diff/tmp/lightrag-stabilization-step6-evidence.txt— re-ingest worker design doc/tmp/lightrag-stabilization-progress.txt— step-by-step progress log/tmp/cache-proxy-99981/— pre-existing Phase 1 hash-cache proxy evidence (separate)
References
- Forge file:
/Users/makinja/system/prompts/forged/99982.md(panel decisions used to drive stabilization) - Mehanik clearance:
/tmp/mehanik-cleared-100009(2026-05-08 19:56) - Genesis SENTINEL v3 audit:
project_sentinel_v3_audit_2026-05-01.md - Image diff:
sbnb/lightrag:latest @ 1.3.4→ghcr.io/hkuds/lightrag:latest @ 1.4.16
No comments to display
No comments to display