LightRAG Stabilization Runbook — 2026-05-08

Genesis

On 2026-05-08 at 14:00, Kelsey Hightower reported LightRAG returning 502 errors. By 19:05 the service had degraded to complete timeout (000). Root cause: MainThread synchronous list comprehension in lightrag/lightrag.py:872 (apipeline_process_enqueue_documents) iterating over 125,341-record JsonDocStatusStorage on the asyncio event loop. A single POST to /documents/text triggered full file rewrite + pipeline iteration over 121K pending docs → CPU pegged at 100% → /health unreachable. The issue was compounded by running sbnb/lightrag:latest amd64-only image under Rosetta on Apple Silicon, incurring 2-3× performance tax.

Six-Step Fix Applied

S1: Disable Runaway Ingest Agents

Stopped LaunchAgents: com.alai.lightrag-outbox-ingest, com.alai.lightrag-migrate-pump, com.alai.lightrag-watchdog. Kept: keepwarm, backup, monitor.

S2: Prune Pending Queue

Stopped container, backed up doc_status.json. Filtered to status=processed only: 8,357 records retained; 116,986 pending/processing/failed quarantined to backup. Restarted container. CPU dropped to 0.31%.

S3: Verify Queryability

Tested naive mode + only_need_context=true (bypasses LLM, returns ALAI corpus chunks). Graph/label endpoint returned 200+ entities. Service functionally restored.

S4: Image Swap for Native ARM64

Replaced sbnb/lightrag:latest (amd64, v1.3.4) with ghcr.io/hkuds/lightrag:latest (native arm64, v1.4.16, official upstream). Verified via docker manifest inspect.

S5: Resource Limits

Added cgroup-enforced limits in compose: cpus: 2.0, memory: 4G.

S6: Re-Ingest Worker Design

Designed (not implemented) re-ingest worker with: batch_size=10, cooldown=60s, health-gate, pre-flight LLM availability check, cursor-based restart safety. Build gated on CEO OCD-3 (aging policy decision).

Verified Post-State

Known Follow-Ups

Evidence Files (Local, Transient)

References


Revision #2
Created 2026-05-08 18:25:40 UTC by John
Updated 2026-06-14 20:02:39 UTC by John