Skip to main content

LightRAG Backup (Azure-native + local safety net)

LightRAG Backup (Azure-native + local safety net)

Domain note (2026-05-17): References to lightrag.basicconsulting.no in this doc are the legacy hostname. Current live endpoint: lightrag.alai.no.

Owner: FlowForge (infra) Implemented: 2026-04-18 (updated for Azure migration 2026-04-18) Source of Truth: Azure VM vm-alai-lightrag (20.240.61.67) Schedule: Weekly Sunday 04:00 CEST Script: ~/system/tools/lightrag-backup.sh (SSH-based) Plist: ~/Library/LaunchAgents/com.alai.lightrag-backup.plist Azure creds: ~/system/config/azure-lightrag-backup.env (mode 0600)

What is backed up

4 Docker volumes (+ checksums + README):

Volume Content Typical size
lightrag-data LightRAG KV store + inputs ~300 MB
lightrag-kg Knowledge graph files small
lightrag-cache LLM response cache small
lightrag-neo4j-data Neo4j graph entities + relations ~170 MB

Typical total: 500 MB – 1 GB compressed.

How it runs (POST-MIGRATION)

Source: Azure VM vm-alai-lightrag (20.240.61.67)

  1. SSH to Azure VM: ssh -i ~/.ssh/azure_alai [email protected]
  2. docker compose stop lightrag neo4j — graceful shutdown (~30s downtime)
  3. docker run alpine tar czf dumps each volume on VM
  4. docker compose start neo4j lightrag — resume
  5. shasum -a 256 *.tar.gz > MANIFEST.sha256 on VM
  6. Write README.md with restore procedure on VM
  7. SCP from VM to Mac Studio — download snapshot to ~/system/backups/lightrag/ (safety net)
  8. Azure offsite upload — Cool tier blob plockfrontstaging/lightrag-backup/<TS>/
  9. Azure rotation — keep last 8 snapshots (longer offsite retention)
  10. Local rotation — keep last 4 snapshots in ~/system/backups/lightrag/ (7-day safety, then deletable)

Downtime: ~60–90s every Sunday 04:00 (cloud LightRAG unavailable during backup).

Key change: Local Docker volumes are NO LONGER the source of truth. Azure VM volumes are primary. Local backups are now safety net only.

Why NOT docker compose pause

pause freezes LightRAG's async event loop. On unpause, uvicorn stays "running" but HTTP handler doesn't service new requests (container reports unhealthy). Requires full container restart to recover. The backup on 2026-04-18 hit this — backup itself was fine (volumes at rest during pause), but container needed restart afterwards. Switched to stop/start for future runs.

Azure storage details

  • Account: plockfrontstaging (swedencentral, Hot storage account)
  • Container: lightrag-backup
  • Resource group: plock-staging-rg
  • Tier per blob: Cool (cheaper — ~$0.01/GB/month for archived reads)
  • Retention: last 8 snapshots (~8 weeks)
  • Estimated cost: ~$0.05–0.10/month for ~4 GB retained

Restore procedure

Restore to Azure VM (primary, production)

# On Mac Studio: pick snapshot
SNAPSHOT=~/system/backups/lightrag/20260418-085317
cd "$SNAPSHOT"
shasum -a 256 -c MANIFEST.sha256 || { echo "checksum mismatch, abort"; exit 1; }

# SCP to Azure VM
scp -i ~/.ssh/azure_alai -r "$SNAPSHOT" [email protected]:/tmp/restore/

# SSH to Azure VM
ssh -i ~/.ssh/azure_alai [email protected]

# On Azure VM:
cd /tmp/restore/$(basename "$SNAPSHOT")
shasum -a 256 -c MANIFEST.sha256 || { echo "checksum mismatch, abort"; exit 1; }

cd ~/lightrag
docker compose down

for vol in lightrag-data lightrag-kg lightrag-cache lightrag-neo4j-data; do
  docker volume rm $vol || true
  docker volume create $vol
  docker run --rm -v $vol:/dst -v /tmp/restore/$(basename "$SNAPSHOT"):/src alpine tar xzf /src/${vol}.tar.gz -C /dst
done

docker compose up -d

# Verify
curl http://localhost:9621/health
# From Mac Studio:
curl https://lightrag.basicconsulting.no/health

Restore to Mac Studio (rollback/emergency only)

Use case: Azure VM failure, need to restore local LightRAG as emergency fallback.

cd ~/system/docker/lightrag
docker compose down

# Pick a snapshot (local or download from Azure first)
SNAPSHOT=~/system/backups/lightrag/20260418-085317
cd "$SNAPSHOT"
shasum -a 256 -c MANIFEST.sha256 || { echo "checksum mismatch, abort"; exit 1; }

for vol in lightrag-data lightrag-kg lightrag-cache lightrag-neo4j-data; do
  docker volume rm $vol || true
  docker volume create $vol
  docker run --rm -v $vol:/dst -v "$SNAPSHOT":/src alpine tar xzf /src/${vol}.tar.gz -C /dst
done

cd ~/system/docker/lightrag
docker compose up -d

# Verify
curl http://localhost:9621/health

# IMPORTANT: Update consumer files to use localhost:9621 instead of cloud endpoint
# (see azure-lightrag-migration.md rollback procedure)

Azure Blob restore (download offsite backup)

Use case: Local backups lost, need to restore from Azure Blob offsite storage.

source ~/system/config/azure-lightrag-backup.env
TS=20260418-085317
RESTORE_DIR=~/system/backups/lightrag/azure-restore-$TS
mkdir -p "$RESTORE_DIR"
az storage blob download-batch \
  --account-name $AZURE_STORAGE_ACCOUNT \
  --account-key "$AZURE_STORAGE_KEY" \
  --source $AZURE_STORAGE_CONTAINER \
  --destination "$RESTORE_DIR" \
  --pattern "$TS/*"

# Verify checksums
cd "$RESTORE_DIR/$TS"
shasum -a 256 -c MANIFEST.sha256

# Then follow "Restore to Azure VM" or "Restore to Mac Studio" procedure above

Monitoring

  • Log: ~/system/logs/lightrag-backup.log (on Mac Studio, backup orchestrator)
  • Latest snapshot size (local): du -sh ~/system/backups/lightrag/
  • Latest snapshot size (Azure VM): ssh -i ~/.ssh/azure_alai [email protected] 'du -sh ~/lightrag-backups/'
  • Azure blob list:
    source ~/system/config/azure-lightrag-backup.env
    az storage blob list \
      --account-name $AZURE_STORAGE_ACCOUNT \
      --account-key "$AZURE_STORAGE_KEY" \
      --container-name $AZURE_STORAGE_CONTAINER \
      --prefix lightrag-backup/ \
      -o table
    
  • Post-run LightRAG health: logged as last line of each run (should show {"status":"healthy"} from https://lightrag.basicconsulting.no/health)

Manual run

bash ~/system/tools/lightrag-backup.sh

Same 60–90s downtime applies. Log goes to same file.

Note: Post-migration (2026-04-18), script must be updated to SSH to Azure VM instead of using local Docker. See script comments for SSH-based backup procedure.



Document Owner: Skillforge
Last Updated: 2026-04-18 (post-Azure migration)
Validated By: Kelsey Hightower (FlowForge), Martin Kleppmann (CodeCraft — data consistency)