Skip to main content

archive.alai.no — Paperless-ngx Setup & Operations

archive.alai.no — Paperless-ngx Setup & Operations

URL: https://archive.alai.no Backend: Paperless-ngx (image ghcr.io/paperless-ngx/paperless-ngx:latest) Host: Azure VM 4.223.110.181 (alai-admin) Container: alai-paperless-1 (with redis, gotenberg, tika sidecars) MC reference: #9546, #9982 (DR backup TODO)

Document management system za sve ALAI-srodne legalne, ugovorne, partnerske, istraživačke i finansijske dokumente. OCR, full-text search, taxonomy.


Access requirements

CF stack (oba sloja) traže 92.221.168.61/32 (ALAI LAN egress) u bypass listama. Vidi CF IP Access Rules — ALAI LAN Bypass.

Iz Mac Studio sa aktivnim VPN-om: bind interface 192.168.68.65 (Deco LAN) zaobilazi VPN routing:

curl --interface 192.168.68.65 https://archive.alai.no/...

Mac Air i ostali bez VPN-a: direktno radi.


API authentication

Paperless koristi DRF Token auth.

Token za admin user (root@localhost) sačuvan lokalno na Mac Studio:

~/.config/alai/paperless-token.env  (mode 600)
PAPERLESS_TOKEN=c9ec30192db3c95802349335edea4bca864a937a
PAPERLESS_BASE=https://archive.alai.no
PAPERLESS_BIND_INTERFACE=192.168.68.65

Svi API zahtjevi:

Authorization: Token c9ec30192db3c95802349335edea4bca864a937a

Regenerate token (ako compromised — Django shell preko docker exec):

ssh -b 192.168.68.65 -i ~/.ssh/azure_alai [email protected] \
  'docker exec alai-paperless-1 python manage.py shell -c "
from rest_framework.authtoken.models import Token
from django.contrib.auth import get_user_model
u = get_user_model().objects.get(username=\"admin\")
Token.objects.filter(user=u).delete()
print(Token.objects.create(user=u).key)
"'

Schema (taxonomy)

Setup-ovan 2026-04-28 preko /tmp/paperless-setup.sh. ID-evi mogu varirati po instanci — koristi name__iexact za lookup.

Document Types (14)

Contract, LOI, NDA, Registration, Insurance Policy, Research Paper, Invoice, Receipt, Email Archive, Identity Document, Tax Document, Financial Statement, Meeting Notes, Pitch Deck

Tags (23, color-coded)

Cross-cutting (cilj): legal, research, kuran-19, partnership, regulator, contract, nda, loi, invoice, registration, urgent, signed, pending-signature

Company tags: ALAI, Drop, Bilko, Tok, Lobby, LumisCare, Plock, ALAI-Tech-DOO, BasicConsulting, client

Storage Paths (21)

Folder hijerarhija po kompaniji + funkciji:

/ALAI/legal/{created_year}/{title}
/ALAI/research/kuran-19/{title}
/ALAI/research/general/{created_year}/{title}
/ALAI/partnerships/sintef/{title}
/ALAI/partnerships/intesa/{title}
/ALAI/partnerships/pbz/{title}
/ALAI/regulators/finanstilsynet/{created_year}/{title}
/ALAI/regulators/skatteetaten/{created_year}/{title}
/ALAI/regulators/bronnoysund/{created_year}/{title}
/ALAI/contacts/{title}
/Drop/legal/{created_year}/{title}
/Drop/contracts/{title}
/Bilko/legal/{created_year}/{title}
/Bilko/contracts/{title}
/Tok/legal/{created_year}/{title}
/Lobby/legal/{created_year}/{title}
/LumisCare/legal/{created_year}/{title}
/Plock/legal/{created_year}/{title}
/ALAI-Tech-DOO/legal/{created_year}/{title}
/BasicConsulting/{created_year}/{title}
/clients/Entur/{created_year}/{title}

Initial Correspondents (25, expanded as docs ingest)

SINTEF, Finanstilsynet, Skatteetaten, Brønnøysundregistrene, PBZ Zagreb, Intesa Sanpaolo, Anthropic, Cloudflare, Tryg, Fiken AS, Entur AS — auto-create on classify match.


Upload workflow

Manual single file

source ~/.config/alai/paperless-token.env
curl -s --interface "$PAPERLESS_BIND_INTERFACE" \
  -H "Authorization: Token $PAPERLESS_TOKEN" \
  -F "title=My Document" \
  -F "storage_path=1" \
  -F "tags=30" -F "tags=17" \
  -F "document=@/path/to/file.pdf" \
  -X POST "$PAPERLESS_BASE/api/documents/post_document/"

Returns task UUID. Verify success via:

curl ... "$PAPERLESS_BASE/api/tasks/?task_id=<UUID>"

Batch upload sa klasifikacijom

Skripta: /tmp/paperless-classify-v2.py (commit u repo-u TBD)

python3 /tmp/paperless-classify-v2.py --dry --all     # dry-run all ~/ALAI/*
python3 /tmp/paperless-classify-v2.py --all           # actual upload
python3 /tmp/paperless-classify-v2.py FILE [FILE...]  # specific files

Klasifikator mapira path → (storage_path, correspondent, document_type, tags) prema rules engine-u. Pre-upload dedup po normalized title; Paperless takođe ima vlastiti content-hash dedup (rejects file ako mu je sadržaj već prisutan).


Operations cheat sheet

# Document count
curl ... "$BASE/api/documents/?page_size=1" | jq '.count'

# Latest 10 docs
curl ... "$BASE/api/documents/?ordering=-created&page_size=10" | jq '.results[]|{id,title,created}'

# Search by tag
curl ... "$BASE/api/documents/?tags__id=17" | jq '.results[].title'

# Search by storage path
curl ... "$BASE/api/documents/?storage_path__id=1"

# Full-text search (OCR'd content)
curl ... "$BASE/api/documents/?query=finanstilsynet"

# Task queue status
curl ... "$BASE/api/tasks/?page_size=200" | jq 'group_by(.status)|map({status:.[0].status,count:length})'

# Failed tasks (often = content duplicates)
curl ... "$BASE/api/tasks/" | jq '[.[]|select(.status=="FAILURE")|{file:.task_file_name,reason:.result}]'

Architecture

[ALAI LAN egress 92.221.168.61]
       │
       ▼
[Cloudflare]
   ├─ IP Access Rule: bypass WAF (Layer 1)
   └─ CF Access policy: bypass Zero Trust (Layer 2)
       │
       ▼
[Caddy on Azure VM 4.223.110.181]
   archive.alai.no → paperless-ngx:8000
       │
       ▼
[alai-paperless-1 container]
   ├─ alai-paperless-redis-1 (queue)
   ├─ alai-paperless-gotenberg-1 (PDF preview)
   └─ alai-paperless-tika-1 (text extraction)
       │
       ▼
[Postgres + media volume on Azure VM]

Outstanding (TODO)

  • MC #9982 — DR backup automation: pg_dump cron + media volume snapshot + B2/R2 upload + 30-day retention
  • Bitwarden token storagebw create item blocked by node 25 incompat (Invalid version error). Manually add via Vaultwarden web UI ako traje
  • Token rotation policy — currently no expiry; consider 90-day rotation za admin token
  • Per-user tokens — kreiraj user-specific tokens za audit trail (admin token shared = no per-user audit)