# SEO Readiness Portal — Real Audit Engine (2026-06-02)

# SEO Readiness Portal — Real Audit Engine (2026-06-02)

**Status:** DEPLOYED to production **Scope:** MC #102800 / #102801 / #102802 / #102803 — Real live crawl audit runner (replaces local readiness stub) **Deploy date:** 2026-06-02 **Evidence:** `/tmp/alai/996bd450/evidence-102800/verification.json`, `/tmp/alai/996bd450/evidence-102820/verification.json` **Image:** `alairegistry.azurecr.io/seo-readiness-portal:20260602-real-audit`\---

## Overview

The SEO Readiness Portal now performs **real live HTTP crawl audits** against client websites, replacing the previous local form-validation-only stub. The audit engine fetches the home page, robots.txt, and sitemap.xml from the public internet, parses them with **cheerio** (HTML5-aware DOM parser), and emits **P0/P1/P2 findings** based on industry-standard SEO readiness signals.

All findings flow into the backlog system (Phase 4) and feed the client report generator (Phase 5). Reports are exported as Markdown and include a mandatory no-ranking-guarantee disclaimer.

**What changed:** Phase 3 (audit runner), Phase 4 (findings/backlog), and Phase 5 (report generation) are now **REAL** — they operate on live crawl data, not local form fields. The previous Phase 4–11 local readiness workflow is retained as a fallback mode (`mode: "local_readiness"` vs `mode: "live_crawl"`). \---

## Architecture

```mermaidflowchart LR    A[Operator Browser] -->|HTTPS + CF Access| B[Cloudflare Access]    B -->|Authenticated header| C[Azure App Service<br></br>seo-readiness-alai]    C -->|Next.js Server Action| D[Live Crawl Runner]    D -->|SSRF-guarded fetch| E[Client website]    D -->|cheerio parse| F[Findings + Backlog]    F --> G[Report Generator]    G --> H[Markdown Export]    C -->|Write| I[/home/data/workspace.json]    C -->|Write| J[/home/data/audits/auditId.json]````

### Components

| Component | Technology | Purpose | Location | |-----------|-----------|---------|----------| | **Live Crawl Runner** | TypeScript + Node.js fetch | Fetch home/robots/sitemap, parse with cheerio, emit findings | src/lib/audit/runner.ts` || <strong>SSRF Guard</strong> | Custom URL validation + AbortController | Block private IPs, enforce 9s per-fetch + 45s total timeout, 2 MB body cap | `src/lib/audit/crawl-guard.ts` || <strong>HTML Parser</strong> | cheerio (HTML5 mode) | Parse title, meta, headings, links, canonical, OG tags | `src/lib/audit/crawl-parser.ts` || <strong>Findings Engine</strong> | TypeScript | Emit P0/P1/P2 findings with evidence JSON, block forbidden ranking claims | `src/lib/audit/runner.ts` (liveFinding) || <strong>Backlog Generator</strong> | TypeScript | Convert findings → backlog items, enforce evidence-URL for done gate | `src/lib/reports/generator.ts` || <strong>Report Generator</strong> | TypeScript | Generate client-facing Markdown report with no-ranking disclaimer | `src/lib/reports/generator.ts` || <strong>Persistence</strong> | JSON file backend | Atomic write to `/home/data/workspace.json` + `/home/data/audits/<id>.json` | `src/lib/workspace/persistence.ts` |`</id>

### Data Flow

1\. **Operator triggers audit** (authenticated browser at https://seo-tools.alai.no/partners`)2. <strong>Server Action calls runLiveCrawlAudit()</strong> with `client`, `site`, `now`3. <strong>guardedFetch()</strong> retrieves home page, robots.txt, sitemap.xml with SSRF guard + timeout4. <strong>cheerio</strong> parses HTML5-compliant DOM (handles broken HTML gracefully)5. <strong>Findings emitted</strong> — P0/P1/P2 severity, 11 categories (crawlability, indexability, content, technical, metadata, performance, mobile, accessibility, structure, security, evidence)6. <strong>Atomic write</strong> — audit JSON → `/home/data/audits/<auditid>.json`, workspace update → `/home/data/workspace.json`7. <strong>Backlog items generated</strong> from findings (operator can convert any finding to a backlog task)8. <strong>Report generated</strong> from audit + backlog, no-ranking disclaimer injected9. <strong>Markdown export</strong> with checksum and handoff checklist`</auditid>

\---

## SSRF Guard

The crawl engine protects against **Server-Side Request Forgery (SSRF)** attacks:

### Blocked targets

- Non-http(s) schemes (e.g., file://`, `ftp://`, `gopher://`)`
- Bare IP literals (http://192.168.1.1/`, `http://\[::1\]/`)`
- Private IPv4 ranges: 10.0.0.0/8`, `172.16.0.0/12`, `192.168.0.0/16`, `127.0.0.0/8`, `169.254.0.0/16` (includes cloud metadata endpoint `169.254.169.254`)`
- Private IPv6 ranges: ::1`, `fc00::/7`, `fe80::/10``
- Numeric/encoded IP hostnames (e.g., 0x7f.0.0.1`, `2130706433`)`

### Timeouts

- **Per-fetch:** 9 seconds (home, robots, sitemap fetched sequentially)
- **Total audit:** 45 seconds hard limit (AbortController abort on timeout)
- **Body size cap:** 2 MB (drains and cancels response body on overflow to prevent socket leaks)

### Known limitations (CEO decision: acceptable for MVP)

- DNS rebind protection deferred — the guard covers literal IPs but does not resolve hostnames at validation time (a follow-on MC can add dns.lookup` pre-check)`
- No per-operator rate limiting (deferred to follow-on MC)
- Single-writer assumption: if two Azure App Service instances concurrently trigger crawls, last write wins on workspace.json` (Postgres migration is a follow-on MC)`

\---

## File-backed Persistence

The audit engine writes to **persistent App Service storage** (Azure flag WEBSITES\_ENABLE\_APP\_SERVICE\_STORAGE=true`):`

- **Workspace state:** /home/data/workspace.json` (atomic write with temp + rename, 8 KB typical size)`
- **Audit archives:** /home/data/audits/<auditid>.json` (one file per audit, ~20–50 KB per file)`</auditid>

**Why file-backend for MVP:** CEO decision (a) — Postgres migration is a follow-on MC. File backend is deterministic, testable, and works for single-operator phase. Concurrent writes from multiple Azure instances are NOT handled (last write wins). **Atomic write protocol:**1. Write to temp file: /home/data/workspace.json.tmp-<uuid id="bkmrk-2.-fs.rename%28%29-to-%2Fh">`2. `fs.rename()` to `/home/data/workspace.json` (atomic on POSIX)3. Collision-safe audit IDs: `audit-<clientid>-<siteid>-<millisecondtimestamp>-&lt;6charUUID&gt;`<p>---</p><h2 id="bkmrk-findings-categories-">Findings Categories and Severity</h2><p>The live crawl audit emits <strong>P0 (blocker), P1 (high), P2 (medium)</strong> findings across <strong>11 categories</strong>:</p><p>| Category | P0 Findings | P1 Findings | P2 Findings ||----------|-------------|-------------|-------------|| <strong>crawlability</strong> | robots.txt blocks all crawlers, home page 403/503/429 | robots.txt fetch failed | Crawl-delay > 60s || <strong>indexability</strong> | Home status ≠ 200, robots meta noindex | | || <strong>content</strong> | Missing h1, title missing | Title < 30 or > 70 chars, h1 ≠ title | Meta description < 120 or > 160 chars, missing priority services || <strong>technical</strong> | | Missing viewport, sitemap index (nested, not flat) | og:image is relative URL || <strong>metadata</strong> | | Missing meta description, canonical mismatch | Missing og:title, og:description, or og:image || <strong>performance</strong> | | | href=# placeholder links (> 5) || <strong>mobile</strong> | | | No viewport || <strong>accessibility</strong> | | | Images missing alt (> 5) || <strong>structure</strong> | | | External links < 3 (isolation signal) || <strong>security</strong> | Canonical URL is http:// (not https://) | | || <strong>evidence</strong> | | | Analytics/Search Console status unknown |</p><strong>Forbidden claim words:</strong> The generator enforces a hard block on `ranking`, `rankings`, `traffic lift`, `traffic growth`, `guarantee`, `guaranteed` in all finding/backlog/report text. Any match throws an error and aborts the audit.<p>---</p><h2 id="bkmrk-findings-%E2%86%92-backlog-%E2%86%92">Findings → Backlog → Report Flow</h2><p>1. <strong>Audit emits findings</strong> — JSON array with </p>`{ id, severity, category, title, description, recommendation, evidence }`2. <strong>Operator converts finding to backlog item</strong> (optional — not all findings require action)3. <strong>Backlog item fields:</strong>   - `title`: "Resolve {severity} {category} readiness item: {finding.title}"   - `notes`: "{finding.recommendation} This is a readiness task from local workspace evidence only."   - `status`: `"open" | "in\_progress" | "done" | "wont\_fix"`   - `evidenceUrl`: REQUIRED for `status: "done"` (external proof the issue was fixed)4. <strong>Report generator</strong> pulls latest audit + backlog, emits Markdown with:   - Audit metadata (date, mode, status, findings count)   - Scope section: "This report reflects basic public-page observability. It does not use Google Search Console, Analytics, paid keyword APIs, or private CMS data. Findings are readiness signals only. <strong>This assessment does not predict search ranking, traffic volume, or guaranteed outcomes.</strong>"   - Findings by severity (P0 → P1 → P2)   - Backlog summary   - Recommendations5. <strong>Export with checksum</strong> — Markdown file + SHA-256 hash stored in export metadata<p>---</p><h2 id="bkmrk-no-ranking-guardrail">No-ranking Guardrail</h2><strong>Every audit</strong> (both `local\_readiness` and `live\_crawl` modes) stores a `guardrails` array in the audit JSON. The UI renders these unconditionally on every audit detail page.<h3 id="bkmrk-live_crawl-guardrail">live_crawl guardrails</h3>```json[  "Live crawl audit only; findings reflect publicly observable signals at crawl time.",  "No Google Search Console, Analytics, paid keyword APIs, or private CMS data is used.",  "This audit does not predict search ranking, traffic volume, or guaranteed outcomes.",  "Findings must not claim ranking or traffic impact.",  "This is a basic public-page audit. It does not use Google Search Console, Analytics, paid keyword APIs, or private CMS data."]```<p>These are injected into the client report's <strong>Scope section</strong> and displayed on the audit detail page. The generator throws an error if any finding text contains forbidden claim words.</p><p>---</p><h2 id="bkmrk-deploy-path">Deploy Path</h2><strong>Target environment:</strong> Azure App Service (Linux container), Sweden Central  <strong>Registry:</strong> `alairegistry.azurecr.io`  <strong>Image tag:</strong> `seo-readiness-portal:20260602-real-audit` (date + purpose semantic tag)  <strong>Public URLs:</strong><ul><li></li></ul>`https://seo-tools.alai.no/partners` (Cloudflare Access authenticated)<li></li>`https://seo-tools.snowit.ba/` (custom hostname via MC #102750, Cloudflare TLS termination)<strong>Origin protection:</strong> Azure App Service origin is IP-locked to Cloudflare ranges (403 on direct access to `seo-readiness-alai.azurewebsites.net` from non-Cloudflare IPs)<h3 id="bkmrk-deploy-steps-%28manual">Deploy steps (manual operator path)</h3>```bashcd /Users/makinja/business/ALAI-Holding-AS/products/SEO-Readiness-Portal<h1 id="bkmrk-1.-local-gates-%28type">1. Local gates (type-check, build, validate)</h1>npm run type-check && npm run build && npm run validate:spec && npm run validate:phase12<h1 id="bkmrk-2.-build-image-%28acr-">2. Build image (ACR Tasks, remote build in Azure)</h1>az acr build -r alairegistry -t seo-readiness-portal:20260602-real-audit .<h1 id="bkmrk-3.-update-app-servic">3. Update App Service container config</h1>az webapp config container set \  --resource-group rg-seo-readiness-prod \  --name seo-readiness-alai \  --container-image-name alairegistry.azurecr.io/seo-readiness-portal:20260602-real-audit \  --container-registry-url https://alairegistry.azurecr.io<h1 id="bkmrk-4.-restart-app-servi">4. Restart App Service</h1>az webapp restart --resource-group rg-seo-readiness-prod --name seo-readiness-alai```<h3 id="bkmrk-post-deploy-verifica">Post-deploy verification (ZAKON PI2 Check 4)</h3>```bash<h1 id="bkmrk-confirm-new-image-is">Confirm new image is active</h1>az webapp config container show -g rg-seo-readiness-prod -n seo-readiness-alai \  --query "[?name=='DOCKER_CUSTOM_IMAGE_NAME'].value" -o tsv<h1 id="bkmrk-verify-public-endpoi">Verify public endpoints (expect 302 CF Access redirect)</h1>curl -sI https://seo-tools.alai.no/api/healthcurl -sI https://seo-tools.snowit.ba/api/health<h1 id="bkmrk-verify-origin-is-ip-">Verify origin is IP-locked (expect 403)</h1>curl -sI https://seo-readiness-alai.azurewebsites.net/api/health<h1 id="bkmrk-confirm-bilko-domain">Confirm Bilko domain untouched</h1>dig +short bilko-demo.alai.no  # expect ghs.googlehosted.com```<strong>Final UAT (pending CEO/Proveo):</strong> Authenticated browser through Cloudflare Access → create client → run live audit → verify real findings from actual crawl → export report → confirm no-ranking disclaimer present.<h3 id="bkmrk-rollback">Rollback</h3>```bashaz webapp config container set \  --resource-group rg-seo-readiness-prod \  --name seo-readiness-alai \  --container-image-name alairegistry.azurecr.io/seo-readiness-portal:20260531-cloud \  --container-registry-url https://alairegistry.azurecr.io<p>az webapp restart --resource-group rg-seo-readiness-prod --name seo-readiness-alai</p>```Previous known-good image: `20260531-cloud` (pre-A1 local-readiness-only version)<p>---</p><h2 id="bkmrk-operator-runbook">Operator Runbook</h2><h3 id="bkmrk-how-to-run-a-live-au">How to run a live audit</h3><p>1. <strong>Authenticate:</strong> Visit </p>`https://seo-tools.alai.no/partners` with Cloudflare Access credentials2. <strong>Create client:</strong> Fill intake form (company name, website, services, competitors, Google access status)3. <strong>Trigger audit:</strong> Click "Run Live Audit" on the client detail page4. <strong>Wait:</strong> Audit takes 10–45 seconds (home + robots + sitemap fetches)5. <strong>Review findings:</strong> Navigate to `/clients/\[clientId\]/audits/\[auditId\]` — see P0/P1/P2 findings with evidence JSON6. <strong>Convert to backlog:</strong> Click "Add to Backlog" on any finding that needs operator action7. <strong>Generate report:</strong> Click "Generate Report" → draft created with scope disclaimer + findings + backlog summary8. <strong>Export:</strong> Click "Export Markdown" → `.md` file with SHA-256 checksum stored in workspace9. <strong>Handoff:</strong> Fill checklist (client approved scope, evidence URLs verified, no forbidden claims) → generate handoff summary → generate partner follow-up package<h3 id="bkmrk-how-to-deploy-a-new-">How to deploy a new version</h3><p>Follow the <strong>Deploy steps</strong> section above. Always run local gates before building the image. Always verify post-deploy (CF Access 302, origin 403, Bilko untouched).</p><h3 id="bkmrk-how-to-rollback">How to rollback</h3><p>Run the <strong>Rollback</strong> command. The previous known-good image is tracked in </p>`DEPLOY-MAP.md`. Verify rollback with the same post-deploy checks.<h3 id="bkmrk-troubleshooting">Troubleshooting</h3><p>| Symptom | Likely cause | Fix ||---------|--------------|-----|| Audit hangs at "running" | SSRF timeout or AbortController not firing | Check Azure logs for timeout errors; verify </p>`TOTAL\_AUDIT\_TIMEOUT\_MS` env var || Audit returns empty findings | Site is behind Cloudflare challenge or 403 IP block | Expect P0 "crawl-blocked" finding; client must allowlist ALAI crawler UA or IP || "Response body exceeded 2 MB cap" error | Large home page or sitemap | Expected behavior; emit P1 finding "home page too large" || workspace.json corruption | Concurrent writes from multiple Azure instances | Restart App Service, restore from `/home/data/workspace.json.backup-<timestamp>` if present || Report contains forbidden claim words | Generator failed to catch; regex bypass | Report to John; update `forbiddenClaimWords` regex in `generator.ts` and `runner.ts` |<p>---</p><h2 id="bkmrk-google-integration-%28">Google Integration (Deferred)</h2><strong>Status:</strong> NOT IMPLEMENTED  <strong>Scope:</strong> MC #102806 (B1 from `REAL-AUDIT-ENGINE-PLAN-2026-06-02.md`)  <strong>Requirements:</strong> Google Cloud OAuth client ID + secret, consent screen approval, token store (file or Postgres)  <strong>Blocked until:</strong> CEO provides/approves Google Cloud project + OAuth credentials  <p>The current live crawl audit does <strong>NOT</strong> fetch Google Search Console impressions/clicks/queries or Google Analytics (GA4) page views/conversions. The </p>`searchConsoleStatus` and `analyticsStatus` fields in the intake form are <strong>metadata-only</strong> — they record the client's access status but do not connect to Google APIs.<p>When Google integration is implemented (follow-on MC), the audit will:</p><ul><li>Fetch impressions/clicks/queries from Search Console (last 90 days)</li><li>Fetch page views/conversions from GA4 (last 90 days)</li><li>Emit P0 findings if indexing errors are detected (e.g., "Discovered - currently not indexed")</li><li>Emit P1 findings if query CTR < 2% for top-impression queries</li></ul><p>The no-ranking-guarantee disclaimer will be updated to: "This report includes Google Search Console and Analytics data. Findings reflect historical performance only. <strong>We do not guarantee future ranking, traffic volume, or conversion outcomes.</strong>"</p><p>---</p><h2 id="bkmrk-technical-decisions-">Technical Decisions Log</h2><h3 id="bkmrk-ceo-decisions-%282026-">CEO decisions (2026-06-02, "sve preporučeno, idi")</h3>| Decision | Rationale | Known limit | Follow-on ||----------|-----------|-------------|-----------|| <strong>(a) File backend</strong> | Deterministic, testable, works for single-operator phase | Last write wins on concurrent access | Postgres migration MC || <strong>(b) Sync Server Action</strong> | MVP path, fits Azure 230s request ceiling | Max 45s total for 3 fetches; concurrent operators share slots | Async job queue MC || <strong>(c) Pure TS + cheerio</strong> | Lea Verou panel feedback: regex = hard no; cheerio handles broken HTML | None | None || <strong>(d) Existing audit detail route</strong> | Reuse `/clients/\[clientId\]/audits/\[auditId\]` — no new route | None | None || <strong>(e) Max one live audit in-flight per client</strong> | Enforced in `runLiveCrawlForClient()` | If operator triggers two audits rapidly, second is rejected | Queue or parallel-audit MC || <strong>(f) 403/CF challenge → P0 finding</strong> | Caller detects HTTP status, emits P0 "crawl-blocked" | No retry logic | Follow-on MC if retry needed |<h3 id="bkmrk-correctness-over-pyt">Correctness over Python parity</h3>The TS implementation <strong>fixes bugs</strong> present in the Python reference (`run-basic-seo-audit.py`):1. <strong>Charset detection</strong> — Python defaults to UTF-8 without checking `Content-Type` or `<meta charset=""></meta>`; TS uses `TextDecoder` with sniffing2. <strong>og:image relative URL</strong> — Python omits og:image entirely; TS detects relative URLs and emits P2 finding3. <strong>sitemapindex nesting</strong> — Python silently ignores `<sitemapindex>`; TS detects and emits P1 finding4. <strong>Canonical vs final URL</strong> — Python compares canonical against requested URL; TS compares against `response.url` (after redirects)<h3 id="bkmrk-proveo-verification-">Proveo verification outcome</h3>All 3 child MCs (A1 #102801, A2 #102802, A3 #102803) were <strong>independently verified by Proveo (Angie Jones)</strong> after CodeCraft build:<ul><li>A1: type-check/build/validate EXIT 0, additive files intact, SSRF guard coverage confirmed</li><li>A2: findings-to-backlog widening verified, evidence-URL done gate confirmed</li><li>A3: <strong>Bug caught in verification</strong> — </li></ul>`forbiddenClaimWords` regex threw on `live\_crawl` scope text ("ranking", "guaranteed"). CodeCraft fixed + added `validate:phase12` regression test. Proveo re-verified PASS.<p>Evidence: </p>`/tmp/alai/996bd450/evidence-102800/verification.json`, `/tmp/alai/996bd450/evidence-102803/fix-verification.json`<p>---</p><h2 id="bkmrk-open-items-and-follo">Open Items and Follow-on MCs</h2><p>| Item | Priority | Description | Tracking ||------|----------|-------------|----------|| DNS-rebind SSRF guard | M | Runtime </p>`dns.lookup` check before fetch (currently only literal IPs blocked) | Follow-on MC || Per-operator rate limiting | M | Prevent abuse: max 10 audits/hour per partner | Follow-on MC || Postgres migration | H | Replace file backend with Postgres for findings/backlog/audits | Follow-on MC || Async job queue | H | Move crawl to background worker (Redis/BullMQ) to unblock Server Action thread | Follow-on MC || Google Search Console integration | H (BLOCKED) | OAuth + impressions/clicks/queries (needs CEO-provided credentials) | MC #102806 || Google Analytics (GA4) integration | M (BLOCKED) | OAuth + page views/conversions (needs CEO-provided credentials) | MC #102806 || Playwright authenticated UAT | H | Browser through CF Access → run audit → verify findings (pending CEO login) | MC #102804 || Retry logic for 403/503 | L | Exponential backoff + retry on transient errors | Follow-on MC || Concurrent audit limit per partner | M | Allow 3 audits in-flight per partner (vs current 1 per client) | Follow-on MC |<p>---</p><h2 id="bkmrk-references">References</h2><ul><li><strong>Plan:</strong> </li></ul>`/Users/makinja/business/ALAI-Holding-AS/products/SEO-Readiness-Portal/REAL-AUDIT-ENGINE-PLAN-2026-06-02.md`<li><strong>BUILD-BLUEPRINT:</strong> </li>`/Users/makinja/business/ALAI-Holding-AS/products/SEO-Readiness-Portal/BUILD-BLUEPRINT.md`<li><strong>DEPLOY-MAP:</strong> </li>`/Users/makinja/business/ALAI-Holding-AS/products/SEO-Readiness-Portal/DEPLOY-MAP.md`<li><strong>Evidence:</strong> </li>`/tmp/alai/996bd450/evidence-102800/verification.json` (A1/A2/A3 Proveo PASS), `/tmp/alai/996bd450/evidence-102820/verification.json` (deploy)<li><strong>Python reference:</strong> </li>`~/business/ALAI-Holding-AS/sales/seo-automation/run-basic-seo-audit.py` (277 lines, public-URL crawl)<li><strong>Validation script:</strong> </li>`scripts/validate-phase12.ts` (regression test for A3 fix) \---

**Last updated:** 2026-06-02 **Owner:** Skillforge (docs) / CodeCraft (implementation) / Proveo (verification) **Status:** DEPLOYED to production, pending authenticated browser UAT (MC #102804)</sitemapindex></timestamp></millisecondtimestamp></siteid></clientid></uuid>