Bilko Backoffice — Support & Fix Loop Runbook
Bilko Backoffice — Support & Fix Loop Runbook
Scope: Operational runbook for ALAI staff handling live customer problems on app.bilko.cloud.
Environment: backend bilko-api-demo, frontend bilko-web-demo, database bilko-demo-db (GCP Cloud Run / Cloud SQL).
Related pages: BookStack 3100 (Backend MVP), BookStack 3104 (Prod Topology).
MC: #103327 (docs), parent #103322 (Support & Fix Loop feature).
1. Overview — The Support Loop
The support loop is the full chain from a customer-visible error to a closed ticket:
- Error occurs — The customer hits an accounting error in the browser (e.g. an invoice save fails, a VAT calculation returns 500).
- Toast notification — The frontend renders an error toast with the
errorCode(e.g.INFRA_001) and a Report this problem CTA button. The toast carries therequestIdanderrorCodefrom the RFC 7807 ProblemDetail response body — not from response headers. - SupportIntakeForm Dialog — Clicking the CTA opens a focus-trapped
role=alertdialogDialog (never embedded in the toast itself). The form pre-fills theerrorCode,requestId, and acontextBundle(10 allowlisted fields: IDs and codes, no PII). - Ticket submission —
POST /support/ticketscreates a row insupport_tickets(V73 migration). The backend validates thecontextBundleallowlist server-side before insert. On duplicate(org_id, request_id)the API returns HTTP 409 and the form shows "A ticket for this error has already been filed." - Admin queue — Staff open
/admin/support(Admin Support Queue page) to see all open tickets, paginated 50 per page, filterable by status. - Triage — Staff click a ticket to open the detail view (
/admin/support/{id}), read thecontextBundle,customerDescription, and join to the audit trail byrequest_id. - Impersonate — From the detail view, staff start a time-limited impersonation session (read-only, reason pre-locked to
support:{ticketId}) to reproduce the issue in the customer's org context. All actions during impersonation are audited. - Fix — Staff apply a fix (DB correction, config change, user education) using safe data-correction procedures.
- Status transition — Staff
PATCH /admin/support/tickets/{id}to advance the ticket status. Every status change writes anaudit_logrow with therequest_idthreaded through. - Close — Ticket moves to
RESOLVED(then optionallyCLOSED). Both require aresolutionNote.
Note: This is a human admin triage queue. There is no automated resolution. No AI agent writes to triage_json at MVP — that column is reserved for V2 AI triage scope.
2. Component Map
2.1 Sentry Capture — apps/api/.../plugins/Sentry.kt
- DSN guard:
Sentry.initis only called whenSENTRY_DSNenv var is non-blank. When absent (local dev, CI, Testcontainers), the SDK is a safe no-op — allcaptureExceptioncalls are silently discarded. - Cloud Run metadata:
options.release = K_REVISION(e.g.bilko-api-00028-abc),options.serverName = K_SERVICE(e.g.bilko-api-demo). - beforeSend PII scrub:
event.request.datais cleared (strips invoice fields, amounts, emails).event.breadcrumbsare cleared. Extra context is filtered to the allowlist:errorCode, requestId, orgId, httpStatus, instancePath. - OCD-1:
bilko-sentry-dsnandbilko-web-sentry-dsnsecrets are provisioned as empty strings in Secret Manager — CEO action required to populate them. Until populated, Sentry is inert in all environments.
2.2 StatusPages — apps/api/.../plugins/StatusPages.kt
- RFC 7807 ProblemDetail responses: All error responses emit
Content-Type: application/problem+jsonwith fields:type, title, status, detail, instance, errorCode, requestId. - Sentry fires in Throwable catch-all only (line 237). Named handlers (
BadRequestException,ConflictException,UnauthorizedException, etc.) do NOT callcaptureException— this prevents flooding Sentry with 4xx user-error noise. Ktor dispatches named handlers first (exact type or nearest supertype), so Throwable only fires for genuine INFRA/unexpected exceptions. requestIdin the response body comes fromcall.callId(CallId plugin — readsX-Request-IDheader, generates a UUID when absent). This is the single canonical source used consistently across StatusPages, AdminPortalRoutes, ImpersonationService, and SupportTicketRoutes. A mixed source (raw header vscall.callId) would produce two differentrequestIdvalues for the same headerless request, breaking the join-by-requestId diagnostic chain.- requestId is a BODY field — the backend does not emit it as a response header. Frontend must read it from the parsed JSON body, not from response headers.
2.3 V72 — audit_log.request_id
Migration: apps/api/src/main/resources/db/migration/V72__audit_log_request_id.sql
- Adds a nullable
TEXTcolumnrequest_idtoaudit_log. No NOT NULL constraint — background/internal audit actions have no HTTP request context. - Partial index
idx_audit_log_request_idon(request_id) WHERE request_id IS NOT NULLfor correlation queries. PlainCREATE INDEX(notCONCURRENTLY) — Flyway wraps migrations in a transaction;CONCURRENTLYis prohibited inside a transaction block. - Correlation only:
request_idis a debuggability handle, NOT a tamper-evidence mechanism. The append-only guarantee comes from theblock_audit_mutation()trigger (V51). - No unique constraint on
(org_id, request_id): one HTTP request legitimately produces multiple audit rows (e.g. impersonation start + org update). The idempotency constraint lives onsupport_tickets, notaudit_log.
2.4 V73 — support_tickets table
Migration: apps/api/src/main/resources/db/migration/V73__support_tickets.sql
| Column | Type | Notes |
|---|---|---|
id | UUID PK | DEFAULT gen_random_uuid() |
org_id | UUID NOT NULL | FK to organizations.id, CASCADE DELETE |
user_id | UUID NOT NULL | FK to users.id |
error_code | TEXT nullable | e.g. INFRA_001 |
request_id | TEXT nullable | Correlation ID from originating failed request. NOT a FK to audit_log.request_id — join via equality. |
context_bundle | JSONB NOT NULL | 10-key allowlist: requestId, errorCode, httpStatus, instancePath, orgId, userId, appRoute, planTier, country, auditRef. CHECK jsonb_typeof = 'object'. Server-side validated. |
customer_description | TEXT nullable | Free text from customer, min 10 chars (enforced frontend) |
status | TEXT NOT NULL | CHECK IN (OPEN, TRIAGED, IN_PROGRESS, RESOLVED, CLOSED). Default OPEN. |
triage_json | JSONB nullable | V2 AI triage output. NULL = not yet triaged at MVP. |
created_at | TIMESTAMPTZ NOT NULL | DEFAULT now() |
updated_at | TIMESTAMPTZ NOT NULL | Auto-updated by trigger on BEFORE UPDATE. |
resolution_note | TEXT nullable | Required for RESOLVED/CLOSED (enforced in route handler) |
external_ref | TEXT nullable | V2 Zendesk/Linear sync reference. Blank at MVP. |
Idempotency: UNIQUE INDEX idx_support_tickets_org_request_id_unique ON support_tickets(org_id, request_id) WHERE request_id IS NOT NULL — one customer request produces at most one ticket. Returns HTTP 409 on violation.
Status transitions (enforced in route handler, not DB):
- OPEN → TRIAGED | CLOSED
- TRIAGED → IN_PROGRESS | CLOSED
- IN_PROGRESS → RESOLVED | CLOSED
- RESOLVED → CLOSED
- CLOSED → (terminal, no further transitions)
RLS policies:
support_tickets_customer_insert:FOR INSERT WITH CHECK (org_id = current_setting('app.current_org_id', true)::uuid)— prevents any authenticated DB connection from inserting a ticket for another org.support_tickets_customer_select: customers see only their own org's tickets.support_tickets_admin_all: platform-admin bypass viacurrent_setting('app.is_platform_admin', true)::boolean = true. Must beSET LOCALper transaction (pgBouncer transaction-mode pooling safe — session-level GUC leaks across connections).- No UPDATE/DELETE policy for customers — deny-by-default after submit. Tickets are never deleted.
2.5 SupportTicketRoutes — apps/api/.../routes/SupportTicketRoutes.kt
| Endpoint | Auth | Notes |
|---|---|---|
POST /support/tickets | JWT (customer) | Creates ticket. org_id/user_id from BilkoPrincipal — NOT from request body. contextBundle server-side allowlist validated. Returns { id, status: "OPEN" }. |
GET /admin/support/tickets | Platform admin | Paginated list. Query params: limit (1–100, default 50), offset, status (optional filter), orgId (optional filter). Returns { data, meta: { total, limit, offset } }. |
GET /admin/support/tickets/{id} | Platform admin | Single ticket detail. |
PATCH /admin/support/tickets/{id} | Platform admin | Status transition + optional resolutionNote, triageJson, externalRef. Invalid transitions → HTTP 422. Every status change writes an audit_log row. |
2.6 Frontend — SupportIntakeForm
Source: apps/web/components/support/SupportIntakeForm.tsx
- Radix Dialog with
role="alertdialog"andaria-modal="true". Not embedded in a toast. - Props:
{ open, onClose, prefill?: Partial<ContextBundle & { customerDescription? }> }. - Pre-fills
errorCodeandrequestIdas read-only display. Customer fillscustomerDescription(min 10 chars). - Assembles
contextBundleusingbuildContextBundle()fromlib/api-support.ts— typed explicit field picks fromBilkoApiErrorand auth store; never spreads raw error object. - On HTTP 409: shows inline "A ticket for this error has already been filed." Does not silently retry.
2.7 Frontend — Admin Support Queue
Source: apps/web/app/(admin)/admin/support/page.tsx and [id]/page.tsx
- Queue page: DataTable with columns Ticket ID, Org, Error Code, Status (colored badge), Created, Actions. Status filter dropdown. Server-side pagination (limit/offset), 50 per page. Client-side page navigation.
- Detail page: full ticket fields,
contextBundleparsed and rendered as text-only (nodangerouslySetInnerHTML). Status transition controls withresolutionNoterequired for RESOLVED/CLOSED. Impersonation shortcut. - Security boundary note: The admin layout's
platformAdminguard inapp/(admin)/admin/layout.tsxis a client-side UX redirect only. The actual security boundary is the backendAdminAuthPlugin.ktwhich enforces theisPlatformAdminJWT claim on every request.
3. Operator Workflow
Step 1 — Find the ticket
Step 2 — Read the context bundle and audit trail
- Click a ticket to open the detail view at
/admin/support/{id}. - The
contextBundlesection shows all 10 allowlisted fields. Key fields for diagnosis:requestId— the correlation handle. Use this to join to the audit trail and Cloud Logging.errorCode— the error category.httpStatus— HTTP status code of the failed request.appRoute— the frontend route where the error occurred (e.g./invoices/new).orgId— the customer's organization UUID.
- Join to the audit trail using the
requestId:
-- Diagnostic query: find audit rows for the failing request
SELECT
id,
created_at,
portal_action,
acting_user_id,
payload,
request_id
FROM audit_log
WHERE request_id = '<requestId from ticket>'
ORDER BY created_at ASC;
This query shows every audit event associated with the customer's specific failing HTTP request — impersonation starts, invoice mutations, status changes — in chronological order.
Step 3 — Triage and impersonate
- On the detail page, scroll to the Update Status section. Transition the ticket from
OPENtoTRIAGEDto signal the ticket is being investigated. - To reproduce the issue in the customer's org context, click Impersonate Org in the yellow panel. The impersonation dialog:
- Pre-fills
reason = support:{ticketId}— this field is READ-ONLY and cannot be edited. - Resolves
orgIdfromticket.orgId— NOT the ticketId (backend endpointPOST /admin/orgs/{orgId}/impersonatetakes the org UUID). - Choose a duration (15/30/60 minutes).
- Pre-fills
- Tab isolation warning: The impersonation token replaces the module-level access token (
_accessTokeninlib/api.ts). All tabs in this browser origin are affected for the duration of the session. A banner confirms "Impersonation active — all tabs in this origin are affected." - All actions during impersonation are audited in
audit_logwithreason = support:{ticketId}.
Step 4 — Apply a fix
For data corrections, see Section 4 (Data-Correction Safety). For configuration or code fixes, follow the standard ALAI deployment pipeline (DEPLOY-MAP.md). After the fix is applied, end the impersonation session using the End Impersonation button. This atomically:
- Calls backend
POST /admin/impersonate/end - Clears
setAccessToken(null)(module-level token) - Removes all three sessionStorage keys (
bilko_impersonation_token,bilko_impersonation_org,bilko_impersonation_expires) in a single synchronous block
Step 5 — Transition and close
- Return to the detail view. Transition status to
IN_PROGRESS(if work is ongoing) orRESOLVED(if fixed). - RESOLVED and CLOSED both require a
resolutionNote(enforced by the API — HTTP 422 if blank). Write a brief note describing what was fixed. - The PATCH call audits the status change in
audit_logwith the admin'srequest_idthreaded through for correlation. - Ticket moves to
CLOSEDas the final terminal state. No further transitions are possible.
4. Data-Correction Safety
Impersonation scope
- Impersonation is RLS-scoped: the impersonation token sets
app.current_org_idGUC to the target org's UUID viaSET LOCAL(pgBouncer transaction-mode safe). - All DB operations during impersonation run under the target org's RLS policies — the admin cannot access other orgs' data even if they craft a direct API call.
- All impersonation actions are audited. The reason field (
support:{ticketId}) is locked and stored verbatim inaudit_log.
Direct data corrections (raw SQL)
If a data correction requires direct SQL on the database (e.g. correcting a journal entry double-entry imbalance, fixing a corrupted exchange rate):
- Run the preflight script before any SQL on prod/demo:
This script: creates a point-in-time backup annotation, logs the operator identity and timestamp, and prints a confirmation token required for the correction script.bash scripts/ops/bilko-support-fix-preflight.sh <orgId> <ticketId> - Never run raw SQL on
bilko-demo-dbwithout the preflight backup pattern. Cloud SQL supports point-in-time recovery to 7-day window. - Double-entry corrections must preserve the ledger balance — every credit must have a matching debit. The platform enforces
NUMERIC(19,4)for all monetary amounts. - After any direct SQL correction, run a reconciliation query to verify the affected org's trial balance still balances.
5. Cloud Logging — Saved Views
Three saved log views are available in GCP Cloud Logging for the bilko-api-demo service:
| View Name | Purpose | Key Filter |
|---|---|---|
bilko-error-by-org |
All ERROR/CRITICAL log entries for a specific org, newest-first. Use when the customer provides their org UUID and you need to see all errors in context. | resource.type="cloud_run_revision" severity>=ERROR jsonPayload.orgId="<orgId>" |
bilko-request-trace |
All log entries for a specific requestId across all severity levels. Gives the complete request lifecycle: auth, route entry, DB queries, response. Use after reading the requestId from the support ticket's contextBundle. |
resource.type="cloud_run_revision" jsonPayload.requestId="<requestId>" |
bilko-5xx-demo |
All 5xx responses from bilko-api-demo in the last 24 hours. Use for proactive triage — catch errors before customers report them. |
resource.type="cloud_run_revision" resource.labels.service_name="bilko-api-demo" httpRequest.status>=500 |
To use these views: GCP Console → Cloud Logging → Log Explorer → Saved Queries. Select the relevant view, substitute the variable in angle brackets with the value from the support ticket.
Correlating a support ticket to logs:
- Take the
requestIdfrom the ticket'scontextBundle(or from therequest_idfield on the ticket row). - Open
bilko-request-traceview, paste therequestIdinto the filter. - You will see the full request lifecycle including the Kotlin
[SupportTickets]structured log line that confirms when the ticket was created.
6. Known Gaps and Follow-ups
| Gap | Description | Tracking |
|---|---|---|
| Sentry DSN not provisioned | bilko-sentry-dsn and bilko-web-sentry-dsn secrets exist in Secret Manager but are empty. Sentry is inert (no-op) in all environments until the CEO provisions the Sentry project and populates the secrets. Until then, rely on Cloud Logging for error observability. |
OCD-1 (open CEO decision) |
| Error-code taxonomy V2 | Most support tickets at MVP will carry generic codes (INFRA_001, VAL_001) because domain-specific codes (e.g. BILKO-VAT-001, BILKO-FISK-001) are not yet defined. Triage is partly blind until V2 error codes land. Staff should use the appRoute and httpStatus from the contextBundle alongside the generic code to narrow the domain. |
MC #103333 (V2 error code taxonomy) |
| Backend hardening follow-on | Server-side context_bundle value length constraints (max ~256 chars per key, charset ASCII printable) are a follow-on hardening item. Current implementation enforces key allowlist but not value length. |
MC #103338 (backend hardening) |
Sentry setTag('httpStatus') minor follow-on |
The Throwable catch-all in StatusPages.kt sets scope tags requestId, orgId, and errorCode. The httpStatus tag is not yet set — the Sentry beforeSend filter uses it to suppress 4xx events, but for the INFRA_001 catch-all (always 500) this is a minor gap only. A one-line follow-on to add scope.setTag("httpStatus", "500") in the catch-all handler is tracked as a minor improvement. |
Follow-on to MC #103323 |
| Admin portal flash (middleware.ts) | middleware.ts checks only for cookie presence (bilko_refresh_token or bilko_auth marker), not for the platformAdmin JWT claim. A non-admin authenticated user who navigates directly to /admin/support will see admin HTML briefly before the client-side useEffect redirect fires. The real security boundary is the backend — AdminAuthPlugin.kt enforces the claim on every API call. UI flash is a UX gap open for CEO decision. |
OCD (open CEO decision) |
| Customer-facing ticket status | Customers have no way to view the status of their submitted ticket or receive a notification when it is resolved. The backend POST /support/tickets returns the ticket ID; a customer-facing GET /support/tickets endpoint and notification flow are deferred to V2. |
V2 scope |
Published by Skillforge (MC #103327). Complements BookStack 3100 (Backend MVP) and BookStack 3104 (Prod Topology). Last updated: 2026-06-11.