# Bilko Observability (GCP-native) 2026-06-10

## Status

LIVE and Proveo-verified as of 2026-06-10. GCP project **tribal-sign-487920-k0**, region **europe-north1**. MC #103329 (FlowForge implementation) + MC #103331 (Proveo independent verification). Parent MC #103328.

**Related:** [Bilko Sentinel — Tier-0 Self-Healing Agent 2026-06-10](/books/bilko-balkan-accounting-saas/page/bilko-sentinel-tier-0-self-healing-agent-2026-06-10)

## Environment Topology

The naming is deliberately confusing due to legacy reasons — read carefully:

<table id="bkmrk-logical-rolecloud-ru"> <thead> <tr><th>Logical role</th><th>Cloud Run services</th><th>Cloud SQL instance</th><th>URLs</th><th>Notes</th></tr> </thead> <tbody> <tr> <td>**PROD (customer trial)**</td> <td>`bilko-api-demo`, `bilko-web-demo`</td> <td>`bilko-demo-db`</td> <td>app.bilko.cloud / bilko-demo.alai.no</td> <td>Named "-demo" for legacy reasons. This is the **functionally live surface** — real customer self-serve trial traffic.</td> </tr> <tr> <td>**STAGE (internal CI/E2E)**</td> <td>`bilko-api-stage`, `bilko-web-stage`</td> <td>`bilko-staging-db`</td> <td>bilko-\*-stage.run.app</td> <td>Internal only. Used for CI validation and E2E test runs. Not customer-facing.</td> </tr> <tr> <td>**Reserved shell (dormant)**</td> <td>`bilko-web` (rev 00001)</td> <td>`bilko-db`</td> <td>N/A</td> <td>Dormant. Excluded from all alerting. Do not SLO-bind until activated.</td> </tr> </tbody></table>

## Uptime Checks (4 active)

<table id="bkmrk-%23display-namehost-%2F-"> <thead> <tr><th>\#</th><th>Display name</th><th>Host / Path</th><th>Period</th><th>Regions</th><th>Env</th></tr> </thead> <tbody> <tr> <td>1</td> <td>Bilko Web Prod (app.bilko.cloud)</td> <td>app.bilko.cloud /</td> <td>60s</td> <td>EUROPE, USA\_VIRGINIA, ASIA\_PACIFIC</td> <td>prod</td> </tr> <tr> <td>2</td> <td>Bilko API Prod (app-api.bilko.cloud/api/v1/health)</td> <td>app-api.bilko.cloud /api/v1/health</td> <td>60s</td> <td>EUROPE, USA\_VIRGINIA, ASIA\_PACIFIC</td> <td>prod</td> </tr> <tr> <td>3</td> <td>Bilko Web Stage</td> <td>bilko-web-stage-dh4m46blja-lz.a.run.app /</td> <td>300s</td> <td>EUROPE, USA\_VIRGINIA, ASIA\_PACIFIC</td> <td>stage</td> </tr> <tr> <td>4</td> <td>Bilko API Stage</td> <td>bilko-api-stage-dh4m46blja-lz.a.run.app /api/v1/health</td> <td>300s</td> <td>EUROPE, USA\_VIRGINIA, ASIA\_PACIFIC</td> <td>stage</td> </tr> </tbody></table>

**Note on API health check:** `app-api.bilko.cloud` (Cloudflare proxy) returns HTTP 405 on GET — this is expected. The actual Cloud Run service returns 200. The uptime check accepts both 200 and 405.

## Alert Policies (7 active, MC #103329)

<table id="bkmrk-policy-nameservices-"> <thead> <tr><th>Policy name</th><th>Services / instances</th><th>Threshold</th><th>Policy ID</th></tr> </thead> <tbody> <tr> <td>Bilko Prod — HTTP 5xx rate high on bilko-api-demo</td> <td>bilko-api-demo</td> <td>&gt;1% 5xx rate over 5-min window (ALIGN\_RATE, REDUCE\_SUM)</td> <td>11502345168057990272</td> </tr> <tr> <td>Bilko Prod — HTTP 5xx rate high on bilko-web-demo</td> <td>bilko-web-demo</td> <td>&gt;1% 5xx rate over 5-min window</td> <td>13840551641108771864</td> </tr> <tr> <td>Bilko Prod — Request latency P95 high on prod services</td> <td>bilko-api-demo, bilko-web-demo</td> <td>API P95 &gt;3000ms; Web P95 &gt;5000ms</td> <td>13840551641108772022</td> </tr> <tr> <td>Bilko Prod — Container restart/crash on prod services</td> <td>bilko-api-demo, bilko-web-demo</td> <td>starting-state instance count MEAN &gt;3 in 5-min window (crash-loop indicator)</td> <td>10038710534975650645</td> </tr> <tr> <td>Bilko — Cloud SQL CPU utilization high (prod + stage)</td> <td>bilko-demo-db, bilko-staging-db</td> <td>bilko-demo-db &gt;70% CPU for 5min; bilko-staging-db &gt;85% for 5min</td> <td>1002243302492516643</td> </tr> <tr> <td>Bilko Prod — Cloud SQL connections near max on bilko-demo-db</td> <td>bilko-demo-db</td> <td>num\_backends &gt;20 (80% of max 25 for db-f1-micro)</td> <td>606613461467816964</td> </tr> <tr> <td>Bilko Prod — Uptime check failed</td> <td>app.bilko.cloud, app-api.bilko.cloud</td> <td>REDUCE\_COUNT\_FALSE &gt;1 for 120s duration (2+ regions failing)</td> <td>8433909893104140357</td> </tr> </tbody></table>

There is also one pre-existing legacy policy from MC #103245: **Bilko CIAM — High 429 rate on bilko-api-demo** (policy ID 4279915624784430014), kept and already had Slack+email attached.

## Notification Channels

<table id="bkmrk-channeltypegcp-chann"> <thead> <tr><th>Channel</th><th>Type</th><th>GCP channel ID</th><th>Attached to</th></tr> </thead> <tbody> <tr> <td>Slack **\#ceo** (ALAI workspace T0AELHU0E13)</td> <td>Slack (GCP-native OAuth)</td> <td>17620748118296880307</td> <td>All 7 MC#103329 policies + legacy CIAM policy</td> </tr> <tr> <td>**alem@alai.no**</td> <td>Email</td> <td>16578157527237754053</td> <td>All 7 MC#103329 policies</td> </tr> <tr> <td>**dev@alai.no**</td> <td>Email (pre-existing)</td> <td>2103834221134748174</td> <td>All 7 MC#103329 policies</td> </tr> </tbody></table>

## Dashboard

Display name: **Bilko Observability — Prod + Stage (MC #103329)**  
Dashboard ID: `070613fa-a0b6-41e1-8606-ccdf0e52a87a`  
[Open in GCP Console](https://console.cloud.google.com/monitoring/dashboards/custom/070613fa-a0b6-41e1-8606-ccdf0e52a87a?project=tribal-sign-487920-k0)

Dashboard tiles:

- Prod API Request Rate by response class
- Prod API Latency P50/P95/P99
- Prod Container Instance Count
- Prod DB CPU Utilization (bilko-demo-db)
- Prod DB Active Connections
- Uptime Check Pass Rate (prod web + api)
- Stage API Request Rate
- Stage API Latency P95
- Stage DB CPU Utilization (bilko-staging-db)

## Proveo Verification (End-to-End Alert Delivery)

Proveo (MC #103331) ran an independent end-to-end proof:

- Created a temporary uptime probe pointing at a non-existent URL guaranteed to return 404
- GCP confirmed REDUCE\_COUNT\_FALSE=3 (threshold breached) within ~90 seconds
- Slack #ceo received a native GCP alert message; incident ID `0.o8uwptg3xflh`, channel type confirmed as `channelType=slack`
- Email delivery structurally proven: GCP fires all attached channels from the same alert event; email channel is `enabled: true` and correctly attached
- Both test artifacts (probe + policy) deleted after verification; zero regression on prod services

**Verdict: PASS.**

## IAM Note

No new IAM bindings were created. All setup used `gcloud monitoring` commands only. The existing `alai-cli-deployer` service account already held Monitoring Admin role.

## Tuning and Maintenance

### Adding or modifying an alert policy

```
# List all policies
gcloud monitoring policies list --project=tribal-sign-487920-k0

# Describe a specific policy (by ID)
gcloud monitoring policies describe POLICY_ID --project=tribal-sign-487920-k0

# Update a threshold (edit JSON/YAML and update)
gcloud monitoring policies update POLICY_ID --policy-from-file=policy.json --project=tribal-sign-487920-k0

# Create a new policy from file
gcloud beta monitoring policies create --policy-from-file=new-policy.json --project=tribal-sign-487920-k0
```

### Known threshold that may need raising

The `bilko-demo-db` SQL connections threshold (20/25) was set at 80% of the `db-f1-micro` max\_connections=25. After a few weeks of baseline data, consider whether to raise the instance tier (which raises max\_connections) or adjust this threshold. Check current connection count:

```
gcloud monitoring time-series list \
  --filter='metric.type="cloudsql.googleapis.com/database/postgresql/num_backends" AND resource.labels.database_id:"bilko-demo-db"' \
  --project=tribal-sign-487920-k0 \
  --freshness=5m
```

## Supersedes

This page supersedes `docs/infrastructure/MONITORING.md` v1.0 (2026-02-25), which described the Railway/Vercel/Express era with PLANNED Sentry/BetterStack. That file has been updated with a superseded header pointing here. See also [Bilko Sentinel — Tier-0 Self-Healing Agent 2026-06-10](/books/bilko-balkan-accounting-saas/page/bilko-sentinel-tier-0-self-healing-agent-2026-06-10) for the detection and diagnosis agent built on top of this observability layer. Discussion note: `docs/infrastructure/OBSERVABILITY-DISCUSSION-2026-06-09.md`.