Skip to main content

Service Design

Service Design Document

Project: {{PROJECT_NAME}} Service: {{SERVICE_NAME}} Version: {{VERSION}} Date: {{DATE}} Author: {{AUTHOR}} Status: Draft | In Review | Approved Reviewers: {{REVIEWERS}}

Document History

Version Date Author Changes
0.1 {{DATE}} {{AUTHOR}} Initial draft

1. Service Overview

Property Value
Service name {{service-name}}
Bounded context {{Domain / bounded context name}}
Repository {{https://github.com/org/service-name}}
Owner team {{Team Name}}
On-call {{PagerDuty rotation / team contact}}
Runbook {{https://wiki.domain.com/runbooks/service-name}}
Tech stack {{Node.js 20 + NestJS + PostgreSQL + Redis}}

Purpose:

TODO: 2-3 sentences. What does this service do? What business capability does it own? What is explicitly OUT of scope?

Bounded context: This service owns the {{DOMAIN}} domain. It is the single source of truth for {{entities owned}}. Other services must NOT directly access this service's database — they must call its API or subscribe to its events.


2. Service Responsibility & Ownership

This service IS responsible for:

  • {{Primary responsibility 1}}
  • {{Primary responsibility 2}}
  • {{Primary responsibility 3}}

This service is NOT responsible for:

  • {{Out-of-scope concern 1 — handled by service X}}
  • {{Out-of-scope concern 2}}

Data ownership:

  • Owns: {{users, user_profiles, user_preferences tables}}
  • Does NOT own: {{orders (belongs to order-service)}}

3. Interface Definition

3.1 REST API Endpoints

Method Path Description Auth
GET /{{service}}/health Health check None
GET /{{service}}/{{resource}} List {{resources}} Bearer JWT
GET /{{service}}/{{resource}}/:id Get by ID Bearer JWT
POST /{{service}}/{{resource}} Create Bearer JWT
PATCH /{{service}}/{{resource}}/:id Update Bearer JWT
DELETE /{{service}}/{{resource}}/:id Delete Bearer JWT

Internal endpoints (service-to-service only, no external access):

Method Path Description Auth
GET /internal/{{resource}}/:id Bulk lookup by IDs Service API key

Full API reference: See api-reference.md or {{OpenAPI URL}}


3.2 gRPC Service Definition (if applicable)

// proto/{{service_name}}.proto
syntax = "proto3";

package {{service_name}};

service {{ServiceName}}Service {
  rpc Get{{Resource}} (Get{{Resource}}Request) returns ({{Resource}});
  rpc List{{Resources}} (List{{Resources}}Request) returns (List{{Resources}}Response);
  rpc Create{{Resource}} (Create{{Resource}}Request) returns ({{Resource}});
}

message {{Resource}} {
  string id = 1;
  string name = 2;
  string created_at = 3;
}

message Get{{Resource}}Request {
  string id = 1;
}

TODO: Remove or populate gRPC section based on actual communication protocol.


3.3 Events Published

Event Type Trigger Topic / Queue Consumer(s)
{{domain}}.{{entity}}.created Entity created {{topic-name}} {{service-a, service-b}}
{{domain}}.{{entity}}.updated Entity updated {{topic-name}} {{service-a}}
{{domain}}.{{entity}}.deleted Soft delete {{topic-name}} {{service-b}}

Example published event:

{
  "specversion": "1.0",
  "type": "{{domain}}.{{entity}}.created",
  "source": "{{service-name}}",
  "id": "evt_01HX7...",
  "time": "2024-01-15T10:30:00Z",
  "datacontenttype": "application/json",
  "data": {
    "id": "{{UUID}}",
    "{{field}}": "{{value}}"
  }
}

Full event schemas: See event-schema-documentation.md


3.4 Events Consumed

Event Type Source Service Handler Action
{{domain}}.{{entity}}.created {{source-service}} {{Action this service takes}}
{{domain}}.{{entity}}.deleted {{source-service}} {{Action — e.g., cascade delete}}

Consumer group: {{service-name}}-consumer Idempotency: All handlers are idempotent (duplicate events produce same result).


4. Database

4.1 Technology & Rationale

Property Value
Database {{PostgreSQL 16}}
ORM {{Prisma 5}}
Rationale {{Why this DB was chosen}}
Hosting {{AWS RDS / Supabase / Self-hosted}}
Replication {{1 primary + 2 read replicas}}
Backup {{Daily snapshot + WAL archiving}}
Encryption At rest and in transit

4.2 Schema Overview

erDiagram
    USERS {
        uuid id PK
        string email UK
        string name
        string role
        string status
        timestamp created_at
        timestamp updated_at
        timestamp deleted_at
    }

    USER_PROFILES {
        uuid id PK
        uuid user_id FK
        string avatar_url
        string bio
        jsonb settings
        timestamp updated_at
    }

    USER_SESSIONS {
        uuid id PK
        uuid user_id FK
        string refresh_token_hash
        string ip_address
        timestamp expires_at
        timestamp created_at
    }

    USERS ||--o| USER_PROFILES : has
    USERS ||--o{ USER_SESSIONS : has

TODO: Update schema to reflect actual tables. Add missing tables.


4.3 Data Ownership Boundaries

  • Read access: Any service may query via this service's API
  • Write access: ONLY this service writes to its tables
  • Direct DB access: FORBIDDEN for all other services

Cross-service data pattern:

Service A needs user name:
  → GET /users/:id via HTTP (NOT direct DB query)
  → Or subscribe to user.updated events and cache locally

5. Dependencies

5.1 Upstream Services (Services This Depends On)

Service Purpose Criticality Fallback
{{auth-service}} JWT validation Critical Cache valid tokens 5 min
{{notification-service}} Send emails Non-critical Queue for retry
{{{{EXTERNAL_API}}}} {{Purpose}} {{Critical/Non-critical}} {{Fallback strategy}}

5.2 Downstream Services (Services That Depend On This)

Service How it uses this service Impact if this service is down
{{order-service}} Validate user exists before creating order Cannot create orders
{{notification-service}} Resolve user email for delivery Cannot send user notifications

5.3 External APIs & Third-Party

Service Purpose Rate Limit Credentials
{{SendGrid}} Transactional email 100 req/s Vault: sendgrid/api-key
{{Stripe}} Payment processing Vault: stripe/secret-key

5.4 Dependency Diagram

graph LR
    ThisService["{{service-name}}"]

    subgraph "Upstream (depends on)"
        AuthService["auth-service"]
        ExternalAPI["external-api"]
    end

    subgraph "Downstream (depended on by)"
        OrderService["order-service"]
        NotifService["notification-service"]
    end

    AuthService --> ThisService
    ExternalAPI --> ThisService
    ThisService --> OrderService
    ThisService --> NotifService

6. Deployment Configuration

Property Dev Staging Production
Replicas 1 2 {{min: 3, max: 10}}
CPU request 100m 250m 500m
CPU limit 500m 1000m 2000m
Memory request 128Mi 256Mi 512Mi
Memory limit 512Mi 1Gi 2Gi
Port 4000 4000 4000

Kubernetes manifest location: {{k8s/{{service-name}}/}} Helm chart: {{charts/{{service-name}}/}} Docker image: {{registry.domain.com/service-name}}


7. Scaling Strategy

Dimension Strategy Trigger
Horizontal (replicas) HPA: CPU > 70% OR RPS > 1000 Automatic
Vertical (resources) VPA recommendations reviewed monthly Manual
Database Read replicas for SELECT queries Manual
Cache Redis Cluster when > 10GB RAM Manual

Stateless confirmation: This service stores NO session state in memory — safe to scale horizontally.


8. Health Check & Readiness Probes

livenessProbe:
  httpGet:
    path: /health/live
    port: 4000
  initialDelaySeconds: 30
  periodSeconds: 10
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /health/ready
    port: 4000
  initialDelaySeconds: 10
  periodSeconds: 5
  failureThreshold: 3

startupProbe:
  httpGet:
    path: /health/startup
    port: 4000
  failureThreshold: 30
  periodSeconds: 10

9. SLA Commitments

Metric Target Measurement Window
Availability 99.9% (8.7h downtime/year) Rolling 30 days
P50 response time < 50ms 1 hour
P95 response time < 200ms 1 hour
P99 response time < 500ms 1 hour
Error rate (5xx) < 0.1% 1 hour

SLA breach escalation: Alert → PagerDuty {{on-call rotation}} → Incident declared at SLA breach risk.


10. Monitoring & Alerting Rules

Metric Threshold Alert Severity Channel
Error rate (5xx) > 1% for 5 min P1 PagerDuty
P99 latency > 1s for 5 min P2 Slack #alerts
CPU utilization > 85% for 10 min P3 Slack #alerts
Memory utilization > 80% P3 Slack #alerts
DB connection pool > 80% P2 PagerDuty
Queue depth > 10,000 items P2 Slack #alerts

Dashboard: {{https://monitoring.domain.com/dashboards/service-name}}


11. Runbook Reference

Runbook location: {{https://wiki.domain.com/runbooks/{{service-name}}}}

Quick reference for common incidents:

Incident Initial Response
High error rate Check logs → identify error pattern → scale up if OOM
High latency Check DB slow query log → check Redis hit rate → check upstream dependency
Pod crash loop Check OOMKilled → check logs → check health probe thresholds
DB connection exhaustion Check pool config → check idle connections → force disconnect

Approval

Role Name Date Signature
Author
Service Owner
Architect
SRE Lead