Service Design

Service Design Document

Project: {{PROJECT_NAME}} Service: {{SERVICE_NAME}} Version: {{VERSION}} Date: {{DATE}} Author: {{AUTHOR}} Status: Draft | In Review | Approved Reviewers: {{REVIEWERS}}

Document History

Version	Date	Author	Changes
0.1	{{DATE}}	{{AUTHOR}}	Initial draft

1. Service Overview

Property	Value
Service name	`{{service-name}}`
Bounded context	`{{Domain / bounded context name}}`
Repository	`{{https://github.com/org/service-name}}`
Owner team	`{{Team Name}}`
On-call	`{{PagerDuty rotation / team contact}}`
Runbook	`{{https://wiki.domain.com/runbooks/service-name}}`
Tech stack	`{{Node.js 20 + NestJS + PostgreSQL + Redis}}`

Purpose:

TODO: 2-3 sentences. What does this service do? What business capability does it own? What is explicitly OUT of scope?

Bounded context: This service owns the {{DOMAIN}} domain. It is the single source of truth for {{entities owned}}. Other services must NOT directly access this service's database — they must call its API or subscribe to its events.

2. Service Responsibility & Ownership

This service IS responsible for:

{{Primary responsibility 1}}
{{Primary responsibility 2}}
{{Primary responsibility 3}}

This service is NOT responsible for:

{{Out-of-scope concern 1 — handled by service X}}
{{Out-of-scope concern 2}}

Data ownership:

Owns: {{users, user_profiles, user_preferences tables}}
Does NOT own: {{orders (belongs to order-service)}}

3. Interface Definition

3.1 REST API Endpoints

Method	Path	Description	Auth
`GET`	`/{{service}}/health`	Health check	None
`GET`	`/{{service}}/{{resource}}`	List `{{resources}}`	Bearer JWT
`GET`	`/{{service}}/{{resource}}/:id`	Get by ID	Bearer JWT
`POST`	`/{{service}}/{{resource}}`	Create	Bearer JWT
`PATCH`	`/{{service}}/{{resource}}/:id`	Update	Bearer JWT
`DELETE`	`/{{service}}/{{resource}}/:id`	Delete	Bearer JWT

Internal endpoints (service-to-service only, no external access):

Method	Path	Description	Auth
`GET`	`/internal/{{resource}}/:id`	Bulk lookup by IDs	Service API key

Full API reference: See api-reference.md or {{OpenAPI URL}}

3.2 gRPC Service Definition (if applicable)

// proto/{{service_name}}.proto
syntax = "proto3";

package {{service_name}};

service {{ServiceName}}Service {
  rpc Get{{Resource}} (Get{{Resource}}Request) returns ({{Resource}});
  rpc List{{Resources}} (List{{Resources}}Request) returns (List{{Resources}}Response);
  rpc Create{{Resource}} (Create{{Resource}}Request) returns ({{Resource}});
}

message {{Resource}} {
  string id = 1;
  string name = 2;
  string created_at = 3;
}

message Get{{Resource}}Request {
  string id = 1;
}

TODO: Remove or populate gRPC section based on actual communication protocol.

3.3 Events Published

Event Type	Trigger	Topic / Queue	Consumer(s)
`{{domain}}.{{entity}}.created`	Entity created	`{{topic-name}}`	`{{service-a, service-b}}`
`{{domain}}.{{entity}}.updated`	Entity updated	`{{topic-name}}`	`{{service-a}}`
`{{domain}}.{{entity}}.deleted`	Soft delete	`{{topic-name}}`	`{{service-b}}`

Example published event:

{
  "specversion": "1.0",
  "type": "{{domain}}.{{entity}}.created",
  "source": "{{service-name}}",
  "id": "evt_01HX7...",
  "time": "2024-01-15T10:30:00Z",
  "datacontenttype": "application/json",
  "data": {
    "id": "{{UUID}}",
    "{{field}}": "{{value}}"
  }
}

Full event schemas: See event-schema-documentation.md

3.4 Events Consumed

Event Type	Source Service	Handler Action
`{{domain}}.{{entity}}.created`	`{{source-service}}`	`{{Action this service takes}}`
`{{domain}}.{{entity}}.deleted`	`{{source-service}}`	`{{Action — e.g., cascade delete}}`

Consumer group: {{service-name}}-consumer Idempotency: All handlers are idempotent (duplicate events produce same result).

4. Database

4.1 Technology & Rationale

Property	Value
Database	`{{PostgreSQL 16}}`
ORM	`{{Prisma 5}}`
Rationale	`{{Why this DB was chosen}}`
Hosting	`{{AWS RDS / Supabase / Self-hosted}}`
Replication	`{{1 primary + 2 read replicas}}`
Backup	`{{Daily snapshot + WAL archiving}}`
Encryption	At rest and in transit

4.2 Schema Overview

erDiagram
    USERS {
        uuid id PK
        string email UK
        string name
        string role
        string status
        timestamp created_at
        timestamp updated_at
        timestamp deleted_at
    }

    USER_PROFILES {
        uuid id PK
        uuid user_id FK
        string avatar_url
        string bio
        jsonb settings
        timestamp updated_at
    }

    USER_SESSIONS {
        uuid id PK
        uuid user_id FK
        string refresh_token_hash
        string ip_address
        timestamp expires_at
        timestamp created_at
    }

    USERS ||--o| USER_PROFILES : has
    USERS ||--o{ USER_SESSIONS : has

TODO: Update schema to reflect actual tables. Add missing tables.

4.3 Data Ownership Boundaries

Read access: Any service may query via this service's API
Write access: ONLY this service writes to its tables
Direct DB access: FORBIDDEN for all other services

Cross-service data pattern:

Service A needs user name:
  → GET /users/:id via HTTP (NOT direct DB query)
  → Or subscribe to user.updated events and cache locally

5. Dependencies

5.1 Upstream Services (Services This Depends On)

Service	Purpose	Criticality	Fallback
`{{auth-service}}`	JWT validation	Critical	Cache valid tokens 5 min
`{{notification-service}}`	Send emails	Non-critical	Queue for retry
`{{{{EXTERNAL_API}}}}`	`{{Purpose}}`	`{{Critical/Non-critical}}`	`{{Fallback strategy}}`

5.2 Downstream Services (Services That Depend On This)

Service	How it uses this service	Impact if this service is down
`{{order-service}}`	Validate user exists before creating order	Cannot create orders
`{{notification-service}}`	Resolve user email for delivery	Cannot send user notifications

5.3 External APIs & Third-Party

Service	Purpose	Rate Limit	Credentials
`{{SendGrid}}`	Transactional email	100 req/s	Vault: `sendgrid/api-key`
`{{Stripe}}`	Payment processing	—	Vault: `stripe/secret-key`

5.4 Dependency Diagram

graph LR
    ThisService["{{service-name}}"]

    subgraph "Upstream (depends on)"
        AuthService["auth-service"]
        ExternalAPI["external-api"]
    end

    subgraph "Downstream (depended on by)"
        OrderService["order-service"]
        NotifService["notification-service"]
    end

    AuthService --> ThisService
    ExternalAPI --> ThisService
    ThisService --> OrderService
    ThisService --> NotifService

6. Deployment Configuration

Property	Dev	Staging	Production
Replicas	1	2	`{{min: 3, max: 10}}`
CPU request	100m	250m	500m
CPU limit	500m	1000m	2000m
Memory request	128Mi	256Mi	512Mi
Memory limit	512Mi	1Gi	2Gi
Port	4000	4000	4000

Kubernetes manifest location: {{k8s/{{service-name}}/}} Helm chart: {{charts/{{service-name}}/}} Docker image: {{registry.domain.com/service-name}}

7. Scaling Strategy

Dimension	Strategy	Trigger
Horizontal (replicas)	HPA: CPU > 70% OR RPS > 1000	Automatic
Vertical (resources)	VPA recommendations reviewed monthly	Manual
Database	Read replicas for SELECT queries	Manual
Cache	Redis Cluster when > 10GB RAM	Manual

Stateless confirmation: This service stores NO session state in memory — safe to scale horizontally.

8. Health Check & Readiness Probes

livenessProbe:
  httpGet:
    path: /health/live
    port: 4000
  initialDelaySeconds: 30
  periodSeconds: 10
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /health/ready
    port: 4000
  initialDelaySeconds: 10
  periodSeconds: 5
  failureThreshold: 3

startupProbe:
  httpGet:
    path: /health/startup
    port: 4000
  failureThreshold: 30
  periodSeconds: 10

9. SLA Commitments

Metric	Target	Measurement Window
Availability	99.9% (8.7h downtime/year)	Rolling 30 days
P50 response time	< 50ms	1 hour
P95 response time	< 200ms	1 hour
P99 response time	< 500ms	1 hour
Error rate (5xx)	< 0.1%	1 hour

SLA breach escalation: Alert → PagerDuty {{on-call rotation}} → Incident declared at SLA breach risk.

10. Monitoring & Alerting Rules

Metric	Threshold	Alert Severity	Channel
Error rate (5xx)	> 1% for 5 min	P1	PagerDuty
P99 latency	> 1s for 5 min	P2	Slack `#alerts`
CPU utilization	> 85% for 10 min	P3	Slack `#alerts`
Memory utilization	> 80%	P3	Slack `#alerts`
DB connection pool	> 80%	P2	PagerDuty
Queue depth	> 10,000 items	P2	Slack `#alerts`

Dashboard: {{https://monitoring.domain.com/dashboards/service-name}}

11. Runbook Reference

Runbook location: {{https://wiki.domain.com/runbooks/{{service-name}}}}

Quick reference for common incidents:

Incident	Initial Response
High error rate	Check logs → identify error pattern → scale up if OOM
High latency	Check DB slow query log → check Redis hit rate → check upstream dependency
Pod crash loop	Check OOMKilled → check logs → check health probe thresholds
DB connection exhaustion	Check pool config → check idle connections → force disconnect

Approval

Role	Name	Date	Signature
Author
Service Owner
Architect
SRE Lead

API Reference

Authentication

Services

Middleware Design Document

Feature Flags

Middleware