Service Design
Service Design Document
Project: {{PROJECT_NAME}} Service: {{SERVICE_NAME}} Version: {{VERSION}} Date: {{DATE}} Author: {{AUTHOR}} Status: Draft | In Review | Approved Reviewers: {{REVIEWERS}}
Document History
| Version | Date | Author | Changes |
|---|---|---|---|
| 0.1 | {{DATE}} | {{AUTHOR}} | Initial draft |
1. Service Overview
| Property | Value |
|---|---|
| Service name | {{service-name}} |
| Bounded context | {{Domain / bounded context name}} |
| Repository | {{https://github.com/org/service-name}} |
| Owner team | {{Team Name}} |
| On-call | {{PagerDuty rotation / team contact}} |
| Runbook | {{https://wiki.domain.com/runbooks/service-name}} |
| Tech stack | {{Node.js 20 + NestJS + PostgreSQL + Redis}} |
Purpose:
TODO: 2-3 sentences. What does this service do? What business capability does it own? What is explicitly OUT of scope?
Bounded context:
This service owns the {{DOMAIN}} domain. It is the single source of truth for {{entities owned}}. Other services must NOT directly access this service's database — they must call its API or subscribe to its events.
2. Service Responsibility & Ownership
This service IS responsible for:
{{Primary responsibility 1}}{{Primary responsibility 2}}{{Primary responsibility 3}}
This service is NOT responsible for:
{{Out-of-scope concern 1 — handled by service X}}{{Out-of-scope concern 2}}
Data ownership:
- Owns:
{{users, user_profiles, user_preferences tables}} - Does NOT own:
{{orders (belongs to order-service)}}
3. Interface Definition
3.1 REST API Endpoints
| Method | Path | Description | Auth |
|---|---|---|---|
GET |
/{{service}}/health |
Health check | None |
GET |
/{{service}}/{{resource}} |
List {{resources}} |
Bearer JWT |
GET |
/{{service}}/{{resource}}/:id |
Get by ID | Bearer JWT |
POST |
/{{service}}/{{resource}} |
Create | Bearer JWT |
PATCH |
/{{service}}/{{resource}}/:id |
Update | Bearer JWT |
DELETE |
/{{service}}/{{resource}}/:id |
Delete | Bearer JWT |
Internal endpoints (service-to-service only, no external access):
| Method | Path | Description | Auth |
|---|---|---|---|
GET |
/internal/{{resource}}/:id |
Bulk lookup by IDs | Service API key |
Full API reference: See api-reference.md or {{OpenAPI URL}}
3.2 gRPC Service Definition (if applicable)
// proto/{{service_name}}.proto
syntax = "proto3";
package {{service_name}};
service {{ServiceName}}Service {
rpc Get{{Resource}} (Get{{Resource}}Request) returns ({{Resource}});
rpc List{{Resources}} (List{{Resources}}Request) returns (List{{Resources}}Response);
rpc Create{{Resource}} (Create{{Resource}}Request) returns ({{Resource}});
}
message {{Resource}} {
string id = 1;
string name = 2;
string created_at = 3;
}
message Get{{Resource}}Request {
string id = 1;
}
TODO: Remove or populate gRPC section based on actual communication protocol.
3.3 Events Published
| Event Type | Trigger | Topic / Queue | Consumer(s) |
|---|---|---|---|
{{domain}}.{{entity}}.created |
Entity created | {{topic-name}} |
{{service-a, service-b}} |
{{domain}}.{{entity}}.updated |
Entity updated | {{topic-name}} |
{{service-a}} |
{{domain}}.{{entity}}.deleted |
Soft delete | {{topic-name}} |
{{service-b}} |
Example published event:
{
"specversion": "1.0",
"type": "{{domain}}.{{entity}}.created",
"source": "{{service-name}}",
"id": "evt_01HX7...",
"time": "2024-01-15T10:30:00Z",
"datacontenttype": "application/json",
"data": {
"id": "{{UUID}}",
"{{field}}": "{{value}}"
}
}
Full event schemas: See event-schema-documentation.md
3.4 Events Consumed
| Event Type | Source Service | Handler Action |
|---|---|---|
{{domain}}.{{entity}}.created |
{{source-service}} |
{{Action this service takes}} |
{{domain}}.{{entity}}.deleted |
{{source-service}} |
{{Action — e.g., cascade delete}} |
Consumer group: {{service-name}}-consumer
Idempotency: All handlers are idempotent (duplicate events produce same result).
4. Database
4.1 Technology & Rationale
| Property | Value |
|---|---|
| Database | {{PostgreSQL 16}} |
| ORM | {{Prisma 5}} |
| Rationale | {{Why this DB was chosen}} |
| Hosting | {{AWS RDS / Supabase / Self-hosted}} |
| Replication | {{1 primary + 2 read replicas}} |
| Backup | {{Daily snapshot + WAL archiving}} |
| Encryption | At rest and in transit |
4.2 Schema Overview
erDiagram
USERS {
uuid id PK
string email UK
string name
string role
string status
timestamp created_at
timestamp updated_at
timestamp deleted_at
}
USER_PROFILES {
uuid id PK
uuid user_id FK
string avatar_url
string bio
jsonb settings
timestamp updated_at
}
USER_SESSIONS {
uuid id PK
uuid user_id FK
string refresh_token_hash
string ip_address
timestamp expires_at
timestamp created_at
}
USERS ||--o| USER_PROFILES : has
USERS ||--o{ USER_SESSIONS : has
TODO: Update schema to reflect actual tables. Add missing tables.
4.3 Data Ownership Boundaries
- Read access: Any service may query via this service's API
- Write access: ONLY this service writes to its tables
- Direct DB access: FORBIDDEN for all other services
Cross-service data pattern:
Service A needs user name:
→ GET /users/:id via HTTP (NOT direct DB query)
→ Or subscribe to user.updated events and cache locally
5. Dependencies
5.1 Upstream Services (Services This Depends On)
| Service | Purpose | Criticality | Fallback |
|---|---|---|---|
{{auth-service}} |
JWT validation | Critical | Cache valid tokens 5 min |
{{notification-service}} |
Send emails | Non-critical | Queue for retry |
{{{{EXTERNAL_API}}}} |
{{Purpose}} |
{{Critical/Non-critical}} |
{{Fallback strategy}} |
5.2 Downstream Services (Services That Depend On This)
| Service | How it uses this service | Impact if this service is down |
|---|---|---|
{{order-service}} |
Validate user exists before creating order | Cannot create orders |
{{notification-service}} |
Resolve user email for delivery | Cannot send user notifications |
5.3 External APIs & Third-Party
| Service | Purpose | Rate Limit | Credentials |
|---|---|---|---|
{{SendGrid}} |
Transactional email | 100 req/s | Vault: sendgrid/api-key |
{{Stripe}} |
Payment processing | — | Vault: stripe/secret-key |
5.4 Dependency Diagram
graph LR
ThisService["{{service-name}}"]
subgraph "Upstream (depends on)"
AuthService["auth-service"]
ExternalAPI["external-api"]
end
subgraph "Downstream (depended on by)"
OrderService["order-service"]
NotifService["notification-service"]
end
AuthService --> ThisService
ExternalAPI --> ThisService
ThisService --> OrderService
ThisService --> NotifService
6. Deployment Configuration
| Property | Dev | Staging | Production |
|---|---|---|---|
| Replicas | 1 | 2 | {{min: 3, max: 10}} |
| CPU request | 100m | 250m | 500m |
| CPU limit | 500m | 1000m | 2000m |
| Memory request | 128Mi | 256Mi | 512Mi |
| Memory limit | 512Mi | 1Gi | 2Gi |
| Port | 4000 | 4000 | 4000 |
Kubernetes manifest location: {{k8s/{{service-name}}/}}
Helm chart: {{charts/{{service-name}}/}}
Docker image: {{registry.domain.com/service-name}}
7. Scaling Strategy
| Dimension | Strategy | Trigger |
|---|---|---|
| Horizontal (replicas) | HPA: CPU > 70% OR RPS > 1000 | Automatic |
| Vertical (resources) | VPA recommendations reviewed monthly | Manual |
| Database | Read replicas for SELECT queries | Manual |
| Cache | Redis Cluster when > 10GB RAM | Manual |
Stateless confirmation: This service stores NO session state in memory — safe to scale horizontally.
8. Health Check & Readiness Probes
livenessProbe:
httpGet:
path: /health/live
port: 4000
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /health/ready
port: 4000
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 3
startupProbe:
httpGet:
path: /health/startup
port: 4000
failureThreshold: 30
periodSeconds: 10
9. SLA Commitments
| Metric | Target | Measurement Window |
|---|---|---|
| Availability | 99.9% (8.7h downtime/year) | Rolling 30 days |
| P50 response time | < 50ms | 1 hour |
| P95 response time | < 200ms | 1 hour |
| P99 response time | < 500ms | 1 hour |
| Error rate (5xx) | < 0.1% | 1 hour |
SLA breach escalation: Alert → PagerDuty {{on-call rotation}} → Incident declared at SLA breach risk.
10. Monitoring & Alerting Rules
| Metric | Threshold | Alert Severity | Channel |
|---|---|---|---|
| Error rate (5xx) | > 1% for 5 min | P1 | PagerDuty |
| P99 latency | > 1s for 5 min | P2 | Slack #alerts |
| CPU utilization | > 85% for 10 min | P3 | Slack #alerts |
| Memory utilization | > 80% | P3 | Slack #alerts |
| DB connection pool | > 80% | P2 | PagerDuty |
| Queue depth | > 10,000 items | P2 | Slack #alerts |
Dashboard: {{https://monitoring.domain.com/dashboards/service-name}}
11. Runbook Reference
Runbook location: {{https://wiki.domain.com/runbooks/{{service-name}}}}
Quick reference for common incidents:
| Incident | Initial Response |
|---|---|
| High error rate | Check logs → identify error pattern → scale up if OOM |
| High latency | Check DB slow query log → check Redis hit rate → check upstream dependency |
| Pod crash loop | Check OOMKilled → check logs → check health probe thresholds |
| DB connection exhaustion | Check pool config → check idle connections → force disconnect |
Approval
| Role | Name | Date | Signature |
|---|---|---|---|
| Author | |||
| Service Owner | |||
| Architect | |||
| SRE Lead |
No comments to display
No comments to display