Integration Design
Integration Design Document
Project: {{PROJECT_NAME}} Integration: {{INTEGRATION_NAME}} Version: {{VERSION}} Date: {{DATE}} Author: {{AUTHOR}} Status: Draft | In Review | Approved Reviewers: {{REVIEWERS}}
Document History
| Version | Date | Author | Changes |
|---|---|---|---|
| 0.1 | {{DATE}} | {{AUTHOR}} | Initial draft |
1. Integration Overview & Context
Integration Name: {{INTEGRATION_NAME}} Type: Synchronous (REST/gRPC) | Asynchronous (Events/Queue) | Bidirectional | File-based
Business Purpose: {{WHY_THIS_INTEGRATION_EXISTS}}
Criticality: Critical | High | Medium | Low
- Impact if down: {{BUSINESS_IMPACT_IF_UNAVAILABLE}}
- Acceptable downtime: {{RTO}} | Max data loss: {{RPO}}
Parties:
| Party | System | Team | Contact |
|---|---|---|---|
| Consumer (caller) | {{CONSUMER_SYSTEM}} | {{TEAM_A}} | {{CONTACT_A}} |
| Provider (server) | {{PROVIDER_SYSTEM}} | {{TEAM_B}} | {{CONTACT_B}} |
2. Integration Topology Diagram
flowchart LR
subgraph ConsumerSide["Consumer — {{CONSUMER_SYSTEM}}"]
C_SVC[{{ConsumerService}}]
C_CB[Circuit Breaker]
C_RETRY[Retry Handler]
end
subgraph Integration["Integration Layer"]
GW[API Gateway / Load Balancer]
Q[Message Queue\n{{QUEUE_NAME}}]
DLQ[Dead Letter Queue\n{{DLQ_NAME}}]
end
subgraph ProviderSide["Provider — {{PROVIDER_SYSTEM}}"]
P_SVC[{{ProviderService}}]
P_DB[(Provider DB)]
P_WORKER[Event Worker]
end
C_SVC --> C_CB
C_CB --> C_RETRY
C_RETRY -->|HTTPS REST| GW
GW --> P_SVC
P_SVC --> P_DB
P_WORKER -->|Publish| Q
Q -->|Consume| C_SVC
Q -->|Failed| DLQ
DLQ -->|Alert| AlertSystem[PagerDuty]
3. Service Contracts
3.1 Integration: {{INTEGRATION_NAME_1}}
Protocol: REST/HTTPS | gRPC | GraphQL | WebSocket | AMQP
Direction: {{CONSUMER}} → {{PROVIDER}}
Idempotency: YES — use Idempotency-Key header | NO
Authentication
| Method | Details |
|---|---|
| Type | Bearer JWT |
| Header | Authorization: Bearer {{TOKEN}} |
| Key rotation | Every {{ROTATION_PERIOD}} — coordinated via {{ROTATION_PROCESS}} |
| Token endpoint | {{AUTH_ENDPOINT}} (if OAuth2) |
Request Contract
Endpoint: {{HTTP_METHOD}} {{BASE_URL}}/{{PATH}}
Headers:
Authorization: Bearer {{JWT_OR_API_KEY}}
Content-Type: application/json
Accept: application/json
X-Request-ID: {{UUID}}
X-Idempotency-Key: {{IDEMPOTENCY_KEY}}
Request Body:
{
"{{field1}}": "{{type}} — {{description}}",
"{{field2}}": "{{type}} — {{description}}",
"metadata": {
"sourceSystem": "{{CONSUMER_SYSTEM_ID}}",
"timestamp": "ISO8601"
}
}
Successful Response 200 / 201:
{
"{{responseField1}}": "{{type}}",
"{{responseField2}}": "{{type}}",
"requestId": "echo of X-Request-ID"
}
Error Handling
| HTTP Status | Error Code | Consumer Action |
|---|---|---|
400 |
VALIDATION_ERROR |
Log error, do NOT retry — fix request |
401 |
UNAUTHORIZED |
Refresh token, retry once |
403 |
FORBIDDEN |
Alert engineering, do NOT retry |
404 |
NOT_FOUND |
Log, do NOT retry — check resource ID |
409 |
CONFLICT |
Log, skip (idempotent) |
422 |
BUSINESS_RULE |
Log error, do NOT retry — escalate |
429 |
RATE_LIMITED |
Backoff per Retry-After header |
500 |
INTERNAL_ERROR |
Retry with exponential backoff |
502/503 |
UNAVAILABLE |
Circuit breaker — fail fast |
Retry Policy
Max retries: {{MAX_RETRIES}} (retry only on 500, 502, 503, 429, network errors)
Strategy: Exponential backoff with jitter
Delays: [{{DELAY_1}}ms, {{DELAY_2}}ms, {{DELAY_3}}ms]
Timeout per attempt: {{TIMEOUT_MS}}ms
Circuit Breaker Configuration
Failure threshold: {{FAILURE_PERCENT}}% failures in {{WINDOW_SECONDS}}s window
Open duration: {{OPEN_DURATION_SECONDS}}s
Half-open test: 1 request
Alert on: Circuit open for > {{ALERT_THRESHOLD_SECONDS}}s
Rate Limiting
| Limit | Value | Window | Action when exceeded |
|---|---|---|---|
| Requests per minute | {{RPM}} | 60s sliding | HTTP 429, Retry-After |
| Burst limit | {{BURST}} | 1s | HTTP 429 immediately |
| Daily quota | {{DAILY}} | 24h | HTTP 429, contact support |
Timeout Configuration
| Timeout Type | Value | Notes |
|---|---|---|
| Connection timeout | {{CONN_TIMEOUT_MS}}ms | Time to establish connection |
| Read timeout | {{READ_TIMEOUT_MS}}ms | Time to receive first byte |
| Total request timeout | {{TOTAL_TIMEOUT_MS}}ms | End-to-end budget |
3.2 Integration: {{INTEGRATION_NAME_2}} (if applicable)
Protocol: gRPC Service definition:
service {{ServiceName}} {
rpc {{MethodName}} ({{RequestMessage}}) returns ({{ResponseMessage}});
rpc {{StreamMethodName}} ({{RequestMessage}}) returns (stream {{ResponseMessage}});
}
message {{RequestMessage}} {
string id = 1;
string tenant_id = 2;
{{FieldType}} {{field_name}} = 3;
}
message {{ResponseMessage}} {
string id = 1;
{{FieldType}} {{field_name}} = 2;
google.protobuf.Timestamp created_at = 3;
}
4. Event-Driven Integrations
4.1 Event Schemas (CloudEvents 1.0)
Event: {{entity}}.{{ACTION}}
Published by: {{PUBLISHER_SYSTEM}}
Consumed by: {{CONSUMER_SYSTEM_1}}, {{CONSUMER_SYSTEM_2}}
{
"specversion": "1.0",
"type": "{{REVERSE_DNS_EVENT_TYPE}}",
"source": "https://{{SYSTEM_DOMAIN}}/{{resource}}",
"id": "{{UUID}}",
"time": "2024-01-01T00:00:00Z",
"datacontenttype": "application/json",
"subject": "{{RESOURCE_ID}}",
"data": {
"entityId": "UUID of affected resource",
"tenantId": "UUID of tenant",
"actorId": "UUID of user who triggered event",
"{{DOMAIN_FIELD_1}}": "domain-specific data",
"{{DOMAIN_FIELD_2}}": "domain-specific data",
"previousState": null,
"newState": "{{STATE}}"
}
}
4.2 Topics / Queues
| Topic/Queue | Partitions | Retention | Consumers | Producer |
|---|---|---|---|---|
{{TOPIC_NAME_1}} |
{{N}} | {{RETENTION}} | {{CONSUMER_GROUPS}} | {{PRODUCER_SERVICE}} |
{{TOPIC_NAME_2}} |
{{N}} | {{RETENTION}} | {{CONSUMER_GROUPS}} | {{PRODUCER_SERVICE}} |
4.3 Ordering Guarantees
| Integration | Ordering | Scope | Notes |
|---|---|---|---|
| {{INTEGRATION_1}} | Strict order | Per tenantId |
Kafka partition by tenantId |
| {{INTEGRATION_2}} | Best-effort | Global | FIFO queue — no strict ordering |
| {{INTEGRATION_3}} | No ordering | N/A | Independent events |
4.4 Idempotency Strategy
For each consumed event:
1. Check processed_events table: SELECT 1 WHERE event_id = $1 AND consumer_group = $2
2. If found: log "Duplicate event skipped" and ACK (do not reprocess)
3. If not found: process event
4. On success: INSERT INTO processed_events (event_id, consumer_group, processed_at)
5. ACK message
Deduplication window: {{DEDUP_WINDOW}} (keep processed_events for this duration)
5. Data Consistency Patterns
5.1 Consistency Model
Model: Strong | Eventual | Causal Acceptable lag: {{MAX_LAG_SECONDS}}s
5.2 Saga Pattern (if used for distributed transactions)
sequenceDiagram
autonumber
participant O as Orchestrator
participant S1 as {{SERVICE_1}}
participant S2 as {{SERVICE_2}}
participant S3 as {{SERVICE_3}}
O->>S1: Execute Step 1
S1-->>O: Step 1 succeeded {result1}
O->>S2: Execute Step 2 (with result1)
S2-->>O: Step 2 succeeded {result2}
O->>S3: Execute Step 3 (with result2)
S3-->>O: Step 3 FAILED
Note over O: Compensating transactions (reverse order)
O->>S2: Compensate Step 2
S2-->>O: Compensated
O->>S1: Compensate Step 1
S1-->>O: Compensated
O-->>Client: Transaction rolled back
Compensation strategies:
| Step | Compensation | Notes |
|---|---|---|
| {{STEP_1}} | {{COMPENSATION_1}} | {{NOTES}} |
| {{STEP_2}} | {{COMPENSATION_2}} | {{NOTES}} |
6. Integration Testing Strategy
6.1 Contract Testing (Pact)
- Consumer-driven contracts: Consumer writes tests defining expected provider behavior
- Provider verification: Provider CI runs consumer contracts on every build
- Pact Broker:
{{PACT_BROKER_URL}}
6.2 Integration Test Environments
| Environment | Purpose | Trigger |
|---|---|---|
| Local | Dev testing with mocked provider | Manual |
| Staging | Full integration with staging provider | Every PR merge |
| Production | Synthetic monitoring | Every 5 minutes |
6.3 Test Scenarios
Happy path:
- {{SCENARIO_1}} — expected outcome
- {{SCENARIO_2}} — expected outcome
Error scenarios:
- Provider returns 500 — circuit breaker opens after threshold
- Provider times out — retry policy kicks in
- Auth token expired — token refresh flow works
- Rate limit exceeded — 429 handled, backoff applied
- Duplicate event consumed — idempotency key prevents double-processing
7. Monitoring & Alerting
7.1 Key Metrics
| Metric | Type | Alert Condition | Severity |
|---|---|---|---|
integration_{{name}}_requests_total |
Counter | — | — |
integration_{{name}}_error_rate |
Gauge | > {{THRESHOLD}}% for 5m | HIGH |
integration_{{name}}_latency_p99_ms |
Histogram | > {{THRESHOLD}}ms for 5m | MEDIUM |
integration_{{name}}_circuit_open |
Gauge | == 1 | CRITICAL |
integration_{{name}}_dlq_depth |
Gauge | > 0 | HIGH |
integration_{{name}}_consumer_lag |
Gauge | > {{LAG_THRESHOLD}} | HIGH |
7.2 Distributed Tracing
- Trace ID propagation:
X-Request-IDandtraceparentheaders forwarded - Sampling rate: {{SAMPLE_RATE}}% in production, 100% in staging
- Tracing tool: {{TRACING_TOOL}} — dashboard: {{DASHBOARD_URL}}
7.3 Alert Routing
| Condition | Alert Channel | Escalation |
|---|---|---|
| Circuit breaker open | PagerDuty {{TEAM_A}} + Slack #{{CHANNEL}} | On-call engineer |
| DLQ depth > 0 | Slack #{{CHANNEL}} | Investigate within 1h |
| Error rate > {{THRESHOLD}}% | PagerDuty | On-call engineer |
Approval
| Role | Name | Date | Signature |
|---|---|---|---|
| Author | |||
| Consumer Team Lead | |||
| Provider Team Lead | |||
| Platform/Infra | |||
| Approver |