Skip to main content

Integration Design

Integration Design Document

Project: {{PROJECT_NAME}} Integration: {{INTEGRATION_NAME}} Version: {{VERSION}} Date: {{DATE}} Author: {{AUTHOR}} Status: Draft | In Review | Approved Reviewers: {{REVIEWERS}}

Document History

Version Date Author Changes
0.1 {{DATE}} {{AUTHOR}} Initial draft

1. Integration Overview & Context

Integration Name: {{INTEGRATION_NAME}} Type: Synchronous (REST/gRPC) | Asynchronous (Events/Queue) | Bidirectional | File-based

Business Purpose: {{WHY_THIS_INTEGRATION_EXISTS}}

Criticality: Critical | High | Medium | Low

  • Impact if down: {{BUSINESS_IMPACT_IF_UNAVAILABLE}}
  • Acceptable downtime: {{RTO}} | Max data loss: {{RPO}}

Parties:

Party System Team Contact
Consumer (caller) {{CONSUMER_SYSTEM}} {{TEAM_A}} {{CONTACT_A}}
Provider (server) {{PROVIDER_SYSTEM}} {{TEAM_B}} {{CONTACT_B}}

2. Integration Topology Diagram

flowchart LR
    subgraph ConsumerSide["Consumer — {{CONSUMER_SYSTEM}}"]
        C_SVC[{{ConsumerService}}]
        C_CB[Circuit Breaker]
        C_RETRY[Retry Handler]
    end

    subgraph Integration["Integration Layer"]
        GW[API Gateway / Load Balancer]
        Q[Message Queue\n{{QUEUE_NAME}}]
        DLQ[Dead Letter Queue\n{{DLQ_NAME}}]
    end

    subgraph ProviderSide["Provider — {{PROVIDER_SYSTEM}}"]
        P_SVC[{{ProviderService}}]
        P_DB[(Provider DB)]
        P_WORKER[Event Worker]
    end

    C_SVC --> C_CB
    C_CB --> C_RETRY
    C_RETRY -->|HTTPS REST| GW
    GW --> P_SVC
    P_SVC --> P_DB

    P_WORKER -->|Publish| Q
    Q -->|Consume| C_SVC
    Q -->|Failed| DLQ
    DLQ -->|Alert| AlertSystem[PagerDuty]

3. Service Contracts

3.1 Integration: {{INTEGRATION_NAME_1}}

Protocol: REST/HTTPS | gRPC | GraphQL | WebSocket | AMQP Direction: {{CONSUMER}} → {{PROVIDER}} Idempotency: YES — use Idempotency-Key header | NO

Authentication

Method Details
Type Bearer JWT
Header Authorization: Bearer {{TOKEN}}
Key rotation Every {{ROTATION_PERIOD}} — coordinated via {{ROTATION_PROCESS}}
Token endpoint {{AUTH_ENDPOINT}} (if OAuth2)

Request Contract

Endpoint: {{HTTP_METHOD}} {{BASE_URL}}/{{PATH}}

Headers:

Authorization: Bearer {{JWT_OR_API_KEY}}
Content-Type: application/json
Accept: application/json
X-Request-ID: {{UUID}}
X-Idempotency-Key: {{IDEMPOTENCY_KEY}}

Request Body:

{
  "{{field1}}": "{{type}} — {{description}}",
  "{{field2}}": "{{type}} — {{description}}",
  "metadata": {
    "sourceSystem": "{{CONSUMER_SYSTEM_ID}}",
    "timestamp": "ISO8601"
  }
}

Successful Response 200 / 201:

{
  "{{responseField1}}": "{{type}}",
  "{{responseField2}}": "{{type}}",
  "requestId": "echo of X-Request-ID"
}

Error Handling

HTTP Status Error Code Consumer Action
400 VALIDATION_ERROR Log error, do NOT retry — fix request
401 UNAUTHORIZED Refresh token, retry once
403 FORBIDDEN Alert engineering, do NOT retry
404 NOT_FOUND Log, do NOT retry — check resource ID
409 CONFLICT Log, skip (idempotent)
422 BUSINESS_RULE Log error, do NOT retry — escalate
429 RATE_LIMITED Backoff per Retry-After header
500 INTERNAL_ERROR Retry with exponential backoff
502/503 UNAVAILABLE Circuit breaker — fail fast

Retry Policy

Max retries: {{MAX_RETRIES}} (retry only on 500, 502, 503, 429, network errors)
Strategy: Exponential backoff with jitter
Delays: [{{DELAY_1}}ms, {{DELAY_2}}ms, {{DELAY_3}}ms]
Timeout per attempt: {{TIMEOUT_MS}}ms

Circuit Breaker Configuration

Failure threshold: {{FAILURE_PERCENT}}% failures in {{WINDOW_SECONDS}}s window
Open duration: {{OPEN_DURATION_SECONDS}}s
Half-open test: 1 request
Alert on: Circuit open for > {{ALERT_THRESHOLD_SECONDS}}s

Rate Limiting

Limit Value Window Action when exceeded
Requests per minute {{RPM}} 60s sliding HTTP 429, Retry-After
Burst limit {{BURST}} 1s HTTP 429 immediately
Daily quota {{DAILY}} 24h HTTP 429, contact support

Timeout Configuration

Timeout Type Value Notes
Connection timeout {{CONN_TIMEOUT_MS}}ms Time to establish connection
Read timeout {{READ_TIMEOUT_MS}}ms Time to receive first byte
Total request timeout {{TOTAL_TIMEOUT_MS}}ms End-to-end budget

3.2 Integration: {{INTEGRATION_NAME_2}} (if applicable)

Protocol: gRPC Service definition:

service {{ServiceName}} {
  rpc {{MethodName}} ({{RequestMessage}}) returns ({{ResponseMessage}});
  rpc {{StreamMethodName}} ({{RequestMessage}}) returns (stream {{ResponseMessage}});
}

message {{RequestMessage}} {
  string id = 1;
  string tenant_id = 2;
  {{FieldType}} {{field_name}} = 3;
}

message {{ResponseMessage}} {
  string id = 1;
  {{FieldType}} {{field_name}} = 2;
  google.protobuf.Timestamp created_at = 3;
}

4. Event-Driven Integrations

4.1 Event Schemas (CloudEvents 1.0)

Event: {{entity}}.{{ACTION}}

Published by: {{PUBLISHER_SYSTEM}} Consumed by: {{CONSUMER_SYSTEM_1}}, {{CONSUMER_SYSTEM_2}}

{
  "specversion": "1.0",
  "type": "{{REVERSE_DNS_EVENT_TYPE}}",
  "source": "https://{{SYSTEM_DOMAIN}}/{{resource}}",
  "id": "{{UUID}}",
  "time": "2024-01-01T00:00:00Z",
  "datacontenttype": "application/json",
  "subject": "{{RESOURCE_ID}}",
  "data": {
    "entityId": "UUID of affected resource",
    "tenantId": "UUID of tenant",
    "actorId": "UUID of user who triggered event",
    "{{DOMAIN_FIELD_1}}": "domain-specific data",
    "{{DOMAIN_FIELD_2}}": "domain-specific data",
    "previousState": null,
    "newState": "{{STATE}}"
  }
}

4.2 Topics / Queues

Topic/Queue Partitions Retention Consumers Producer
{{TOPIC_NAME_1}} {{N}} {{RETENTION}} {{CONSUMER_GROUPS}} {{PRODUCER_SERVICE}}
{{TOPIC_NAME_2}} {{N}} {{RETENTION}} {{CONSUMER_GROUPS}} {{PRODUCER_SERVICE}}

4.3 Ordering Guarantees

Integration Ordering Scope Notes
{{INTEGRATION_1}} Strict order Per tenantId Kafka partition by tenantId
{{INTEGRATION_2}} Best-effort Global FIFO queue — no strict ordering
{{INTEGRATION_3}} No ordering N/A Independent events

4.4 Idempotency Strategy

For each consumed event:
1. Check processed_events table: SELECT 1 WHERE event_id = $1 AND consumer_group = $2
2. If found: log "Duplicate event skipped" and ACK (do not reprocess)
3. If not found: process event
4. On success: INSERT INTO processed_events (event_id, consumer_group, processed_at)
5. ACK message

Deduplication window: {{DEDUP_WINDOW}} (keep processed_events for this duration)

5. Data Consistency Patterns

5.1 Consistency Model

Model: Strong | Eventual | Causal Acceptable lag: {{MAX_LAG_SECONDS}}s

5.2 Saga Pattern (if used for distributed transactions)

sequenceDiagram
    autonumber
    participant O as Orchestrator
    participant S1 as {{SERVICE_1}}
    participant S2 as {{SERVICE_2}}
    participant S3 as {{SERVICE_3}}

    O->>S1: Execute Step 1
    S1-->>O: Step 1 succeeded {result1}
    O->>S2: Execute Step 2 (with result1)
    S2-->>O: Step 2 succeeded {result2}
    O->>S3: Execute Step 3 (with result2)
    S3-->>O: Step 3 FAILED

    Note over O: Compensating transactions (reverse order)
    O->>S2: Compensate Step 2
    S2-->>O: Compensated
    O->>S1: Compensate Step 1
    S1-->>O: Compensated
    O-->>Client: Transaction rolled back

Compensation strategies:

Step Compensation Notes
{{STEP_1}} {{COMPENSATION_1}} {{NOTES}}
{{STEP_2}} {{COMPENSATION_2}} {{NOTES}}

6. Integration Testing Strategy

6.1 Contract Testing (Pact)

  • Consumer-driven contracts: Consumer writes tests defining expected provider behavior
  • Provider verification: Provider CI runs consumer contracts on every build
  • Pact Broker: {{PACT_BROKER_URL}}

6.2 Integration Test Environments

Environment Purpose Trigger
Local Dev testing with mocked provider Manual
Staging Full integration with staging provider Every PR merge
Production Synthetic monitoring Every 5 minutes

6.3 Test Scenarios

Happy path:

  • {{SCENARIO_1}} — expected outcome
  • {{SCENARIO_2}} — expected outcome

Error scenarios:

  • Provider returns 500 — circuit breaker opens after threshold
  • Provider times out — retry policy kicks in
  • Auth token expired — token refresh flow works
  • Rate limit exceeded — 429 handled, backoff applied
  • Duplicate event consumed — idempotency key prevents double-processing

7. Monitoring & Alerting

7.1 Key Metrics

Metric Type Alert Condition Severity
integration_{{name}}_requests_total Counter
integration_{{name}}_error_rate Gauge > {{THRESHOLD}}% for 5m HIGH
integration_{{name}}_latency_p99_ms Histogram > {{THRESHOLD}}ms for 5m MEDIUM
integration_{{name}}_circuit_open Gauge == 1 CRITICAL
integration_{{name}}_dlq_depth Gauge > 0 HIGH
integration_{{name}}_consumer_lag Gauge > {{LAG_THRESHOLD}} HIGH

7.2 Distributed Tracing

  • Trace ID propagation: X-Request-ID and traceparent headers forwarded
  • Sampling rate: {{SAMPLE_RATE}}% in production, 100% in staging
  • Tracing tool: {{TRACING_TOOL}} — dashboard: {{DASHBOARD_URL}}

7.3 Alert Routing

Condition Alert Channel Escalation
Circuit breaker open PagerDuty {{TEAM_A}} + Slack #{{CHANNEL}} On-call engineer
DLQ depth > 0 Slack #{{CHANNEL}} Investigate within 1h
Error rate > {{THRESHOLD}}% PagerDuty On-call engineer

Approval

Role Name Date Signature
Author
Consumer Team Lead
Provider Team Lead
Platform/Infra
Approver