# Data Flow Document

# Data Flow Document

> **Project:** {{PROJECT_NAME}}
> **Version:** {{VERSION}}
> **Date:** {{DATE}}
> **Author:** {{AUTHOR}}
> **Status:** Draft | In Review | Approved
> **Reviewers:** {{REVIEWERS}}
> **Classification:** Public | Internal | Confidential | Restricted

## Document History
| Version | Date | Author | Changes |
|---------|------|--------|---------|
| 0.1     | {{DATE}} | {{AUTHOR}} | Initial draft |

---

## 1. Data Flow Overview

<!-- GUIDANCE: Provide a high-level overview of how data moves through the system. What enters, what gets transformed, where it's stored, and where it exits. This is a critical document for GDPR DPIAs and security reviews. -->

**System:** {{SYSTEM_NAME}}
**Data Owner:** {{DATA_OWNER_ROLE}}
**DPO Contact:** {{DPO_EMAIL}}

**Overview:** {{DESCRIBE_WHAT_DATA_FLOWS_THROUGH_SYSTEM}}

### High-Level Data Flow

```mermaid
flowchart LR
    subgraph Inputs["Data Sources / Ingestion"]
        U[Users — Web/App]
        API_IN[External API]
        IMPORT[Bulk Import]
        WEBHOOK[Webhooks]
    end

    subgraph Processing["Processing Layer"]
        VAL[Validation & Sanitization]
        TRANS[Business Logic / Transformation]
        ENRICH[Data Enrichment]
    end

    subgraph Storage["Storage Layer"]
        DB[(Primary DB\nPostgreSQL)]
        CACHE[(Cache\nRedis)]
        BLOB[Object Storage\nS3/Blob]
        SEARCH[Search Index\nElasticsearch]
        DW[Data Warehouse\n{{DW_TECH}}]
    end

    subgraph Outputs["Data Consumers / Egress"]
        API_OUT[REST API]
        REPORTS[Reports / Analytics]
        EXPORT[Data Export]
        THIRD[Third-party Integrations]
        EMAIL[Email / Notifications]
    end

    U & API_IN & IMPORT & WEBHOOK --> VAL
    VAL --> TRANS
    TRANS --> ENRICH
    ENRICH --> DB & BLOB
    DB --> CACHE & SEARCH
    DB --> DW
    DB --> API_OUT & REPORTS & EXPORT
    DW --> REPORTS
    ENRICH --> THIRD
    DB --> EMAIL
```

---

## 2. Data Sources & Ingestion

<!-- GUIDANCE: List every source of data entering the system. Include volume estimates and ingestion method. -->

| Source | Type | Protocol | Volume (est.) | Format | PII? | Validation |
|--------|------|----------|--------------|--------|------|-----------|
| Web application users | Real-time | HTTPS POST | {{REQ_PER_DAY}} req/day | JSON | YES | Schema + business rules |
| Mobile app users | Real-time | HTTPS POST | {{REQ_PER_DAY}} req/day | JSON | YES | Schema + business rules |
| `{{EXTERNAL_SYSTEM}}` API | Real-time | Webhooks | {{EVENTS_PER_DAY}} events/day | JSON | {{YES/NO}} | HMAC signature + schema |
| CSV bulk import | Batch | File upload | {{IMPORTS_PER_DAY}} files/day | CSV | {{YES/NO}} | Column mapping + row validation |
| `{{THIRD_PARTY_API}}` | Polling | REST/HTTPS | {{CALLS_PER_HOUR}}/hour | JSON | {{YES/NO}} | Response schema validation |

### Ingestion Error Handling

| Error Type | Action | Notification |
|-----------|--------|-------------|
| Schema validation failure | Reject with error details | Return 400 to caller |
| Duplicate record | Upsert (prefer existing) or reject | Log, return 409 |
| PII fields contain unexpected data | Quarantine + alert | Slack #{{CHANNEL}} |
| Import file corrupted | Reject entire file | Email uploader + error report |

---

## 3. Data Transformations

<!-- GUIDANCE: Document every transformation applied to data between ingestion and storage. ETL = Extract, Transform, Load. ELT = Extract, Load, Transform (in the data warehouse). -->

### 3.1 Ingestion Transformations (before storage)

| Step | Input | Transformation | Output | Notes |
|------|-------|---------------|--------|-------|
| 1. Sanitization | Raw user input | Strip HTML, trim whitespace | Clean strings | Prevents XSS |
| 2. Normalization | `{{FIELD}}` | Lowercase + trim | Normalized `{{FIELD}}` | e.g., email normalization |
| 3. Enrichment | User IP | GeoIP lookup | `{country, region, city}` | Third-party API call |
| 4. PII masking | `{{PII_FIELD}}` | Hash / mask for logs | Masked value | Never log raw PII |
| 5. Encryption | Sensitive fields | AES-256-GCM | Encrypted blob | At application layer |

### 3.2 ETL Pipeline (to Data Warehouse)

```mermaid
flowchart LR
    subgraph Extract["Extract"]
        PGLOG[PostgreSQL WAL / CDC]
        SCHED[Scheduled SQL Export]
    end

    subgraph Transform["Transform ({{TRANSFORM_TOOL}})"]
        CLEAN[Data Cleaning]
        JOIN[Joins & Aggregations]
        DEDUP[Deduplication]
        ANON[PII Anonymization]
    end

    subgraph Load["Load"]
        DW[({{DATA_WAREHOUSE}})]
    end

    PGLOG --> CLEAN
    SCHED --> CLEAN
    CLEAN --> JOIN
    JOIN --> DEDUP
    DEDUP --> ANON
    ANON --> DW
```

**Pipeline schedule:** {{PIPELINE_SCHEDULE}} (e.g., hourly incremental, daily full)
**Latency:** Source to DW within {{MAX_LATENCY}}
**Tool:** {{ETL_TOOL}} (e.g., dbt, Airbyte, custom)

---

## 4. Data Storage

<!-- GUIDANCE: Where is data stored at rest? Include all storage systems — primary DB, cache, object storage, backups, analytics. -->

| Storage System | Technology | Purpose | Data Classification | Encryption at Rest |
|---------------|-----------|---------|--------------------|--------------------|
| Primary Database | {{DB_TECH}} {{VERSION}} | Transactional data | Confidential | AES-256 ({{KEY_MGMT}}) |
| Cache | Redis {{VERSION}} | Hot data, sessions | Internal | AES-256 |
| Object Storage | {{S3_COMPATIBLE}} | Files, documents, media | {{CLASSIFICATION}} | SSE-S3 / SSE-KMS |
| Search Index | Elasticsearch | Full-text search | Internal | TLS + at-rest encryption |
| Data Warehouse | {{DW}} | Analytics, reporting | Anonymized | {{DW_ENCRYPTION}} |
| Backup Storage | {{BACKUP_TECH}} | Disaster recovery | Restricted | AES-256 |
| Audit Logs | {{LOG_STORAGE}} | Compliance / audit trail | Restricted | Immutable, encrypted |

---

## 5. Data Access Patterns

<!-- GUIDANCE: Who accesses what data, how, and how often? This informs index design, caching, and access control. -->

### 5.1 Read Patterns

| Consumer | Data Accessed | Frequency | Access Method | Caching |
|---------|--------------|-----------|--------------|--------|
| Web application | User profile, settings | Per request | REST API | Redis 5min TTL |
| Web application | {{ENTITY}} list | Per page load | REST API (paginated) | CDN + Redis |
| Reporting service | Aggregated metrics | Every 1h | DW query | Materialized views |
| Admin dashboard | Raw records | On demand | REST API (admin) | No cache |
| External partner | {{SUBSET_OF_DATA}} | {{FREQUENCY}} | REST API (scoped JWT) | {{CACHING}} |

### 5.2 Write Patterns

| Writer | Data Written | Frequency | Write Method | Consistency |
|--------|-------------|-----------|-------------|------------|
| User actions (web) | CRUD operations | Per user action | REST API | Strong (synchronous) |
| Background worker | Aggregates, computed fields | Every {{INTERVAL}} | Direct DB write | Eventual |
| Import process | Bulk records | {{FREQUENCY}} | Batch insert | Strong (per batch) |
| Event consumer | Denormalized cache | On event | Direct DB write | Eventual |

---

## 6. Data Retention & Archival

<!-- GUIDANCE: Define retention periods per data type. This is legally required for GDPR compliance. Every data category must have a documented retention period and deletion/archival process. -->

| Data Category | Retention Period | Legal Basis | Action at Expiry | Automated? |
|--------------|-----------------|-------------|-----------------|-----------|
| User account data | Duration of relationship + {{N}} years | Contract | Soft delete → anonymize | Automated (nightly job) |
| Transaction records | {{N}} years | Legal obligation ({{REGULATION}}) | Archive to cold storage | Automated |
| Audit logs | {{N}} years | Legitimate interest (security) | Delete | Automated |
| Session tokens | {{N}} hours/days | Technical necessity | Auto-expire via TTL | Yes (Redis TTL) |
| Marketing consent | Until withdrawn | Consent | Delete within {{N}} days of withdrawal | Manual + automated |
| Analytics data | {{N}} years (anonymized) | Legitimate interest | Delete | Automated |
| Backup files | {{N}} days | Business continuity | Overwrite (rolling) | Automated |
| Error logs | {{N}} days | Legitimate interest | Delete | Automated |

**Retention schedule job:** `retention-policy.job.ts` — runs daily at {{TIME}} UTC
**Archival target:** {{COLD_STORAGE_LOCATION}}

---

## 7. Data Quality Rules

<!-- GUIDANCE: Define the quality checks that run on data. Poor data quality causes bad decisions and compliance issues. -->

### 7.1 Validation Rules

| Field | Rule | Error Action | Severity |
|-------|------|-------------|---------|
| `email` | Valid RFC 5322 format | Reject | CRITICAL |
| `phone` | E.164 format | Reject | HIGH |
| `{{DATE_FIELD}}` | Not in future | Reject | HIGH |
| `{{AMOUNT_FIELD}}` | >= 0 | Reject | CRITICAL |
| `{{FK_FIELD}}` | References existing record | Reject | CRITICAL |
| `{{TEXT_FIELD}}` | Max {{N}} characters | Reject | MEDIUM |

### 7.2 Data Quality Metrics

| Metric | Target | Current | Alert Threshold |
|--------|--------|---------|----------------|
| Null rate on required fields | 0% | {{CURRENT}} | > 0.1% |
| Duplicate rate | < 0.01% | {{CURRENT}} | > 0.1% |
| Schema validation pass rate | > 99.9% | {{CURRENT}} | < 99% |
| ETL pipeline success rate | > 99.5% | {{CURRENT}} | < 98% |

---

## 8. PII Data Flow Mapping

<!-- GUIDANCE: Critical for GDPR compliance. Map exactly where PII is stored, processed, and shared. This feeds directly into the DPIA. -->

### 8.1 PII Inventory

| PII Category | Fields | Storage Location | Encrypted? | Access Controls | Lawful Basis |
|-------------|--------|-----------------|-----------|----------------|-------------|
| Contact info | `email`, `phone`, `address` | Primary DB, Email system | Yes | Role-based (user self + admin) | Contract |
| Identity | `full_name`, `date_of_birth` | Primary DB | Yes (field-level) | Role-based | Contract |
| Financial | `{{PAYMENT_FIELD}}` | {{PAYMENT_PROVIDER}} (tokenized) | Tokenized | PCI scope only | Contract |
| Behavioral | `login_history`, `click_events` | Analytics DB | No (anonymized) | Admin only | Legitimate interest |
| Location | `ip_address` (→ geo) | Logs (masked) | N/A | Admin only | Legitimate interest |
| Device | `user_agent`, `device_id` | Analytics DB | No | Admin only | Legitimate interest |

### 8.2 PII Flow Diagram

```mermaid
flowchart TD
    USER([Data Subject]) -->|Provides| INGESTION[Ingestion Layer]
    INGESTION -->|Validates & encrypts| DB[(Primary DB\nPII encrypted at rest)]
    DB -->|Pseudonymized| DW[(Data Warehouse\nNo direct PII)]
    DB -->|Masked in logs| LOGS[Log Aggregator]
    DB -->|Tokenized| PAYMENT[Payment Provider\nPCI scope]
    DB -->|Explicit consent| EMAIL[Email Provider\nEmail + name only]
    DB -->|Right to erasure| DELETION[Anonymization Service]
    DELETION -->|Anonymized| DB
    DB -->|Audit trail| AUDIT[Audit Log\nRestricted access]

    style DB fill:#ffcccc
    style DW fill:#ccffcc
    style LOGS fill:#ffffcc
    style AUDIT fill:#ffcccc
```

---

## 9. Cross-Border Data Transfer

<!-- GUIDANCE: GDPR restricts transfer of EU personal data to non-adequate countries. Document every transfer and the legal mechanism. -->

| Transfer | From | To | Data Category | Mechanism | DPA/SCCs? |
|---------|------|-----|--------------|-----------|----------|
| {{TRANSFER_1}} | EU ({{COUNTRY}}) | US ({{PROVIDER}}) | {{DATA_CATEGORY}} | Standard Contractual Clauses (SCCs) | Yes — signed {{DATE}} |
| {{TRANSFER_2}} | EU | {{COUNTRY}} | {{DATA_CATEGORY}} | Adequacy decision | N/A |
| {{TRANSFER_3}} | EU | {{COUNTRY}} | {{DATA_CATEGORY}} | Binding Corporate Rules | {{YES/NO}} |

**Third-party processors with data access:**
| Processor | Service | Data Accessed | DPA Signed | Location |
|---------|---------|--------------|-----------|---------|
| {{PROCESSOR_1}} | {{SERVICE}} | {{DATA}} | Yes | {{LOCATION}} |
| {{PROCESSOR_2}} | {{SERVICE}} | {{DATA}} | Yes | {{LOCATION}} |

---

## 10. Data Lineage Tracking

<!-- GUIDANCE: For compliance and debugging, you need to trace where data came from and where it went. -->

**Lineage tool:** {{LINEAGE_TOOL}} (e.g., Apache Atlas, DataHub, custom)
**Coverage:** Primary DB + DW

### Lineage Events Captured

```json
{
  "eventType": "DATA_WRITE",
  "timestamp": "ISO8601",
  "actor": "system/user-id",
  "action": "CREATE | UPDATE | DELETE | EXPORT | IMPORT",
  "resource": {
    "type": "{{ENTITY}}",
    "id": "UUID"
  },
  "fields_modified": ["{{field1}}", "{{field2}}"],
  "sourceSystem": "{{SOURCE}}",
  "traceId": "UUID"
}
```

---

## 11. Backup & Recovery for Data

<!-- GUIDANCE: Define backup strategy per storage system. Include RTO and RPO targets. -->

| Storage | Backup Method | Frequency | Retention | RTO | RPO | Test Frequency |
|---------|--------------|-----------|-----------|-----|-----|----------------|
| Primary DB | Continuous WAL archiving + snapshots | Continuous / Daily | 30 days | 1h | 5min | Monthly |
| Object Storage | Cross-region replication | Continuous | 30 days | 4h | 1h | Quarterly |
| Data Warehouse | Snapshot | Daily | 14 days | 8h | 24h | Quarterly |
| Redis Cache | RDB snapshots | Every 15min | 24h | 15min | 15min | Monthly |

**Last backup test:** {{DATE}} — Result: {{PASS/FAIL}}
**Recovery runbook:** {{LINK_TO_RUNBOOK}}

---

## Approval
| Role | Name | Date | Signature |
|------|------|------|-----------|
| Author | | | |
| Data Owner | | | |
| DPO / Privacy | | | |
| Security | | | |
| Tech Lead | | | |