Data Flow Document
Data Flow Document
Project: {{PROJECT_NAME}}
Version: {{VERSION}}
Date: {{DATE}}
Author: {{AUTHOR}}
Status: Draft | In Review | Approved
Reviewers: {{REVIEWERS}}
Classification: Public | Internal | Confidential | Restricted
Document History
| Version |
Date |
Author |
Changes |
| 0.1 |
{{DATE}} |
{{AUTHOR}} |
Initial draft |
1. Data Flow Overview
System: {{SYSTEM_NAME}}
Data Owner: {{DATA_OWNER_ROLE}}
DPO Contact: {{DPO_EMAIL}}
Overview: {{DESCRIBE_WHAT_DATA_FLOWS_THROUGH_SYSTEM}}
High-Level Data Flow
flowchart LR
subgraph Inputs["Data Sources / Ingestion"]
U[Users — Web/App]
API_IN[External API]
IMPORT[Bulk Import]
WEBHOOK[Webhooks]
end
subgraph Processing["Processing Layer"]
VAL[Validation & Sanitization]
TRANS[Business Logic / Transformation]
ENRICH[Data Enrichment]
end
subgraph Storage["Storage Layer"]
DB[(Primary DB\nPostgreSQL)]
CACHE[(Cache\nRedis)]
BLOB[Object Storage\nS3/Blob]
SEARCH[Search Index\nElasticsearch]
DW[Data Warehouse\n{{DW_TECH}}]
end
subgraph Outputs["Data Consumers / Egress"]
API_OUT[REST API]
REPORTS[Reports / Analytics]
EXPORT[Data Export]
THIRD[Third-party Integrations]
EMAIL[Email / Notifications]
end
U & API_IN & IMPORT & WEBHOOK --> VAL
VAL --> TRANS
TRANS --> ENRICH
ENRICH --> DB & BLOB
DB --> CACHE & SEARCH
DB --> DW
DB --> API_OUT & REPORTS & EXPORT
DW --> REPORTS
ENRICH --> THIRD
DB --> EMAIL
2. Data Sources & Ingestion
| Source |
Type |
Protocol |
Volume (est.) |
Format |
PII? |
Validation |
| Web application users |
Real-time |
HTTPS POST |
{{REQ_PER_DAY}} req/day |
JSON |
YES |
Schema + business rules |
| Mobile app users |
Real-time |
HTTPS POST |
{{REQ_PER_DAY}} req/day |
JSON |
YES |
Schema + business rules |
{{EXTERNAL_SYSTEM}} API |
Real-time |
Webhooks |
{{EVENTS_PER_DAY}} events/day |
JSON |
{{YES/NO}} |
HMAC signature + schema |
| CSV bulk import |
Batch |
File upload |
{{IMPORTS_PER_DAY}} files/day |
CSV |
{{YES/NO}} |
Column mapping + row validation |
{{THIRD_PARTY_API}} |
Polling |
REST/HTTPS |
{{CALLS_PER_HOUR}}/hour |
JSON |
{{YES/NO}} |
Response schema validation |
Ingestion Error Handling
| Error Type |
Action |
Notification |
| Schema validation failure |
Reject with error details |
Return 400 to caller |
| Duplicate record |
Upsert (prefer existing) or reject |
Log, return 409 |
| PII fields contain unexpected data |
Quarantine + alert |
Slack #{{CHANNEL}} |
| Import file corrupted |
Reject entire file |
Email uploader + error report |
3.1 Ingestion Transformations (before storage)
3.2 ETL Pipeline (to Data Warehouse)
flowchart LR
subgraph Extract["Extract"]
PGLOG[PostgreSQL WAL / CDC]
SCHED[Scheduled SQL Export]
end
subgraph Transform["Transform ({{TRANSFORM_TOOL}})"]
CLEAN[Data Cleaning]
JOIN[Joins & Aggregations]
DEDUP[Deduplication]
ANON[PII Anonymization]
end
subgraph Load["Load"]
DW[({{DATA_WAREHOUSE}})]
end
PGLOG --> CLEAN
SCHED --> CLEAN
CLEAN --> JOIN
JOIN --> DEDUP
DEDUP --> ANON
ANON --> DW
Pipeline schedule: {{PIPELINE_SCHEDULE}} (e.g., hourly incremental, daily full)
Latency: Source to DW within {{MAX_LATENCY}}
Tool: {{ETL_TOOL}} (e.g., dbt, Airbyte, custom)
4. Data Storage
| Storage System |
Technology |
Purpose |
Data Classification |
Encryption at Rest |
| Primary Database |
{{DB_TECH}} {{VERSION}} |
Transactional data |
Confidential |
AES-256 ({{KEY_MGMT}}) |
| Cache |
Redis {{VERSION}} |
Hot data, sessions |
Internal |
AES-256 |
| Object Storage |
{{S3_COMPATIBLE}} |
Files, documents, media |
{{CLASSIFICATION}} |
SSE-S3 / SSE-KMS |
| Search Index |
Elasticsearch |
Full-text search |
Internal |
TLS + at-rest encryption |
| Data Warehouse |
{{DW}} |
Analytics, reporting |
Anonymized |
{{DW_ENCRYPTION}} |
| Backup Storage |
{{BACKUP_TECH}} |
Disaster recovery |
Restricted |
AES-256 |
| Audit Logs |
{{LOG_STORAGE}} |
Compliance / audit trail |
Restricted |
Immutable, encrypted |
5. Data Access Patterns
5.1 Read Patterns
| Consumer |
Data Accessed |
Frequency |
Access Method |
Caching |
| Web application |
User profile, settings |
Per request |
REST API |
Redis 5min TTL |
| Web application |
{{ENTITY}} list |
Per page load |
REST API (paginated) |
CDN + Redis |
| Reporting service |
Aggregated metrics |
Every 1h |
DW query |
Materialized views |
| Admin dashboard |
Raw records |
On demand |
REST API (admin) |
No cache |
| External partner |
{{SUBSET_OF_DATA}} |
{{FREQUENCY}} |
REST API (scoped JWT) |
{{CACHING}} |
5.2 Write Patterns
| Writer |
Data Written |
Frequency |
Write Method |
Consistency |
| User actions (web) |
CRUD operations |
Per user action |
REST API |
Strong (synchronous) |
| Background worker |
Aggregates, computed fields |
Every {{INTERVAL}} |
Direct DB write |
Eventual |
| Import process |
Bulk records |
{{FREQUENCY}} |
Batch insert |
Strong (per batch) |
| Event consumer |
Denormalized cache |
On event |
Direct DB write |
Eventual |
6. Data Retention & Archival
| Data Category |
Retention Period |
Legal Basis |
Action at Expiry |
Automated? |
| User account data |
Duration of relationship + {{N}} years |
Contract |
Soft delete → anonymize |
Automated (nightly job) |
| Transaction records |
{{N}} years |
Legal obligation ({{REGULATION}}) |
Archive to cold storage |
Automated |
| Audit logs |
{{N}} years |
Legitimate interest (security) |
Delete |
Automated |
| Session tokens |
{{N}} hours/days |
Technical necessity |
Auto-expire via TTL |
Yes (Redis TTL) |
| Marketing consent |
Until withdrawn |
Consent |
Delete within {{N}} days of withdrawal |
Manual + automated |
| Analytics data |
{{N}} years (anonymized) |
Legitimate interest |
Delete |
Automated |
| Backup files |
{{N}} days |
Business continuity |
Overwrite (rolling) |
Automated |
| Error logs |
{{N}} days |
Legitimate interest |
Delete |
Automated |
Retention schedule job: retention-policy.job.ts — runs daily at {{TIME}} UTC
Archival target: {{COLD_STORAGE_LOCATION}}
7. Data Quality Rules
7.1 Validation Rules
| Field |
Rule |
Error Action |
Severity |
email |
Valid RFC 5322 format |
Reject |
CRITICAL |
phone |
E.164 format |
Reject |
HIGH |
{{DATE_FIELD}} |
Not in future |
Reject |
HIGH |
{{AMOUNT_FIELD}} |
>= 0 |
Reject |
CRITICAL |
{{FK_FIELD}} |
References existing record |
Reject |
CRITICAL |
{{TEXT_FIELD}} |
Max {{N}} characters |
Reject |
MEDIUM |
7.2 Data Quality Metrics
| Metric |
Target |
Current |
Alert Threshold |
| Null rate on required fields |
0% |
{{CURRENT}} |
> 0.1% |
| Duplicate rate |
< 0.01% |
{{CURRENT}} |
> 0.1% |
| Schema validation pass rate |
> 99.9% |
{{CURRENT}} |
< 99% |
| ETL pipeline success rate |
> 99.5% |
{{CURRENT}} |
< 98% |
8. PII Data Flow Mapping
8.1 PII Inventory
| PII Category |
Fields |
Storage Location |
Encrypted? |
Access Controls |
Lawful Basis |
| Contact info |
email, phone, address |
Primary DB, Email system |
Yes |
Role-based (user self + admin) |
Contract |
| Identity |
full_name, date_of_birth |
Primary DB |
Yes (field-level) |
Role-based |
Contract |
| Financial |
{{PAYMENT_FIELD}} |
{{PAYMENT_PROVIDER}} (tokenized) |
Tokenized |
PCI scope only |
Contract |
| Behavioral |
login_history, click_events |
Analytics DB |
No (anonymized) |
Admin only |
Legitimate interest |
| Location |
ip_address (→ geo) |
Logs (masked) |
N/A |
Admin only |
Legitimate interest |
| Device |
user_agent, device_id |
Analytics DB |
No |
Admin only |
Legitimate interest |
8.2 PII Flow Diagram
flowchart TD
USER([Data Subject]) -->|Provides| INGESTION[Ingestion Layer]
INGESTION -->|Validates & encrypts| DB[(Primary DB\nPII encrypted at rest)]
DB -->|Pseudonymized| DW[(Data Warehouse\nNo direct PII)]
DB -->|Masked in logs| LOGS[Log Aggregator]
DB -->|Tokenized| PAYMENT[Payment Provider\nPCI scope]
DB -->|Explicit consent| EMAIL[Email Provider\nEmail + name only]
DB -->|Right to erasure| DELETION[Anonymization Service]
DELETION -->|Anonymized| DB
DB -->|Audit trail| AUDIT[Audit Log\nRestricted access]
style DB fill:#ffcccc
style DW fill:#ccffcc
style LOGS fill:#ffffcc
style AUDIT fill:#ffcccc
9. Cross-Border Data Transfer
| Transfer |
From |
To |
Data Category |
Mechanism |
DPA/SCCs? |
| {{TRANSFER_1}} |
EU ({{COUNTRY}}) |
US ({{PROVIDER}}) |
{{DATA_CATEGORY}} |
Standard Contractual Clauses (SCCs) |
Yes — signed {{DATE}} |
| {{TRANSFER_2}} |
EU |
{{COUNTRY}} |
{{DATA_CATEGORY}} |
Adequacy decision |
N/A |
| {{TRANSFER_3}} |
EU |
{{COUNTRY}} |
{{DATA_CATEGORY}} |
Binding Corporate Rules |
{{YES/NO}} |
Third-party processors with data access:
| Processor |
Service |
Data Accessed |
DPA Signed |
Location |
| {{PROCESSOR_1}} |
{{SERVICE}} |
{{DATA}} |
Yes |
{{LOCATION}} |
| {{PROCESSOR_2}} |
{{SERVICE}} |
{{DATA}} |
Yes |
{{LOCATION}} |
10. Data Lineage Tracking
Lineage tool: {{LINEAGE_TOOL}} (e.g., Apache Atlas, DataHub, custom)
Coverage: Primary DB + DW
Lineage Events Captured
{
"eventType": "DATA_WRITE",
"timestamp": "ISO8601",
"actor": "system/user-id",
"action": "CREATE | UPDATE | DELETE | EXPORT | IMPORT",
"resource": {
"type": "{{ENTITY}}",
"id": "UUID"
},
"fields_modified": ["{{field1}}", "{{field2}}"],
"sourceSystem": "{{SOURCE}}",
"traceId": "UUID"
}
11. Backup & Recovery for Data
| Storage |
Backup Method |
Frequency |
Retention |
RTO |
RPO |
Test Frequency |
| Primary DB |
Continuous WAL archiving + snapshots |
Continuous / Daily |
30 days |
1h |
5min |
Monthly |
| Object Storage |
Cross-region replication |
Continuous |
30 days |
4h |
1h |
Quarterly |
| Data Warehouse |
Snapshot |
Daily |
14 days |
8h |
24h |
Quarterly |
| Redis Cache |
RDB snapshots |
Every 15min |
24h |
15min |
15min |
Monthly |
Last backup test: {{DATE}} — Result: {{PASS/FAIL}}
Recovery runbook: {{LINK_TO_RUNBOOK}}
Approval
| Role |
Name |
Date |
Signature |
| Author |
|
|
|
| Data Owner |
|
|
|
| DPO / Privacy |
|
|
|
| Security |
|
|
|
| Tech Lead |
|
|
|