System Design for Customer Data Platforms

Shifting from cryptography to system design. In this post, I’ll explore how Customer Data Platforms (CDPs) are built—covering architecture, data flow, and real-world insights.

Jul 30, 2025

Project source code: https://github.com/quang-ng/cdp

Project Description

A comprehensive Customer Data Platform that enables businesses to unify, enrich, and activate their customer data for better insights and engagement.

Key Features

Unified Customer Profiles - Quickly and easily connect all of your customer data into clean, enriched customer profiles
Data Enrichment - Fill the gaps in each profile with intelligent data matching and enhancement
Predictive Analytics - Predict customer behavior using advanced machine learning models
AI-Powered Analysis - Analysis made easy with AI-driven insights and automated recommendations
Secure Data Capture - Enterprise-grade security ensuring safe and compliant data handling

System Diagram

                 ┌──────────────┐    ┌─────────────┐
                 │   Webhooks   │    │  Schedulers │
                 │              │    │             │
                 │ • Real-time  │    │ • BuildKite │
                 │   Events     │    │ • Cron Jobs │
                 │ • User       │    │ • ECS Tasks │
                 │   Actions    │    │             │
                 └──────┬───────┘    └──────┬──────┘
                        │                   │
                        │                   │
                        │            ┌──────▼──────┐
                        │            │ECS Fargate  │
                        │            │             │
                        │            │ • Singer Tap│
                        │            │ • Docker    │
                        │            │   Container │
                        |            └──────┬──────┘
                ┌───────▼─────┐             │
                │ API Gateway │             │
                │             │             │
                │ • Webhook   │             │
                │   Endpoints │             │
                └──────┬──────┘             │
                       │                    │
                ┌──────▼──────┐             │
                │   Lambda    │             │
                │             │             │
                │ • Validate  │             │
                │ • Process   │             │
                └──────┬──────┘             │
                       │                    │
                ┌──────▼──────┐             │
                │     SQS     │             │
                │             │             │
                │ • Buffer    │             │
                │   Events    │             │
                └──────┬──────┘             │
                       │                    │
                ┌──────▼──────┐             │
                │Batch Proc.  │             │
                │  (Lambda)   │             │
                │             │             │
                │ • Aggregate │             │
                │ • 20k Batch │             │
                └──────┬──────┘             │
                       │                    │
                       ┼────────────────────┘
                       │
             ┌─────────▼─────────┐
             │      S3 Raw       │
             │   (Integration)   │
             │                   │
             │ • NDJSON Files    │
             └─────────┬─────────┘
                       │
                       │
             ┌─────────▼─────────┐
             │     PySpark       │
             │  Transformation   │
             │                   │
             │ • Data Cleaning   │
             │ • Normalization   │
             │ • Validation      │
             │ • Data Enrichment │
             └─────────┬─────────┘
                       │
             ┌─────────▼─────────┐
             │   S3 Processed    │
             │  (Curated Zone)   │
             │                   │
             │ • Clean Data      │
             │ • Partitioned     │
             │ • Schema Aligned  │
             │ • Enriched Data   │
             └─────────┬─────────┘
                       │
                       │
             ┌─────────│────────┐
             │                  │
   ┌─────────▼──────┐    ┌──────▼─────────┐
   │  Database      │    │ Elasticsearch  │
   │ (PostgreSQL)   │    │ (Search &      │
   │                │    │  Analytics)    │
   │ • Customer     │    │                │
   │   Tables       │    │                │
   │ • Event Tables │    │                │
   │ • Product      │    │                │
   │   Tables       │    │                │
   └──────┬─────────┘    └────────────────┘
          │                        │
          │                        │
   ┌──────▼──────┐                 │
   │    API      │                 │
   │ (Enrichment │                 │
   │   & Query)  │                 │
   └─────────────┘                 │
                                   │
                                   │
   ┌───────────────────────────────▼──────────┐
   │ Segmentation   Analytics   Reporting     │
   │                                          │            
   │ • Behavioral Segments                    │
   │ • Demographic Groups                     │
   │ • ML Models                              │
   │ • Customer Insights                      │
   │ • Cohort Analysis                        │
   │ • AI-Powered Insights                    │
   │ • Dashboards                             │
   │ • KPIs                                   │
   │ • Business Intelligence                  │
   │ • Real-time Metrics                      │
   └─────────┬────────────────────────────────┘
             │
             ▼
   ┌───────────────────────┐
   │    Activation         │
   │                       │
   │ • Email Service       │
   │ • Facebook Ads        │
   │ • Google Ads          │
   │ • Other Marketing     │
   │   Platforms           │
   └───────────────────────┘

Thanks for reading! This post is public so feel free to share it.

S3 Raw (Integration Layer)

The S3 Raw bucket serves as the initial landing zone for all incoming customer data from various sources.

Purpose

Data Ingestion Hub - Central repository for raw, unprocessed data
Data Preservation - Maintains original data format for audit and compliance
Decoupling - Separates data collection from processing workflows

Data Organization

s3://cdp-raw-bucket/
├── acme_corp/
│   ├── salesforce/
│   │   ├── year=2025/
│   │   │   ├── month=01/
│   │   │   │   ├── day=15/
│   │   │   │   │   ├── order_20250115T010000.ndjson.gzip (5k records)
│   │   │   │   │   ├── product_20250115T010000.ndjson.gzip (5k records)
│   │   │   │   │   ├── order_20250115T020000.ndjson.gzip (4.5k records)
│   │   │   │   │   ├── product_20250115T020000.ndjson.gzip (480 records)
│   │   │   │   └── day=16/
│   │   │   │       ├── order_20250116T010000.ndjson.gzip
│   │   │   │       └── product_20250116T010000.ndjson.gzip
│   ├── shopify/
│   │   ├── year=2025/
│   │   │   ├── month=01/
│   │   │   │   ├── day=15/
│   │   │   │   │   ├── order_20250115T030000.ndjson.gzip
│   │   │   │   │   ├── product_20250115T030000.ndjson.gzip
│   │   │   │   │   └── customer_batch_002.ndjson.gzip

File Formats

NDJSON.GZIP - Standardized, compressed, line-by-line processable format

Data Sources

CRM Systems - Customer profiles, contact information, sales data
E-commerce Platforms - Orders, products, transaction history
Marketing Tools - Campaign data, email metrics, ad performance
Real-time Events - User interactions, page views, clicks
Third-party APIs - Social media data, enrichment services

Retention Policy

Raw Data - 7 years for compliance and historical analysis
Partitioning - By year/month/day for efficient querying
Lifecycle Management - Automatic transition to cheaper storage classes

How It Works

Data Extraction Pipeline:

Singer.io Taps - Custom-built data extractors for each provider
- Implements Singer specification for standardized data extraction
- Handles API rate limiting and incremental syncs
- Batch Processing - Pushes data in 20k record batches to S3
- Creates multiple smaller files per day instead of single large file
- Outputs data in Singer format (JSON records with schema)
BuildKite Scheduling - Orchestrates data extraction jobs
- Cron-based scheduling for regular data pulls
- Parallel execution for multiple customers/providers
- Retry logic and failure notifications
- Build pipeline triggers based on data availability
Docker Containerization - Packaging and deployment
- Each tap packaged as lightweight Docker container
- Consistent runtime environment across all extractors
- Easy versioning and rollback capabilities
- Isolated execution prevents interference between jobs
ECS Fargate Deployment - Serverless container execution
- Auto-scaling based on scheduled workloads
- No server management overhead
- Cost-effective pay-per-use model
- Built-in load balancing and health checks

Error Handling & Recovery Strategy

Key Features:

30-minute ECS timeouts with resume capability
Exponential backoff for rate limits (429) and server errors (5xx)
Batch-level recovery - failed batches retry independently
Singer state files enable resumable operations

Error Flow: Rate Limit → Backoff → Retry | Server Error → 3x Retry → Dead Letter Queue

Performance & Scalability

Design:

20k Record Batches - Optimal for memory and parallel processing
Auto-scaling - ECS Fargate scales with workload demand
Incremental Loading - Singer state tracks progress

Targets:

1M+ records/hour per provider
2-hour data availability SLA
70-80% compression with gzip

Limits:

1000 concurrent ECS tasks per region
3500+ S3 PUT requests/second per prefix

Data Validation & Schema Management

Schema Validation:

JSON Schema Validation - Each batch validated against expected schema before S3 upload
Field Type Checking - Validate data types (string, integer, datetime) per field
Required Field Validation - Ensure mandatory fields are present and non-null

Schema Evolution Handling:

Schema Registry - Centralized store for provider schema versions
Backward Compatibility - New fields allowed, existing fields cannot be removed
Schema Drift Detection - Alert when incoming data doesn't match expected schema

Validation Process:

Singer Tap Extract → Schema Validation → Pass: Upload to S3 | Fail: Dead Letter Queue

Dead Letter Queue (DLQ) Usage:

Failed Extractions - Batches that fail after 3 retry attempts
Schema Validation Failures - Records that don't match expected schema
Data Quality Issues - Records with invalid data types or missing required fields
Rate Limit Exhaustion - Batches that repeatedly hit API rate limits
Authentication Failures - Records from providers with expired/invalid credentials

DLQ Processing:

Storage Location - s3://cdp-dlq-bucket/{customer}/{provider}/{error-type}/
Error Metadata - Each failed record includes error reason, timestamp, retry count
Manual Review - Engineering team reviews DLQ daily for pattern analysis
Reprocessing - Fixed records can be manually resubmitted to main pipeline
Alerting - Slack notifications when DLQ volume exceeds threshold

How Reprocessing Works:

Investigate Issue - Engineer reviews DLQ records and error metadata
Fix Root Cause - Update schema, fix credentials, or resolve data quality issues
Prepare Records - Clean/transform failed records to match expected format
Manual Resubmission - Use BuildKite pipeline to reprocess specific DLQ batches
Validation - Reprocessed records go through same validation pipeline
Monitor Results - Track success rate of reprocessed records

Validation Failures:

Invalid records sent to dead letter queue with error details
Alerts sent to engineering team for schema mismatches
Daily reports on validation success rates per provider

Webhook Handling

Webhook Architecture:

API Gateway - Public endpoint for receiving webhook events
Lambda Functions - Process incoming webhooks and validate payloads
SQS Queues - Buffer webhook events for reliable processing
Batch Processing - Aggregate events into 20k record batches for S3

Webhook Flow:

Provider Webhook → API Gateway → Lambda → SQS → Batch Processor → S3 Raw

Step-by-Step Process:

Provider Webhook → API Gateway
- Shopify/Stripe/Salesforce sends real-time event (order created, payment completed, etc.)
- API Gateway receives the HTTP POST request at public endpoint
API Gateway → Lambda
- API Gateway forwards the webhook payload to Lambda function
- Lambda function starts processing the webhook
Lambda Processing:
- Auth Check - Validates webhook signature (HMAC/JWT) to ensure it's from real provider
- Validate - Checks payload format and required fields
- Extract Metadata - Determines customer, provider, event type from headers/payload
Lambda → SQS
- Lambda sends validated webhook event to SQS queue
- SQS acts as a buffer - holds events until ready for batch processing
SQS → Batch Processor
- Separate Lambda function reads events from SQS
- Aggregate - Collects events until reaching 20k records OR timeout (5 minutes)
Batch Processor → S3 Raw
- Store NDJSON - Converts batch to NDJSON.gzip format
- Uploads to S3 using same folder structure as Singer taps

Why This Design?

Immediate Response - Webhook providers get instant 200 OK response
Reliable Processing - SQS ensures no events are lost if Lambda fails
Batch Efficiency - Groups individual webhook events into efficient 20k record files
Consistent Format - Webhook data stored same way as batch extractions
Scalable - Can handle thousands of webhook events per second

Webhook Security:

Signature Validation - Verify webhook signatures from providers (HMAC, JWT)
Rate Limiting - API Gateway throttling to prevent abuse
IP Whitelisting - Restrict webhook sources to known provider IPs
Authentication - Custom headers or tokens for customer identification

Webhook Providers:

Shopify - Order events, customer updates, inventory changes
Stripe - Payment events, subscription changes, dispute notifications
Salesforce - Lead updates, opportunity changes, account modifications
Custom APIs - User actions, application events, system notifications

Error Handling:

Validation Failures - Invalid webhooks sent to DLQ with error details
Processing Errors - Failed batch processing retried with exponential backoff
Dead Letter Queue - Persistent failures stored for manual review
Monitoring - CloudWatch metrics for webhook success rates and latency

Benefits:

Real-time Data - Near-instant data availability (vs daily batch jobs)
Event-driven - Capture user actions as they happen
Scalable - SQS buffering handles traffic spikes
Consistent Format - Same NDJSON.gzip format as batch extractions
Reliable - SQS ensures no webhook events are lost

PySpark Transformation (Data Processing Engine)

The PySpark Transformation layer serves as the core data processing engine that converts raw, unstructured data from S3 Raw into clean, enriched, and analytics-ready datasets.

Purpose

Data Quality - Cleanse, validate, and standardize incoming data
Data Enrichment - Enhance records with additional context and insights
Schema Alignment - Normalize data formats across different sources
Performance Optimization - Partition and optimize data for downstream analytics

Architecture Overview

Core Components:

AWS Glue/EMR - Managed Spark execution environment
PySpark Jobs - Data transformation logic and workflows
Apache Iceberg - Modern table format for ACID transactions
Data Catalog - Centralized metadata and schema registry
S3 Processed - Curated data lake for analytics workloads

Processing Modes

The transformation engine supports two distinct processing modes to handle different data scenarios and business requirements.

1. Incremental Mode (Default)

Purpose: Process only new or changed data since the last transformation run.

Characteristics:

Daily Processing - Runs every night after data extraction
Delta Detection - Identifies new files in S3 Raw since last run
Fast Processing - Processes only changed data (typically last 1-7 days)
Low Latency - Data available in 2-4 hours after ingestion
Cost Efficient - Processes minimal data volumes

How It Works:

S3 Raw → Delta Detection → PySpark Transform → S3 Processed → Database Update
  ↓           ↓                 ↓                ↓               ↓
New Files   Last Run      Process Delta     Merge/Upsert    Incremental
Since T-1   Timestamp     Data Only         Tables          Refresh

Incremental Processing Logic:

State Tracking - Read last successful run timestamp from state file
Delta Detection - Scan S3 Raw for files newer than last run
Partition Processing - Process only affected date partitions
Merge Strategy - Upsert new/changed records into existing tables
State Update - Update state file with current run timestamp

Benefits:

Speed - 10-100x faster than full reprocessing
Cost - Processes only changed data, reducing compute costs
Freshness - Near real-time data availability
Resources - Minimal cluster size and runtime

Use Cases:

Daily Operations - Regular nightly data processing
Real-time Analytics - Fresh data for dashboards and reports
Event Processing - User actions, transactions, behavioral data
Operational Reporting - Daily/weekly business metrics

2. Historic Mode (On-Demand)

Purpose: Reprocess entire historical dataset for backfills, migrations, or major schema changes.

Characteristics:

Full Reprocessing - Processes all available historical data
Complete Rebuild - Recreates entire curated dataset from scratch
High Throughput - Uses large Spark clusters for parallel processing
Long Running - Can take 6-24 hours depending on data volume
Resource Intensive - Requires significant compute and memory

How It Works:

S3 Raw → Full Scan → Parallel Processing → Complete Rebuild → Database Reload
  ↓         ↓              ↓                    ↓               ↓
All Files  Years of     Large Cluster      Replace All     Full Table
History    Data         Processing         Tables          Refresh

Historic Processing Logic:

Full Scan - Read all files from S3 Raw (all dates/customers)
Cluster Scaling - Provision large EMR cluster (50-200 nodes)
Parallel Processing - Process multiple customers/dates simultaneously
Complete Rebuild - Recreate all tables from scratch
Atomic Replacement - Replace production tables atomically

Benefits:

Completeness - Ensures all historical data is properly processed
Consistency - Uniform application of latest business rules
Recovery - Rebuilds data after corruption or major errors
Migration - Supports major schema or logic changes

Use Cases:

Initial Setup - First-time customer onboarding
Schema Migration - Major changes to data structure
Data Recovery - Recovering from data corruption
Algorithm Updates - Applying new enrichment logic to all data
Compliance - Reprocessing for regulatory requirements

Transformation Pipeline

Data Processing Steps:

Data Ingestion
Schema Validation
Data Cleaning
Data Normalization
Data Enrichment
Quality Validation
Output Partitioning: parquet files

Infrastructure Setup for PySpark on ECS Fargate

Running PySpark on ECS Fargate provides greater control, cost efficiency, and seamless integration with the existing CDP infrastructure.

Why ECS Fargate for PySpark?

Advantages:

Consistency - Same infrastructure as data extraction pipeline
Cost Control - Pay only for actual compute time, no cluster overhead
Flexibility - Custom Spark configurations and Python dependencies
Integration - Native integration with existing ECS services and monitoring
Scalability - Independent scaling per transformation job

Architecture Overview:

BuildKite Scheduler → ECS Task Definition → Fargate Container → PySpark Job → S3 Processed
      (Cron)            (Job Config)        (Spark Runtime)    (Transform)   (Output)

Performance Optimization

Partitioning Strategy:

s3://cdp-processed/acme_corp/customers/
├── year=2025/
│   ├── month=01/
│   │   ├── segment=enterprise/
│   │   ├── segment=mid_market/
│   │   └── segment=smb/
│   └── month=02/
│       ├── segment=enterprise/
│       ├── segment=mid_market/
│       └── segment=smb/

File Optimization:

File Size - Target 128-512 MB per file for optimal S3 performance
Compression - Snappy compression for balance of speed and size
Format - Parquet for columnar analytics workloads
Compaction - Regular file compaction to prevent small file problems

Caching Strategy:

Broadcast Joins - Cache small dimension tables in memory
Checkpointing - Persist intermediate results for complex transformations
Dynamic Allocation - Adjust executor count based on workload

S3 Processed → Database: Data Loading Flow

The following diagram illustrates how curated data moves from S3 Processed to the Database in both Incremental and Historic modes:

           Incremental Mode (Default)                  Historic Mode (Full Reload)
┌───────────────────────────────┐                ┌───────────────────────────────┐
│   S3 Processed (Curated)      │                │   S3 Processed (Curated)      │
│   (New/Changed Partitions)    │                │   (All Partitions)            │
└──────────────┬────────────────┘                └──────────────┬────────────────┘
               │                                             │
               │                                             │
      Delta Detection & State Tracking               Full Scan of All Data
               │                                             │
               ▼                                             ▼
        Data Extraction                              Data Extraction
               │                                             │
               ▼                                             ▼
   Transformation/Mapping (if needed)           Transformation/Mapping (if needed)
               │                                             │
               ▼                                             ▼
        Upsert/Merge into DB                        Truncate & Bulk Load
               │                                             │
               ▼                                             ▼
   Update State for Next Run                        Integrity Checks & State Reset

Incremental Mode:

Detects and loads only new or changed data since the last run.
Uses upsert/merge logic to keep database in sync with minimal lag.
Fast, cost-efficient, and suitable for daily operations.

Historic Mode:

Loads all curated data, typically after schema changes or for backfills.
Truncates and reloads entire tables for full consistency.
Resource-intensive, used for migrations or recovery.

Activation Pipeline: Simple Data Flow

Segment Selection
- Query the database or data warehouse for users matching your criteria (e.g., bought > $50).
- Always export the results to S3 as Parquet files (columnar format), regardless of segment size.
Audience Export
- The list of users (emails, IDs) is always written to S3 as Parquet files.
- For large segments, split into multiple files for efficient processing.
Channel Integration
- The activation service reads the audience from S3 or receives it via API.
- Upload the list to your email or ad platform using their bulk import features.
Personalization
- Join user details (like name) in batch before export, so all personalization data is included in the Parquet file. Avoid per-user database queries (no N+1 problem).
- Use templates stored in your email/ad platform.
Send & Monitor
- The platform sends the campaign to users.
- Track delivery, opens, clicks, and conversions.
- Export these results back to your CDP for analytics.
Feedback Loop
- Store campaign results in your analytics tables or data warehouse.
- Use this data for future targeting and reporting.

For all segments:

Always use S3 for storage and process in batches if needed.
Store all segment and audience data in Parquet files for efficient analytics and processing.
Use the bulk import and reporting features of your marketing platforms.

Summary

The Customer Data Platform (CDP) is a unified system that enables businesses to collect, process, enrich, and activate customer data from multiple sources. It features:

Centralized data ingestion and storage (S3 Raw)
Automated data extraction, validation, and enrichment pipelines
Scalable, secure, and compliant architecture leveraging AWS (ECS Fargate, Lambda, S3, EMR/Glue)
Real-time and batch data processing for analytics and activation
Support for predictive analytics, segmentation, and marketing activation

This document provides a detailed system design, including data flows, error handling, and performance strategies.

Let’s go to Github page to see how those actually implementation.

Quang

Discussion about this post

Quang

System Design for Customer Data Platforms

Shifting from cryptography to system design. In this post, I’ll explore how Customer Data Platforms (CDPs) are built—covering architecture, data flow, and real-world insights.

Project Description

Key Features

System Diagram

Deep Dive: System Components

S3 Raw (Integration Layer)

Purpose

Data Organization

File Formats

Data Sources

Retention Policy

How It Works

Error Handling & Recovery Strategy

Performance & Scalability

Data Validation & Schema Management

Webhook Handling

PySpark Transformation (Data Processing Engine)

Purpose

Architecture Overview

Processing Modes

1. Incremental Mode (Default)

2. Historic Mode (On-Demand)

Transformation Pipeline

Infrastructure Setup for PySpark on ECS Fargate

Why ECS Fargate for PySpark?

Performance Optimization

S3 Processed → Database: Data Loading Flow

Activation Pipeline: Simple Data Flow

Summary

Discussion about this post