Data Warehouse Design Patterns

Data warehouse design is a critical aspect of building analytics infrastructure. The right design pattern can make the difference between a warehouse that scales efficiently and one that becomes a maintenance nightmare. This guide explores the fundamental patterns and best practices for designing modern data warehouses.

Understanding Dimensional Modeling

Dimensional modeling, pioneered by Ralph Kimball, is the foundation of most data warehouse designs. It organizes data into facts (measurements) and dimensions (context), making it intuitive for business users and optimized for analytical queries.

Core Concepts

Fact Tables: Store quantitative measurements (sales, clicks, transactions)
Dimension Tables: Store descriptive attributes (customers, products, time)
Grain: The level of detail stored in a fact table
Measures: Numeric values that can be aggregated

Star Schema: The Foundation

The star schema is the simplest and most widely used dimensional model. It consists of a central fact table surrounded by dimension tables, resembling a star when visualized.

-- Example Star Schema: Sales Data Warehouse

-- Fact Table
CREATE TABLE fact_sales (
    sale_id BIGINT PRIMARY KEY,
    date_key INT NOT NULL,
    customer_key INT NOT NULL,
    product_key INT NOT NULL,
    store_key INT NOT NULL,
    quantity INT NOT NULL,
    unit_price DECIMAL(10,2) NOT NULL,
    total_amount DECIMAL(10,2) NOT NULL,
    discount_amount DECIMAL(10,2),
    FOREIGN KEY (date_key) REFERENCES dim_date(date_key),
    FOREIGN KEY (customer_key) REFERENCES dim_customer(customer_key),
    FOREIGN KEY (product_key) REFERENCES dim_product(product_key),
    FOREIGN KEY (store_key) REFERENCES dim_store(store_key)
);

-- Dimension Tables
CREATE TABLE dim_date (
    date_key INT PRIMARY KEY,
    full_date DATE NOT NULL,
    day_of_week VARCHAR(10),
    day_of_month INT,
    month INT,
    quarter INT,
    year INT,
    is_weekend BOOLEAN,
    is_holiday BOOLEAN
);

CREATE TABLE dim_customer (
    customer_key INT PRIMARY KEY,
    customer_id VARCHAR(50) NOT NULL,
    customer_name VARCHAR(200),
    email VARCHAR(200),
    segment VARCHAR(50),
    city VARCHAR(100),
    state VARCHAR(50),
    country VARCHAR(50)
);

CREATE TABLE dim_product (
    product_key INT PRIMARY KEY,
    product_id VARCHAR(50) NOT NULL,
    product_name VARCHAR(200),
    category VARCHAR(100),
    subcategory VARCHAR(100),
    brand VARCHAR(100),
    unit_cost DECIMAL(10,2)
);

CREATE TABLE dim_store (
    store_key INT PRIMARY KEY,
    store_id VARCHAR(50) NOT NULL,
    store_name VARCHAR(200),
    store_type VARCHAR(50),
    city VARCHAR(100),
    state VARCHAR(50),
    region VARCHAR(50)
);

✨ Benefits of Star Schema

Simple and intuitive structure
Excellent query performance
Easy to understand for business users
Optimized for BI tools
Minimal joins required

Snowflake Schema: Normalized Dimensions

The snowflake schema extends the star schema by normalizing dimension tables into multiple related tables. This reduces data redundancy but increases query complexity.

-- Snowflake Schema Example: Normalized Product Dimension

CREATE TABLE dim_product (
    product_key INT PRIMARY KEY,
    product_id VARCHAR(50) NOT NULL,
    product_name VARCHAR(200),
    subcategory_key INT NOT NULL,
    brand_key INT NOT NULL,
    FOREIGN KEY (subcategory_key) REFERENCES dim_subcategory(subcategory_key),
    FOREIGN KEY (brand_key) REFERENCES dim_brand(brand_key)
);

CREATE TABLE dim_subcategory (
    subcategory_key INT PRIMARY KEY,
    subcategory_name VARCHAR(100),
    category_key INT NOT NULL,
    FOREIGN KEY (category_key) REFERENCES dim_category(category_key)
);

CREATE TABLE dim_category (
    category_key INT PRIMARY KEY,
    category_name VARCHAR(100),
    department_key INT NOT NULL,
    FOREIGN KEY (department_key) REFERENCES dim_department(department_key)
);

CREATE TABLE dim_brand (
    brand_key INT PRIMARY KEY,
    brand_name VARCHAR(100),
    manufacturer VARCHAR(200)
);

Star vs Snowflake: When to Use Each

Aspect	Star Schema	Snowflake Schema
Query Performance	Faster (fewer joins)	Slower (more joins)
Storage	More redundancy	Less redundancy
Maintenance	Simpler updates	More complex updates
Query Complexity	Simpler queries	More complex queries
Best For	Most use cases	Storage-constrained environments

Slowly Changing Dimensions (SCD)

Handling changes in dimension data over time is crucial. There are several strategies, known as Slowly Changing Dimensions (SCD) types.

Type 1: Overwrite

Simply overwrite old values with new ones. No history is maintained.

-- SCD Type 1: Update customer address
UPDATE dim_customer
SET 
    address = 'New Address',
    city = 'New City',
    state = 'New State',
    updated_at = CURRENT_TIMESTAMP
WHERE customer_key = 12345;

Type 2: Add New Row

Create a new row for each change, maintaining full history.

-- SCD Type 2: Customer dimension with history
CREATE TABLE dim_customer (
    customer_key INT PRIMARY KEY,
    customer_id VARCHAR(50) NOT NULL,
    customer_name VARCHAR(200),
    email VARCHAR(200),
    segment VARCHAR(50),
    effective_date DATE NOT NULL,
    expiration_date DATE,
    is_current BOOLEAN DEFAULT TRUE,
    version INT DEFAULT 1
);

-- Insert new version when customer changes
INSERT INTO dim_customer (
    customer_key, customer_id, customer_name, 
    email, segment, effective_date, is_current, version
)
SELECT 
    (SELECT MAX(customer_key) + 1 FROM dim_customer),
    customer_id,
    customer_name,
    'new_email@example.com',
    'Premium',
    CURRENT_DATE,
    TRUE,
    version + 1
FROM dim_customer
WHERE customer_id = 'CUST123' AND is_current = TRUE;

-- Mark old version as not current
UPDATE dim_customer
SET 
    is_current = FALSE,
    expiration_date = CURRENT_DATE - INTERVAL '1 day'
WHERE customer_id = 'CUST123' 
  AND is_current = TRUE
  AND customer_key != (SELECT MAX(customer_key) FROM dim_customer WHERE customer_id = 'CUST123');

Type 3: Add New Column

Add columns to track limited history (e.g., current and previous values).

CREATE TABLE dim_customer (
    customer_key INT PRIMARY KEY,
    customer_id VARCHAR(50) NOT NULL,
    customer_name VARCHAR(200),
    current_segment VARCHAR(50),
    previous_segment VARCHAR(50),
    segment_change_date DATE
);

Key Takeaways

Star schema is the most common and performant pattern for most use cases
Snowflake schema trades query performance for storage efficiency
Choose SCD type based on history requirements and query patterns
Surrogate keys provide flexibility and performance benefits
Design for query patterns, not just data structure
Consider cloud warehouse capabilities when designing

Modern Cloud Data Warehouse Considerations

Cloud data warehouses like Snowflake, BigQuery, and Redshift have changed some traditional design considerations:

Denormalization is Less Critical

Modern cloud warehouses handle joins efficiently, making snowflake schemas more viable than in traditional systems.

Columnar Storage

Cloud warehouses use columnar storage, which changes optimization strategies. Wide tables are less problematic.

Separation of Storage and Compute

Storage is cheap, compute is expensive. This shifts the cost-benefit analysis of normalization.

Best Practices

Start with Star Schema: It's simple, performant, and meets most needs
Use Surrogate Keys: Integer keys for better performance and flexibility
Design for Queries: Understand your query patterns before designing
Document Grain: Clearly define the grain of each fact table
Implement SCD Appropriately: Choose the right type for each dimension
Consider Partitioning: Partition large fact tables by date
Test Performance: Validate design with realistic data volumes

Conclusion

Data warehouse design patterns have evolved, but the fundamentals remain important. Star schema continues to be the gold standard for most analytical workloads, while modern cloud platforms offer new possibilities for optimization. Choose patterns that align with your specific requirements, query patterns, and platform capabilities.

Understanding Dimensional Modeling

Core Concepts

Star Schema: The Foundation

Snowflake Schema: Normalized Dimensions

Star vs Snowflake: When to Use Each

Slowly Changing Dimensions (SCD)

Type 1: Overwrite

Type 2: Add New Row

Type 3: Add New Column

Key Takeaways

Modern Cloud Data Warehouse Considerations

Denormalization is Less Critical

Columnar Storage

Separation of Storage and Compute

Best Practices

Conclusion

Related Articles

Building Scalable Data Pipelines

Modern Data Stack Overview

Implementing a Data Quality Framework