Data warehouse design is a critical aspect of building analytics infrastructure. The right design pattern can make the difference between a warehouse that scales efficiently and one that becomes a maintenance nightmare. This guide explores the fundamental patterns and best practices for designing modern data warehouses.
Understanding Dimensional Modeling
Dimensional modeling, pioneered by Ralph Kimball, is the foundation of most data warehouse designs. It organizes data into facts (measurements) and dimensions (context), making it intuitive for business users and optimized for analytical queries.
Core Concepts
- Fact Tables: Store quantitative measurements (sales, clicks, transactions)
- Dimension Tables: Store descriptive attributes (customers, products, time)
- Grain: The level of detail stored in a fact table
- Measures: Numeric values that can be aggregated
Star Schema: The Foundation
The star schema is the simplest and most widely used dimensional model. It consists of a central fact table surrounded by dimension tables, resembling a star when visualized.
-- Example Star Schema: Sales Data Warehouse
-- Fact Table
CREATE TABLE fact_sales (
sale_id BIGINT PRIMARY KEY,
date_key INT NOT NULL,
customer_key INT NOT NULL,
product_key INT NOT NULL,
store_key INT NOT NULL,
quantity INT NOT NULL,
unit_price DECIMAL(10,2) NOT NULL,
total_amount DECIMAL(10,2) NOT NULL,
discount_amount DECIMAL(10,2),
FOREIGN KEY (date_key) REFERENCES dim_date(date_key),
FOREIGN KEY (customer_key) REFERENCES dim_customer(customer_key),
FOREIGN KEY (product_key) REFERENCES dim_product(product_key),
FOREIGN KEY (store_key) REFERENCES dim_store(store_key)
);
-- Dimension Tables
CREATE TABLE dim_date (
date_key INT PRIMARY KEY,
full_date DATE NOT NULL,
day_of_week VARCHAR(10),
day_of_month INT,
month INT,
quarter INT,
year INT,
is_weekend BOOLEAN,
is_holiday BOOLEAN
);
CREATE TABLE dim_customer (
customer_key INT PRIMARY KEY,
customer_id VARCHAR(50) NOT NULL,
customer_name VARCHAR(200),
email VARCHAR(200),
segment VARCHAR(50),
city VARCHAR(100),
state VARCHAR(50),
country VARCHAR(50)
);
CREATE TABLE dim_product (
product_key INT PRIMARY KEY,
product_id VARCHAR(50) NOT NULL,
product_name VARCHAR(200),
category VARCHAR(100),
subcategory VARCHAR(100),
brand VARCHAR(100),
unit_cost DECIMAL(10,2)
);
CREATE TABLE dim_store (
store_key INT PRIMARY KEY,
store_id VARCHAR(50) NOT NULL,
store_name VARCHAR(200),
store_type VARCHAR(50),
city VARCHAR(100),
state VARCHAR(50),
region VARCHAR(50)
);
- Simple and intuitive structure
- Excellent query performance
- Easy to understand for business users
- Optimized for BI tools
- Minimal joins required
Snowflake Schema: Normalized Dimensions
The snowflake schema extends the star schema by normalizing dimension tables into multiple related tables. This reduces data redundancy but increases query complexity.
-- Snowflake Schema Example: Normalized Product Dimension
CREATE TABLE dim_product (
product_key INT PRIMARY KEY,
product_id VARCHAR(50) NOT NULL,
product_name VARCHAR(200),
subcategory_key INT NOT NULL,
brand_key INT NOT NULL,
FOREIGN KEY (subcategory_key) REFERENCES dim_subcategory(subcategory_key),
FOREIGN KEY (brand_key) REFERENCES dim_brand(brand_key)
);
CREATE TABLE dim_subcategory (
subcategory_key INT PRIMARY KEY,
subcategory_name VARCHAR(100),
category_key INT NOT NULL,
FOREIGN KEY (category_key) REFERENCES dim_category(category_key)
);
CREATE TABLE dim_category (
category_key INT PRIMARY KEY,
category_name VARCHAR(100),
department_key INT NOT NULL,
FOREIGN KEY (department_key) REFERENCES dim_department(department_key)
);
CREATE TABLE dim_brand (
brand_key INT PRIMARY KEY,
brand_name VARCHAR(100),
manufacturer VARCHAR(200)
);
Star vs Snowflake: When to Use Each
| Aspect | Star Schema | Snowflake Schema |
|---|---|---|
| Query Performance | Faster (fewer joins) | Slower (more joins) |
| Storage | More redundancy | Less redundancy |
| Maintenance | Simpler updates | More complex updates |
| Query Complexity | Simpler queries | More complex queries |
| Best For | Most use cases | Storage-constrained environments |
Slowly Changing Dimensions (SCD)
Handling changes in dimension data over time is crucial. There are several strategies, known as Slowly Changing Dimensions (SCD) types.
Type 1: Overwrite
Simply overwrite old values with new ones. No history is maintained.
-- SCD Type 1: Update customer address
UPDATE dim_customer
SET
address = 'New Address',
city = 'New City',
state = 'New State',
updated_at = CURRENT_TIMESTAMP
WHERE customer_key = 12345;
Type 2: Add New Row
Create a new row for each change, maintaining full history.
-- SCD Type 2: Customer dimension with history
CREATE TABLE dim_customer (
customer_key INT PRIMARY KEY,
customer_id VARCHAR(50) NOT NULL,
customer_name VARCHAR(200),
email VARCHAR(200),
segment VARCHAR(50),
effective_date DATE NOT NULL,
expiration_date DATE,
is_current BOOLEAN DEFAULT TRUE,
version INT DEFAULT 1
);
-- Insert new version when customer changes
INSERT INTO dim_customer (
customer_key, customer_id, customer_name,
email, segment, effective_date, is_current, version
)
SELECT
(SELECT MAX(customer_key) + 1 FROM dim_customer),
customer_id,
customer_name,
'new_email@example.com',
'Premium',
CURRENT_DATE,
TRUE,
version + 1
FROM dim_customer
WHERE customer_id = 'CUST123' AND is_current = TRUE;
-- Mark old version as not current
UPDATE dim_customer
SET
is_current = FALSE,
expiration_date = CURRENT_DATE - INTERVAL '1 day'
WHERE customer_id = 'CUST123'
AND is_current = TRUE
AND customer_key != (SELECT MAX(customer_key) FROM dim_customer WHERE customer_id = 'CUST123');
Type 3: Add New Column
Add columns to track limited history (e.g., current and previous values).
CREATE TABLE dim_customer (
customer_key INT PRIMARY KEY,
customer_id VARCHAR(50) NOT NULL,
customer_name VARCHAR(200),
current_segment VARCHAR(50),
previous_segment VARCHAR(50),
segment_change_date DATE
);
Key Takeaways
- Star schema is the most common and performant pattern for most use cases
- Snowflake schema trades query performance for storage efficiency
- Choose SCD type based on history requirements and query patterns
- Surrogate keys provide flexibility and performance benefits
- Design for query patterns, not just data structure
- Consider cloud warehouse capabilities when designing
Modern Cloud Data Warehouse Considerations
Cloud data warehouses like Snowflake, BigQuery, and Redshift have changed some traditional design considerations:
Denormalization is Less Critical
Modern cloud warehouses handle joins efficiently, making snowflake schemas more viable than in traditional systems.
Columnar Storage
Cloud warehouses use columnar storage, which changes optimization strategies. Wide tables are less problematic.
Separation of Storage and Compute
Storage is cheap, compute is expensive. This shifts the cost-benefit analysis of normalization.
Best Practices
- Start with Star Schema: It's simple, performant, and meets most needs
- Use Surrogate Keys: Integer keys for better performance and flexibility
- Design for Queries: Understand your query patterns before designing
- Document Grain: Clearly define the grain of each fact table
- Implement SCD Appropriately: Choose the right type for each dimension
- Consider Partitioning: Partition large fact tables by date
- Test Performance: Validate design with realistic data volumes
Conclusion
Data warehouse design patterns have evolved, but the fundamentals remain important. Star schema continues to be the gold standard for most analytical workloads, while modern cloud platforms offer new possibilities for optimization. Choose patterns that align with your specific requirements, query patterns, and platform capabilities.