Chapter 7: Cloud Data Warehousing with Redshift

Learning Objectives

Cloud data warehouses are the muscle of modern analytics platforms. Amazon Redshift is one of the longest-running and most widely deployed cloud warehouses, and its architecture demonstrates many of the design ideas you will encounter in Snowflake, BigQuery, and Databricks SQL. Master Redshift's slices, distribution keys, and Spectrum dispatch — and the rest of the cloud-warehouse universe becomes much easier to navigate.

Chapter Map

Before diving into each section, here is how the three partitions of this chapter relate.

flowchart LR A[MPP Architecture
Leader + Compute + Slices + RA3] --> B[Schema Design
Distribution + Sort Keys] B --> C[Spectrum + Serverless + WLM
Lakehouse + Operations] A -.->|enables parallelism| B B -.->|sets data layout| C

Part 1: Redshift MPP Architecture

Pre-Reading Check — Part 1

1. Which statement best describes the division of labor between the leader node and the compute nodes in a Redshift cluster?

The leader node stores user data while the compute nodes only relay queries. The leader node parses, optimizes, and dispatches queries; the compute nodes scan data and execute the compiled plan in parallel. Both node types execute scans, but only compute nodes communicate with clients. The leader node executes single-row queries while compute nodes only handle aggregations.

2. A Redshift cluster has 4 RA3 nodes and each node has 8 slices. A query scans a 4 billion-row fact table that is evenly distributed. Roughly how many rows does each slice scan?

125 million rows. 500 million rows. 1 billion rows. 4 billion rows (each slice gets a full copy).

3. Which capability of RA3 nodes is fundamentally enabled by Redshift Managed Storage (RMS)?

Storing all blocks exclusively on local NVMe SSDs. Eliminating the leader node from the cluster. Sizing compute and storage independently, with cold blocks tiered to S3 and cluster relocation across AZs at zero RPO. Replacing columnar storage with row-oriented storage.

Leader Node and Compute Nodes

A Redshift cluster is a tightly coordinated team of specialists. The leader node is the conductor: when a SQL statement arrives, it parses the query, optimizes it, generates compiled C++ code, and ships that code along with execution instructions to the worker nodes. Critically, the leader node does not store user data and does not scan rows itself. Think of it as the air-traffic controller in a busy airport: it doesn't fly the planes, but nothing lands or takes off without its instructions.

The compute nodes are the planes. Each compute node is a server with its own CPU cores, memory, high-bandwidth network interface, and (for RA3 node types) local SSD storage. Compute nodes execute the compiled query code in parallel, scanning data, evaluating predicates, computing aggregates, and shipping intermediate results back to the leader for final assembly.

ComponentStores Data?Executes Queries?Talks to Clients?
Leader nodeNoPlans onlyYes (single endpoint)
Compute nodeYes (via slices)Yes (in parallel)No (internal network)

When a client application connects, it always connects through the leader node's endpoint. The leader is therefore both the brain and the front door, while the compute nodes form the engine room.

Slices and Massively Parallel Processing

The unit of parallelism inside Redshift is not the node — it is the slice. Each compute node is partitioned into a fixed number of slices based on its instance type. For example, an ra3.xlplus node has 2 slices, while larger node types have more. Each slice receives a portion of the node's memory and disk and processes its share of the data in parallel with every other slice in the cluster.

This is the essence of Massively Parallel Processing (MPP): rather than a single server churning through a query, dozens or hundreds of slices work concurrently on disjoint partitions. If a query needs to scan a 10 billion-row fact table on a cluster with 32 slices, each slice scans roughly 312 million rows simultaneously. The cluster's total parallel capacity is nodes × slices_per_node; doubling node count doubles parallelism.

Animation 7.1: Query routing — Leader fans out to compute slices, then aggregates
A SQL query enters via the Leader Node. The compiled plan fans out to compute nodes, then to slices, which process in parallel and stream partial results back up.
SELECT … Leader Node parse | optimize | compile Compute Node 1 Compute Node 2 Compute Node 3 Slice 1312M rows Slice 2312M rows Slice 1312M rows Slice 2312M rows Slice 1312M rows Slice 2312M rows Final result → client

A useful analogy is sorting mail in a giant post office. A single clerk (a traditional database) sorts every envelope sequentially. An MPP system hires 32 clerks (slices), gives each a bin labeled with a range of zip codes, and lets them all sort simultaneously. The supervisor (leader node) hands out the work and assembles the final stacks.

flowchart TD Client[Client / BI Tool] -->|SQL via single endpoint| Leader[Leader Node
parse, optimize, compile] Leader -->|compiled code + plan| N1[Compute Node 1] Leader -->|compiled code + plan| N2[Compute Node 2] Leader -->|compiled code + plan| N3[Compute Node N] N1 --> S1A[Slice 1] & S1B[Slice 2] N2 --> S2A[Slice 1] & S2B[Slice 2] N3 --> S3A[Slice 1] & S3B[Slice 2] S1A & S1B & S2A & S2B & S3A & S3B -->|partial results| Leader Leader -->|final result set| Client

RA3 Nodes and Managed Storage

The original Redshift node families (DS2, DC2) coupled compute and storage tightly: disks lived inside compute nodes, and growing the warehouse meant adding nodes you didn't otherwise need. The RA3 family (ra3.xlplus, ra3.4xlarge, ra3.16xlarge) decouples the two by introducing Redshift Managed Storage (RMS).

TierLocationRole
Tier 1 (hot)Local NVMe SSDs on each RA3 nodeFrequently accessed blocks; queried at SSD speed
Tier 2 (cold)Amazon S3 (managed by Redshift)Cold blocks, automatically offloaded; petabyte-scale

Redshift continuously analyzes block temperature, age, and workload patterns, then prefetches hot blocks to local SSD before queries need them. From the user's perspective, storage looks like one big disk, but in reality the hot working set lives next to the CPU while cold history sleeps in S3. RA3 also supports cluster relocation — moving a cluster between Availability Zones with the same endpoint and zero RPO — because the persistent data lives in regional S3.

flowchart LR Q[Incoming Query] --> CL[Compute Layer
RA3 Nodes vCPU + RAM] CL -->|hot blocks| T1[Tier 1: Local NVMe SSD] CL -->|cold blocks fetched on demand| T2[Tier 2: Amazon S3 RMS] T2 -.->|prefetch by block temperature| T1 T1 -.->|evict cold blocks| T2 T2 --> AZ[Cross-AZ Durability
cluster relocation, zero RPO]

Key Takeaways — MPP Architecture

Post-Reading Check — Part 1

1. Which statement best describes the division of labor between the leader node and the compute nodes in a Redshift cluster?

The leader node stores user data while the compute nodes only relay queries. The leader node parses, optimizes, and dispatches queries; the compute nodes scan data and execute the compiled plan in parallel. Both node types execute scans, but only compute nodes communicate with clients. The leader node executes single-row queries while compute nodes only handle aggregations.

2. A Redshift cluster has 4 RA3 nodes and each node has 8 slices. A query scans a 4 billion-row fact table that is evenly distributed. Roughly how many rows does each slice scan?

125 million rows. 500 million rows. 1 billion rows. 4 billion rows (each slice gets a full copy).

3. Which capability of RA3 nodes is fundamentally enabled by Redshift Managed Storage (RMS)?

Storing all blocks exclusively on local NVMe SSDs. Eliminating the leader node from the cluster. Sizing compute and storage independently, with cold blocks tiered to S3 and cluster relocation across AZs at zero RPO. Replacing columnar storage with row-oriented storage.

Part 2: Schema Design — Distribution and Sort Keys

Pre-Reading Check — Part 2

1. Why does setting customer_id as the DISTKEY on both the sales fact table and the customer dimension table speed up the join between them?

It physically sorts the rows on disk in customer_id order. Matching customer_id values land on the same slice on both sides, so the join is collocated and no rows have to move across the network. It tells Redshift to compress customer_id with AZ64. It causes both tables to be replicated to every node.

2. You have a 200-row country dimension joined many ways from large fact tables on a 10-node cluster. Which distribution style is the best fit, and what is its main cost?

DISTSTYLE EVEN — cost is increased redistribution at join time. DISTKEY (country_code) — cost is hash skew if country_code is not unique. DISTSTYLE ALL — cost is multiplied storage (one full copy on every node). DISTSTYLE AUTO — cost is unpredictable cluster restarts.

3. A table has a compound sort key on (sale_date, region, state). Which query benefits the LEAST from this sort key?

WHERE sale_date = '2026-05-07' WHERE sale_date = '2026-05-07' AND region = 'us-east' WHERE sale_date BETWEEN '2026-01-01' AND '2026-06-30' WHERE state = 'CA' with no sale_date or region predicate

4. How do sort keys actually reduce I/O on filtered queries?

Redshift caches the entire sorted column in leader memory. Each 1 MB block has a zone map of min/max values; blocks whose range cannot match the predicate are skipped entirely. Sorted columns are stored row-wise, which is faster than columnar. Sort keys force Redshift to use a B-tree index for every query.

Schema design is the single biggest lever for Redshift performance. Two tables with identical columns can have query times that differ by 100x simply because of distribution and sort key choices. Distribution styles decide which slice each row lives on; sort keys decide where on disk the rows go within a slice.

Distribution Styles: KEY, ALL, EVEN, AUTO

When you load a row into Redshift, it has to land on exactly one slice. The distribution style tells Redshift how to choose.

StyleBest ForStorage CostJoin Cost
KEYLarge fact-dimension joins on a stable key1xFree if collocated
ALLSmall dimension tables joined many waysN nodes × 1xAlways free
EVENTables with no good join key1xRedistribution required
AUTOUnknown access pattern; let Redshift decideAdaptsAdapts
Animation 7.2: Distribution styles — where rows land on slices
Watch 8 sample rows distribute differently across 4 slices depending on the chosen DISTSTYLE. Click a tab to switch styles.
Slice 1 Slice 2 Slice 3 Slice 4 8 rows: customer_id 101, 102, 103, 104, 105, 106, 107, 108 Mode: KEY (hash on customer_id) 101 102 103 104 105 106 107 108

A canonical pattern is the fact + dimension schema. Suppose you have a sales fact table with 10 billion rows and a customer dimension with 50 million rows. They join on customer_id:

CREATE TABLE sales (
    sale_id      BIGINT,
    customer_id  BIGINT NOT NULL,
    sale_date    DATE,
    amount       NUMERIC(12,2)
)
DISTKEY (customer_id)
SORTKEY (sale_date);

CREATE TABLE customer (
    customer_id  BIGINT NOT NULL,
    name         VARCHAR(200),
    region       VARCHAR(50)
)
DISTKEY (customer_id)
SORTKEY (customer_id);

Both tables now use customer_id as the DISTKEY, so all sales for customer 12345 sit on the same slice as customer 12345's row in the dimension table. The join is collocated. For a tiny dimension like country (200 rows), use DISTSTYLE ALL instead.

flowchart TD Start[New Table] --> Q1{Small and slowly
changing?} Q1 -->|Yes| ALL[DISTSTYLE ALL
full copy on every node] Q1 -->|No| Q2{Stable JOIN column?} Q2 -->|Yes| KEY[DISTKEY column
hash route to slice] Q2 -->|No| Q3{Need uniform spread
without join hint?} Q3 -->|Yes| EVEN[DISTSTYLE EVEN
round-robin] Q3 -->|Unknown / mixed| AUTO[DISTSTYLE AUTO
adapts over time]

Sort Keys: Compound vs Interleaved

Once a row is on its slice, the sort key decides where on disk it lands. Redshift stores data in 1 MB blocks and keeps a zone map for each block recording the min/max of every column. When a query has a WHERE predicate on a sorted column, the optimizer consults zone maps and skips entire blocks whose range cannot match. On a 10 TB table, a good sort key turns a full scan into a few hundred-millisecond seek.

Compound sort key (the default) sorts by the first column, then by the second within ties, like a phone book sorted by last name, first name, middle initial. It delivers dramatic speedups for predicates on the leading column or columns. A (sale_date, region, state) compound key shines for WHERE sale_date = '2026-05-07', still helps for WHERE sale_date = '2026-05-07' AND region = 'us-east', but provides no benefit for WHERE state = 'CA' alone.

Interleaved sort key gives equal weight to each column via a space-filling curve. It helps when queries filter on different subsets of the key columns over time. Trade-offs: load and VACUUM REINDEX are significantly slower; avoid on monotonically increasing columns.

AspectCompoundInterleaved
Filter patternLeading-column predicatesAny subset of key columns
Load costLowHigher
VACUUM costStandardSignificant (REINDEX)
Best onTime-series, ordered keysMulti-dimensional cubes
Avoid on(works broadly)Monotonic IDs, dates alone

Compression Encodings

Every column in Redshift is stored in a columnar format with a compression encoding. Common encodings:

Use COPY … COMPUPDATE ON or run ANALYZE COMPRESSION and apply the recommendations.

Key Takeaways — Schema Design

Post-Reading Check — Part 2

1. Why does setting customer_id as the DISTKEY on both the sales fact table and the customer dimension table speed up the join between them?

It physically sorts the rows on disk in customer_id order. Matching customer_id values land on the same slice on both sides, so the join is collocated and no rows have to move across the network. It tells Redshift to compress customer_id with AZ64. It causes both tables to be replicated to every node.

2. You have a 200-row country dimension joined many ways from large fact tables on a 10-node cluster. Which distribution style is the best fit, and what is its main cost?

DISTSTYLE EVEN — cost is increased redistribution at join time. DISTKEY (country_code) — cost is hash skew if country_code is not unique. DISTSTYLE ALL — cost is multiplied storage (one full copy on every node). DISTSTYLE AUTO — cost is unpredictable cluster restarts.

3. A table has a compound sort key on (sale_date, region, state). Which query benefits the LEAST from this sort key?

WHERE sale_date = '2026-05-07' WHERE sale_date = '2026-05-07' AND region = 'us-east' WHERE sale_date BETWEEN '2026-01-01' AND '2026-06-30' WHERE state = 'CA' with no sale_date or region predicate

4. How do sort keys actually reduce I/O on filtered queries?

Redshift caches the entire sorted column in leader memory. Each 1 MB block has a zone map of min/max values; blocks whose range cannot match the predicate are skipped entirely. Sorted columns are stored row-wise, which is faster than columnar. Sort keys force Redshift to use a B-tree index for every query.

Part 3: Spectrum, Serverless, and WLM

Pre-Reading Check — Part 3

1. Which set of optimizations does Redshift Spectrum apply when scanning S3 files, and why do they matter for cost?

It compresses files in flight; matters because S3 charges egress fees. Predicate pushdown, projection pushdown, and partition pruning — matters because Spectrum is priced per terabyte scanned. It rebuilds B-tree indexes on S3 objects; matters for query latency. It always loads the data into Redshift first; matters because loaded queries are faster.

2. A 10 TB CSV-format S3 lake fully scanned costs roughly $50 per query. Storing the same data as Parquet partitioned by year and region, queried with WHERE year = 2026 AND region = 'us-east', scans only ~100 GB. Why is this such a big difference?

Parquet files are physically smaller than CSV by exactly 100x. Spectrum prunes partitions outside the predicate and reads only needed columns from the columnar Parquet, drastically cutting bytes scanned. Parquet automatically caches in the leader node so repeat queries are free. Redshift forwards CSV scans to RDS for cheaper processing.

3. For an analytics platform with bursty, unpredictable workloads and long idle periods, which Redshift deployment model is best, and why?

Provisioned cluster, because the cluster-hour price is always cheaper. Redshift Serverless, because RPU-based auto-scaling charges only for capacity actually used and there is no idle cost. A single-node DC2 cluster, because it scales to zero automatically. Spectrum-only, because it never uses any compute capacity.

4. With Automatic WLM enabled, an executive dashboard query is tagged HIGH priority while an ad-hoc analyst query is tagged NORMAL. What does Redshift do differently for the two?

It rejects the analyst query until the dashboard finishes. It dynamically allocates more memory and earlier scheduling to the HIGH-priority query, while still giving the NORMAL query a fair share. It runs both queries on identical resources but logs the priority for billing. It moves the analyst query to a separate cluster every time.

5. When does Redshift's optimizer use a materialized view to answer a query that does not literally name the view in its FROM clause?

Never — users must always reference the view explicitly. When the view's columns and predicates match the query, the automatic query rewrite feature transparently redirects the query to the precomputed view. Only for queries from the leader node's superuser account. Whenever AUTO REFRESH NO is set on the view.

Querying S3 Data with Spectrum

Redshift Spectrum lets a Redshift cluster run SQL directly against files in S3 — Parquet, ORC, CSV, JSON, Avro — without first loading them into Redshift tables. The mental model is a federation layer: Redshift sees the S3 data as external tables living in an external schema backed by the AWS Glue Data Catalog (or a Hive Metastore).

CREATE EXTERNAL SCHEMA spectrum_schema
FROM DATA CATALOG
DATABASE 'analytics_lake'
IAM_ROLE 'arn:aws:iam::123456789012:role/redshift-spectrum-role'
REGION 'us-east-1';

After this single command, every table cataloged in the analytics_lake Glue database appears in Redshift as spectrum_schema.<table_name>. The most powerful pattern is mixing local and external data in one query — recent hot data in Redshift fact tables joined with archived S3 history:

SELECT
    c.region,
    SUM(s.amount)         AS hot_amount,
    SUM(h.amount)         AS cold_amount
FROM   public.customer            c
JOIN   public.sales               s ON s.customer_id = c.customer_id
LEFT JOIN spectrum_schema.archived_sales h
       ON h.customer_id = c.customer_id
WHERE  s.sale_date >= DATEADD(month, -3, CURRENT_DATE)
GROUP BY c.region;

Spectrum runs on a separate, massively scaled fleet of workers. They perform predicate pushdown, projection pushdown, and partition pruning using Hive-style partition layouts like s3://lake/sales/year=2026/region=us-east/. Because Spectrum is priced per terabyte scanned, partition pruning and columnar formats translate directly to dollars saved.

Concretely: a 10 TB CSV-format data lake fully scanned costs roughly 10 × $5 = $50 per query. The same data as Parquet with year/region partitions, filtered to one partition, scans only ~100 GB — about $0.50 per query. Same answer, 100x cheaper.

Animation 7.3: Spectrum query path — Redshift dispatches scan to S3 via Glue
Watch a SQL query travel from the client through the leader node, resolve metadata at the Glue Data Catalog, dispatch to the Spectrum worker fleet, scan S3, and return filtered rows to compute nodes for join.
ClientSQL with WHERE Leader Nodeplan + pushdown Glue Catalogpaths + partitions Spectrum Fleetscan workers Amazon S3Parquet partitions Compute Nodesjoin + aggregate 1: SQL 2: resolve metadata 3: dispatch 4: scan 5: filtered rows 6: results Cost lever: pushdown + partitions 10 TB CSV full scan: ~$50/query   →   100 GB Parquet partition: ~$0.50/query Same answer. 100x cheaper. Same SQL.
sequenceDiagram participant C as Client participant L as Leader Node participant G as Glue Data Catalog participant SF as Spectrum Fleet participant S3 as S3 (Parquet) participant CN as Compute Nodes C->>L: SQL: spectrum_schema.archived_sales JOIN public.sales L->>G: Resolve external schema + metadata G-->>L: File paths, partitions, stats L->>SF: Dispatch scan with predicates SF->>S3: Read matching partitions/columns S3-->>SF: Filtered Parquet row groups SF-->>CN: Stream filtered rows CN-->>L: Partial aggregates L-->>C: Final result set

Redshift Serverless Capacity Model

For ad-hoc analytics, data-app backends, and sporadic dashboards, an always-on cluster is overkill. Redshift Serverless replaces the cluster abstraction with a capacity unit called the RPU (Redshift Processing Unit). You set a base capacity (in RPUs) and an optional max, and Redshift auto-scales between them as workloads demand.

DimensionProvisionedServerless
Pricing unitCluster-hourRPU-hour
ScalingManual (resize)Automatic
Cold startNone (always on)~30 seconds first query
Idle costYes (always billed)None (pause when idle)
WLMManual queuesAutomatic
Best forSteady, predictable loadBursty, unpredictable load

Serverless still supports Spectrum identically, so a common pattern is Serverless + Spectrum: a Glue Catalog over an S3 lakehouse, a Serverless workgroup, paying only for RPU-seconds that actual queries consume.

Data Sharing Across Clusters

Redshift data sharing lets one producer cluster expose specific schemas/tables to one or more consumer clusters — even across AWS accounts and regions — without copying data. Use it to separate an ETL/ELT producer cluster from BI consumer clusters, or to share curated data with another business unit, while keeping a single source of truth in RMS.

Workload Management (WLM)

Even on a perfectly designed schema, performance falls apart when ETL jobs, executive dashboards, and ad-hoc queries all hit the cluster simultaneously. WLM organizes queries into queues; each queue gets a slice of memory and a max concurrency.

WorkloadQueue / PriorityRationale
Critical ETL loadsHIGHESTMust finish in batch window
Executive dashboardsHIGHUser-facing latency
Analyst ad-hocNORMALMany users, tolerate seconds
Long-running reportsLOWBackground, can wait
Experimental queriesLOWESTDon't impact others

Concurrency Scaling, Result Caching, and Materialized Views

Concurrency scaling automatically spins up transient secondary clusters when read-only queries queue up; results return through the original cluster. Each main-cluster running day earns one free hour of concurrency scaling, then bills per-second.

Result caching stores recent query results in the leader's memory; identical queries against unchanged data return in milliseconds, with no configuration.

Materialized views precompute and persist expensive joins/aggregations:

CREATE MATERIALIZED VIEW mv_daily_sales_by_region
AUTO REFRESH YES
AS
SELECT  c.region,
        s.sale_date,
        SUM(s.amount) AS daily_amount,
        COUNT(*)      AS order_count
FROM    sales    s
JOIN    customer c USING (customer_id)
GROUP BY c.region, s.sale_date;

With AUTO REFRESH YES, Redshift incrementally maintains the view. The optimizer's automatic query rewrite feature can transparently redirect a user's query against sales/customer to the materialized view when columns and predicates match — analysts get the speedup without changing SQL.

Key Takeaways — Spectrum, Serverless, and WLM

Chapter Summary

Redshift is a textbook example of how a cloud data warehouse trades complexity for performance. The leader node plans queries; compute nodes execute them in parallel; slices are the unit of parallelism; MPP ties them together. RA3 + RMS decouples compute from storage. Schema design — distribution styles and sort keys — is the largest performance lever. Spectrum federates SQL over S3; Serverless removes cluster management; data sharing avoids copies. WLM, concurrency scaling, result caching, and materialized views deliver predictable performance under mixed workloads. Master these levers and Redshift becomes the foundation for streaming, lakehouse, and orchestration patterns in later chapters.

Post-Reading Check — Part 3

1. Which set of optimizations does Redshift Spectrum apply when scanning S3 files, and why do they matter for cost?

It compresses files in flight; matters because S3 charges egress fees. Predicate pushdown, projection pushdown, and partition pruning — matters because Spectrum is priced per terabyte scanned. It rebuilds B-tree indexes on S3 objects; matters for query latency. It always loads the data into Redshift first; matters because loaded queries are faster.

2. A 10 TB CSV-format S3 lake fully scanned costs roughly $50 per query. Storing the same data as Parquet partitioned by year and region, queried with WHERE year = 2026 AND region = 'us-east', scans only ~100 GB. Why is this such a big difference?

Parquet files are physically smaller than CSV by exactly 100x. Spectrum prunes partitions outside the predicate and reads only needed columns from the columnar Parquet, drastically cutting bytes scanned. Parquet automatically caches in the leader node so repeat queries are free. Redshift forwards CSV scans to RDS for cheaper processing.

3. For an analytics platform with bursty, unpredictable workloads and long idle periods, which Redshift deployment model is best, and why?

Provisioned cluster, because the cluster-hour price is always cheaper. Redshift Serverless, because RPU-based auto-scaling charges only for capacity actually used and there is no idle cost. A single-node DC2 cluster, because it scales to zero automatically. Spectrum-only, because it never uses any compute capacity.

4. With Automatic WLM enabled, an executive dashboard query is tagged HIGH priority while an ad-hoc analyst query is tagged NORMAL. What does Redshift do differently for the two?

It rejects the analyst query until the dashboard finishes. It dynamically allocates more memory and earlier scheduling to the HIGH-priority query, while still giving the NORMAL query a fair share. It runs both queries on identical resources but logs the priority for billing. It moves the analyst query to a separate cluster every time.

5. When does Redshift's optimizer use a materialized view to answer a query that does not literally name the view in its FROM clause?

Never — users must always reference the view explicitly. When the view's columns and predicates match the query, the automatic query rewrite feature transparently redirects the query to the precomputed view. Only for queries from the leader node's superuser account. Whenever AUTO REFRESH NO is set on the view.

Your Progress

Answer Explanations