Study Guide: Cloud Data Warehousing with Redshift

Cloud data warehouses are the muscle of modern analytics platforms. Amazon Redshift is one of the longest-running and most widely deployed cloud warehouses, and its architecture demonstrates many of the design ideas you will encounter in Snowflake, BigQuery, and Databricks SQL. Master Redshift's slices, distribution keys, and Spectrum dispatch — and the rest of the cloud-warehouse universe becomes much easier to navigate.

Chapter Map

Before diving into each section, here is how the three partitions of this chapter relate.

Part 1: Redshift MPP Architecture

Leader Node and Compute Nodes

A Redshift cluster is a tightly coordinated team of specialists. The leader node is the conductor: when a SQL statement arrives, it parses the query, optimizes it, generates compiled C++ code, and ships that code along with execution instructions to the worker nodes. Critically, the leader node does not store user data and does not scan rows itself. Think of it as the air-traffic controller in a busy airport: it doesn't fly the planes, but nothing lands or takes off without its instructions.

The compute nodes are the planes. Each compute node is a server with its own CPU cores, memory, high-bandwidth network interface, and (for RA3 node types) local SSD storage. Compute nodes execute the compiled query code in parallel, scanning data, evaluating predicates, computing aggregates, and shipping intermediate results back to the leader for final assembly.

Component	Stores Data?	Executes Queries?	Talks to Clients?
Leader node	No	Plans only	Yes (single endpoint)
Compute node	Yes (via slices)	Yes (in parallel)	No (internal network)

When a client application connects, it always connects through the leader node's endpoint. The leader is therefore both the brain and the front door, while the compute nodes form the engine room.

Slices and Massively Parallel Processing

The unit of parallelism inside Redshift is not the node — it is the slice. Each compute node is partitioned into a fixed number of slices based on its instance type. For example, an ra3.xlplus node has 2 slices, while larger node types have more. Each slice receives a portion of the node's memory and disk and processes its share of the data in parallel with every other slice in the cluster.

This is the essence of Massively Parallel Processing (MPP): rather than a single server churning through a query, dozens or hundreds of slices work concurrently on disjoint partitions. If a query needs to scan a 10 billion-row fact table on a cluster with 32 slices, each slice scans roughly 312 million rows simultaneously. The cluster's total parallel capacity is nodes × slices_per_node; doubling node count doubles parallelism.

Animation 7.1: Query routing — Leader fans out to compute slices, then aggregates

A SQL query enters via the Leader Node. The compiled plan fans out to compute nodes, then to slices, which process in parallel and stream partial results back up.

A useful analogy is sorting mail in a giant post office. A single clerk (a traditional database) sorts every envelope sequentially. An MPP system hires 32 clerks (slices), gives each a bin labeled with a range of zip codes, and lets them all sort simultaneously. The supervisor (leader node) hands out the work and assembles the final stacks.

RA3 Nodes and Managed Storage

The original Redshift node families (DS2, DC2) coupled compute and storage tightly: disks lived inside compute nodes, and growing the warehouse meant adding nodes you didn't otherwise need. The RA3 family (ra3.xlplus, ra3.4xlarge, ra3.16xlarge) decouples the two by introducing Redshift Managed Storage (RMS).

Tier	Location	Role
Tier 1 (hot)	Local NVMe SSDs on each RA3 node	Frequently accessed blocks; queried at SSD speed
Tier 2 (cold)	Amazon S3 (managed by Redshift)	Cold blocks, automatically offloaded; petabyte-scale

Redshift continuously analyzes block temperature, age, and workload patterns, then prefetches hot blocks to local SSD before queries need them. From the user's perspective, storage looks like one big disk, but in reality the hot working set lives next to the CPU while cold history sleeps in S3. RA3 also supports cluster relocation — moving a cluster between Availability Zones with the same endpoint and zero RPO — because the persistent data lives in regional S3.

Key Takeaways — MPP Architecture

The leader node plans queries and returns results; compute nodes execute compiled code in parallel.
Slices, not nodes, are the unit of parallel work; total parallelism = nodes × slices_per_node.
A query is fast when data is spread evenly across slices; one overloaded slice = the whole query waits.
RA3 + RMS decouples compute from storage: hot blocks on NVMe, cold blocks tiered to S3, sized independently.
Cluster relocation across AZs is possible because persistent data already lives in regional S3.

Part 2: Schema Design — Distribution and Sort Keys

Pre-Reading Check — Part 2

1. Why does setting customer_id as the DISTKEY on both the sales fact table and the customer dimension table speed up the join between them?

It physically sorts the rows on disk in customer_id order. Matching customer_id values land on the same slice on both sides, so the join is collocated and no rows have to move across the network. It tells Redshift to compress customer_id with AZ64. It causes both tables to be replicated to every node.

2. You have a 200-row country dimension joined many ways from large fact tables on a 10-node cluster. Which distribution style is the best fit, and what is its main cost?

DISTSTYLE EVEN — cost is increased redistribution at join time. DISTKEY (country_code) — cost is hash skew if country_code is not unique. DISTSTYLE ALL — cost is multiplied storage (one full copy on every node). DISTSTYLE AUTO — cost is unpredictable cluster restarts.

3. A table has a compound sort key on (sale_date, region, state). Which query benefits the LEAST from this sort key?

WHERE sale_date = '2026-05-07' WHERE sale_date = '2026-05-07' AND region = 'us-east' WHERE sale_date BETWEEN '2026-01-01' AND '2026-06-30' WHERE state = 'CA' with no sale_date or region predicate

4. How do sort keys actually reduce I/O on filtered queries?

Redshift caches the entire sorted column in leader memory. Each 1 MB block has a zone map of min/max values; blocks whose range cannot match the predicate are skipped entirely. Sorted columns are stored row-wise, which is faster than columnar. Sort keys force Redshift to use a B-tree index for every query.

Schema design is the single biggest lever for Redshift performance. Two tables with identical columns can have query times that differ by 100x simply because of distribution and sort key choices. Distribution styles decide which slice each row lives on; sort keys decide where on disk the rows go within a slice.

Distribution Styles: KEY, ALL, EVEN, AUTO

When you load a row into Redshift, it has to land on exactly one slice. The distribution style tells Redshift how to choose.

KEY distribution picks one column as the DISTKEY. Redshift hashes that column's value and routes the row to the corresponding slice. Two tables with the same DISTKEY have matching key values landing on the same slice, so joins are collocated — no network shuffle.
ALL distribution stores a complete copy on the first slice of every node. Joins against an ALL table are always free, regardless of the other side. The cost: storage is multiplied by the node count.
EVEN distribution uses round-robin. Uniform spread, no help for joins (any join needs redistribution).
AUTO distribution is the default. Redshift starts with ALL for small tables, switches to EVEN as they grow, and may convert to KEY based on observed query patterns.

Style	Best For	Storage Cost	Join Cost
KEY	Large fact-dimension joins on a stable key	1x	Free if collocated
ALL	Small dimension tables joined many ways	N nodes × 1x	Always free
EVEN	Tables with no good join key	1x	Redistribution required
AUTO	Unknown access pattern; let Redshift decide	Adapts	Adapts

Animation 7.2: Distribution styles — where rows land on slices

Watch 8 sample rows distribute differently across 4 slices depending on the chosen DISTSTYLE. Click a tab to switch styles.

A canonical pattern is the fact + dimension schema. Suppose you have a sales fact table with 10 billion rows and a customer dimension with 50 million rows. They join on customer_id:

CREATE TABLE sales (
    sale_id      BIGINT,
    customer_id  BIGINT NOT NULL,
    sale_date    DATE,
    amount       NUMERIC(12,2)
)
DISTKEY (customer_id)
SORTKEY (sale_date);

CREATE TABLE customer (
    customer_id  BIGINT NOT NULL,
    name         VARCHAR(200),
    region       VARCHAR(50)
)
DISTKEY (customer_id)
SORTKEY (customer_id);

Both tables now use customer_id as the DISTKEY, so all sales for customer 12345 sit on the same slice as customer 12345's row in the dimension table. The join is collocated. For a tiny dimension like country (200 rows), use DISTSTYLE ALL instead.

flowchart TD Start[New Table] --> Q1{Small and slowly
changing?} Q1 -->|Yes| ALL[DISTSTYLE ALL
full copy on every node] Q1 -->|No| Q2{Stable JOIN column?} Q2 -->|Yes| KEY[DISTKEY column
hash route to slice] Q2 -->|No| Q3{Need uniform spread
without join hint?} Q3 -->|Yes| EVEN[DISTSTYLE EVEN
round-robin] Q3 -->|Unknown / mixed| AUTO[DISTSTYLE AUTO
adapts over time]

Sort Keys: Compound vs Interleaved

Once a row is on its slice, the sort key decides where on disk it lands. Redshift stores data in 1 MB blocks and keeps a zone map for each block recording the min/max of every column. When a query has a WHERE predicate on a sorted column, the optimizer consults zone maps and skips entire blocks whose range cannot match. On a 10 TB table, a good sort key turns a full scan into a few hundred-millisecond seek.

Compound sort key (the default) sorts by the first column, then by the second within ties, like a phone book sorted by last name, first name, middle initial. It delivers dramatic speedups for predicates on the leading column or columns. A (sale_date, region, state) compound key shines for WHERE sale_date = '2026-05-07', still helps for WHERE sale_date = '2026-05-07' AND region = 'us-east', but provides no benefit for WHERE state = 'CA' alone.

Interleaved sort key gives equal weight to each column via a space-filling curve. It helps when queries filter on different subsets of the key columns over time. Trade-offs: load and VACUUM REINDEX are significantly slower; avoid on monotonically increasing columns.

Aspect	Compound	Interleaved
Filter pattern	Leading-column predicates	Any subset of key columns
Load cost	Low	Higher
VACUUM cost	Standard	Significant (REINDEX)
Best on	Time-series, ordered keys	Multi-dimensional cubes
Avoid on	(works broadly)	Monotonic IDs, dates alone

Compression Encodings

Every column in Redshift is stored in a columnar format with a compression encoding. Common encodings:

AZ64 — Amazon's proprietary encoding for numeric/date types; the modern default.
ZSTD — General-purpose, high-ratio; strong default for VARCHAR/CHAR.
BYTEDICT, DELTA, RUNLENGTH — Specialized for low-cardinality, sequential, or sparse data.
RAW — No compression; used for sort key columns.

Use COPY … COMPUPDATE ON or run ANALYZE COMPRESSION and apply the recommendations.

Key Takeaways — Schema Design

Choose DISTKEY by JOIN pattern: co-locate the fact table and its primary dimension on the shared foreign key.
Use DISTSTYLE ALL for small, slowly-changing dimensions joined many ways.
Compound sort keys are best for predictable, leading-column filters (often a date).
Interleaved sort keys are for genuinely multi-dimensional ad-hoc filtering — with much higher VACUUM cost.
Sort keys reduce I/O via zone maps over 1 MB blocks — a good sort key skips entire blocks.
Default to AZ64 for numerics and ZSTD for strings unless ANALYZE COMPRESSION says otherwise.

Post-Reading Check — Part 2

1. Why does setting customer_id as the DISTKEY on both the sales fact table and the customer dimension table speed up the join between them?

2. You have a 200-row country dimension joined many ways from large fact tables on a 10-node cluster. Which distribution style is the best fit, and what is its main cost?

3. A table has a compound sort key on (sale_date, region, state). Which query benefits the LEAST from this sort key?

4. How do sort keys actually reduce I/O on filtered queries?

Part 3: Spectrum, Serverless, and WLM

Pre-Reading Check — Part 3

1. Which set of optimizations does Redshift Spectrum apply when scanning S3 files, and why do they matter for cost?

It compresses files in flight; matters because S3 charges egress fees. Predicate pushdown, projection pushdown, and partition pruning — matters because Spectrum is priced per terabyte scanned. It rebuilds B-tree indexes on S3 objects; matters for query latency. It always loads the data into Redshift first; matters because loaded queries are faster.

2. A 10 TB CSV-format S3 lake fully scanned costs roughly $50 per query. Storing the same data as Parquet partitioned by year and region, queried with WHERE year = 2026 AND region = 'us-east', scans only ~100 GB. Why is this such a big difference?

Parquet files are physically smaller than CSV by exactly 100x. Spectrum prunes partitions outside the predicate and reads only needed columns from the columnar Parquet, drastically cutting bytes scanned. Parquet automatically caches in the leader node so repeat queries are free. Redshift forwards CSV scans to RDS for cheaper processing.

3. For an analytics platform with bursty, unpredictable workloads and long idle periods, which Redshift deployment model is best, and why?

Provisioned cluster, because the cluster-hour price is always cheaper. Redshift Serverless, because RPU-based auto-scaling charges only for capacity actually used and there is no idle cost. A single-node DC2 cluster, because it scales to zero automatically. Spectrum-only, because it never uses any compute capacity.

4. With Automatic WLM enabled, an executive dashboard query is tagged HIGH priority while an ad-hoc analyst query is tagged NORMAL. What does Redshift do differently for the two?

It rejects the analyst query until the dashboard finishes. It dynamically allocates more memory and earlier scheduling to the HIGH-priority query, while still giving the NORMAL query a fair share. It runs both queries on identical resources but logs the priority for billing. It moves the analyst query to a separate cluster every time.

5. When does Redshift's optimizer use a materialized view to answer a query that does not literally name the view in its FROM clause?

Never — users must always reference the view explicitly. When the view's columns and predicates match the query, the automatic query rewrite feature transparently redirects the query to the precomputed view. Only for queries from the leader node's superuser account. Whenever AUTO REFRESH NO is set on the view.

Querying S3 Data with Spectrum

Redshift Spectrum lets a Redshift cluster run SQL directly against files in S3 — Parquet, ORC, CSV, JSON, Avro — without first loading them into Redshift tables. The mental model is a federation layer: Redshift sees the S3 data as external tables living in an external schema backed by the AWS Glue Data Catalog (or a Hive Metastore).

CREATE EXTERNAL SCHEMA spectrum_schema
FROM DATA CATALOG
DATABASE 'analytics_lake'
IAM_ROLE 'arn:aws:iam::123456789012:role/redshift-spectrum-role'
REGION 'us-east-1';

After this single command, every table cataloged in the analytics_lake Glue database appears in Redshift as spectrum_schema.<table_name>. The most powerful pattern is mixing local and external data in one query — recent hot data in Redshift fact tables joined with archived S3 history:

SELECT
    c.region,
    SUM(s.amount)         AS hot_amount,
    SUM(h.amount)         AS cold_amount
FROM   public.customer            c
JOIN   public.sales               s ON s.customer_id = c.customer_id
LEFT JOIN spectrum_schema.archived_sales h
       ON h.customer_id = c.customer_id
WHERE  s.sale_date >= DATEADD(month, -3, CURRENT_DATE)
GROUP BY c.region;

Spectrum runs on a separate, massively scaled fleet of workers. They perform predicate pushdown, projection pushdown, and partition pruning using Hive-style partition layouts like s3://lake/sales/year=2026/region=us-east/. Because Spectrum is priced per terabyte scanned, partition pruning and columnar formats translate directly to dollars saved.

Concretely: a 10 TB CSV-format data lake fully scanned costs roughly 10 × $5 = $50 per query. The same data as Parquet with year/region partitions, filtered to one partition, scans only ~100 GB — about $0.50 per query. Same answer, 100x cheaper.

Animation 7.3: Spectrum query path — Redshift dispatches scan to S3 via Glue

Watch a SQL query travel from the client through the leader node, resolve metadata at the Glue Data Catalog, dispatch to the Spectrum worker fleet, scan S3, and return filtered rows to compute nodes for join.

sequenceDiagram participant C as Client participant L as Leader Node participant G as Glue Data Catalog participant SF as Spectrum Fleet participant S3 as S3 (Parquet) participant CN as Compute Nodes C->>L: SQL: spectrum_schema.archived_sales JOIN public.sales L->>G: Resolve external schema + metadata G-->>L: File paths, partitions, stats L->>SF: Dispatch scan with predicates SF->>S3: Read matching partitions/columns S3-->>SF: Filtered Parquet row groups SF-->>CN: Stream filtered rows CN-->>L: Partial aggregates L-->>C: Final result set

Redshift Serverless Capacity Model

For ad-hoc analytics, data-app backends, and sporadic dashboards, an always-on cluster is overkill. Redshift Serverless replaces the cluster abstraction with a capacity unit called the RPU (Redshift Processing Unit). You set a base capacity (in RPUs) and an optional max, and Redshift auto-scales between them as workloads demand.

Dimension	Provisioned	Serverless
Pricing unit	Cluster-hour	RPU-hour
Scaling	Manual (resize)	Automatic
Cold start	None (always on)	~30 seconds first query
Idle cost	Yes (always billed)	None (pause when idle)
WLM	Manual queues	Automatic
Best for	Steady, predictable load	Bursty, unpredictable load

Serverless still supports Spectrum identically, so a common pattern is Serverless + Spectrum: a Glue Catalog over an S3 lakehouse, a Serverless workgroup, paying only for RPU-seconds that actual queries consume.

Data Sharing Across Clusters

Redshift data sharing lets one producer cluster expose specific schemas/tables to one or more consumer clusters — even across AWS accounts and regions — without copying data. Use it to separate an ETL/ELT producer cluster from BI consumer clusters, or to share curated data with another business unit, while keeping a single source of truth in RMS.

Workload Management (WLM)

Even on a perfectly designed schema, performance falls apart when ETL jobs, executive dashboards, and ad-hoc queries all hit the cluster simultaneously. WLM organizes queries into queues; each queue gets a slice of memory and a max concurrency.

Manual WLM: define queues by hand — "ETL: 50% memory, concurrency 4; BI: 30% memory, concurrency 8."
Automatic WLM (recommended): Redshift learns from query history and dynamically allocates memory/concurrency. You only specify query priorities (LOWEST, LOW, NORMAL, HIGH, HIGHEST).

Workload	Queue / Priority	Rationale
Critical ETL loads	HIGHEST	Must finish in batch window
Executive dashboards	HIGH	User-facing latency
Analyst ad-hoc	NORMAL	Many users, tolerate seconds
Long-running reports	LOW	Background, can wait
Experimental queries	LOWEST	Don't impact others

Concurrency Scaling, Result Caching, and Materialized Views

Concurrency scaling automatically spins up transient secondary clusters when read-only queries queue up; results return through the original cluster. Each main-cluster running day earns one free hour of concurrency scaling, then bills per-second.

Result caching stores recent query results in the leader's memory; identical queries against unchanged data return in milliseconds, with no configuration.

Materialized views precompute and persist expensive joins/aggregations:

CREATE MATERIALIZED VIEW mv_daily_sales_by_region
AUTO REFRESH YES
AS
SELECT  c.region,
        s.sale_date,
        SUM(s.amount) AS daily_amount,
        COUNT(*)      AS order_count
FROM    sales    s
JOIN    customer c USING (customer_id)
GROUP BY c.region, s.sale_date;

With AUTO REFRESH YES, Redshift incrementally maintains the view. The optimizer's automatic query rewrite feature can transparently redirect a user's query against sales/customer to the materialized view when columns and predicates match — analysts get the speedup without changing SQL.

Key Takeaways — Spectrum, Serverless, and WLM

Spectrum extends Redshift SQL to the S3 data lake without ETL, billed per TB scanned. Use Parquet, partitions, and predicate pushdown to keep cost low.
Serverless trades cluster management for RPU-based auto-scaling, with no idle cost — ideal for bursty workloads.
Data sharing lets multiple clusters read one canonical dataset without copies.
Automatic WLM with query priorities is the recommended path for mixed workloads; concurrency scaling absorbs bursts.
Result caching and materialized views with auto-rewrite collapse repeat-query cost to near zero.

Chapter Summary

Redshift is a textbook example of how a cloud data warehouse trades complexity for performance. The leader node plans queries; compute nodes execute them in parallel; slices are the unit of parallelism; MPP ties them together. RA3 + RMS decouples compute from storage. Schema design — distribution styles and sort keys — is the largest performance lever. Spectrum federates SQL over S3; Serverless removes cluster management; data sharing avoids copies. WLM, concurrency scaling, result caching, and materialized views deliver predictable performance under mixed workloads. Master these levers and Redshift becomes the foundation for streaming, lakehouse, and orchestration patterns in later chapters.

Post-Reading Check — Part 3