Describe the Redshift architecture, including the responsibilities of the leader and compute nodes, the role of slices in parallel execution, and how RA3 managed storage decouples compute from storage.
Choose appropriate distribution and sort keys to optimize query performance for fact, dimension, and analytical workloads.
Use Redshift Spectrum and Redshift Serverless to query data lake files in Amazon S3 without first loading them into the warehouse.
Apply Workload Management (WLM), concurrency scaling, result caching, and materialized views to deliver predictable performance under mixed workloads.
Cloud data warehouses are the muscle of modern analytics platforms. Amazon Redshift is one of the longest-running and most widely deployed cloud warehouses, and its architecture demonstrates many of the design ideas you will encounter in Snowflake, BigQuery, and Databricks SQL. Master Redshift's slices, distribution keys, and Spectrum dispatch — and the rest of the cloud-warehouse universe becomes much easier to navigate.
Chapter Map
Before diving into each section, here is how the three partitions of this chapter relate.
flowchart LR
A[MPP Architecture Leader + Compute + Slices + RA3] --> B[Schema Design Distribution + Sort Keys]
B --> C[Spectrum + Serverless + WLM Lakehouse + Operations]
A -.->|enables parallelism| B
B -.->|sets data layout| C
Part 1: Redshift MPP Architecture
Pre-Reading Check — Part 1
1. Which statement best describes the division of labor between the leader node and the compute nodes in a Redshift cluster?
The leader node stores user data while the compute nodes only relay queries.The leader node parses, optimizes, and dispatches queries; the compute nodes scan data and execute the compiled plan in parallel.Both node types execute scans, but only compute nodes communicate with clients.The leader node executes single-row queries while compute nodes only handle aggregations.
2. A Redshift cluster has 4 RA3 nodes and each node has 8 slices. A query scans a 4 billion-row fact table that is evenly distributed. Roughly how many rows does each slice scan?
125 million rows.500 million rows.1 billion rows.4 billion rows (each slice gets a full copy).
3. Which capability of RA3 nodes is fundamentally enabled by Redshift Managed Storage (RMS)?
Storing all blocks exclusively on local NVMe SSDs.Eliminating the leader node from the cluster.Sizing compute and storage independently, with cold blocks tiered to S3 and cluster relocation across AZs at zero RPO.Replacing columnar storage with row-oriented storage.
Leader Node and Compute Nodes
A Redshift cluster is a tightly coordinated team of specialists. The leader node is the conductor: when a SQL statement arrives, it parses the query, optimizes it, generates compiled C++ code, and ships that code along with execution instructions to the worker nodes. Critically, the leader node does not store user data and does not scan rows itself. Think of it as the air-traffic controller in a busy airport: it doesn't fly the planes, but nothing lands or takes off without its instructions.
The compute nodes are the planes. Each compute node is a server with its own CPU cores, memory, high-bandwidth network interface, and (for RA3 node types) local SSD storage. Compute nodes execute the compiled query code in parallel, scanning data, evaluating predicates, computing aggregates, and shipping intermediate results back to the leader for final assembly.
Component
Stores Data?
Executes Queries?
Talks to Clients?
Leader node
No
Plans only
Yes (single endpoint)
Compute node
Yes (via slices)
Yes (in parallel)
No (internal network)
When a client application connects, it always connects through the leader node's endpoint. The leader is therefore both the brain and the front door, while the compute nodes form the engine room.
Slices and Massively Parallel Processing
The unit of parallelism inside Redshift is not the node — it is the slice. Each compute node is partitioned into a fixed number of slices based on its instance type. For example, an ra3.xlplus node has 2 slices, while larger node types have more. Each slice receives a portion of the node's memory and disk and processes its share of the data in parallel with every other slice in the cluster.
This is the essence of Massively Parallel Processing (MPP): rather than a single server churning through a query, dozens or hundreds of slices work concurrently on disjoint partitions. If a query needs to scan a 10 billion-row fact table on a cluster with 32 slices, each slice scans roughly 312 million rows simultaneously. The cluster's total parallel capacity is nodes × slices_per_node; doubling node count doubles parallelism.
Animation 7.1: Query routing — Leader fans out to compute slices, then aggregates
A SQL query enters via the Leader Node. The compiled plan fans out to compute nodes, then to slices, which process in parallel and stream partial results back up.
A useful analogy is sorting mail in a giant post office. A single clerk (a traditional database) sorts every envelope sequentially. An MPP system hires 32 clerks (slices), gives each a bin labeled with a range of zip codes, and lets them all sort simultaneously. The supervisor (leader node) hands out the work and assembles the final stacks.
The original Redshift node families (DS2, DC2) coupled compute and storage tightly: disks lived inside compute nodes, and growing the warehouse meant adding nodes you didn't otherwise need. The RA3 family (ra3.xlplus, ra3.4xlarge, ra3.16xlarge) decouples the two by introducing Redshift Managed Storage (RMS).
Redshift continuously analyzes block temperature, age, and workload patterns, then prefetches hot blocks to local SSD before queries need them. From the user's perspective, storage looks like one big disk, but in reality the hot working set lives next to the CPU while cold history sleeps in S3. RA3 also supports cluster relocation — moving a cluster between Availability Zones with the same endpoint and zero RPO — because the persistent data lives in regional S3.
flowchart LR
Q[Incoming Query] --> CL[Compute Layer RA3 Nodes vCPU + RAM]
CL -->|hot blocks| T1[Tier 1: Local NVMe SSD]
CL -->|cold blocks fetched on demand| T2[Tier 2: Amazon S3 RMS]
T2 -.->|prefetch by block temperature| T1
T1 -.->|evict cold blocks| T2
T2 --> AZ[Cross-AZ Durability cluster relocation, zero RPO]
Key Takeaways — MPP Architecture
The leader node plans queries and returns results; compute nodes execute compiled code in parallel.
Slices, not nodes, are the unit of parallel work; total parallelism = nodes × slices_per_node.
A query is fast when data is spread evenly across slices; one overloaded slice = the whole query waits.
RA3 + RMS decouples compute from storage: hot blocks on NVMe, cold blocks tiered to S3, sized independently.
Cluster relocation across AZs is possible because persistent data already lives in regional S3.
Post-Reading Check — Part 1
1. Which statement best describes the division of labor between the leader node and the compute nodes in a Redshift cluster?
The leader node stores user data while the compute nodes only relay queries.The leader node parses, optimizes, and dispatches queries; the compute nodes scan data and execute the compiled plan in parallel.Both node types execute scans, but only compute nodes communicate with clients.The leader node executes single-row queries while compute nodes only handle aggregations.
2. A Redshift cluster has 4 RA3 nodes and each node has 8 slices. A query scans a 4 billion-row fact table that is evenly distributed. Roughly how many rows does each slice scan?
125 million rows.500 million rows.1 billion rows.4 billion rows (each slice gets a full copy).
3. Which capability of RA3 nodes is fundamentally enabled by Redshift Managed Storage (RMS)?
Storing all blocks exclusively on local NVMe SSDs.Eliminating the leader node from the cluster.Sizing compute and storage independently, with cold blocks tiered to S3 and cluster relocation across AZs at zero RPO.Replacing columnar storage with row-oriented storage.
Part 2: Schema Design — Distribution and Sort Keys
Pre-Reading Check — Part 2
1. Why does setting customer_id as the DISTKEY on both the sales fact table and the customer dimension table speed up the join between them?
It physically sorts the rows on disk in customer_id order.Matching customer_id values land on the same slice on both sides, so the join is collocated and no rows have to move across the network.It tells Redshift to compress customer_id with AZ64.It causes both tables to be replicated to every node.
2. You have a 200-row country dimension joined many ways from large fact tables on a 10-node cluster. Which distribution style is the best fit, and what is its main cost?
DISTSTYLE EVEN — cost is increased redistribution at join time.DISTKEY (country_code) — cost is hash skew if country_code is not unique.DISTSTYLE ALL — cost is multiplied storage (one full copy on every node).DISTSTYLE AUTO — cost is unpredictable cluster restarts.
3. A table has a compound sort key on (sale_date, region, state). Which query benefits the LEAST from this sort key?
WHERE sale_date = '2026-05-07'WHERE sale_date = '2026-05-07' AND region = 'us-east'WHERE sale_date BETWEEN '2026-01-01' AND '2026-06-30'WHERE state = 'CA' with no sale_date or region predicate
4. How do sort keys actually reduce I/O on filtered queries?
Redshift caches the entire sorted column in leader memory.Each 1 MB block has a zone map of min/max values; blocks whose range cannot match the predicate are skipped entirely.Sorted columns are stored row-wise, which is faster than columnar.Sort keys force Redshift to use a B-tree index for every query.
Schema design is the single biggest lever for Redshift performance. Two tables with identical columns can have query times that differ by 100x simply because of distribution and sort key choices. Distribution styles decide which slice each row lives on; sort keys decide where on disk the rows go within a slice.
Distribution Styles: KEY, ALL, EVEN, AUTO
When you load a row into Redshift, it has to land on exactly one slice. The distribution style tells Redshift how to choose.
KEY distribution picks one column as the DISTKEY. Redshift hashes that column's value and routes the row to the corresponding slice. Two tables with the same DISTKEY have matching key values landing on the same slice, so joins are collocated — no network shuffle.
ALL distribution stores a complete copy on the first slice of every node. Joins against an ALL table are always free, regardless of the other side. The cost: storage is multiplied by the node count.
EVEN distribution uses round-robin. Uniform spread, no help for joins (any join needs redistribution).
AUTO distribution is the default. Redshift starts with ALL for small tables, switches to EVEN as they grow, and may convert to KEY based on observed query patterns.
Style
Best For
Storage Cost
Join Cost
KEY
Large fact-dimension joins on a stable key
1x
Free if collocated
ALL
Small dimension tables joined many ways
N nodes × 1x
Always free
EVEN
Tables with no good join key
1x
Redistribution required
AUTO
Unknown access pattern; let Redshift decide
Adapts
Adapts
Animation 7.2: Distribution styles — where rows land on slices
Watch 8 sample rows distribute differently across 4 slices depending on the chosen DISTSTYLE. Click a tab to switch styles.
A canonical pattern is the fact + dimension schema. Suppose you have a sales fact table with 10 billion rows and a customer dimension with 50 million rows. They join on customer_id:
CREATE TABLE sales (
sale_id BIGINT,
customer_id BIGINT NOT NULL,
sale_date DATE,
amount NUMERIC(12,2)
)
DISTKEY (customer_id)
SORTKEY (sale_date);
CREATE TABLE customer (
customer_id BIGINT NOT NULL,
name VARCHAR(200),
region VARCHAR(50)
)
DISTKEY (customer_id)
SORTKEY (customer_id);
Both tables now use customer_id as the DISTKEY, so all sales for customer 12345 sit on the same slice as customer 12345's row in the dimension table. The join is collocated. For a tiny dimension like country (200 rows), use DISTSTYLE ALL instead.
flowchart TD
Start[New Table] --> Q1{Small and slowly changing?}
Q1 -->|Yes| ALL[DISTSTYLE ALL full copy on every node]
Q1 -->|No| Q2{Stable JOIN column?}
Q2 -->|Yes| KEY[DISTKEY column hash route to slice]
Q2 -->|No| Q3{Need uniform spread without join hint?}
Q3 -->|Yes| EVEN[DISTSTYLE EVEN round-robin]
Q3 -->|Unknown / mixed| AUTO[DISTSTYLE AUTO adapts over time]
Sort Keys: Compound vs Interleaved
Once a row is on its slice, the sort key decides where on disk it lands. Redshift stores data in 1 MB blocks and keeps a zone map for each block recording the min/max of every column. When a query has a WHERE predicate on a sorted column, the optimizer consults zone maps and skips entire blocks whose range cannot match. On a 10 TB table, a good sort key turns a full scan into a few hundred-millisecond seek.
Compound sort key (the default) sorts by the first column, then by the second within ties, like a phone book sorted by last name, first name, middle initial. It delivers dramatic speedups for predicates on the leading column or columns. A (sale_date, region, state) compound key shines for WHERE sale_date = '2026-05-07', still helps for WHERE sale_date = '2026-05-07' AND region = 'us-east', but provides no benefit for WHERE state = 'CA' alone.
Interleaved sort key gives equal weight to each column via a space-filling curve. It helps when queries filter on different subsets of the key columns over time. Trade-offs: load and VACUUM REINDEX are significantly slower; avoid on monotonically increasing columns.
Aspect
Compound
Interleaved
Filter pattern
Leading-column predicates
Any subset of key columns
Load cost
Low
Higher
VACUUM cost
Standard
Significant (REINDEX)
Best on
Time-series, ordered keys
Multi-dimensional cubes
Avoid on
(works broadly)
Monotonic IDs, dates alone
Compression Encodings
Every column in Redshift is stored in a columnar format with a compression encoding. Common encodings:
AZ64 — Amazon's proprietary encoding for numeric/date types; the modern default.
ZSTD — General-purpose, high-ratio; strong default for VARCHAR/CHAR.
BYTEDICT, DELTA, RUNLENGTH — Specialized for low-cardinality, sequential, or sparse data.
RAW — No compression; used for sort key columns.
Use COPY … COMPUPDATE ON or run ANALYZE COMPRESSION and apply the recommendations.
Key Takeaways — Schema Design
Choose DISTKEY by JOIN pattern: co-locate the fact table and its primary dimension on the shared foreign key.
Use DISTSTYLE ALL for small, slowly-changing dimensions joined many ways.
Compound sort keys are best for predictable, leading-column filters (often a date).
Interleaved sort keys are for genuinely multi-dimensional ad-hoc filtering — with much higher VACUUM cost.
Sort keys reduce I/O via zone maps over 1 MB blocks — a good sort key skips entire blocks.
Default to AZ64 for numerics and ZSTD for strings unless ANALYZE COMPRESSION says otherwise.
Post-Reading Check — Part 2
1. Why does setting customer_id as the DISTKEY on both the sales fact table and the customer dimension table speed up the join between them?
It physically sorts the rows on disk in customer_id order.Matching customer_id values land on the same slice on both sides, so the join is collocated and no rows have to move across the network.It tells Redshift to compress customer_id with AZ64.It causes both tables to be replicated to every node.
2. You have a 200-row country dimension joined many ways from large fact tables on a 10-node cluster. Which distribution style is the best fit, and what is its main cost?
DISTSTYLE EVEN — cost is increased redistribution at join time.DISTKEY (country_code) — cost is hash skew if country_code is not unique.DISTSTYLE ALL — cost is multiplied storage (one full copy on every node).DISTSTYLE AUTO — cost is unpredictable cluster restarts.
3. A table has a compound sort key on (sale_date, region, state). Which query benefits the LEAST from this sort key?
WHERE sale_date = '2026-05-07'WHERE sale_date = '2026-05-07' AND region = 'us-east'WHERE sale_date BETWEEN '2026-01-01' AND '2026-06-30'WHERE state = 'CA' with no sale_date or region predicate
4. How do sort keys actually reduce I/O on filtered queries?
Redshift caches the entire sorted column in leader memory.Each 1 MB block has a zone map of min/max values; blocks whose range cannot match the predicate are skipped entirely.Sorted columns are stored row-wise, which is faster than columnar.Sort keys force Redshift to use a B-tree index for every query.
Part 3: Spectrum, Serverless, and WLM
Pre-Reading Check — Part 3
1. Which set of optimizations does Redshift Spectrum apply when scanning S3 files, and why do they matter for cost?
It compresses files in flight; matters because S3 charges egress fees.Predicate pushdown, projection pushdown, and partition pruning — matters because Spectrum is priced per terabyte scanned.It rebuilds B-tree indexes on S3 objects; matters for query latency.It always loads the data into Redshift first; matters because loaded queries are faster.
2. A 10 TB CSV-format S3 lake fully scanned costs roughly $50 per query. Storing the same data as Parquet partitioned by year and region, queried with WHERE year = 2026 AND region = 'us-east', scans only ~100 GB. Why is this such a big difference?
Parquet files are physically smaller than CSV by exactly 100x.Spectrum prunes partitions outside the predicate and reads only needed columns from the columnar Parquet, drastically cutting bytes scanned.Parquet automatically caches in the leader node so repeat queries are free.Redshift forwards CSV scans to RDS for cheaper processing.
3. For an analytics platform with bursty, unpredictable workloads and long idle periods, which Redshift deployment model is best, and why?
Provisioned cluster, because the cluster-hour price is always cheaper.Redshift Serverless, because RPU-based auto-scaling charges only for capacity actually used and there is no idle cost.A single-node DC2 cluster, because it scales to zero automatically.Spectrum-only, because it never uses any compute capacity.
4. With Automatic WLM enabled, an executive dashboard query is tagged HIGH priority while an ad-hoc analyst query is tagged NORMAL. What does Redshift do differently for the two?
It rejects the analyst query until the dashboard finishes.It dynamically allocates more memory and earlier scheduling to the HIGH-priority query, while still giving the NORMAL query a fair share.It runs both queries on identical resources but logs the priority for billing.It moves the analyst query to a separate cluster every time.
5. When does Redshift's optimizer use a materialized view to answer a query that does not literally name the view in its FROM clause?
Never — users must always reference the view explicitly.When the view's columns and predicates match the query, the automatic query rewrite feature transparently redirects the query to the precomputed view.Only for queries from the leader node's superuser account.Whenever AUTO REFRESH NO is set on the view.
Querying S3 Data with Spectrum
Redshift Spectrum lets a Redshift cluster run SQL directly against files in S3 — Parquet, ORC, CSV, JSON, Avro — without first loading them into Redshift tables. The mental model is a federation layer: Redshift sees the S3 data as external tables living in an external schema backed by the AWS Glue Data Catalog (or a Hive Metastore).
CREATE EXTERNAL SCHEMA spectrum_schema
FROM DATA CATALOG
DATABASE 'analytics_lake'
IAM_ROLE 'arn:aws:iam::123456789012:role/redshift-spectrum-role'
REGION 'us-east-1';
After this single command, every table cataloged in the analytics_lake Glue database appears in Redshift as spectrum_schema.<table_name>. The most powerful pattern is mixing local and external data in one query — recent hot data in Redshift fact tables joined with archived S3 history:
SELECT
c.region,
SUM(s.amount) AS hot_amount,
SUM(h.amount) AS cold_amount
FROM public.customer c
JOIN public.sales s ON s.customer_id = c.customer_id
LEFT JOIN spectrum_schema.archived_sales h
ON h.customer_id = c.customer_id
WHERE s.sale_date >= DATEADD(month, -3, CURRENT_DATE)
GROUP BY c.region;
Spectrum runs on a separate, massively scaled fleet of workers. They perform predicate pushdown, projection pushdown, and partition pruning using Hive-style partition layouts like s3://lake/sales/year=2026/region=us-east/. Because Spectrum is priced per terabyte scanned, partition pruning and columnar formats translate directly to dollars saved.
Concretely: a 10 TB CSV-format data lake fully scanned costs roughly 10 × $5 = $50 per query. The same data as Parquet with year/region partitions, filtered to one partition, scans only ~100 GB — about $0.50 per query. Same answer, 100x cheaper.
Animation 7.3: Spectrum query path — Redshift dispatches scan to S3 via Glue
Watch a SQL query travel from the client through the leader node, resolve metadata at the Glue Data Catalog, dispatch to the Spectrum worker fleet, scan S3, and return filtered rows to compute nodes for join.
sequenceDiagram
participant C as Client
participant L as Leader Node
participant G as Glue Data Catalog
participant SF as Spectrum Fleet
participant S3 as S3 (Parquet)
participant CN as Compute Nodes
C->>L: SQL: spectrum_schema.archived_sales JOIN public.sales
L->>G: Resolve external schema + metadata
G-->>L: File paths, partitions, stats
L->>SF: Dispatch scan with predicates
SF->>S3: Read matching partitions/columns
S3-->>SF: Filtered Parquet row groups
SF-->>CN: Stream filtered rows
CN-->>L: Partial aggregates
L-->>C: Final result set
Redshift Serverless Capacity Model
For ad-hoc analytics, data-app backends, and sporadic dashboards, an always-on cluster is overkill. Redshift Serverless replaces the cluster abstraction with a capacity unit called the RPU (Redshift Processing Unit). You set a base capacity (in RPUs) and an optional max, and Redshift auto-scales between them as workloads demand.
Dimension
Provisioned
Serverless
Pricing unit
Cluster-hour
RPU-hour
Scaling
Manual (resize)
Automatic
Cold start
None (always on)
~30 seconds first query
Idle cost
Yes (always billed)
None (pause when idle)
WLM
Manual queues
Automatic
Best for
Steady, predictable load
Bursty, unpredictable load
Serverless still supports Spectrum identically, so a common pattern is Serverless + Spectrum: a Glue Catalog over an S3 lakehouse, a Serverless workgroup, paying only for RPU-seconds that actual queries consume.
Data Sharing Across Clusters
Redshift data sharing lets one producer cluster expose specific schemas/tables to one or more consumer clusters — even across AWS accounts and regions — without copying data. Use it to separate an ETL/ELT producer cluster from BI consumer clusters, or to share curated data with another business unit, while keeping a single source of truth in RMS.
Workload Management (WLM)
Even on a perfectly designed schema, performance falls apart when ETL jobs, executive dashboards, and ad-hoc queries all hit the cluster simultaneously. WLM organizes queries into queues; each queue gets a slice of memory and a max concurrency.
Manual WLM: define queues by hand — "ETL: 50% memory, concurrency 4; BI: 30% memory, concurrency 8."
Automatic WLM (recommended): Redshift learns from query history and dynamically allocates memory/concurrency. You only specify query priorities (LOWEST, LOW, NORMAL, HIGH, HIGHEST).
Workload
Queue / Priority
Rationale
Critical ETL loads
HIGHEST
Must finish in batch window
Executive dashboards
HIGH
User-facing latency
Analyst ad-hoc
NORMAL
Many users, tolerate seconds
Long-running reports
LOW
Background, can wait
Experimental queries
LOWEST
Don't impact others
Concurrency Scaling, Result Caching, and Materialized Views
Concurrency scaling automatically spins up transient secondary clusters when read-only queries queue up; results return through the original cluster. Each main-cluster running day earns one free hour of concurrency scaling, then bills per-second.
Result caching stores recent query results in the leader's memory; identical queries against unchanged data return in milliseconds, with no configuration.
Materialized views precompute and persist expensive joins/aggregations:
CREATE MATERIALIZED VIEW mv_daily_sales_by_region
AUTO REFRESH YES
AS
SELECT c.region,
s.sale_date,
SUM(s.amount) AS daily_amount,
COUNT(*) AS order_count
FROM sales s
JOIN customer c USING (customer_id)
GROUP BY c.region, s.sale_date;
With AUTO REFRESH YES, Redshift incrementally maintains the view. The optimizer's automatic query rewrite feature can transparently redirect a user's query against sales/customer to the materialized view when columns and predicates match — analysts get the speedup without changing SQL.
Key Takeaways — Spectrum, Serverless, and WLM
Spectrum extends Redshift SQL to the S3 data lake without ETL, billed per TB scanned. Use Parquet, partitions, and predicate pushdown to keep cost low.
Serverless trades cluster management for RPU-based auto-scaling, with no idle cost — ideal for bursty workloads.
Data sharing lets multiple clusters read one canonical dataset without copies.
Automatic WLM with query priorities is the recommended path for mixed workloads; concurrency scaling absorbs bursts.
Result caching and materialized views with auto-rewrite collapse repeat-query cost to near zero.
Chapter Summary
Redshift is a textbook example of how a cloud data warehouse trades complexity for performance. The leader node plans queries; compute nodes execute them in parallel; slices are the unit of parallelism; MPP ties them together. RA3 + RMS decouples compute from storage. Schema design — distribution styles and sort keys — is the largest performance lever. Spectrum federates SQL over S3; Serverless removes cluster management; data sharing avoids copies. WLM, concurrency scaling, result caching, and materialized views deliver predictable performance under mixed workloads. Master these levers and Redshift becomes the foundation for streaming, lakehouse, and orchestration patterns in later chapters.
Post-Reading Check — Part 3
1. Which set of optimizations does Redshift Spectrum apply when scanning S3 files, and why do they matter for cost?
It compresses files in flight; matters because S3 charges egress fees.Predicate pushdown, projection pushdown, and partition pruning — matters because Spectrum is priced per terabyte scanned.It rebuilds B-tree indexes on S3 objects; matters for query latency.It always loads the data into Redshift first; matters because loaded queries are faster.
2. A 10 TB CSV-format S3 lake fully scanned costs roughly $50 per query. Storing the same data as Parquet partitioned by year and region, queried with WHERE year = 2026 AND region = 'us-east', scans only ~100 GB. Why is this such a big difference?
Parquet files are physically smaller than CSV by exactly 100x.Spectrum prunes partitions outside the predicate and reads only needed columns from the columnar Parquet, drastically cutting bytes scanned.Parquet automatically caches in the leader node so repeat queries are free.Redshift forwards CSV scans to RDS for cheaper processing.
3. For an analytics platform with bursty, unpredictable workloads and long idle periods, which Redshift deployment model is best, and why?
Provisioned cluster, because the cluster-hour price is always cheaper.Redshift Serverless, because RPU-based auto-scaling charges only for capacity actually used and there is no idle cost.A single-node DC2 cluster, because it scales to zero automatically.Spectrum-only, because it never uses any compute capacity.
4. With Automatic WLM enabled, an executive dashboard query is tagged HIGH priority while an ad-hoc analyst query is tagged NORMAL. What does Redshift do differently for the two?
It rejects the analyst query until the dashboard finishes.It dynamically allocates more memory and earlier scheduling to the HIGH-priority query, while still giving the NORMAL query a fair share.It runs both queries on identical resources but logs the priority for billing.It moves the analyst query to a separate cluster every time.
5. When does Redshift's optimizer use a materialized view to answer a query that does not literally name the view in its FROM clause?
Never — users must always reference the view explicitly.When the view's columns and predicates match the query, the automatic query rewrite feature transparently redirects the query to the precomputed view.Only for queries from the leader node's superuser account.Whenever AUTO REFRESH NO is set on the view.