Data Engineering and Warehousing with Modern Pipelines
A comprehensive intermediate-level textbook covering modern data engineering, lakehouse architecture, streaming and batch pipelines, cloud warehousing, governance, and analytics integration patterns drawn from AWS analytics services.
Table of Contents
- Chapter 1: Foundations of Modern Data Engineering
- Chapter 2: The Lakehouse Paradigm
- Chapter 3: Storage Foundations and Open Table Formats
- Chapter 4: Batch Ingestion and ETL/ELT Pipelines
- Chapter 5: Streaming Ingestion and Real-Time Pipelines
- Chapter 6: Distributed Processing with Spark and EMR
- Chapter 7: Cloud Data Warehousing with Redshift
- Chapter 8: Interactive Querying and Federated Analytics
- Chapter 9: Workflow Orchestration with Airflow and MWAA
- Chapter 10: Zero-ETL and SaaS Integration
- Chapter 11: Data Governance, Catalog, and Access Control
- Chapter 12: Search, Logs, and Observability with OpenSearch
- Chapter 13: BI, ML, and Cost Optimization in Production
Chapter 1: Foundations of Modern Data Engineering
Learning Objectives
- Define the core responsibilities of a data engineer and how the discipline differs from analytics engineering and ML engineering
- Compare batch, micro-batch, and streaming processing models with appropriate use cases
- Trace the historical evolution from on-premises ETL to modern cloud-native data platforms
- Identify the key drivers of modern pipeline architectures: latency, scale, cost, and governance
What Modern Data Engineering Means
Data engineering in 2025 is the practice of building the systems that move data from where it is born — application logs, transactional databases, sensors, APIs — to the places where it can be queried, modeled, and acted upon. If software engineering builds the products, data engineering builds the circulatory system that keeps the rest of the organization alive. Pipelines, not reports, are the core deliverable.
Definition and Scope of Data Engineering in 2025
A modern data engineer owns the full path from ingestion to storage and is judged primarily on the reliability, scalability, and cost-efficiency of that path. Their daily work spans building ETL/ELT pipelines, scheduling jobs through orchestrators, partitioning storage, handling errors and retries, and optimizing infrastructure costs [Source: https://www.elevano.com/blog/types-of-data-engineers/]. The toolkit reflects this infrastructure orientation: Apache Airflow for orchestration, Apache Kafka for event streaming, Apache Spark (often via Databricks) for distributed compute, and a range of ingestion tools like NiFi or Talend [Source: https://www.refontelearning.com/blog/data-engineering-tools-2025].
Think of a data engineer as a city’s water utility crew. They do not decide what citizens drink or cook — that’s the analytics engineer’s domain — but they ensure the water arrives clean, on time, at sufficient pressure, and without leaks. When pipes burst at 3 a.m., they get the page.
The biggest shift over the last decade has been from batch to streaming. Traditional pipelines ran nightly: an extraction job pulled yesterday’s transactions, a transformation job reshaped them, and a load job pushed them into a warehouse before the morning standup. That cadence is no longer good enough for fraud detection, personalization, or real-time dashboards [Source: https://pub.aimind.so/modern-data-engineering-from-traditional-etl-to-real-time-pipelines-5b51ed637ffc]. A modern data engineer is therefore as comfortable with streaming semantics — exactly-once delivery, watermarking, late-arriving data — as they are with cron schedules.
Key Takeaway: A data engineer owns the end-to-end pipeline from raw source to query-ready storage, with infrastructure (orchestration, streaming, scaling) as the primary concern. The defining shift of the last decade is the move from nightly batch jobs to continuous streaming pipelines that feed real-time products.
Roles: Data Engineer vs Analytics Engineer vs Platform Engineer
The “data team” of 2015 was usually one or two generalists. Today the work has fractured into specialized roles, and confusing them in a job description is a common cause of mis-hires. Three roles dominate the modern stack.
The data engineer (sometimes called the pipeline engineer or ETL/ELT engineer) builds the highways from source systems into the warehouse or lakehouse. Their tools are infrastructure-heavy: Airflow, Kafka, Spark, cloud-native services [Source: https://www.elevano.com/blog/types-of-data-engineers/]. Their output is reliable raw or lightly-cleaned data that downstream teams can trust.
The analytics engineer picks up where the data engineer drops off. After data lands in the warehouse, the analytics engineer writes SQL transformations — typically in dbt — that turn raw tables into business-friendly models: a dim_customer table, an orders_enriched view, a revenue metric definition. They version-control their models in Git, write tests, and produce documentation aimed at analysts and BI dashboards [Source: https://www.elevano.com/blog/types-of-data-engineers/] [Source: https://blog.dataengineerthings.org/mastering-data-engineering-key-insights-for-ai-ml-vs-analytics-workflows-1f0489425e0b]. The role only became viable around 2016, when dbt and cloud warehouses (Snowflake, BigQuery) made post-load SQL transformation cheap enough to be the default.
The platform engineer (or data platform engineer) builds and operates the shared infrastructure both of the above teams depend on: the Kubernetes clusters running Airflow, the Terraform that provisions Snowflake accounts, the IAM policies, the CI/CD for pipeline code, the observability stack [Source: https://www.refontelearning.com/blog/data-engineering-tools-2025]. They are software/DevOps engineers who happen to specialize in data tooling.
A fourth role increasingly appears alongside these: the ML data engineer, who builds feature pipelines that feed model training and online inference. They blend Python-heavy data engineering with feature stores (Feast, Tecton), MLflow, and tools like TFX [Source: https://blog.dataengineerthings.org/mastering-data-engineering-key-insights-for-ai-ml-vs-analytics-workflows-1f0489425e0b].
[Diagram suggestion: Layered stack diagram showing Sources → (Data Engineer) → Warehouse/Lakehouse → (Analytics Engineer) → BI/Dashboards, with Platform Engineer shown as a vertical bar supporting all layers, and ML Data Engineer branching off the warehouse toward Feature Store → Models.]
Figure 1.1: Modern data team roles across the pipeline stack
flowchart LR
Sources["Source Systems<br/>(APIs, DBs, Logs)"]
DE["Data Engineer<br/>Airflow / Kafka / Spark"]
WH["Warehouse / Lakehouse<br/>Snowflake / Databricks"]
AE["Analytics Engineer<br/>dbt / SQL / Git"]
BI["BI Dashboards<br/>Looker / Hex"]
MLE["ML Data Engineer<br/>Feast / MLflow"]
FS["Feature Store"]
Models["ML Models"]
PE["Platform Engineer: Terraform, Kubernetes, CI/CD, IAM, Observability"]
Sources --> DE --> WH --> AE --> BI
WH --> MLE --> FS --> Models
PE -.supports.-> DE
PE -.supports.-> WH
PE -.supports.-> AE
PE -.supports.-> MLE
| Role | Primary Concern | Tooling | Output Consumer | Production Cadence |
|---|---|---|---|---|
| Data Engineer (Pipeline) | Ingest -> store reliability | Airflow, Kafka, Spark | Warehouse/lakehouse tables | High (24/7 monitoring) |
| Analytics Engineer | Business-logic modeling | dbt, SQL, Git | Analysts, dashboards | Medium (weekly releases) |
| Platform Engineer | Shared infra & DevOps | Terraform, K8s, CI/CD | Other data teams | Continuous |
| ML Data Engineer | Feature pipelines | Python, MLflow, feature stores | Data scientists, models | High (experiments) |
A useful mental model: if data is electricity, the platform engineer builds the grid, the data engineer wires the houses, the analytics engineer designs the appliances, and the ML data engineer powers a specialized factory next door. In leaner organizations, “full-stack data engineers” absorb several of these roles at once [Source: https://pub.aimind.so/modern-data-engineering-from-traditional-etl-to-real-time-pipelines-5b51ed637ffc], but as data volume and team count grow, specialization tends to reassert itself.
Key Takeaway: The modern data team specializes by layer of the stack: data engineers move data, analytics engineers model it, platform engineers run the shared infra, and ML data engineers feed models. A single “full-stack data engineer” can wear all hats in a small org, but at scale these become distinct disciplines with different tools and audiences.
The Shift from Monolithic ETL to Composable Pipelines
The original data architecture, dominant from the 1990s to roughly the early 2010s, was monolithic ETL: a single vendor tool (Informatica, Ab Initio, IBM DataStage) extracted data, transformed it inside a proprietary engine, and loaded the results into an on-premises warehouse like Teradata or Oracle. The transformation step had to happen before loading, because warehouse storage was expensive and warehouse compute was scarce — you could not afford to land messy raw data and clean it later.
The modern pattern inverts this. ELT (Extract-Load-Transform) loads raw data into a cloud warehouse or lakehouse first, then transforms it in place using SQL [Source: https://pub.aimind.so/modern-data-engineering-from-traditional-etl-to-real-time-pipelines-5b51ed637ffc]. This works because cloud storage is cheap (object storage costs pennies per GB-month) and cloud compute is elastic (you can spin up a thousand-core warehouse for ten minutes and pay only for those minutes). The economic constraint that forced ETL evaporated.
Composability is the second pillar of the shift. Instead of one vendor doing ingest+transform+orchestrate+serve, modern stacks bolt together best-of-breed components: Fivetran or Airbyte for ingestion, Snowflake or BigQuery for storage and compute, dbt for transformation, Airflow or Dagster for orchestration, Looker or Hex for BI [Source: https://www.brainforge.ai/blog/13-essential-data-engineering-tools-that-will-transform-your-analytics-stack-in-2025]. Each layer is replaceable.
A worked example makes the contrast concrete. A retailer wants daily sales reporting:
- Monolithic ETL (2005): an Informatica job extracts from Oracle nightly, transforms inside Informatica’s engine on a dedicated server, loads aggregated tables into Teradata. One vendor, one license, one bottleneck.
- Composable ELT (2025): Fivetran replicates raw Postgres tables into Snowflake every 15 minutes. dbt models in a Git repo transform raw tables into
fct_ordersanddim_productwith tests and lineage. Airflow triggers dbt runs after Fivetran lands new data. Looker dashboards read the dbt models.
The composable approach is more flexible but operationally more complex — five integration points instead of one vendor — which is precisely why the platform engineer role exists.
Figure 1.2: Monolithic ETL versus composable ELT
flowchart TB
subgraph Monolithic["Monolithic ETL (2005)"]
S1["Oracle Source"] --> I1["Informatica<br/>Extract + Transform"]
I1 --> T1["Teradata Warehouse"]
end
subgraph Composable["Composable ELT (2025)"]
S2["Postgres Source"] --> F["Fivetran<br/>(Extract + Load)"]
F --> SF["Snowflake<br/>(Raw Tables)"]
SF --> D["dbt<br/>(Transform in SQL)"]
D --> SF2["Snowflake<br/>(fct_orders, dim_product)"]
A["Airflow<br/>(Orchestration)"] -.triggers.-> F
A -.triggers.-> D
SF2 --> L["Looker Dashboards"]
end
Key Takeaway: The industry moved from monolithic ETL (one vendor, transform-before-load, on-prem warehouse) to composable ELT (best-of-breed tools, load-then-transform in cloud warehouse). Cheap cloud storage and elastic compute made the inversion economically viable, while open interfaces between layers made it operationally desirable.
Processing Models
Once you decide what data to move, you have to decide how often to move it. The answer is determined less by technology than by the latency requirements of the downstream product, and getting it wrong is one of the most expensive mistakes a data team can make.
Batch vs Micro-Batch vs Streaming
There are three canonical processing models, and they exist on a spectrum of latency-vs-complexity.
Batch processing runs on a schedule — typically hourly, daily, or weekly — and processes a bounded chunk of data at a time. A nightly job that aggregates yesterday’s web logs into a daily_pageviews table is batch. Batch is simple, well-understood, easy to debug (re-running yesterday’s job produces the same output), and cheap, because you can shut down compute between runs [Source: https://pub.aimind.so/modern-data-engineering-from-traditional-etl-to-real-time-pipelines-5b51ed637ffc]. Its weakness is latency: the freshest data in your dashboard is as old as your batch interval.
Micro-batch processing runs the same logic on much smaller windows — every few seconds to a few minutes. Spark Structured Streaming’s default mode is micro-batch: it collects events into small batches, processes each, and emits results. Micro-batch trades a small latency penalty (seconds to a minute) for the operational simplicity of batch semantics, and is often the pragmatic sweet spot.
Streaming processing treats data as an unbounded, continuous flow. Each event is processed as it arrives, with sub-second latency. Apache Flink, Kafka Streams, and Apache Beam in streaming mode are the dominant engines [Source: https://www.flexera.com/blog/finops/lambda-architecture/]. Streaming is the most powerful model and the most operationally demanding: you must reason about windowing, watermarks, late-arriving events, exactly-once semantics, and stateful operators.
Analogy: batch is like sending a daily mail truck — efficient, predictable, but slow to deliver any single letter. Micro-batch is a courier that does a circuit every five minutes. Streaming is a pneumatic tube — every letter shoots through the moment it’s dropped in.
| Model | Typical Latency | Complexity | Cost Profile | Example Use Case |
|---|---|---|---|---|
| Batch | Hours to days | Low | Lowest (idle between runs) | Nightly financial reconciliation |
| Micro-batch | Seconds to minutes | Medium | Medium (continuous compute) | Near-real-time dashboards |
| Streaming | Sub-second | High | Highest (always-on, stateful) | Fraud detection, live personalization |
[Diagram suggestion: A horizontal latency-vs-complexity chart with Batch on the left, Micro-batch in the middle, Streaming on the right, and arrows showing “freshness increases” and “operational burden increases” both pointing right.]
Lambda and Kappa Architectures
When you need both the accuracy of batch and the freshness of streaming, you arrive at one of the two named architectures that have defined large-scale data engineering since the early 2010s.
Lambda architecture (Nathan Marz, ~2011) splits the pipeline into three layers [Source: https://en.wikipedia.org/wiki/Lambda_architecture] [Source: https://www.flexera.com/blog/finops/lambda-architecture/]:
- Batch layer — processes the immutable historical dataset on a schedule, producing comprehensive and accurate views. Typical tech: Hadoop, Spark, BigQuery, Redshift [Source: https://www.databricks.com/blog/what-is-lambda-architecture].
- Speed layer — processes the most recent data in a streaming fashion, filling the gap between “now” and “the last completed batch run.” Typical tech: Storm, Flink, Kafka Streams [Source: https://www.bugfree.ai/knowledge-hub/lambda-architecture-batch-speed-layer-explained].
- Serving layer — merges outputs from both layers into queryable views. Typical tech: Druid, Pinot, Cassandra, Elasticsearch [Source: https://www.flexera.com/blog/finops/lambda-architecture/].
New events feed both the batch and speed layers simultaneously. A query against the serving layer transparently combines the accurate historical view (from batch) with the approximate recent view (from speed) [Source: https://aws.amazon.com/blogs/big-data/build-a-big-data-lambda-architecture-for-batch-and-real-time-analytics-using-amazon-redshift/].
The fatal weakness of Lambda is the two-armies problem: you have to implement the same business logic twice — once in your batch framework, once in your stream framework — and keep them in sync forever. Bug fixes, schema changes, and metric definitions all double in cost.
Kappa architecture (Jay Kreps, ~2014) eliminates the batch layer entirely [Source: https://www.flexera.com/blog/finops/lambda-architecture/]. All data flows through a single streaming pipeline backed by a durable, replayable log (typically Apache Kafka) and processed by a unified engine (typically Apache Flink). Historical reprocessing is achieved by replaying the log from an earlier offset — there is only one codebase, only one set of semantics.
| Aspect | Lambda | Kappa |
|---|---|---|
| Layers | Batch + Speed + Serving | Stream + Serving |
| Codebases | Two (batch and stream) | One (stream only) |
| Complexity | High | Low |
| Historical reprocessing | Re-run batch job | Replay Kafka log from offset |
| Best when | Batch accuracy is critical and stream is approximate | Stream pipeline can deliver required accuracy |
| Fault tolerance | Batch recomputes everything | Replay durable log |
[Source: https://www.flexera.com/blog/finops/lambda-architecture/] [Source: https://en.wikipedia.org/wiki/Lambda_architecture]
Kappa is enabled by two technologies maturing together: Kafka’s durable, partitioned, replayable log provides reliable storage of every event ever seen, and Flink (along with Spark Structured Streaming) provides a unified engine that handles both bounded (historical replay) and unbounded (live stream) data with exactly-once semantics. Migrating from Lambda to a Kappa-style unified pipeline can yield 50-70% simpler operations [Source: https://www.flexera.com/blog/finops/lambda-architecture/].
A worked example: Netflix-style viewing telemetry. Lambda would run nightly Spark jobs over yesterday’s view events to compute accurate per-show totals, while a Flink job ran on Kafka to update “trending now” counters in real time. Kappa instead runs a single Flink job over the Kafka log; if the metric definition changes, the team replays the log from the relevant retention point [Source: https://www.flexera.com/blog/finops/lambda-architecture/].
[Diagram suggestion: Side-by-side comparison. Left: Lambda with parallel batch and speed arrows merging in a serving layer. Right: Kappa with a single Kafka -> Flink -> serving arrow, plus a dashed “replay” arrow looping back to Flink from Kafka.]
Figure 1.3: Lambda versus Kappa architectures
flowchart LR
subgraph Lambda["Lambda Architecture"]
E1["Event Source"] --> B["Batch Layer<br/>(Spark / Hadoop)"]
E1 --> SP["Speed Layer<br/>(Flink / Kafka Streams)"]
B --> SV1["Serving Layer<br/>(Druid / Pinot)"]
SP --> SV1
SV1 --> Q1["Query / Dashboard"]
end
subgraph Kappa["Kappa Architecture"]
E2["Event Source"] --> K["Kafka<br/>(Durable Replayable Log)"]
K --> FL["Flink<br/>(Unified Stream Engine)"]
FL --> SV2["Serving Layer"]
SV2 --> Q2["Query / Dashboard"]
K -.replay from offset.-> FL
end
When Latency Matters: SLA-Driven Design
The right processing model is determined by the service level agreement (SLA) of the consumer, not by the preferences of the data team. Three rules of thumb help.
First, work backwards from the user. If an executive looks at a revenue dashboard once a morning, you do not need streaming — a 6 a.m. batch is fine. If a fraud-detection model must score a credit card swipe before the merchant’s terminal times out, you have ~200 milliseconds end to end and streaming is mandatory.
Second, latency requirements compound across the pipeline. A streaming source that lands in a warehouse refreshed every hour gives you, at best, hour-old data. The slowest hop sets the floor.
Third, freshness has a cost curve that bends sharply upward. Going from daily to hourly batch is usually cheap. Going from hourly to one-minute micro-batch typically requires re-architecting toward streaming infrastructure. Going from one minute to one second often requires changing your storage layer (introducing Druid, Pinot, or a key-value store like Cassandra) [Source: https://www.flexera.com/blog/finops/lambda-architecture/]. Every order-of-magnitude latency improvement is a step-function increase in operational complexity.
Worked example, a fintech company: a customer-facing balance display must update within 1 second (streaming, Kafka + Flink); fraud scoring must complete within 200 ms during card authorization (streaming with stateful operators); daily P&L close must be accurate but can run overnight (batch); regulatory reports run monthly (batch). A common anti-pattern is reaching for streaming because it sounds modern when batch would meet the SLA at a fraction of the cost.
Figure 1.4: Decision tree for choosing a processing model from consumer SLA
flowchart TD
Start["What is the consumer's<br/>latency SLA?"]
Start --> Q1{"Sub-second<br/>required?"}
Q1 -->|Yes| Stream["Streaming<br/>(Flink, Kafka Streams)<br/>+ stateful operators"]
Q1 -->|No| Q2{"Seconds to<br/>minutes?"}
Q2 -->|Yes| Micro["Micro-batch<br/>(Spark Structured Streaming)"]
Q2 -->|No| Q3{"Hours acceptable?"}
Q3 -->|Yes| Batch["Batch<br/>(Airflow + Spark / SQL)"]
Q3 -->|No| Q4{"Need both batch<br/>accuracy and<br/>stream freshness?"}
Q4 -->|Yes, two codebases OK| Lambda["Lambda Architecture"]
Q4 -->|Prefer single codebase| Kappa["Kappa Architecture"]
Key Takeaway: Pick the processing model that matches the consumer’s latency SLA, not the team’s enthusiasm. Lambda combines batch accuracy with stream freshness at the price of duplicated codebases; Kappa unifies on streaming with a replayable log to gain operational simplicity. Each step toward lower latency is a step-function increase in cost and complexity, so reserve streaming for the use cases that genuinely require it.
The Modern Data Stack
The phrase “modern data stack” is a marketing term that nonetheless captures a real architectural shift. Three structural changes define it: storage and compute are separated, data lives in open table formats accessible to many engines, and most teams buy managed services rather than self-hosting.
Decoupled Storage and Compute
In a traditional warehouse like Teradata, storage and compute lived on the same physical nodes. To store more data you bought more nodes — which also gave you more compute whether you wanted it or not, and vice versa. The “tightly-coupled” model forced you to scale every axis at once.
Decoupled storage and compute is the defining architectural shift of the modern cloud data stack. Cheap, durable object storage (Amazon S3, Google Colossus, Azure ADLS) holds a single copy of the data, while independently scalable compute clusters or serverless slot pools execute queries on demand [Source: https://www.moonfire.com/stories/the-lakehouse-era/] [Source: https://xenoss.io/blog/snowflake-bigquery-databricks].
Each major platform implements the pattern slightly differently:
- Snowflake stores data in compressed columnar files on S3 and processes queries through independent Virtual Warehouses — separate compute clusters that can be sized and started/stopped per workload, all reading the same single copy of data [Source: https://xenoss.io/blog/snowflake-bigquery-databricks] [Source: https://www.flexera.com/blog/finops/snowflake-vs-databricks/]. Marketing runs a small warehouse; the data science team runs a huge one for an experiment; neither blocks the other.
- BigQuery is fully serverless. Storage sits in Google’s Colossus filesystem; compute is a massive multi-tenant pool of “slots” that BigQuery automatically allocates per query. There are no clusters to size [Source: https://xenoss.io/blog/snowflake-bigquery-databricks] [Source: https://clickhouse.com/resources/engineering/top-5-cloud-data-warehouses].
- Databricks runs Delta Lake (a transaction-log layer over Parquet files on S3) as storage, and Apache Spark with the Photon vectorized engine as compute [Source: https://www.flexera.com/blog/finops/snowflake-vs-databricks/] [Source: https://www.datumo.io/blog/snowflake-vs-databricks-vs-bigquery]. The same data copy supports both batch and streaming workloads with ACID guarantees.
Analogy: a traditional warehouse is a hotel where every guest must rent the entire floor including the kitchen and the gym whether they use them or not. Decoupled compute/storage is a public library — the books (data) sit on shared shelves (S3), and any reader (compute cluster) can show up, check out what they need, and leave without the others noticing.
| Benefit | Why It Matters |
|---|---|
| Elasticity | Scale compute up or down independently of data volume; release resources when idle |
| Cost efficiency | Pay only for compute you actively use; storage is cheap and constant |
| Workload isolation | Different teams run on separate clusters against the same data; no contention |
| Engine flexibility | Pick the best engine per job (SQL, Spark, ML) without copying data |
[Source: https://www.fivetran.com/blog/what-is-a-data-lakehouse]
[Diagram suggestion: A horizontal layered diagram with “Object Storage (S3 / Colossus / ADLS)” as a single bottom slab, and several independent boxes above it labeled “Snowflake VW (BI)”, “Snowflake VW (Data Science)”, “Databricks Cluster (ETL)”, “BigQuery Slots (Ad-hoc)”, each with an arrow reading from the same storage slab.]
Figure 1.5: Decoupled storage and compute with multiple independent engines
flowchart TD
subgraph Compute["Independent Compute Engines"]
VW1["Snowflake VW<br/>(BI Workload)"]
VW2["Snowflake VW<br/>(Data Science)"]
DB["Databricks Cluster<br/>(ETL with Photon)"]
BQ["BigQuery Slots<br/>(Ad-hoc Queries)"]
TR["Trino<br/>(Federated SQL)"]
end
Storage["Object Storage Layer<br/>S3 / Google Colossus / Azure ADLS<br/>(single copy of data, Parquet + Iceberg / Delta)"]
VW1 --> Storage
VW2 --> Storage
DB --> Storage
BQ --> Storage
TR --> Storage
Key Takeaway: Decoupled storage and compute means a single copy of data lives on cheap object storage while many independent compute engines read from it. This unlocks elastic scaling, workload isolation, and pay-per-use economics that make the rest of the modern stack possible.
Open Table Formats and Interoperable Layers
Object storage by itself only gives you files. To get warehouse-like behavior — ACID transactions, schema enforcement, time travel, efficient updates — you need a table format layered on top of those files. Three open formats now dominate: Apache Iceberg, Delta Lake, and Apache Hudi.
A table format is essentially a metadata protocol. It maintains a transaction log describing which Parquet files are part of which version of which table. A query engine that speaks the protocol can read consistent snapshots, perform updates and deletes, and roll back to earlier versions [Source: https://www.fivetran.com/blog/what-is-a-data-lakehouse] [Source: https://www.ovaledge.com/blog/data-lakehouse].
The lakehouse is the architectural pattern that combines open table formats over object storage with warehouse-style query engines. It promises the cheap, scalable storage of a data lake with the ACID guarantees and query performance of a warehouse [Source: https://www.fivetran.com/blog/what-is-a-data-lakehouse] [Source: https://www.moonfire.com/stories/the-lakehouse-era/]. The strategic appeal is interoperability: if your data is in Iceberg on S3, you can query it from Snowflake, Databricks, Trino, Athena, Spark, or Flink — without copying it or paying egress fees per engine.
This interoperability has been so compelling that even closed-platform vendors are opening up. Snowflake announced Iceberg table support; BigQuery, AWS Athena, and Azure Synapse have all added open-format compatibility, partly to reduce customer fears of vendor lock-in [Source: https://www.moonfire.com/stories/the-lakehouse-era/].
There is a real trade-off, however. Specialized warehouses with proprietary storage formats and tightly integrated local caching layers (e.g., Snowflake’s native tables) often outperform open-format query engines (e.g., Trino on Iceberg) because the latter must repeatedly fetch from remote object storage. Snowflake’s query acceleration service and Databricks’ Photon optimizer help close the gap, but the gap is real [Source: https://www.moonfire.com/stories/the-lakehouse-era/].
Key Takeaway: Open table formats like Iceberg and Delta Lake turn object storage into transactional, governed table layers that any compatible engine can read. The resulting lakehouse pattern is the modern compromise between lake economics and warehouse semantics, and it is rapidly eroding vendor lock-in across the industry.
Managed Services vs Self-Hosted Trade-offs
Every modern data stack decision sits on a spectrum from “fully managed SaaS” (Snowflake, BigQuery, Databricks SaaS, Fivetran, dbt Cloud) to “self-hosted open source” (Spark on EKS, Airflow on K8s, self-run Trino, OSS dbt Core, raw Kafka clusters).
Managed services trade money for engineering time. Snowflake’s bill is large, but you do not employ people to upgrade query engines, patch CVEs, or tune memory allocators. Self-hosting trades engineering time for cost flexibility and control: you can squeeze unit economics on spot instances, customize behavior, and avoid vendor lock-in — but you also own pager rotations and upgrade cycles.
A heuristic that holds for most teams under ~50 engineers: buy managed for the layers that aren’t your differentiator; self-host only where you have specific reasons (cost at scale, regulatory, deep customization). A fintech doing fraud detection might self-host Flink because their stream processing logic is core IP; the same fintech almost certainly buys Snowflake because nothing about their warehouse is worth running themselves.
| Dimension | Managed (e.g., Snowflake) | Self-Hosted (e.g., Spark on K8s) |
|---|---|---|
| Time to first value | Days | Weeks to months |
| Ops burden | Vendor handles | You handle (upgrades, scaling, on-call) |
| Cost at small scale | Often cheaper (no idle infra) | Often more expensive |
| Cost at large scale | Premium per unit | Can be much cheaper if optimized |
| Customization | Limited | Full |
| Lock-in risk | Higher | Lower |
Key Takeaway: The modern stack is a portfolio decision: buy managed services for commodity layers to save engineering time, and self-host only the layers where you have a specific reason — cost at scale, regulatory constraint, or genuine differentiation.
Drivers of Architecture Choice
Architecture choices are not made in a vacuum. Four forces — latency, scale, cost, and governance — push every modern data platform toward particular shapes, and the relative weight of each force in your organization determines the design you should land on.
Cost-Performance Trade-offs
Decoupled storage and compute fundamentally reframed the cost conversation. In an on-prem world, you sized for peak load and paid for it 24/7. In the cloud, you pay for what you use — but only if your architecture lets you scale down. The fastest way to lose money in the modern stack is to provision a “always-on” Snowflake warehouse or Databricks cluster and never turn it off [Source: https://www.flexera.com/blog/finops/snowflake-vs-databricks/].
Cost discipline in modern data engineering means three things:
- Right-size compute to workload. A nightly ETL job needs a large warehouse for 30 minutes, not a medium warehouse for 4 hours.
- Use auto-suspend and auto-scale aggressively. Idle compute is the single biggest source of waste.
- Push transformations into cheaper layers. Materializing a heavily-used aggregate once is cheaper than scanning the raw table thousands of times. Conversely, materializing data nobody queries wastes storage and pipeline runtime.
Pricing models matter. BigQuery’s bytes-scanned model rewards columnar projection and partition pruning; Snowflake’s per-second compute billing rewards short bursts on right-sized warehouses; Databricks’ DBU model rewards spot instances and job clusters over interactive clusters [Source: https://xenoss.io/blog/snowflake-bigquery-databricks]. Architecting without understanding the bill produces surprises.
Key Takeaway: Cost in the modern stack is a function of architecture, not just usage. Right-sized compute, aggressive auto-suspend, and matching workload patterns to the platform’s pricing model are the levers that separate efficient teams from expensive ones.
Governance and Compliance Pressure
A decade ago, data governance meant a wiki page describing the warehouse schema. Today it means lineage, access control, classification, retention, and audit — often under regulatory pressure (GDPR, CCPA, HIPAA, SOX, the EU AI Act).
The modern stack responds to governance pressure in two ways. First, open table formats with ACID semantics (Iceberg, Delta Lake) make schema enforcement, time travel, and audit trails first-class features rather than bolt-ons [Source: https://www.ovaledge.com/blog/data-lakehouse]. Second, catalog and access-control layers (Unity Catalog on Databricks, Snowflake’s built-in RBAC, AWS Lake Formation, Apache Polaris) sit above the storage layer and centralize permissions, masking, and lineage across engines.
The lakehouse architecture is particularly attractive from a governance perspective because it provides a single governed layer that both BI and ML workloads consume — eliminating the common pattern of ML teams copying production data into a side bucket where governance does not apply [Source: https://www.fivetran.com/blog/what-is-a-data-lakehouse].
Example: a HIPAA-regulated healthcare company can put raw patient records in S3 with Iceberg tables and Unity Catalog for fine-grained access, so analysts, ML engineers, and auditors share one governed view. The same data scattered across a warehouse, a feature store, and an analyst’s notebook would be a compliance failure waiting to happen.
Key Takeaway: Governance is no longer an afterthought; it shapes architecture. Open table formats, centralized catalogs, and the single-copy lakehouse pattern exist in large part because they make compliance and access control tractable across BI and ML simultaneously.
Multi-Cloud and Hybrid Realities
Few large organizations are purely on one cloud. Acquisitions, regulatory data-residency requirements, vendor pricing leverage, and pre-existing investments produce multi-cloud and hybrid (cloud + on-prem) realities whether the architecture team wants them or not.
Three patterns recur. Cloud-portable platforms like Databricks and Snowflake run on AWS, Azure, and GCP, letting an enterprise use the same data platform across clouds. Open table formats let data physically live in any object store while remaining queryable from engines elsewhere — Iceberg tables in S3 can be read by Trino running in GCP. Federated query engines like Trino or Starburst Galaxy let a single SQL query span multiple underlying systems without moving data.
The trade-off is operational. Multi-cloud means duplicating IAM, networking, observability, CI/CD, and on-call across providers. Most pragmatic teams pick a primary cloud and treat secondary clouds as exceptions, using open formats and portable platforms to keep the door open without paying full multi-cloud overhead [Source: https://www.moonfire.com/stories/the-lakehouse-era/].
Key Takeaway: Multi-cloud and hybrid are operational realities, not aspirations, in most large organizations. Open table formats, cloud-portable platforms like Databricks and Snowflake, and federated query engines are the tools that keep architectures flexible without forcing the cost of running everything everywhere.
Chapter Summary
Modern data engineering is the discipline of building reliable, scalable, cost-aware pipelines that move data from operational systems into the storage and compute layers where it can be queried and modeled. The role has specialized: data engineers own ingest-to-storage infrastructure, analytics engineers transform loaded data into business models with dbt, platform engineers run the shared infrastructure, and ML data engineers feed feature pipelines into models. The defining shift of the last decade is the move from monolithic on-prem ETL to composable cloud ELT, where cheap object storage and elastic compute let teams load raw data first and transform it in place using SQL.
Processing models span batch (cheap, simple, hours of latency) through micro-batch to streaming (expensive, complex, sub-second latency). Lambda combines batch and streaming for accuracy plus freshness at the cost of two codebases; Kappa unifies on streaming with a replayable log (Kafka) and a unified engine (Flink) to eliminate that duplication. The right choice is determined by the consumer’s SLA — every order-of-magnitude latency improvement is a step-function increase in cost.
The structural shifts of the modern stack are decoupled storage and compute, open table formats (Iceberg, Delta Lake) creating interoperable lakehouses, and a default toward managed services for commodity layers. Four drivers shape every architecture: latency pushes toward streaming, scale pushes toward decoupled compute, cost pushes toward elastic right-sized resources matched to pricing models, and governance pushes toward open formats and centralized catalogs that work across BI and ML. The rest of this textbook builds on these foundations.
Key Terms
| Term | Definition |
|---|---|
| ETL | Extract-Transform-Load. The traditional pipeline pattern where data is transformed before being loaded into the destination warehouse. Dominant in the on-prem era when warehouse storage was expensive and transformation had to happen on a separate compute tier. |
| ELT | Extract-Load-Transform. The modern pipeline pattern where raw data is loaded into a cloud warehouse or lakehouse first, then transformed in place using the warehouse’s own SQL engine. Enabled by cheap cloud storage and elastic compute, and the foundation for tools like dbt. |
| data pipeline | An orchestrated sequence of data movement and transformation steps that takes data from source systems through to query-ready storage. May be batch, micro-batch, or streaming, and is typically managed by an orchestrator like Airflow or Dagster. |
| lakehouse | An architectural pattern that combines open table formats (Iceberg, Delta Lake, Hudi) over cheap object storage with warehouse-style query engines, providing ACID transactions, schema enforcement, and multi-engine access on a single copy of data. |
| Lambda architecture | A data processing pattern with three layers — a batch layer for accurate historical views, a speed layer for low-latency recent data, and a serving layer that merges them — used when both batch accuracy and stream freshness are required. Suffers from the “two armies” problem of duplicated batch and stream codebases. |
| Kappa architecture | A simplification of Lambda that eliminates the batch layer and treats all data as streams flowing through a single pipeline backed by a durable replayable log (Kafka) and a unified engine (Flink). Historical reprocessing is done by replaying the log. |
| decoupled compute | An architectural pattern in which compute resources are scaled, scheduled, and billed independently of the storage layer. Implemented as Snowflake Virtual Warehouses, BigQuery slots, or Databricks clusters reading from shared object storage. |
| data platform | A managed end-to-end environment that combines storage, compute, orchestration, governance, and access control into a coherent product (e.g., Snowflake, Databricks, BigQuery). Distinct from a single tool: a platform spans the layers a data team needs to operate. |
Chapter 2: The Lakehouse Paradigm
Learning Objectives
- Contrast data warehouses, data lakes, and lakehouses across schema, cost, and workload axes
- Explain how SageMaker Lakehouse and similar platforms unify access across S3 and Redshift
- Identify when a lakehouse is the right choice versus a pure warehouse or pure lake
- Describe the medallion (bronze/silver/gold) layering pattern
From Warehouse to Lake to Lakehouse
For three decades, organizations that wanted to analyze data had to pick a side. They could pour their information into a data warehouse and accept rigid schemas in exchange for fast, reliable queries — or they could dump everything into a data lake and trade reliability for flexibility and cheap storage. The lakehouse paradigm refuses that choice. By layering transactional table formats on top of cheap object storage, it tries to give you the warehouse’s discipline at the lake’s price [Source: https://www.ibm.com/think/topics/data-warehouse-vs-data-lake-vs-data-lakehouse]. Understanding why that compromise was necessary requires looking back at where each pattern came from.
OLTP vs OLAP: why warehouses exist
Operational systems — the databases behind your e-commerce checkout, your hospital’s patient records, your bank’s ledger — are designed for OLTP (Online Transaction Processing). Each transaction touches a few rows, completes in milliseconds, and must satisfy ACID guarantees so two customers don’t book the same airline seat. These systems are tuned for write throughput and row-by-row consistency.
Analytics is the opposite workload. A finance executive asking “what was revenue by region last quarter?” needs to scan millions of rows but only a few columns, and the query can take seconds without anyone caring. Running that scan on the OLTP database would lock tables, blow out caches, and degrade the customer experience. So organizations built the data warehouse — a separate, OLAP-optimized (Online Analytical Processing) store with proprietary columnar formats, aggressive indexing, and pre-built star or snowflake schemas tuned for business intelligence [Source: https://www.adaltas.com/en/2022/05/17/data-warehouse-lake-lakehouse-comparison/].
Warehouses excel at structured, repeatable reporting. Their proprietary storage and query engines deliver predictable performance and full SQL semantics. But that excellence comes at three costs: dollars (high-performance proprietary storage is expensive), rigidity (every column must be defined before you load a row — schema-on-write), and narrowness (warehouses traditionally handle structured tabular data only, not images, JSON blobs, or sensor logs) [Source: https://www.flexera.com/blog/finops/data-warehouse-vs-data-lake-vs-data-lakehouse/].
[Diagram suggestion: a side-by-side box diagram showing an OLTP database with many small concurrent reads/writes flowing into a row of customer transactions, versus an OLAP warehouse with a single analyst issuing a long-running aggregation across millions of rows organized by column.]
Key Takeaway: Warehouses exist because OLTP systems can’t safely host analytical scans. Their reliability and SQL power come bundled with proprietary storage costs and schema rigidity that fit structured reporting but exclude unstructured and exploratory work.
Data lake origin story and the schema-on-read approach
Around 2010, three forces collided: storage got dramatically cheaper as Amazon S3 and similar cloud object stores commoditized durable bytes; new data sources exploded, with mobile clickstreams, IoT telemetry, and JSON event logs that didn’t fit neatly into rows and columns; and machine learning practitioners needed raw, unsampled data for training rather than the pre-aggregated cubes warehouses preferred.
The data lake was the answer. Dump everything — structured CSVs, semi-structured JSON, unstructured video — into S3, ADLS, or GCS, and worry about structure later. This approach is called schema-on-read: you don’t define a table when you write the file, you define one when you query it. A new use case can apply a different schema to the same raw bytes without rewriting them [Source: https://www.striim.com/blog/data-warehouse-vs-data-lake-vs-data-lakehouse-an-overview/].
The economics were transformative. Object storage costs pennies per gigabyte per month versus dollars in proprietary warehouse storage. Data scientists could land first and ask questions later. Streaming pipelines could write directly to the lake without an ETL gate.
But lakes had a structural weakness: no ACID guarantees. If two pipelines wrote to the same Parquet directory at once, you could end up with partial files, duplicate rows, or corrupt partitions. There was no notion of a transaction, no atomic update, no easy way to delete a single user’s records for GDPR compliance. Schema drift went undetected until a downstream query exploded. The industry coined a grim term — data swamp — for lakes that had accumulated so much undocumented, inconsistent data that nobody could trust the answers anymore [Source: https://www.montecarlodata.com/blog-data-warehouse-vs-data-lake-vs-data-lakehouse-definitions-similarities-and-differences/].
A useful analogy: a warehouse is a bonded archive where every box is labeled, weighed, and indexed before it enters; a lake is a self-storage unit where you can throw anything in any order, but you might not find what you need a year later — and the unit next door might have leaked into yours overnight.
Key Takeaway: Data lakes solved the cost and flexibility problems of warehouses by separating storage from compute and applying schema only at read time, but the absence of ACID transactions and schema enforcement caused many lakes to degrade into untrustworthy data swamps.
Why lakehouses emerged: reconciling flexibility and reliability
By the late 2010s, large organizations were running two parallel stacks: a lake for ML and exploration, and a warehouse for BI and reporting, with brittle ETL pipelines copying data from one to the other. This duplicated storage, drifted definitions (“revenue” meant different things in each system), and forced teams to choose which copy was authoritative.
The lakehouse insight was simple but powerful: keep the cheap object-storage substrate, keep the open file formats (Parquet most commonly), but add a transaction layer on top that gives you ACID semantics, schema enforcement, time travel, and fast row-level updates. Three open table formats now dominate this layer: Delta Lake (originated at Databricks), Apache Iceberg (originated at Netflix, now used heavily by Snowflake and AWS), and Apache Hudi (originated at Uber). All three provide optimistic concurrency control, snapshot isolation, and engine interoperability — meaning multiple compute engines can safely read and write the same physical files [Source: https://www.databricks.com/blog/databricks-lakehouse-data-modeling-myths-truths-and-best-practices].
The result is that warehouse-grade reliability now runs on lake-grade economics. The same Parquet file that costs pennies to store can be MERGE-updated, queried with full SQL, and rolled back to yesterday’s snapshot. Traditional warehouses are increasingly redundant on top of these systems [Source: https://www.adaltas.com/en/2022/05/17/data-warehouse-lake-lakehouse-comparison/].
Figure 2.1: Evolution from warehouse to lake to lakehouse
timeline
title Evolution of Analytical Data Architectures
1990s : Data Warehouse
: Proprietary columnar storage
: Schema-on-write, ACID, BI only
2010s : Data Lake
: Cheap object storage (S3/ADLS/GCS)
: Schema-on-read, ML-friendly
: No ACID, risk of data swamps
2020s : Data Lakehouse
: Object storage + transaction layer
: Delta Lake / Iceberg / Hudi
: ACID + open formats + plural engines
The following comparison consolidates the three architectures across the dimensions an architect actually has to weigh:
| Dimension | Data Warehouse | Data Lake | Data Lakehouse |
|---|---|---|---|
| Storage substrate | Proprietary, high-performance, expensive | Cloud object storage (S3, ADLS, GCS), cheap | Cloud object storage + transaction layer |
| Data formats | Structured tabular only (star/snowflake) | Any format (JSON, CSV, Parquet, video, logs) | Any format; tabular wrapped in open table format |
| Schema model | Schema-on-write (strict) | Schema-on-read (loose) | Schema-on-write enforced via metastore-driven tables |
| ACID transactions | Native, full | Minimal or none | Full ACID via Delta / Iceberg / Hudi |
| Update / delete | Full DML | Complex, often rewrite entire partition | Fast MERGE / UPDATE / DELETE |
| Workloads | OLAP, BI, reporting | Exploration, ML, staging | BI + SQL + ML + streaming, unified |
| Governance | Strong RBAC, auditing | Limited, requires external tooling | Enterprise governance via Unity Catalog, Glue Catalog, Nessie |
| Versioning | Built-in | External tools required | Git-like time travel built in |
| Cost profile | High (storage + compute coupled) | Lowest (storage only) | Low storage, optimized compute |
[Source: https://www.adaltas.com/en/2022/05/17/data-warehouse-lake-lakehouse-comparison/] [Source: https://www.striim.com/blog/data-warehouse-vs-data-lake-vs-data-lakehouse-an-overview/] [Source: https://www.flexera.com/blog/finops/data-warehouse-vs-data-lake-vs-data-lakehouse/]
Key Takeaway: Lakehouses emerged to eliminate the parallel-stack tax — Delta Lake, Apache Iceberg, and Apache Hudi add ACID transactions and schema enforcement directly to Parquet on object storage, giving organizations warehouse reliability at lake economics without maintaining two systems.
Lakehouse Architecture Components
Despite vendor differences, every modern lakehouse decomposes into the same three architectural layers: an object-storage foundation, a metastore that gives those raw files a tabular identity, and one or more query engines that read and write through the metastore. Understanding these layers as separable building blocks is essential, because vendors mix and match them — and so will you when designing a real system.
Object storage as the foundation
The bottom of every lakehouse is a cloud object store: Amazon S3, Azure Data Lake Storage Gen2, or Google Cloud Storage. These services provide effectively infinite, eleven-nines-durable, pay-per-byte storage that decouples completely from compute. You can write data once and have a thousand different engines read it without paying any of them to be online.
Files in this layer are typically Apache Parquet — a columnar format that compresses well, supports predicate pushdown, and is readable by virtually every analytic engine. Parquet’s columnar layout is what gives lakehouses warehouse-class scan performance: a query that reads only three columns from a thousand-column table only pays I/O for those three.
Real-world analogy: object storage is like a city’s foundational water supply. Every restaurant, factory, and apartment in the city draws from the same pipes, but each has its own kitchen or treatment process. The water utility doesn’t care what cuisine you cook; it just guarantees clean water at scale and low cost.
Metastore and catalog layer
Raw Parquet files in S3 are just files. The metastore (also called the catalog) is the layer that turns a directory of Parquet files into something engines can recognize as a table — a thing with a name, a schema, partitions, statistics, and access control [Source: https://www.databricks.com/blog/databricks-lakehouse-data-modeling-myths-truths-and-best-practices].
The metastore is where the open table formats live. When you create a Delta or Iceberg table, the metastore records: the current schema, the list of files that constitute the current snapshot, the history of past snapshots (enabling time travel), and the partitioning scheme. When you commit a write, the metastore updates the snapshot atomically — that atomic pointer swap is what gives lakehouses their ACID guarantees on top of immutable object-store files.
Production lakehouses use one of:
- Unity Catalog (Databricks) — governance across clusters, jobs, and tables, with cross-catalog transactions
- AWS Glue Data Catalog — the unified metastore for AWS analytics services and SageMaker Lakehouse [Source: https://aws.amazon.com/blogs/aws/simplify-analytics-and-aiml-with-new-amazon-sagemaker-lakehouse/]
- Apache Nessie — open-source, Git-like branching and versioning for data
- Hive Metastore — the older, still-widely-deployed standard
Without a central catalog, every engine maintains its own picture of the data and they drift apart. With one, multiple engines can read and write the same physical files consistently.
Query engines accessing shared data
The top layer is whatever compute you bring. Because the data lives in open formats and the catalog is engine-neutral, you can point many engines at the same lakehouse simultaneously:
- Apache Spark for large-scale ETL and ML feature engineering
- Trino / Presto for interactive SQL across heterogeneous sources
- Amazon Athena for serverless ad-hoc SQL over S3
- Amazon Redshift for warehouse-grade BI queries
- Amazon EMR / AWS Glue for managed Spark and Flink jobs
- Databricks Photon for vectorized SQL inside the Databricks platform
- DuckDB for local or single-node analytics
This is the essential architectural shift the lakehouse enables: storage is shared, compute is plural and disposable. You spin up a Spark cluster, finish a job, shut it down, and the data persists. An analyst opens Athena, queries the same tables, pays only for bytes scanned. A data scientist trains a model in SageMaker against the same bytes. No copies, no ETL between engines.
[Diagram suggestion: a three-tier stack showing S3 / ADLS at the bottom holding Parquet files; a middle catalog layer (Unity Catalog or Glue Data Catalog) holding table metadata, schema, and snapshots; and a top fan of engines (Spark, Athena, Redshift, Trino, EMR) all pointing arrows at the catalog, with the catalog pointing arrows down at the storage layer.]
Figure 2.2: Three-layer lakehouse architecture
flowchart TD
subgraph Engines["Query Engines (plural, disposable)"]
Spark["Apache Spark"]
Trino["Trino / Presto"]
Athena["Amazon Athena"]
Redshift["Amazon Redshift"]
EMR["EMR / Glue"]
Duck["DuckDB"]
end
subgraph Catalog["Metastore / Catalog Layer"]
Meta["Unity Catalog / Glue Data Catalog / Nessie<br/>schemas, snapshots, partitions, ACLs"]
end
subgraph Storage["Object Storage Foundation"]
S3["Amazon S3 / ADLS / GCS<br/>Parquet files (open, columnar)"]
end
Spark --> Meta
Trino --> Meta
Athena --> Meta
Redshift --> Meta
EMR --> Meta
Duck --> Meta
Meta --> S3
Key Takeaway: A lakehouse is the deliberate decomposition of three layers — object storage, metastore, and query engines — held together by an open table format. The metastore is the keystone: it converts directories of Parquet files into transactional, governed tables that any compatible engine can safely query.
AWS SageMaker Lakehouse
Announced at AWS re:Invent 2024 on December 3, 2024, Amazon SageMaker Lakehouse is AWS’s flagship implementation of the lakehouse pattern and now generally available [Source: https://aws.amazon.com/about-aws/whats-new/2024/12/aws-announces-amazon-sagemaker-lakehouse/]. It is the canonical AWS example used throughout this textbook because it concretely shows how the abstract layers of the previous section map onto a production cloud platform.
Unified access across S3 and Redshift
The headline capability is unification. Historically, AWS customers ran S3 data lakes (queryable by Athena, EMR, Glue) and Redshift data warehouses (queryable only by Redshift) as two separate worlds, with Glue jobs shoveling data between them. SageMaker Lakehouse erases that boundary: it presents a single copy of data that spans S3 lakes, Redshift Managed Storage, and federated sources such as DynamoDB, Snowflake, Salesforce, SAP, ServiceNow, and Zendesk [Source: https://aws.amazon.com/blogs/aws/simplify-analytics-and-aiml-with-new-amazon-sagemaker-lakehouse/].
Concretely, this means a Spark job in EMR can read a Redshift table in place — without unloading it to S3 first — and a Redshift query can join that warehouse table with an S3-resident Iceberg table in the same SQL statement. Zero-ETL integrations pull from SaaS sources directly into the lakehouse, eliminating custom Kinesis or Glue pipelines for many common feeds [Source: https://aws.amazon.com/blogs/aws/introducing-the-next-generation-of-amazon-sagemaker-the-center-for-all-your-data-analytics-and-ai/].
Figure 2.3: SageMaker Lakehouse unified access across S3, Redshift, and SaaS sources
flowchart LR
subgraph Sources["Data Sources"]
S3["S3 Data Lake<br/>(Iceberg / Parquet)"]
RMS["Redshift Managed Storage<br/>(warehouse tables)"]
SaaS["SaaS / Federated<br/>Salesforce, SAP,<br/>ServiceNow, Zendesk,<br/>DynamoDB, Snowflake"]
end
Glue["AWS Glue Data Catalog<br/>(unified metastore)"]
subgraph Studio["SageMaker Unified Studio"]
EMR["EMR / Spark"]
Athena["Athena"]
RS["Redshift Query Editor"]
SM["SageMaker Studio (ML)"]
end
S3 -- "in-place" --> Glue
RMS -- "in-place" --> Glue
SaaS -- "zero-ETL" --> Glue
Glue --> EMR
Glue --> Athena
Glue --> RS
Glue --> SM
Worked example: a retail company has order history in Redshift (the existing warehouse), clickstream JSON in S3 (the existing lake), and customer master data in Salesforce (a SaaS source). Pre-lakehouse, building a “customer lifetime value” feature for an ML model required three pipelines and at least one duplicated copy of each source. With SageMaker Lakehouse, a single Spark job in SageMaker Unified Studio joins all three tables in place, writes the result back as an Iceberg table, and that table becomes immediately queryable by Athena (for the analyst) and Redshift (for the BI dashboard) without further movement.
Iceberg-compatible tables
SageMaker Lakehouse is built on Apache Iceberg as its open table format [Source: https://aws.amazon.com/about-aws/whats-new/2024/12/aws-announces-amazon-sagemaker-lakehouse/]. This choice has strategic consequences. Because Iceberg is open and engine-neutral, the same tables are queryable in place by Amazon EMR, AWS Glue, Redshift, Apache Spark, Athena, Trino, and any other Iceberg-aware engine. Customers are not locked into a single compute engine. Iceberg also provides the standard lakehouse capabilities: ACID transactions, time travel via snapshots, schema evolution, and fast MERGE / UPDATE / DELETE operations.
The unified metastore is the AWS Glue Data Catalog, which acts as the single source of truth for table definitions across S3 and Redshift — including for data physically stored in Redshift Managed Storage [Source: https://mactores.com/blog/amazon-sagemaker-lakehouse-unified-data-access-for-ml-and-analytics]. This is what makes “single copy of data” technically possible: the catalog gives every engine the same view of every table.
Identity-based fine-grained access
Centralizing data only helps if security keeps up. SageMaker Lakehouse enforces fine-grained permissions consistently across all attached engines — same user, same row-level and column-level rules whether they query via Athena, Redshift, or EMR. The governance layer is SageMaker Data and AI Governance, built on Amazon DataZone, providing data discovery, lineage, and policy management across the whole estate [Source: https://aws.amazon.com/blogs/aws/simplify-analytics-and-aiml-with-new-amazon-sagemaker-lakehouse/].
Access happens through SageMaker Unified Studio (initially in preview at launch), a single workspace that bundles EMR Studio, Glue Studio, the Redshift Query Editor, SageMaker Studio, and Bedrock tools — turning the lakehouse into the single access surface for analytics and AI/ML alike [Source: https://aws.amazon.com/blogs/aws/introducing-the-next-generation-of-amazon-sagemaker-the-center-for-all-your-data-analytics-and-ai/].
The following matrix shows how SageMaker Lakehouse maps to the generic three-layer architecture from the previous section:
| Generic Layer | SageMaker Lakehouse |
|---|---|
| Object storage | Amazon S3 + Redshift Managed Storage |
| Open table format | Apache Iceberg |
| Metastore / catalog | AWS Glue Data Catalog |
| Query engines | EMR, Glue, Redshift, Athena, Spark |
| Governance | SageMaker Data and AI Governance (on DataZone) |
| User workspace | SageMaker Unified Studio |
Key Takeaway: SageMaker Lakehouse operationalizes the lakehouse pattern on AWS by using Apache Iceberg as the open table format and Glue Data Catalog as the unified metastore, letting every compatible engine query a single copy of data spanning S3, Redshift, and federated SaaS sources without ETL.
Medallion Architecture
A lakehouse gives you the machinery to store and query data reliably, but it doesn’t tell you how to organize that data so that analysts trust it and engineers can debug it. The medallion architecture, originated and championed by Databricks, is the canonical lakehouse design pattern. It organizes data into three progressively refined quality tiers — Bronze, Silver, and Gold — typically implemented as Delta or Iceberg tables, with each layer rebuildable from the layer beneath it [Source: https://www.databricks.com/blog/what-is-medallion-architecture] [Source: https://docs.databricks.com/aws/en/lakehouse/medallion].
The naming evokes Olympic medals — bronze is the entry point, silver is refined, gold is the prize — and the pattern applies whether you’re organizing data for BI dashboards, ML training, or both.
Bronze: raw landing zone
The Bronze layer is the single source of truth for ingested data. Tables here are append-only, immutable, and as close to the source format as possible. Schema is applied loosely — schema-on-read style — to keep ingestion robust against upstream changes [Source: https://docs.databricks.com/aws/en/lakehouse/medallion].
Bronze tables typically capture:
- The full payload from the source (every column, even ones nobody currently queries)
- An ingestion timestamp
- A source-system identifier
- Optional metadata such as the source file path or Kafka offset
What Bronze deliberately does not do: deduplicate, join, drop columns, or transform business meaning. The Delta Lake or Iceberg format preserves auditability, and because everything is append-only and timestamped, you can replay any downstream pipeline against any historical snapshot of the source. If a regulator asks “what did our data look like on March 14?”, Bronze can answer.
Real-world analogy: Bronze is a hospital’s intake records. You write down everything the patient says — symptoms, history, contradictions and all — without filtering. Diagnosis happens later; the intake is the immutable record of what was reported.
Silver: cleaned and conformed
The Silver layer is where data becomes trustworthy. Transformations applied here include cleansing (removing or repairing malformed rows), deduplication (one record per business event), normalization (standard units, standard date formats, standard country codes), enrichment via joins with reference data, and schema enforcement (every Silver table has a strict, documented schema and writes that violate it fail) [Source: https://docs.databricks.com/aws/en/lakehouse/medallion] [Source: https://delta.io/blog/delta-lake-medallion-architecture/].
Crucially, Silver tables are still non-aggregated record-level views. Each row corresponds to a real-world event or entity — an order, a session, a sensor reading. This makes Silver the natural home for data scientists and analytics engineers, who need fine-grained data to compute features and build models but no longer want to deal with the source’s inconsistencies.
Worked example: an e-commerce company’s Bronze orders_raw table contains every API payload, including duplicates from client retries, orders with malformed addresses, and currency codes in mixed cases (“USD”, “Usd”, “usd”). The Silver orders table is the result of: deduplicating on order ID + timestamp; rejecting rows with invalid postal codes (writing them to a quarantine table for review); upper-casing all currency codes; joining customer ID to the customer dimension to add region; and enforcing a strict schema with non-nullable order ID and amount.
Gold: business-ready aggregates
The Gold layer materializes business meaning. Tables here are aggregated, dimensional, or feature-engineered for direct consumption by BI dashboards, executive reports, and production ML models [Source: https://docs.databricks.com/aws/en/lakehouse/medallion]. They are optimized for query performance — partitioned, clustered, and often pre-joined.
Typical Gold tables include daily or hourly aggregates (“revenue by region by day”), dimensional models (star schemas with fact and dimension tables tuned for BI), KPIs and metrics aligned to business definitions, and ML feature tables ready for model training and inference [Source: https://www.databricks.com/blog/what-is-medallion-architecture].
Continuing the e-commerce example, a Gold daily_revenue_by_region table is built by aggregating the Silver orders table grouped by region and order date. The Gold table powers the executive dashboard, refreshes nightly, and is rebuildable end-to-end from Bronze if a definition changes.
The data flows in one direction — Bronze to Silver to Gold — and each layer is rebuildable from Bronze. That last property is the medallion pattern’s superpower: when the definition of “active customer” changes, you don’t have to find every downstream table by hand. You change the Silver or Gold logic, rerun, and get a consistent rebuild [Source: https://delta.io/blog/delta-lake-medallion-architecture/].
The medallion layers compare as follows:
| Layer | Purpose | Schema | Transformations | Typical Consumers |
|---|---|---|---|---|
| Bronze | Raw single source of truth | Schema-on-read, loose | Ingestion only; append-only | Pipeline operators, auditors, replay jobs |
| Silver | Cleaned, validated record-level data | Schema-enforced, strict | Cleanse, dedupe, normalize, join, validate | Data scientists, analytics engineers |
| Gold | Business-ready aggregates and models | Schema-enforced, dimensional | Aggregate, curate, feature-engineer | BI dashboards, executives, ML inference |
[Source: https://docs.databricks.com/aws/en/lakehouse/medallion] [Source: https://weld.app/blog/medallion-layers]
[Diagram suggestion: three stacked Delta/Iceberg-table icons labeled Bronze, Silver, Gold, with arrows flowing left-to-right and a dashed “rebuildable from Bronze” loop arrow under each downstream layer; on the right, callouts showing typical consumers (auditor, data scientist, executive dashboard) attached to each layer.]
Figure 2.4: Medallion architecture — Bronze, Silver, Gold flow
flowchart LR
Source["Source Systems<br/>APIs, Kafka, files, CDC"]
subgraph Bronze["Bronze Layer"]
B["Raw, append-only<br/>Schema-on-read<br/>Full payload + ingest timestamp"]
end
subgraph Silver["Silver Layer"]
S["Cleansed, deduped, normalized<br/>Schema-enforced<br/>Record-level events"]
end
subgraph Gold["Gold Layer"]
G["Aggregates, dimensional models<br/>ML feature tables<br/>Business definitions"]
end
Source --> B
B --> S
S --> G
B -. "auditors / replay" .-> Auditor(["Auditor"])
S -. "data scientists" .-> DS(["Data Scientist"])
G -. "BI / executives / ML" .-> BI(["Dashboard / ML"])
G -. "rebuildable from Bronze" .-> B
A common deployment shape, especially in larger organizations, is a hub-and-spoke arrangement where a central Data Hub owns canonical Bronze/Silver/Gold tables for cross-organization concepts (customer, employee, product), and each domain (Sales, Marketing, Finance) maintains its own Bronze/Silver/Gold extensions that join to the hub [Source: https://docs.databricks.com/aws/en/lakehouse/medallion]. This avoids both extreme central bottlenecks and uncoordinated domain duplication.
Key Takeaway: The medallion architecture is a quality-tiering discipline that turns lakehouse machinery into a maintainable system: Bronze preserves raw truth for replay and audit, Silver delivers cleaned record-level data for analytics engineers, and Gold serves business-ready aggregates to dashboards and ML — each layer rebuildable from Bronze when definitions evolve.
Chapter Summary
The lakehouse paradigm is a deliberate response to an industry that spent a decade running parallel data warehouse and data lake stacks. Warehouses gave reliability, ACID transactions, and SQL performance, but at proprietary-storage prices and only for structured data. Lakes gave cheap object-store economics and flexibility for ML and unstructured workloads, but lost ACID guarantees and frequently degraded into data swamps. The lakehouse keeps the lake’s storage substrate and open file formats while adding a transaction layer — Delta Lake, Apache Iceberg, or Apache Hudi — that restores warehouse-grade guarantees on top of Parquet files in S3.
Architecturally, every lakehouse decomposes into three layers: object storage at the bottom, a metastore (Unity Catalog, AWS Glue Data Catalog, Apache Nessie) in the middle, and a plurality of compute engines on top (Spark, Trino, Athena, Redshift, EMR, Photon). The open table format is the contract that lets engines safely share storage. AWS SageMaker Lakehouse, announced at re:Invent 2024 and now generally available, instantiates this pattern with S3 + Redshift Managed Storage as the storage substrate, Apache Iceberg as the open table format, AWS Glue Data Catalog as the unified metastore, and a single Unified Studio workspace that brings analytics and AI/ML onto one access surface — including zero-ETL pulls from Salesforce, SAP, ServiceNow, and Zendesk into the same governed copy of data.
A lakehouse is the right choice when an organization needs unified BI, ML, and streaming on a single copy of data, wants to retire parallel warehouse and lake stacks, and is ready to standardize on an open table format and central catalog. A pure warehouse remains a fine choice when workloads are exclusively structured BI and you value managed simplicity over flexibility. A pure lake is rarely the right answer in 2026 — the marginal cost of adding a transaction layer is small, and the cost of running without one (corruption, drift, swamp) is large.
Once the lakehouse infrastructure is in place, the medallion architecture provides the organizational discipline. Bronze captures raw, append-only source data for replay and audit. Silver cleans, deduplicates, normalizes, and enforces schema to produce trustworthy record-level data. Gold aggregates and curates for BI dashboards and ML feature tables. Each layer is rebuildable from Bronze, which is the property that lets the system evolve without rewriting history. Together, these patterns — open table formats on object storage, unified catalogs, plural engines, and medallion tiering — define what modern data engineering looks like and form the foundation that subsequent chapters build on.
Key Terms
| Term | Definition |
|---|---|
| lakehouse | A data architecture that adds a transaction layer (Delta Lake, Apache Iceberg, or Apache Hudi) on top of cloud object storage to combine the cost and flexibility of a data lake with the reliability, ACID transactions, and schema enforcement of a data warehouse. |
| data warehouse | An OLAP-optimized analytical store using proprietary, high-performance columnar storage and strict schema-on-write, designed for structured BI and reporting workloads with full ACID guarantees but at high storage cost and limited flexibility. |
| data lake | A repository on cheap cloud object storage (S3, ADLS, GCS) that accepts data in any format with minimal upfront structure, applying schema only at read time; flexible and cost-efficient but historically lacking ACID guarantees, leading to “data swamps.” |
| schema-on-read | An approach in which data is stored in its raw form and a schema is applied only when the data is queried, allowing different consumers to interpret the same bytes differently; characteristic of data lakes and Bronze layer tables. |
| schema-on-write | An approach in which a schema is defined and enforced at the moment data is written, rejecting non-conforming records; characteristic of data warehouses and lakehouse Silver/Gold layer tables. |
| medallion architecture | Databricks’ canonical lakehouse design pattern that organizes data into three progressively refined quality tiers — Bronze (raw), Silver (cleaned and validated), Gold (business-ready aggregates) — each rebuildable from Bronze. |
| metastore | The catalog layer (e.g., Unity Catalog, AWS Glue Data Catalog, Apache Nessie, Hive Metastore) that stores table metadata, schemas, partitioning, snapshots, and permissions, allowing multiple compute engines to read and write the same physical files consistently. |
| OLAP | Online Analytical Processing — workloads that scan large volumes of data across many rows but few columns to answer aggregate analytical questions, optimized for read throughput rather than transactional latency; the workload class warehouses and lakehouses are built for. |
Chapter 3: Storage Foundations and Open Table Formats
Learning Objectives
- Compare row-oriented, columnar, and hybrid storage formats by use case
- Explain how Apache Iceberg, Delta Lake, and Hudi implement ACID on object storage
- Use partitioning, clustering, and Z-ordering to accelerate analytical queries
- Configure S3 storage classes and S3 Tables for cost-effective lake storage
Storage is where every data engineering decision eventually lands. You can pick the most elegant orchestrator and the fastest query engine, but if your bytes are laid out poorly on disk, queries will be slow, costs will be high, and producers will fight consumers over schema changes. This chapter goes deep on the physical and logical storage layers that modern lakehouses depend on: the file formats that hold the actual rows, the open table formats that wrap those files in transactional semantics, the object storage services that host them, and the partitioning strategies that make billion-row queries feel instantaneous.
Think of these layers like a library. The file format is how each book is printed — paper, font, binding. The table format is the catalog system that tells you which books exist, which editions are current, and which are archived. The object store is the building itself with its shelving and climate control. And partitioning is the floor plan that keeps biographies separate from cookbooks so a librarian doesn’t have to walk every aisle to find one title. Get any of these wrong and the whole library slows down.
File Formats for Analytics
CSV, JSON, Avro: row-oriented and human-readable
Row-oriented formats store records the way you’d write them on a notepad — all the fields for one record together, then all the fields for the next record, and so on. CSV (comma-separated values) is the lowest common denominator: a plain-text grid that every spreadsheet, database, and scripting language can read. JSON layers in nested structures and types, which is invaluable when records have arrays or sub-objects (think API payloads or event logs). Avro is a compact binary row format with an embedded schema, designed for high-throughput streaming systems like Apache Kafka where producers and consumers may evolve independently.
Row formats shine when you need to read or write whole records: ingesting events from a webhook, exporting a customer profile, or replaying a Kafka topic. They are terrible for analytics because answering “what was the average order value last quarter?” forces the engine to read every byte of every order — including customer addresses, SKUs, and shipping notes you don’t care about — just to grab one column.
A real-world analogy: CSV is like a paper receipt. Easy to print, easy to read with your eyes, painful if you need to total the tax column across ten thousand receipts.
Parquet and ORC: columnar compression and predicate pushdown
Parquet and ORC flip the layout. Instead of storing rows together, they group values from the same column together, so a query that touches three of fifty columns reads roughly 6% of the file rather than 100%. Apache Parquet has become the de facto standard for analytical lakes; ORC (Optimized Row Columnar) is its close cousin, more common in legacy Hive deployments.
Parquet organizes a file hierarchically: the file is split into row groups of around 128 MB, each row group is sliced into per-column column chunks, and each chunk is further broken into pages that hold the actual encoded values [Source: https://last9.io/blog/parquet-vs-csv/]. At every level, Parquet stores rich metadata: min/max values, null counts, distinct counts, and optional bloom filters [Source: https://www.youtube.com/watch?v=OsJvgTmeyeE].
Figure 3.1: Parquet file hierarchical layout with row groups, column chunks, and pages
flowchart TD
File["Parquet File"]
File --> RG1["Row Group 1 (~128 MB)"]
File --> RG2["Row Group 2 (~128 MB)"]
File --> Footer["Footer (file metadata + schema)"]
RG1 --> CC1["Column Chunk: order_id"]
RG1 --> CC2["Column Chunk: customer_id"]
RG1 --> CC3["Column Chunk: amount"]
CC1 --> P1["Page 1 (encoded values + stats)"]
CC1 --> P2["Page 2 (encoded values + stats)"]
CC1 --> P3["Page N ..."]
Footer --> Stats["Min/Max, null counts, bloom filters"]
Stats -. "predicate pushdown skips chunks/pages" .-> RG2
That metadata enables predicate pushdown — the ability for a query engine to consult the statistics and skip data that cannot possibly match the filter, without reading the underlying values [Source: https://last9.io/blog/parquet-vs-csv/]. If a query asks WHERE shipdate <= '1996-09-02' and a row group’s max shipdate is '1996-08-15', the engine reads it. If the row group’s min shipdate is '1997-01-01', the engine skips the entire row group. On the TPC-H SF20 benchmark, this kind of pruning skips roughly 30% of data on selective filters [Source: https://duckdb.org/2024/12/05/csv-files-dethroning-parquet-or-not.html].
Parquet also stacks several encodings on top of compression to shrink files dramatically:
| Encoding | What It Does | Best For |
|---|---|---|
| Dictionary encoding | Replaces repeated values with integer IDs | Low-cardinality columns (status, country) |
| Run-length encoding (RLE) | Compresses runs of identical values into (value, count) | Sorted or naturally clustered columns |
| Bit-packing | Stores small integers using only the bits they need | Counts, IDs, encoded categoricals |
| Bloom filters | Probabilistic structure for “definitely not present” checks | Point lookups by ID |
The performance numbers are striking. On a 194 GB Allstate dataset, Parquet compressed the data to 4.7 GB — a roughly 97% reduction — and read 3.5x less data on full scans [Source: https://www.cloudera.com/blog/technical/benchmarking-apache-parquet-the-allstate-experience.html]. On TPC-H SF20, Parquet was 5x smaller than CSV (3.2 GB vs 16 GB) and 7-10x faster on joins (2 seconds versus 20 seconds) [Source: https://duckdb.org/2024/12/05/csv-files-dethroning-parquet-or-not.html]. On AWS Athena, the same query scanned 117 KB of Parquet versus 48 MB of CSV, which directly shows up on the bill since Athena charges per byte scanned [Source: https://last9.io/blog/parquet-vs-csv/].
Choosing format based on query and write pattern
The choice is rarely “Parquet always wins.” It depends on access pattern.
| Pattern | Recommended Format | Why |
|---|---|---|
| Streaming ingest with schema evolution | Avro | Compact binary, embedded schema, row-based writes |
| Webhook landing zone, debug-friendly | JSON or CSV | Human-readable, no tooling required |
| Analytical scans, BI dashboards | Parquet (or ORC on Hive) | Columnar, compressed, predicate pushdown |
| Small (<100 MB) reference data | CSV is fine | Tooling overhead exceeds benefit |
| OLAP at >100 MB scale | Parquet | 5-40x smaller, 7-100x faster queries [Source: https://last9.io/blog/parquet-vs-csv/] |
A common pattern is to land raw data as JSON or Avro for fidelity, then immediately convert to Parquet for analytics. The raw layer is the audit trail; the Parquet layer is what dashboards actually query.
Key Takeaway: Row-oriented formats (CSV, JSON, Avro) are great for ingestion and whole-record access, but they force analytical queries to read every byte of every row. Parquet’s columnar layout, hierarchical metadata, and predicate pushdown deliver 5-40x size reduction and 7-100x query speedup on analytical workloads, making it the default file format for modern lakes.
Open Table Formats
A Parquet file is just a Parquet file. It has no notion of “the current state of the orders table.” If two writers append at the same time, or one writer fails halfway through, you can end up with partial data, duplicate files, or queries that see inconsistent snapshots. Traditional data warehouses solved this with a tightly-coupled storage engine. Lakes solved it by inventing open table formats: thin metadata layers that sit on top of Parquet/ORC files and provide ACID (Atomicity, Consistency, Isolation, Durability) transactions, schema evolution, and time travel — all on commodity object storage [Source: https://iceberg.apache.org/docs/latest/].
Three open table formats dominate today: Apache Iceberg, Delta Lake, and Apache Hudi. They all solve the same core problems but with different design choices, ecosystems, and personalities.
Apache Iceberg snapshots and time travel
Apache Iceberg organizes a table as a tree of metadata files. At the root sits a metadata.json file pointing at the current snapshot. Each snapshot lists manifest files, and each manifest lists the data files (Parquet or ORC) that belong to the table at that moment [Source: https://iceberg.apache.org/docs/latest/]. Writers append new data files and produce a new snapshot atomically; readers see whatever snapshot was current when their query started. This gives serializable isolation with optimistic concurrency control — two writers can prepare commits in parallel, and the second one detects the conflict and retries [Source: https://iceberg.apache.org/docs/latest/].
Figure 3.2: Apache Iceberg metadata layers from catalog pointer to data files
flowchart TD
Catalog["Catalog (Glue / Hive / REST)"]
Catalog --> MetaJSON["metadata.json (current snapshot pointer)"]
MetaJSON --> Snap1["Snapshot S1 (older)"]
MetaJSON --> Snap2["Snapshot S2 (current)"]
Snap1 --> ML1["Manifest List S1"]
Snap2 --> ML2["Manifest List S2"]
ML1 --> M1["Manifest File A"]
ML2 --> M1
ML2 --> M2["Manifest File B (new)"]
M1 --> D1["data-001.parquet"]
M1 --> D2["data-002.parquet"]
M2 --> D3["data-003.parquet (newly added)"]
Because every snapshot is preserved (until expired), Iceberg supports time travel: you can query the table as of a specific timestamp or snapshot ID without restoring from backup.
-- Query Iceberg as of a specific time
SELECT * FROM iceberg_table
FOR SYSTEM_TIME AS OF '2026-05-07 14:00:00';
-- Or by snapshot ID
SELECT * FROM iceberg_table
FOR SYSTEM_VERSION AS OF 4538291;
Iceberg’s biggest superpower is schema evolution backed by stable column IDs. Each column has an immutable numeric ID, so you can rename, reorder, drop, or add columns without rewriting any data files [Source: https://iceberg.apache.org/docs/latest/]:
{
"type": "struct",
"fields": [
{"id": 1, "name": "id", "type": "long"},
{"id": 2, "name": "new_column", "type": "string"}
]
}
The same goes for partition evolution. If you started partitioning by region and later want to partition by region and year, Iceberg can change the partition spec on a go-forward basis without rewriting historical data — old partitions and new partitions coexist seamlessly [Source: https://iceberg.apache.org/docs/latest/]. Iceberg has the broadest engine support: native in Spark, Flink, Trino, Athena, BigQuery, Snowflake, and DuckDB, which makes it the safest choice for vendor-neutral lakehouses.
Delta Lake transaction log
Delta Lake takes a different route. Instead of a tree of manifests, it maintains a _delta_log directory containing an ordered series of JSON commit files, each describing one transaction (add file X, remove file Y, update schema, etc.) [Source: https://docs.delta.io/latest/index.html]. To compute the current table state, the reader replays the log from the last checkpoint forward — much like a database write-ahead log applied to object storage.
Delta Lake also offers serializable isolation via optimistic concurrency, supports time travel through versioned commits, and integrates tightly with Spark and the Databricks platform.
-- Delta time travel by version
SELECT * FROM delta_table VERSION AS OF 5;
-- By timestamp
SELECT * FROM delta_table TIMESTAMP AS OF '2026-05-01';
Delta’s default version retention is 30 days [Source: https://docs.delta.io/latest/index.html]. Schema evolution is more limited than Iceberg’s: you can add columns at the end and merge schemas during writes, but renaming and dropping columns natively is constrained. Partitions are explicit and visible to users, so changing them requires rewrites. Where Delta really wins is the Databricks ecosystem: zero-copy table cloning, deep MERGE optimizations, and excellent ML/AI tooling integration through MLflow and Unity Catalog [Source: https://docs.delta.io/latest/index.html].
Apache Hudi merge-on-read and copy-on-write
Apache Hudi is the streaming-first member of the family. It was built at Uber to solve a problem the others didn’t initially tackle: efficient row-level upserts and deletes on a lake. Hudi maintains a timeline of instants and offers two table types [Source: https://hudi.apache.org/docs/overview/]:
| Table Type | How It Works | Best For |
|---|---|---|
| Copy-on-Write (CoW) | Each write rewrites the affected files in full. Reads are fast and uniform. | Read-heavy analytics, batch ETL |
| Merge-on-Read (MoR) | New changes are written as small delta logs and merged at read time (with periodic compaction). Writes are fast; reads pay a small merge cost. | Streaming ingestion, CDC, low-latency upserts |
Hudi also brings record-level indexing (Bloom, Hash File, HBase-backed, etc.), so updating a single row by primary key doesn’t require scanning the whole partition [Source: https://hudi.apache.org/docs/overview/]. This makes it the format of choice for change data capture (CDC) pipelines where you’re constantly applying updates from a transactional source. Hudi’s schema evolution is more limited than Iceberg’s, and ecosystem support outside Spark and Flink is thinner — Trino and BigQuery integrations exist but are less mature.
Figure 3.3: Hudi Copy-on-Write vs Merge-on-Read write paths
flowchart LR
Upsert["Incoming upsert batch"]
Upsert --> CoW["Copy-on-Write path"]
Upsert --> MoR["Merge-on-Read path"]
CoW --> Rewrite["Rewrite affected base Parquet files"]
Rewrite --> CoWRead["Reader: scan base files (fast, uniform)"]
MoR --> Delta["Append small delta log files (Avro)"]
Delta --> Compact["Periodic compaction job"]
Compact --> Base["Merged base Parquet files"]
Delta --> MoRRead["Reader: merge base + deltas at query time"]
Base --> MoRRead
Putting them side by side
| Feature | Iceberg | Delta Lake | Hudi |
|---|---|---|---|
| Metadata | JSON manifest tree | _delta_log transaction log | Timeline (DeltaLog) |
| Isolation | Serializable | Serializable | Serializable |
| Schema evolution | Most advanced (rename, drop, reorder, promote) | Moderate (add at end, merge) | Basic |
| Partition evolution | Hidden + dynamic | Explicit, requires rewrite | Largely fixed |
| Time travel | Snapshot-based | Version & timestamp | Instant-based |
| Best for | Multi-engine warehouses | Databricks/Spark, ML | Streaming, CDC, upserts |
| Native engines | Spark, Flink, Trino, Athena, BigQuery, Snowflake | Spark, Databricks-centric | Spark, Flink |
A real-world analogy: Iceberg is like Git for your lake — every commit is a snapshot you can travel back to, and branching/merging are first-class. Delta Lake is like a single transaction log a la PostgreSQL’s WAL, replayed to derive state. Hudi is like a journaling filesystem optimized for many small updates with periodic compaction.
Key Takeaway: Open table formats turn a pile of Parquet files on object storage into something that behaves like a real database table — with ACID transactions, schema evolution, and time travel. Pick Iceberg for multi-engine flexibility and aggressive schema change, Delta Lake for tight Databricks/Spark integration, and Hudi when streaming upserts and CDC are the dominant workload.
Object Storage on AWS
Amazon S3 storage classes
Amazon S3 is the de facto storage backbone for AWS-based lakes. What many teams miss is that S3 is not one storage tier — it’s a family of classes with very different cost and access profiles. Choosing the right class per object can cut the storage bill by 50-95%.
| Storage Class | Designed For | Access Latency | Cost Profile |
|---|---|---|---|
| S3 Standard | Hot, frequently accessed data | Milliseconds | Highest storage, lowest retrieval |
| S3 Intelligent-Tiering | Unknown/changing access patterns | Milliseconds | Auto-moves objects between tiers |
| S3 Standard-IA (Infrequent Access) | Less frequent but rapid access | Milliseconds | ~45% cheaper storage, retrieval fee |
| S3 One Zone-IA | Recreatable infrequent data | Milliseconds | Single AZ, ~20% cheaper than Standard-IA |
| S3 Glacier Instant Retrieval | Archive with millisecond access | Milliseconds | Low storage, higher retrieval |
| S3 Glacier Flexible Retrieval | Archive, minutes-to-hours retrieval | Minutes-hours | Very low storage |
| S3 Glacier Deep Archive | Long-term compliance archives | Hours (12+ hrs) | Lowest storage cost |
For a typical lake, the hot transformed Parquet that dashboards hit lives in S3 Standard. Last-year’s raw landing data might live in S3 Standard-IA. Compliance archives older than seven years go to Glacier Deep Archive.
Amazon S3 Tables for managed Iceberg
Running Iceberg yourself is workable but operationally heavy: you have to schedule compaction (small file problem), expire old snapshots, clean up unreferenced files, and tune metadata layout. Most teams underinvest here, and table performance silently degrades over months.
Amazon S3 Tables, announced at AWS re:Invent 2024, is a fully managed Apache Iceberg service built on a new bucket type called table buckets [Source: https://aws.amazon.com/blogs/aws/new-amazon-s3-tables-storage-optimized-for-analytics-workloads/]. Tables are first-class AWS resources with their own ARNs, IAM policies, and dedicated endpoints [Source: https://www.theregister.com/2024/12/03/aws_introduces_s3_tables/].
Three things make S3 Tables compelling:
-
Performance. AWS reports up to 3x faster query throughput and 10x higher transactions per second versus self-managed Iceberg in standard S3 buckets [Source: https://aws.amazon.com/blogs/aws/new-amazon-s3-tables-storage-optimized-for-analytics-workloads/]. The bucket type is purpose-built for tabular access patterns rather than generic object storage [Source: https://www.youtube.com/watch?v=eztA5VYH2nM].
-
Automatic maintenance. S3 Tables runs background compaction (with configurable target file sizes), snapshot expiration, and unreferenced file removal without operator involvement [Source: https://aws.amazon.com/blogs/aws/new-amazon-s3-tables-storage-optimized-for-analytics-workloads/]. The “small file problem” — millions of tiny Parquet files killing query performance — disappears as a maintenance task.
-
Native integrations. Tables are auto-registered in the AWS Glue Data Catalog and queryable from Athena, EMR, Spark, Redshift, Kinesis Data Firehose, and QuickSight without bespoke catalog plumbing [Source: https://aws.amazon.com/blogs/aws/new-amazon-s3-tables-storage-optimized-for-analytics-workloads/].
The trade-off is a small storage premium over S3 Standard plus per-request maintenance fees, generally offset by better query economics and reduced engineering toil [Source: https://www.singlestore.com/blog/aws-re-invent-recap-2024/].
Lifecycle policies and intelligent tiering
For data that is not in S3 Tables, you control cost through lifecycle policies — declarative rules that transition objects between storage classes based on age. A common lake lifecycle:
# Conceptual S3 lifecycle policy
- Day 0-30: S3 Standard (hot ETL output)
- Day 30-90: S3 Standard-IA (occasional ad-hoc queries)
- Day 90-365: S3 Glacier Instant Retrieval (audit access)
- Day 365+: S3 Glacier Deep Archive (compliance only)
- Day 2555 (7 years): Delete
If you don’t know your access pattern — common for new pipelines — S3 Intelligent-Tiering monitors object access and moves objects between Frequent, Infrequent, and Archive tiers automatically. There’s a small monitoring fee per object, but it’s worth it when you’d otherwise leave everything in Standard out of caution.
Figure 3.4: S3 storage class lifecycle transitions for a typical lake
stateDiagram-v2
[*] --> S3_Standard: Object created (Day 0)
S3_Standard --> S3_Standard_IA: Day 30 (infrequent access)
S3_Standard_IA --> Glacier_Instant: Day 90 (audit-only access)
Glacier_Instant --> Glacier_Deep_Archive: Day 365 (compliance only)
Glacier_Deep_Archive --> [*]: Day 2555 (delete after 7 years)
S3_Standard --> Intelligent_Tiering: Unknown access pattern
Intelligent_Tiering --> Intelligent_Tiering: Auto-move Frequent/Infrequent/Archive
A real-world analogy: lifecycle policies are the household rule “after a year in the closet, it goes to the attic; after five years in the attic, it goes to the storage unit.” You only pay for the climate-controlled closet for things you actually wear.
Key Takeaway: S3 is not one storage class but a tiered family, and getting the tiers right is the largest cost lever in most lakes. For Iceberg specifically, S3 Tables delivers up to 3x query throughput and 10x TPS over self-managed Iceberg by automating compaction, snapshot expiry, and unreferenced file cleanup as a first-class AWS service.
Partitioning Strategies
A 10 TB table will respond to a “last 24 hours” query in seconds or hours depending entirely on how it’s partitioned. Partitioning is the practice of physically grouping data files by the values of one or more columns so that queries with filters on those columns can prune entire directories without reading them.
Hive-style partitioning
The classical scheme, inherited from Apache Hive, encodes partition values directly into the directory path:
s3://lake/orders/
region=us/year=2026/month=05/day=07/part-0001.parquet
region=us/year=2026/month=05/day=07/part-0002.parquet
region=eu/year=2026/month=05/day=07/part-0001.parquet
A query with WHERE region='us' AND year=2026 AND month=5 only lists those directories. The query engine never even sees the EU files. Hive partitioning is simple, universal, and works on any object store, which is why it has been the default for over a decade.
The downsides are real. Users must include the partition columns explicitly in WHERE clauses, or pruning silently fails and a full scan occurs. Partition values are tied to physical layout, so changing the partition scheme — say, from daily to hourly — requires rewriting all historical data. And too-fine partitioning produces millions of tiny files that crush metadata and query startup time.
Hidden partitioning in Iceberg
Apache Iceberg solves the awkward parts with hidden partitioning. The table’s partition spec is stored in metadata, not in the directory layout, and Iceberg automatically derives partition values from source columns using transforms [Source: https://iceberg.apache.org/docs/latest/]:
-- Define partitioning by day, derived from a timestamp column
CREATE TABLE orders (
order_id BIGINT,
customer_id BIGINT,
order_ts TIMESTAMP,
amount DECIMAL(12,2)
) PARTITIONED BY (days(order_ts));
-- Users write natural SQL, Iceberg prunes automatically
SELECT SUM(amount)
FROM orders
WHERE order_ts BETWEEN '2026-05-01' AND '2026-05-07';
The user never types WHERE day_partition = ...; Iceberg figures out which partitions are relevant from the natural filter on order_ts. Available transforms include years, months, days, hours, bucket(N, col), and truncate(N, col), and the partition spec can evolve over time without rewriting old data [Source: https://iceberg.apache.org/docs/latest/].
A real-world analogy: Hive partitioning is like asking customers to write the warehouse aisle number on every order. Hidden partitioning is like the warehouse figuring out the aisle from the SKU automatically — and being free to reorganize the aisles tomorrow without re-printing every order form.
Bucketing, clustering, and Z-ordering
Partitioning works best for low-cardinality columns (region, day, status). For high-cardinality columns like customer_id or order_id, partitioning is impossible — you’d get millions of tiny partitions. The answer is bucketing, clustering, and Z-ordering.
Bucketing distributes rows across a fixed number of buckets using a hash of one or more columns. A query filtering on customer_id = 12345 only needs to scan the one bucket where that customer’s hash lands. Bucketing is great for joins on the bucketed column (the engine can do bucket-by-bucket joins without shuffles) and for point lookups.
Clustering physically sorts rows within each file (or partition) by a column, so min/max statistics become tight and predicate pushdown becomes effective. If orders within each day-partition are sorted by customer_id, then a filter on customer_id can skip most row groups via Parquet’s min/max metadata.
Z-ordering is a clustering technique that interleaves the bits of multiple columns into a single space-filling curve, so files become physically clustered on multiple dimensions simultaneously. It’s most associated with Delta Lake but available in other engines too. If you Z-order a Delta table by (country, customer_id), queries filtering on either column — or both — get substantial pruning. Without Z-ordering, sorting by country then customer_id only helps queries that filter by country first.
| Technique | Best For | Cardinality | Mechanism |
|---|---|---|---|
| Hive partitioning | Coarse filters on a few columns | Low (10s-1000s) | Directory paths |
| Hidden partitioning (Iceberg) | Same as Hive but with evolution | Low | Metadata + transforms |
| Bucketing | Point lookups, equi-joins | High | Hash to fixed buckets |
| Clustering / sort | Range queries on one dominant column | High | Sort within file |
| Z-ordering | Multi-column range filters | High, multi-dim | Interleaved bit clustering |
A worked example: imagine a 5 TB clickstream table queried mostly by event_date, occasionally by user_id, and sometimes by country. A solid layout is:
- Partition by
days(event_ts)(Iceberg hidden partitioning) — coarse pruning on the dominant filter. - Z-order within each partition by
(user_id, country)— second-dimension pruning via tight min/max stats. - Use a target file size of 128-512 MB to keep file count manageable.
Queries filtering on date alone hit the partition prune. Queries also filtering on user_id additionally skip most files within the day. And the metadata stays small because there are at most a few thousand date partitions, not millions of user partitions.
Figure 3.5: Layered pruning for a clickstream query: partition + Z-order + row group
flowchart TD
Q["Query: WHERE event_ts in last 24h AND user_id = 12345"]
Q --> P["Partition prune: keep day=2026-05-07"]
P --> Z["Z-order prune: skip files whose (user_id, country) range excludes 12345"]
Z --> RG["Parquet row-group prune: skip groups via min/max stats"]
RG --> Page["Page-level scan: read only matching pages"]
Page --> Result["Return matching rows"]
P -. "skips ~99% of partitions" .-> Skip1["Skipped TB"]
Z -. "skips most surviving files" .-> Skip2["Skipped GB"]
RG -. "skips ~30% on selective filters" .-> Skip3["Skipped MB"]
Key Takeaway: Partitioning prunes at directory granularity; clustering and Z-ordering prune at file and row-group granularity. Combine coarse partitioning on the dominant low-cardinality filter (often time) with Z-ordering on the high-cardinality query columns to make billion-row tables feel like indexed databases.
Chapter Summary
Storage is a stack of decisions, not a single choice. At the bottom, file formats determine how bytes are physically arranged: row-oriented formats like CSV, JSON, and Avro are great for ingestion and whole-record access, while columnar formats like Parquet and ORC dominate analytics through compression, column projection, and predicate pushdown — typically delivering 5-40x size reductions and 7-100x query speedups [Source: https://last9.io/blog/parquet-vs-csv/] [Source: https://www.cloudera.com/blog/technical/benchmarking-apache-parquet-the-allstate-experience.html].
On top of those files, open table formats add the ACID, schema evolution, and time travel that turn a pile of Parquet into something that behaves like a real database table. Apache Iceberg leads on multi-engine flexibility and aggressive schema/partition evolution, Delta Lake leads inside the Databricks/Spark ecosystem, and Apache Hudi leads on streaming upserts and CDC [Source: https://iceberg.apache.org/docs/latest/] [Source: https://docs.delta.io/latest/index.html] [Source: https://hudi.apache.org/docs/overview/].
The object storage layer underneath needs deliberate tiering. S3’s storage class family — Standard, Intelligent-Tiering, Standard-IA, Glacier variants — controls cost dramatically when paired with lifecycle policies, and Amazon S3 Tables offers a fully managed Iceberg service with up to 3x throughput and 10x TPS by automating the compaction and maintenance work that historically broke teams’ Iceberg deployments [Source: https://aws.amazon.com/blogs/aws/new-amazon-s3-tables-storage-optimized-for-analytics-workloads/].
Finally, partitioning strategies decide how queries actually find their data. Hive-style partitioning is the universal default; Iceberg’s hidden partitioning makes it user-friendly and evolvable; bucketing, clustering, and Z-ordering extend pruning to high-cardinality columns where directory partitioning fails [Source: https://iceberg.apache.org/docs/latest/].
The next chapters will build on this foundation: query engines that exploit Parquet pushdown and Iceberg metadata, ingestion patterns that produce well-formed files, and orchestration that keeps the whole stack maintained.
Key Terms
| Term | Definition |
|---|---|
| Parquet | An open columnar file format organizing data into row groups, column chunks, and pages, with rich metadata enabling predicate pushdown and aggressive compression; the de facto standard for analytical lake storage. |
| ORC | Optimized Row Columnar, a columnar file format similar in goals to Parquet but historically associated with the Hive ecosystem; provides columnar compression and stripe-level indexing. |
| Apache Iceberg | An open table format that uses a tree of JSON manifest metadata over Parquet/ORC files to provide ACID transactions, time travel, hidden partitioning, and full schema evolution backed by stable column IDs. |
| Delta Lake | An open table format that uses a _delta_log transaction log directory to provide ACID transactions, time travel, and schema enforcement; tightly integrated with Databricks and Spark. |
| Apache Hudi | An open table format with a Timeline-based metadata model and copy-on-write / merge-on-read table types optimized for streaming upserts, deletes, and change data capture workloads. |
| ACID | Atomicity, Consistency, Isolation, Durability — the four guarantees that make multi-step writes safe; provided by open table formats on top of object storage through metadata commits and optimistic concurrency. |
| partitioning | The practice of physically grouping data files by column values (e.g., date, region) so that queries with filters on those columns can prune entire directories or partitions without reading them. |
| S3 Tables | A fully managed Apache Iceberg service launched by AWS at re:Invent 2024 using a new “table bucket” resource type; delivers up to 3x query throughput and 10x TPS versus self-managed Iceberg with automatic compaction and maintenance. |
| Z-ordering | A multi-dimensional clustering technique that interleaves the bits of multiple columns into a space-filling curve, producing physical clustering on several columns at once and enabling effective pruning on multi-column filters. |
| predicate pushdown | A query optimization where filters are evaluated against file/row-group/page-level statistics (min/max, bloom filters) so irrelevant data is skipped without being read; the primary reason columnar formats deliver 7-100x speedups. |
Chapter 4: Batch Ingestion and ETL/ELT Pipelines
Learning Objectives
By the end of this chapter, you will be able to:
- Differentiate ETL from ELT and choose the appropriate pattern for cloud warehouses such as Snowflake, BigQuery, and Redshift.
- Build a batch ingestion pipeline using AWS Glue, including crawlers, the Data Catalog, and Spark-backed jobs.
- Apply incremental ingestion patterns: Change Data Capture (CDC), watermarking, and Glue job bookmarks.
- Design idempotent and replayable batch jobs that survive partial failures, late-arriving data, and schema drift.
Batch ingestion is the workhorse of most data platforms. Streaming gets the headlines, but the majority of analytics workloads, financial reconciliations, machine-learning training sets, and regulatory reports are still produced by jobs that run on a schedule, read a bounded slice of data, and write it somewhere downstream. The job of a data engineer is to make those batches reliable, cheap, and correct even when sources misbehave. This chapter walks through the design choices and the AWS Glue tooling that make that possible.
ETL vs ELT in Modern Architectures
For thirty years, “ETL” was synonymous with data integration. A dedicated server pulled rows out of source systems, reshaped them in memory or in a staging database, and wrote the polished result into a warehouse. Cloud warehouses inverted that flow. Today, most pipelines load raw data first and transform it inside the warehouse, an arrangement called ELT. Knowing when to use each pattern, and where the boundary lies, is the most consequential architectural decision in a batch pipeline.
Why ELT dominates in cloud warehouses
ETL (Extract, Transform, Load) performs transformations in an intermediate system before loading the final shape into the warehouse. ELT (Extract, Load, Transform) loads raw data first and uses the warehouse’s own compute to transform it on demand [Source: https://www.cliffsnotes.com/study-notes/28411172]. The difference looks superficial but the economics are very different.
Cloud warehouses favor ELT for three reinforcing reasons:
- Compute elasticity. Snowflake, BigQuery, and Redshift can scale compute up to handle a transform and back down when it finishes. A dedicated ETL server, by contrast, costs the same whether it is processing 100 GB or sitting idle [Source: https://www.scribd.com/document/817767064/Page-12-of-25].
- Pay-per-use billing. Snowflake charges per second of warehouse compute. BigQuery charges per byte scanned. Redshift offers reserved or serverless capacity. These models reward “transform when needed” rather than “transform always.”
- Schema agility. Loading raw data first means new fields land automatically and historical reprocessing is a query, not a redeployment.
The performance gap can be dramatic. Loading 100 GB of clickstream data with traditional ETL means an intermediate server has to extract, join, aggregate, and then push the result into the warehouse, often over hours. The ELT version loads the raw 100 GB into Snowflake in minutes and applies the same transformations using the warehouse’s massively parallel compute [Source: https://www.scribd.com/document/817767064/Page-12-of-25].
Figure 4.1: ETL vs ELT data flow comparison
flowchart LR
subgraph ETL["ETL Pattern"]
S1[Source DB] --> E1[Extract]
E1 --> T1[Transform<br/>on intermediate server]
T1 --> L1[Load shaped data]
L1 --> W1[(Warehouse)]
end
subgraph ELT["ELT Pattern"]
S2[Source DB] --> E2[Extract]
E2 --> L2[Load raw data]
L2 --> W2[(Warehouse)]
W2 --> T2[Transform in place<br/>using warehouse compute]
T2 --> W2
end
| Factor | ETL | ELT |
|---|---|---|
| Processing power | Bound by intermediate server | Warehouse auto-scales |
| Data movement | Multiple hops, repeated I/O | Single load, transform in place |
| Latency on large datasets | Bottlenecked at transform layer | Near real-time within warehouse |
| Schema changes | Pipeline redeploy + backfill | New fields ride along automatically |
| Cost when idle | 24/7 server cost | Storage only |
Analogy. ETL is a meal-kit company that chops your vegetables in a central kitchen and ships pre-prepped boxes. ELT is a grocery delivery service: the raw ingredients arrive in your kitchen, and you decide what to make tonight. The grocery model wastes nothing if your menu changes; the meal kit forces an upstream change every time you want a new dish.
Each cloud warehouse has features that lean directly into ELT:
- Snowflake offers a
VARIANTdata type that stores semi-structured JSON or Parquet natively. You can land a JSON blob and useLATERAL FLATTEN()at query time to project columns out of it, no transform pipeline required. - BigQuery supports nested and repeated fields. A row can contain an
ARRAY<STRUCT>of events, andCROSS JOIN UNNEST(events)projects them on demand. Storage is cheap, and the engine prunes columns and partitions automatically. - Redshift complements its native tables with Redshift Spectrum, which queries Parquet on S3 without first loading it. A common hybrid pattern is to load curated dimensions into Redshift while leaving fact tables as Spectrum external tables.
Key Takeaway: ELT wins in the cloud because elastic compute, pay-per-use billing, and schema-on-read each remove a constraint that ETL used to impose. Default to ELT when your destination is a modern cloud warehouse.
When ETL still applies: PII redaction, schema enforcement
ELT is the default, but it is not the universal answer. ETL still has a role wherever raw data must not be allowed to land in the warehouse in its original form. Three scenarios are common.
1. PII redaction and minimization. Privacy regulations such as GDPR and HIPAA require that personally identifiable information be processed only for legitimate purposes and stored only as long as necessary. If your warehouse cannot guarantee that no one will query a raw email or ssn column, you must transform that column before it arrives. Hashing, tokenization, or masking happens in an ETL stage, often inside a Glue job or a Lambda function, before the data is written to the warehouse zone that analysts can read.
2. Strict schema enforcement. Some downstream systems, regulatory feeds, financial reports, machine-learning feature stores, cannot tolerate optional fields, type coercion, or schema drift. ETL’s schema-on-write model rejects malformed records at the boundary instead of corrupting downstream tables [Source: https://www.scribd.com/document/817767064/Page-12-of-25]. If a record is missing a required field, the pipeline can quarantine it, page an engineer, or substitute a default. The warehouse never sees a half-formed row.
3. Cross-warehouse or air-gapped destinations. When data has to leave one cloud and enter another, or move between security zones, the transformation often has to happen in a neutral compute layer that can talk to both ends. AWS Glue, Apache Spark on EMR, or a vendor like Fivetran sits in that middle ground and shapes the data while it is in flight.
Most production pipelines are hybrid. The PII-bearing columns are hashed in flight (ETL); everything else is loaded raw and transformed inside the warehouse (ELT). The mental model is “ETL the unsafe, ELT the rest.”
Key Takeaway: Choose ETL when raw data must not land in its original form, when downstream systems demand strict schemas, or when crossing trust boundaries. In every other case, prefer ELT.
The transformation layer: dbt and SQL Mesh
If raw data lands in the warehouse, the transformations have to live somewhere. In the early days of ELT, teams wrote tangled stacks of stored procedures, scheduled SQL scripts, and ad-hoc cron jobs. The modern answer is a dedicated transformation framework. Two dominate.
dbt (data build tool) treats SQL transformations as software. Each model is a SELECT statement saved in a .sql file, plus a YAML file describing tests, descriptions, and dependencies. dbt compiles those files into a directed acyclic graph (DAG) of materializations, runs them in dependency order against the warehouse, and reports which tests passed and failed. Common features include incremental models that only process new rows, snapshots that capture slowly changing dimensions, and a generated documentation site.
SQL Mesh is a newer entrant that adds support for virtual data environments, semantic versioning of models, and a more rigorous handling of breaking changes. Where dbt re-runs the affected models, SQL Mesh will compute and expose a “preview” environment that contains only the diff, then promote the changes once approved.
Both tools share a common philosophy: business logic is code, code lives in version control, code is tested before it is deployed, and the warehouse is the runtime. Without a transformation framework, ELT degenerates into the same maintenance nightmare that ETL was thirty years ago. With one, the warehouse becomes a managed application surface.
Key Takeaway: ELT pushes complexity into the warehouse, so a disciplined transformation framework like dbt or SQL Mesh is non-optional. Treat SQL transformations as code: version-controlled, tested, and dependency-aware.
AWS Glue Deep Dive
AWS Glue is Amazon’s serverless data integration service. It bundles four capabilities you will use repeatedly: a managed metadata catalog, automatic schema crawlers, a visual ETL builder (Glue Studio), and a serverless Spark runtime that executes jobs. Together they cover the lifecycle of a batch pipeline from “we just got a folder of CSV files in S3” to “production-grade Parquet tables refreshing every hour” [Source: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-samples-legislators.html].
Glue crawlers and Data Catalog
The AWS Glue Data Catalog is the central metadata repository for the Glue ecosystem. It stores table definitions, schemas, partition information, and pointers to the underlying data, and it is queryable by Athena, Redshift Spectrum, EMR, and other services. Think of it as the directory that lets disparate tools agree on what a table is and where its files live.
A Glue crawler is the discovery mechanism that populates the catalog. Point a crawler at an S3 prefix or a JDBC source, and it will:
- Classify the objects it finds (JSON, Parquet, CSV, Avro, ORC, fixed-width, custom classifiers).
- Infer the schema, including column names, types, and partition structure derived from S3 prefixes (
year=2026/month=05/day=07/). - Register the result as a table in a target Glue database, or update an existing table’s schema and partitions.
The crawler is what lets the rest of Glue treat raw S3 files as structured tables. A typical workflow looks like this:
Source: s3://my-data-lake/raw/orders/year=2026/month=05/...
↓
Glue Crawler scans + classifies as Parquet
↓
Glue Data Catalog: my_database.orders (columns, partitions)
↓
Athena, Glue jobs, Redshift Spectrum can all query "my_database.orders"
Two operational details matter. First, crawlers can be scheduled, but on a high-frequency-write lake the schedule has to balance freshness against API costs; many teams trigger crawlers from S3 event notifications instead. Second, crawlers infer types from samples, so an outlier file can change a column’s inferred type from int to string and break downstream queries. Schema versioning in the catalog mitigates this, but the practical answer is to write canonical Parquet from the producer side whenever possible.
Figure 4.2: Glue crawler-to-catalog-to-consumer pipeline
flowchart TD
S3[("S3 raw zone<br/>year=/month=/day=")] -->|scan + classify| C[Glue Crawler]
C -->|infer schema<br/>+ partitions| DC[(Glue Data Catalog<br/>my_database.orders)]
DC --> A[Athena queries]
DC --> J[Glue Spark jobs]
DC --> RS[Redshift Spectrum]
DC --> EMR[EMR / external tools]
classDef catalog fill:#1f3a5f,stroke:#58a6ff,color:#fff;
class DC catalog;
Key Takeaway: Crawlers turn raw files into queryable tables by populating the Data Catalog. Treat the catalog as your single source of truth for schema; without it, Glue jobs, Athena, and Spectrum cannot find your data.
Glue Studio visual ETL
Not every transformation deserves a hand-written Spark job. Glue Studio is a drag-and-drop interface for building ETL jobs visually. You wire up source nodes (Catalog table, S3, JDBC, Kinesis), transformation nodes (apply mapping, filter, join, drop fields, aggregate, custom SQL), and sink nodes (S3, Redshift, RDS), and Glue Studio generates the underlying PySpark or Scala code.
The visual builder is appropriate for:
- Straight-through column projections, type casts, and renames.
- Joining a fact table to a small dimension and writing the result as Parquet.
- Filtering by date or status and partitioning the output.
- Reviewing what a junior engineer is doing without reading 200 lines of Spark.
It is less appropriate for jobs with intricate window functions, stateful computations, or custom Python libraries. For those, drop down to a Glue script or a notebook. The generated code is editable, so you can start visually and finish in code.
Glue Studio also exposes the same job parameters as code-based jobs, including worker type (G.1X, G.2X, G.4X, G.8X for memory-intensive work), number of workers, timeout, and bookmark behavior. Whatever the authoring surface, the runtime is the same Spark engine.
Key Takeaway: Glue Studio is the right authoring surface for routine column-mapping jobs and for review-friendly artifacts. Reach for hand-written Spark when the logic exceeds simple flow-graph transformations.
Glue Spark jobs and DynamicFrames
Under the hood, Glue jobs run on a managed Apache Spark cluster. AWS provisions executors automatically, scales them within configured limits, and tears them down when the job finishes. You pay per second of DPU (data processing unit) consumption [Source: https://aws.amazon.com/blogs/big-data/optimize-memory-management-in-aws-glue/].
The Glue API introduces an abstraction on top of Spark called the DynamicFrame. A DynamicFrame is similar to a Spark DataFrame but with one critical difference: it tolerates schema variance per record [Source: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-samples-legislators.html]. Where a DataFrame requires every row in a partition to share a fixed schema, a DynamicFrame stores per-record types using a structure called a “choice type.” If half the JSON documents in your bucket have an address.zip field as a string and the other half have it as an integer, a DataFrame would fail to infer a single schema; a DynamicFrame would record both possibilities and let you resolve the ambiguity later with a ResolveChoice transform.
| Aspect | DynamicFrame | Spark DataFrame |
|---|---|---|
| Schema strictness | Tolerates variance per record | Requires uniform schema |
| Native sources | Glue Catalog, S3, JDBC | Standard Spark sources |
| Built-in transforms | ApplyMapping, ResolveChoice, Relationalize, DropNullFields | Standard Spark API |
| Best for | Heterogeneous, semi-structured data | Cleaned, conforming data |
The Relationalize transform is the killer feature that earns DynamicFrames their place. Given a deeply nested JSON structure, Relationalize walks the tree and produces a set of relational tables joined by surrogate keys, exactly what you need to load nested data into a relational warehouse [Source: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-samples-legislators.html]. Doing the same job in plain Spark requires hand-written explode and selectExpr calls. Once your data is clean and conforming, you can convert a DynamicFrame to a DataFrame with .toDF() and use standard Spark SQL.
A complete worked example, adapted from the AWS legislators sample [Source: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-samples-legislators.html]:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
# 1. Extract: read the catalog table populated by the crawler.
persons = glueContext.create_dynamic_frame.from_catalog(
database="legislators",
table_name="persons_json"
)
# 2. Transform: apply mapping, then flatten nested arrays.
persons_mapped = persons.apply_mapping([
("family_name", "string", "last_name", "string"),
("given_name", "string", "first_name", "string"),
("id", "string", "person_id", "string"),
])
# Relationalize splits nested arrays into joinable child tables.
relational = persons_mapped.relationalize(
"persons_root",
"s3://my-bucket/temp/"
)
# 3. Load: write the root table as Parquet.
glueContext.write_dynamic_frame.from_options(
frame=relational.select("persons_root"),
connection_type="s3",
connection_options={"path": "s3://my-bucket/output/persons/"},
format="parquet"
)
job.commit()
The job.commit() call is more than a polite goodbye. It is the signal that tells Glue to advance the job bookmark (covered later) and mark the run as successful. Without it, the next run will reprocess the same data.
Three tuning levers are worth knowing before the first production deploy [Source: https://aws.amazon.com/blogs/big-data/optimize-memory-management-in-aws-glue/]:
- File grouping. S3 prefixes with thousands of small files create one Spark task per file, which puts memory pressure on the driver as it tracks
mapStatusfor every task. Setting the connection optiongroupFiles: 'inPartition'collapses many small files into a single task, dramatically reducing overhead. - Adaptive Query Execution (AQE). Enabled by default in Glue 4.0+ and available with
spark.sql.adaptive.enabled=truein Glue 3.0. AQE converts sort-merge joins to broadcast hash joins at runtime when one side turns out to be small, and rebalances skewed partitions. It is almost always a net win. - Worker sizing. Default
G.1Xworkers offer 4 vCPU and 16 GB. Move toG.2XorG.4Xwhen you see executor OOMs, and check whether the OOM is caused by a skewed partition (one key dominates) before throwing more memory at the problem.
Key Takeaway: Glue Spark jobs run on managed Spark with DynamicFrames as the schema-flexible front door. Use
Relationalizefor nested data, enable file grouping and AQE for performance, and never forgetjob.commit().
Incremental Ingestion Patterns
Full reloads do not scale. A 10-billion-row source table cannot be re-extracted nightly without saturating the source database, the network, and the warehouse. Incremental ingestion is the practice of moving only what has changed since the last run. There are three families of approaches, and you will likely use all three across a single platform.
Change Data Capture (CDC) sources
Change Data Capture (CDC) is a technique that identifies and captures INSERT, UPDATE, and DELETE operations in source databases, enabling incremental synchronization rather than full reloads [Source: https://aws.amazon.com/video/watch/197175d22e9/]. Instead of asking “what is the current state of the table?” CDC asks “what events changed the table?” That difference, state vs. event, is what makes incremental ingestion correct.
Three CDC strategies dominate, in descending order of fidelity.
1. Log-based CDC reads the database’s transaction log directly. PostgreSQL exposes its Write-Ahead Log (WAL); MySQL exposes the binary log (binlog); Oracle exposes redo logs; SQL Server exposes its own change tracking tables. Specialized connectors, Debezium is the open-source standard, AWS DMS is the managed alternative, decode the log and emit a stream of change events with full INSERT/UPDATE/DELETE semantics, primary keys, before-and-after values, and transaction order.
Log-based CDC has three properties that make it the production default:
- Zero performance impact on the source: it reads what the database is already writing.
- Captures every change reliably, including deletes, in transaction order.
- Supports both initial snapshots and ongoing increments.
The trade-off is operational complexity. You need a connector, a durable transport (usually Kafka), and a sink that can apply the events idempotently.
2. Trigger-based CDC installs database triggers that fire on INSERT/UPDATE/DELETE and write change rows to a side table. Consumers poll that table. Triggers work on every database engine and can capture business logic context, but they add latency to every write and can lose data if a trigger fails. Most teams avoid them when log-based options are available.
3. Query-based CDC asks the source for rows changed since a known point in time. It is the simplest pattern and the most common starting point. We cover it in the next sub-topic under the heading of watermarking.
A reference architecture for log-based CDC [Source: https://aws.amazon.com/video/watch/197175d22e9/]:
PostgreSQL (WAL) or MySQL (binlog)
↓
Debezium connector (or AWS DMS)
↓
Kafka topic (durable, replayable buffer)
↓
Stream processor (Flink / Spark Streaming / Glue)
↓
Warehouse upsert (MERGE on primary key)
AWS DMS, in particular, supports three modes: full load only (one-time snapshot), CDC only (changes after a known LSN), and full-load-plus-CDC (snapshot then continuous stream). The last mode is what you want for migrations and ongoing replication into a warehouse.
Figure 4.3: Log-based CDC reference architecture
flowchart LR
DB[("Source DB<br/>PostgreSQL WAL<br/>or MySQL binlog")] -->|read txn log| CN[Debezium / DMS<br/>connector]
CN -->|emit change events<br/>INSERT/UPDATE/DELETE| K[(Kafka topic<br/>durable, replayable)]
K --> SP[Stream processor<br/>Flink / Spark / Glue]
SP -->|MERGE on<br/>primary key| WH[(Warehouse<br/>upsert sink)]
Key Takeaway: Prefer log-based CDC (Debezium or DMS) for production change capture: it is the only pattern that captures deletes reliably, preserves transaction order, and avoids loading the source database.
Watermarking and high-water-mark queries
When log-based CDC is impractical, an analytics-only data store, a third-party API without a binlog, or a small infrequently changing table, query-based CDC with a watermark is the pragmatic choice. The pattern is centuries old in spirit but precise in mechanics.
A watermark is a monotonically increasing column on the source: a last_modified_at timestamp, an auto-incrementing id, or a version field. The pipeline records the highest value it saw on the last run (the high-water mark, HWM) and asks the source for everything beyond that mark.
The query template:
SELECT *
FROM users
WHERE last_modified_at > :prev_hwm
AND last_modified_at <= :current_hwm
ORDER BY last_modified_at;
prev_hwm is loaded from the pipeline’s metadata store. current_hwm is captured at the start of the run, often now() - 5 minutes to give in-flight transactions time to commit. After the run succeeds, the metadata store advances prev_hwm to current_hwm.
Figure 4.4: Watermark advancement state machine
stateDiagram-v2
[*] --> LoadPrevHWM
LoadPrevHWM: Load prev_hwm<br/>from metadata store
LoadPrevHWM --> CaptureCurrent
CaptureCurrent: Capture current_hwm<br/>(now() - lag window)
CaptureCurrent --> QuerySource
QuerySource: SELECT WHERE col > prev_hwm<br/>AND col <= current_hwm
QuerySource --> WriteSink
WriteSink: Idempotent write<br/>to warehouse
WriteSink --> Success
WriteSink --> Failure
Success: Advance prev_hwm := current_hwm
Success --> [*]
Failure: Leave prev_hwm unchanged<br/>(safe retry)
Failure --> LoadPrevHWM
The pattern’s strengths are simplicity and portability. It works against any database, any view, any REST API that exposes a sortable timestamp. The weaknesses are real and worth memorizing:
- No deletes. A row removed from the source vanishes silently because the watermark query cannot find what is no longer there. The fix is a
is_deletedsoft-delete flag in the source. If you cannot add one, you must periodically reconcile by full-comparing primary keys, an expensive escape hatch. - Clock skew and late writes. If a source transaction commits with
last_modified_at = Tbut is not visible untilT + 30s, an HWM advanced pastTwill miss the row. Mitigations include watermark windows that lag behind real time, and overlap reads whereprev_hwmis set slightly behind the previouscurrent_hwm. The downstream upsert (next section) absorbs the duplicates. - Full table scans. Without an index on the watermark column, every run reads the whole table. Index the column or partition the source by it.
| Watermark column | Pros | Cons |
|---|---|---|
last_modified_at (timestamp) | Universal, easy to reason about | Clock skew, late writes |
Auto-increment id | Strict monotonicity, no skew | Cannot detect updates, only inserts |
Database lsn / sequence | Captures inserts and updates | Engine-specific, may not be queryable |
The right answer is often a combination: insert-only fact tables keyed on id, mutable dimension tables keyed on last_modified_at plus a soft-delete column, and a periodic full-compare to catch drift.
Key Takeaway: Watermarking is the simplest incremental pattern but has three structural weaknesses, missing deletes, clock skew, and table scans. Mitigate each with soft-deletes, watermark windows, and indexed columns.
Glue job bookmarks
When the source is S3 rather than a database, Glue offers a built-in incremental mechanism called job bookmarks [Source: https://aws.amazon.com/video/watch/197175d22e9/]. A bookmark is metadata Glue persists between job runs, recording which files (by path), timestamps, and row counts have already been processed. The next run automatically skips them.
A canonical use case:
Run 1 (Monday): Process files 1 to 100. Bookmark stores last_path=file_100.
Run 2 (Tuesday): Process files 101 to 150. Files 1 to 100 are skipped automatically.
Run 3 (Wednesday): No new files. Job runs but processes zero rows.
Bookmarks are enabled in the job configuration and require two cooperating calls in your script. The DynamicFrame must be created from a bookmark-aware source (create_dynamic_frame.from_catalog or from_options with transformation_ctx set), and the script must end with job.commit(). Without job.commit(), the bookmark is not advanced and the next run will reprocess the same files. With it, only new or modified files since the last successful commit are read.
Three operational notes are worth knowing:
- Bookmark scope is per
transformation_ctx. If your job reads two sources, give each its own context name so their bookmarks advance independently. - Bookmarks survive job edits. As long as the
transformation_ctxdoes not change, you can edit the script and the bookmark persists across versions. - Resetting a bookmark requires a deliberate operation. The Glue console offers “Reset bookmark” and “Run with bookmark disabled” options, useful for backfills.
A common pitfall: if a job partially fails and is rerun without job.commit() running, the bookmark stays at the previous mark and the rerun reprocesses everything since. That is the correct behavior, fail open rather than skip data, but it does mean partial-output cleanup is the engineer’s responsibility (next section).
Key Takeaway: Glue job bookmarks make incremental S3 ingestion automatic, but only when paired with
job.commit()and stabletransformation_ctxnames. Treat them as the S3-side analog of a database watermark.
Reliability Patterns
Incremental ingestion is necessary but not sufficient. The same job will be retried after a network blip, replayed after a bad upstream change, and rerun after a DDL evolution that broke its assumptions. The reliability of a batch pipeline is determined less by its happy-path code than by what happens on the second, third, and fourth runs. Three patterns make those reruns safe.
Idempotent writes and exactly-once semantics
A write is idempotent when applying it twice produces the same result as applying it once. Idempotency is the single most important property of a reliable batch job, because partial failures, retries, and replays are not edge cases; they are the norm. Three implementation patterns deliver idempotency in practice.
1. Upsert with version guards. The strongest pattern for mutable rows is a MERGE statement keyed on the primary key, with a guard clause that ignores out-of-order updates [Source: https://aws.amazon.com/video/watch/197175d22e9/]:
MERGE INTO target_table t
USING staged_changes s
ON t.id = s.id
WHEN MATCHED AND (s.version > t.version OR t.version IS NULL) THEN
UPDATE SET t.value = s.value,
t.version = s.version,
t.updated_at = s.updated_at
WHEN NOT MATCHED THEN
INSERT (id, value, version, updated_at)
VALUES (s.id, s.value, s.version, s.updated_at);
The s.version > t.version clause is the load-bearing piece. It means a stale event arriving after a fresher one is silently discarded. Replays are safe; out-of-order CDC streams are safe.
Figure 4.5: Idempotent upsert decision flow with version guard
flowchart TD
Start([Incoming change<br/>s.id, s.version, s.value]) --> Lookup{Row exists<br/>in target?}
Lookup -->|No| Insert[INSERT new row<br/>id, value, version]
Lookup -->|Yes| Compare{s.version ><br/>t.version?}
Compare -->|Yes<br/>fresher event| Update[UPDATE value, version,<br/>updated_at]
Compare -->|No<br/>stale or duplicate| Skip[No-op<br/>discard silently]
Insert --> End([Commit])
Update --> End
Skip --> End
classDef safe fill:#1f3a5f,stroke:#58a6ff,color:#fff;
class Skip safe;
2. Deduplication tables. When MERGE is unavailable (large append-only Parquet partitions, for example), you can maintain a small ledger of processed event IDs and filter incoming events against it:
INSERT INTO target_table
SELECT s.*
FROM staged_changes s
WHERE NOT EXISTS (
SELECT 1
FROM cdc_processed p
WHERE p.source_id = s.id
AND p.operation_sequence = s.seq
);
Pair the insert with an insert into cdc_processed inside the same transaction (or atomic write). Replays that find the row already in the ledger become no-ops.
3. Atomic partition swaps. For partition-grained idempotency, write the new data to a staging path, validate it, and then swap the staging path into the table catalog atomically. If the run fails midway, the staging path is orphaned (and a janitor cleans it up); the live table is untouched.
| Pattern | When to use | Cost |
|---|---|---|
| Upsert with version guard | Mutable warehouse tables, CDC sinks | Index lookup per row |
| Dedup table | Append-only with event IDs | Extra storage and join |
| Atomic partition swap | Daily batch loads | Two-phase write, requires catalog support |
“Exactly-once” is a marketing term; “exactly-once-effective” is the engineering reality. Your job may execute twice, but the observable result is the same as if it had executed once. That is what idempotency buys you, and it is the strongest guarantee you should claim.
Key Takeaway: Idempotency is non-negotiable. Use upserts with version guards for mutable data, dedup tables for append-only data, and atomic partition swaps for daily loads. The goal is not “run exactly once” but “produce the same answer no matter how many times you run.”
Replayability via raw zone retention
A pipeline is replayable if you can reconstruct any past state of any downstream table by re-running transformations against retained raw inputs. Replayability is what saves you when (not if) a transformation has a bug, a business rule changes retroactively, or a regulator asks for a six-month-old report computed under the rules in force then.
The architectural foundation is the raw zone: an immutable, retention-tagged S3 prefix where every input file lands in its original form, with metadata recording when it arrived. The raw zone is append-only by policy; nothing in it is ever rewritten. Transformations read from the raw zone and write to a curated zone; the curated zone can be wiped and rebuilt from raw at any time.
A typical zone layout:
s3://lake/raw/ ← immutable, partitioned by ingest date
orders/ingest_date=2026-05-07/...
events/ingest_date=2026-05-07/...
s3://lake/curated/ ← rebuildable, partitioned by business date
orders_clean/order_date=2026-05-07/...
s3://lake/marts/ ← presentation, used by BI tools
fact_orders/...
Three operational practices keep replayability honest:
- Retention as policy, not afterthought. Use S3 Lifecycle rules to move raw partitions to Glacier after 90 days but never delete; or set Object Lock to enforce immutability for compliance. Decide retention by data class, not by storage cost intuition.
- Capture ingestion metadata. Stamp each raw file with
ingest_timestamp,source_version, andpipeline_run_id. Replays can then ask “what would today’s transformation produce against the raw data we had on 2026-05-01?” - Idempotent transformations. Replay only works if rerunning a transform on the same input produces the same output. Random IDs, “now()” timestamps, and external API calls inside transforms all break replayability. Replace them with deterministic equivalents (hash-based IDs, watermark timestamps, cached lookups).
The combination of an immutable raw zone, idempotent transformations, and a transformation framework with versioned models (dbt, SQL Mesh) gives you a platform where any past output can be regenerated. That capability turns “we have a bug” from a crisis into a deploy.
Key Takeaway: Treat the raw zone as your source of truth. Curated tables can always be rebuilt; raw data, once lost, cannot. Pair immutable storage with deterministic transforms and you can replay any historical state.
Schema evolution handling
Schema evolution is the inevitable change in the structure of incoming data: a new column appears, a type widens from int to bigint, an old column is renamed, an enum gains a value. A pipeline that breaks every time a producer makes a change is a pipeline that owns the producer’s roadmap. The goal is to absorb common changes automatically and surface only the changes that genuinely need engineering attention.
The taxonomy that matters in practice:
| Change type | Compatible? | Handling |
|---|---|---|
| Add nullable column | Yes (backward) | Auto-add, default to NULL |
| Add required column | No | Producer must coordinate; default value or version bump |
| Drop column | Sometimes (forward) | Keep reading, let column be NULL downstream |
| Rename column | No | Treat as drop + add; usually requires alias mapping |
| Widen type (int → bigint) | Yes | Promote downstream type |
| Narrow type (bigint → int) | No | Reject or quarantine |
| Add enum value | Depends | Allow if downstream uses string; gate if mapped to bounded type |
Three mechanisms handle most of these automatically:
- Glue crawlers detect new fields on each run and update the catalog table. Combined with Parquet’s column-level metadata, downstream queries can ignore unknown columns or pick them up immediately.
- DynamicFrame
ResolveChoicelets you declare a strategy when a column has multiple inferred types in the same data:make_struct,cast,project, ormatch_catalog. A common pattern is tocastnumeric ambiguity to the wider type andprojectJSON ambiguity to a string. - Schema registries (AWS Glue Schema Registry, Confluent Schema Registry) version the contract between producers and consumers. Producers register a new version; consumers check compatibility before reading. Backward-compatible changes propagate silently; incompatible changes fail at registration time, not in production.
For producer-facing pipelines (events flowing from microservices into a lake), a registry is the right answer. For analyst-facing pipelines (curated tables consumed by dashboards), versioned dbt or SQL Mesh models give you the same control with deployment gates.
The unsexy but critical practice is monitoring schema drift. A weekly report comparing today’s catalog schema to last week’s catches the slow accretion of fields that no one announced. Fields that no one explained are tomorrow’s incident.
Key Takeaway: Schema evolution is continuous, not exceptional. Use crawlers to absorb compatible changes, ResolveChoice to handle ambiguity, and a schema registry to gate breaking changes at the producer boundary.
Chapter Summary
Batch pipelines look simple from a distance, “extract, transform, load”, and reveal subtlety up close. The first decision, ETL or ELT, hinges on whether the warehouse can be trusted with raw data; in modern cloud warehouses the answer is usually yes, with carve-outs for PII redaction and strict-schema destinations. The second decision is what to do inside the warehouse, and disciplined transformation frameworks like dbt and SQL Mesh keep ELT from collapsing into spaghetti SQL.
AWS Glue gives you the components to build batch pipelines without managing infrastructure: crawlers discover schemas, the Data Catalog stores them, Glue Studio offers a visual authoring surface, and Spark-backed jobs with DynamicFrames handle the heavy lifting on heterogeneous data. The Relationalize transform, file grouping, and Adaptive Query Execution are the levers that turn a working job into a fast one.
Incremental ingestion is the difference between a pipeline that scales and one that does not. Log-based CDC (Debezium, AWS DMS) is the high-fidelity default for change capture; query-based CDC with watermarks is the simple, portable fallback; Glue job bookmarks handle S3-source increments automatically. Each pattern has its failure modes, missing deletes, clock skew, and partial-failure replays, and each is addressed by reliability patterns at the next layer.
Reliability rests on three foundations. Idempotent writes (upserts with version guards, dedup tables, atomic partition swaps) make retries safe. Replayability via an immutable raw zone and deterministic transformations means past outputs can always be regenerated. Schema evolution handling, with crawlers, ResolveChoice, and schema registries, absorbs the constant low-grade change that producers introduce. A pipeline that gets all three right can run for years across thousands of source changes and still produce correct results.
Key Terms
- ETL (Extract, Transform, Load): A data integration pattern in which data is transformed in an intermediate system before being loaded into the target warehouse. Enforces schema-on-write.
- ELT (Extract, Load, Transform): A data integration pattern in which raw data is loaded into the target warehouse first and transformed in place using the warehouse’s own compute. Enables schema-on-read and is the default in cloud architectures.
- AWS Glue: A managed serverless data integration service combining a metadata catalog, schema crawlers, a visual ETL builder (Glue Studio), and a Spark-based job runtime.
- Data Catalog: The central metadata repository in AWS Glue that stores table definitions, schemas, and partition information. Queryable by Athena, Redshift Spectrum, EMR, and Glue jobs.
- DynamicFrame: A Glue-native abstraction over Spark DataFrames that tolerates schema variance per record and offers built-in transforms such as
ApplyMapping,ResolveChoice, andRelationalize. - CDC (Change Data Capture): A technique for identifying and capturing INSERT, UPDATE, and DELETE operations in source databases to enable incremental synchronization. Implemented via log-based, trigger-based, or query-based approaches.
- Watermark: A monotonically increasing column (timestamp, sequence number, or ID) used to identify rows changed since the last pipeline run. The basis of query-based incremental ingestion.
- Idempotency: The property of a write operation such that applying it multiple times produces the same result as applying it once. Achieved via upserts with version guards, deduplication tables, or atomic partition swaps.
- Schema evolution: The process by which the structure of incoming data changes over time. Handled via crawlers, ResolveChoice transforms, and schema registries to keep pipelines stable across producer changes.
Chapter 5: Streaming Ingestion and Real-Time Pipelines
Learning Objectives
By the end of this chapter you will be able to:
- Compare Amazon Kinesis Data Streams, Amazon MSK, and Amazon Data Firehose by use case, pricing model, and delivery semantics.
- Build a streaming pipeline with Apache Flink on Amazon Managed Service for Apache Flink, including source and sink configuration, state backends, and checkpointing.
- Apply windowing, watermarks, and event-time semantics to streaming data, distinguishing tumbling, sliding, session, and global windows.
- Implement exactly-once delivery in streaming systems using barrier-based snapshots, replayable sources, and transactional or idempotent sinks.
The previous chapter built batch warehouses where data arrived in scheduled chunks. This chapter shifts the temporal axis: data now arrives continuously, and every decision — capacity, ordering, correctness, recovery — must be reconsidered through the lens of unbounded streams.
Streaming Foundations
A stream is an unbounded sequence of events ordered (loosely) by time. Streaming systems decouple producers from consumers via a durable log so fast publishers do not crush slow subscribers, and so consumers can replay history after failures. Three primitives — pub/sub, partitions, and offsets — appear everywhere from Kinesis to Kafka to Pulsar.
Pub/sub vs queue semantics
A traditional queue is a one-shot mailbox. A producer drops a message in; exactly one consumer takes it out and the message disappears. Think of a coffee shop ticket queue — once the barista calls “order 47,” the ticket is destroyed and no one else can claim that drink. Queues fan work out across workers but cannot replay history.
A pub/sub log is more like a bookshelf. Producers append events to the end; many consumer groups can each read from any position they choose, and reading does not erase the entry. New consumers can scan from the start; failed consumers can rewind to a known offset and re-read. This is the model used by Apache Kafka, Amazon Kinesis Data Streams, and Amazon MSK [Source: https://www.kai-waehner.de/blog/2023/01/23/apache-kafka-and-apache-flink-a-match-made-in-heaven/].
With a queue you cannot reprocess yesterday’s traffic to fix a bug — the messages are gone. With a log you reset the consumer offset and replay. Modern data architectures prefer log-based ingestion because tomorrow you will want to add analytics, audit, and ML-feature consumers without re-engineering the producers.
| Property | Queue (e.g., SQS) | Pub/Sub Log (e.g., Kinesis, Kafka) |
|---|---|---|
| Read pattern | Consume-and-delete | Append-only, position-based read |
| Multiple consumers | Compete for messages | Independent groups, each tracks own offset |
| Replay | Generally no | Yes, within retention window |
| Ordering | Best-effort or per-group (FIFO queues) | Per-partition strict ordering |
| Typical use | Task distribution | Event sourcing, analytics, audit |
Key Takeaway: Logs separate “how data is stored” from “who has read it,” enabling replay, multiple independent consumers, and time-travel debugging — capabilities that classical queues cannot provide.
Producers, consumers, partitions, and offsets
A producer is any process that writes events. A consumer reads them. The log between them is split into partitions (Kafka, MSK) or shards (Kinesis Data Streams) — units of parallelism that each maintain their own strictly ordered sequence of records [Source: https://www.confluent.io/blog/windowing-in-kafka-streams/].
When a producer sends a record, a partition key decides which partition it lands in. Records with the same key always land in the same partition, so they preserve order relative to each other. Records with different keys land in (potentially) different partitions and have no global order. This is the central trade-off of streaming systems: ordering is per-partition, never global, because global ordering would require a single bottleneck and destroy horizontal scalability.
Each record in a partition gets a monotonically increasing offset (Kafka) or sequence number (Kinesis). Consumers track their position — “I have processed up to offset 19,431 in partition 3” — and store that position somewhere durable so they can resume after a crash. In Kafka, offsets are stored in an internal __consumer_offsets topic. In Kinesis, the Kinesis Client Library (KCL) checkpoints sequence numbers to DynamoDB.
Consider a clickstream with 12 partitions keyed by user_id. Every event for alice@example.com lands in partition 7, so her “view → add to cart → checkout” sequence stays in order. Bob’s events land in partition 2. Twelve parallel consumer instances each own one partition, processing 1/12 of the load with full per-user ordering.
Figure 5.1: Pub/sub log fan-out — producers write keyed records into ordered partitions; multiple consumer groups read independently, each tracking its own offset.
flowchart LR
PA["Producer A<br/>(key=alice)"] --> P0
PB["Producer B<br/>(key=bob)"] --> P1
PC["Producer C<br/>(key=carol)"] --> P2
PD["Producer D<br/>(key=dan)"] --> P3
subgraph Log["Pub/Sub Log (per-partition strict ordering)"]
P0["Partition 0<br/>r0 r1 r2 ..."]
P1["Partition 1<br/>r0 r1 r2 ..."]
P2["Partition 2<br/>r0 r1 r2 ..."]
P3["Partition 3<br/>r0 r1 r2 ..."]
end
P0 --> CGX["Consumer Group X<br/>(analytics)<br/>offset=19431"]
P1 --> CGX
P2 --> CGY["Consumer Group Y<br/>(audit)<br/>offset=12005"]
P3 --> CGY
P0 -.replay.-> CGZ["Consumer Group Z<br/>(new ML feature)<br/>offset=0"]
P1 -.replay.-> CGZ
Producer A ──┐ ┌── Consumer Group X
Producer B ──┤── Partition 0: [r0, r1, r2, …] ──┤ (instance reads P0+P1)
│── Partition 1: [r0, r1, r2, …] ──┤
│── Partition 2: [r0, r1, r2, …] ──┤── Consumer Group Y
Producer C ──┘── Partition 3: [r0, r1, r2, …] ──┘ (different instance reads each)
(each partition is strictly ordered)
Key Takeaway: Partition keys decide both ordering (records with the same key are ordered) and parallelism (different keys can be processed in parallel), making partition-key design the most consequential schema decision in any streaming pipeline.
Backpressure and ordering guarantees
What happens when consumers are slower than producers? In a naive system, producers fill memory until the broker crashes. Real streaming systems apply backpressure — a feedback mechanism that slows producers (or buffers to disk) when consumers fall behind.
Kafka and MSK implement backpressure indirectly: brokers persist all incoming records to disk, and producers receive ProduceResponse acknowledgments only after replication. If brokers are overloaded, ack latency rises and clients throttle naturally. Kinesis Data Streams returns ProvisionedThroughputExceededException when a shard is saturated, forcing the producer SDK into exponential backoff [Source: https://www.confluent.io/blog/windowing-in-kafka-streams/].
Apache Flink propagates backpressure through its operator DAG: a slow sink fills its input buffer, the upstream operator’s output buffer fills, and so on back to the source — which stops fetching from Kafka. This chain prevents memory blow-up but means a single slow sink can stall a job, which is why async I/O, sink batching, and checkpoint timeout tuning matter.
Ordering guarantees fall into three tiers:
| Guarantee | Meaning | Cost |
|---|---|---|
| Per-partition / per-key | Records with the same key are processed in order | Free — natural consequence of partitioning |
| Global ordering | All records across all partitions strictly ordered | Requires single partition → no parallelism |
| At-least-once vs exactly-once | Records may be duplicated vs delivered exactly once | Exactly-once requires transactions or idempotency |
The “at-least-once” default deserves attention. When a consumer fails after processing a record but before checkpointing its offset, restart reprocesses the record — double-counting a purchase or sending two notifications. The rest of this chapter is largely about how to fix that.
Key Takeaway: Streaming pipelines preserve order only within a partition, and provide at-least-once delivery by default; achieving global ordering or exactly-once semantics costs throughput, design effort, or both.
AWS Streaming Services
AWS exposes three first-party streaming primitives, each optimized for a different point on the latency / retention / operational-overhead surface. Picking incorrectly is the most common — and most expensive — mistake in AWS streaming architecture. The right framing: Kinesis Data Streams is a low-latency replayable log, Amazon MSK is managed Apache Kafka for ecosystems that need true Kafka semantics, and Amazon Data Firehose is a delivery service for fire-and-forget loading into storage targets [Source: https://www.kai-waehner.de/blog/2023/01/23/apache-kafka-and-apache-flink-a-match-made-in-heaven/].
Amazon Kinesis Data Streams
Kinesis Data Streams (KDS) uses a shard-based capacity model. Each shard supports 1 MB/s or 1,000 records/s of ingest and 2 MB/s of egress; you scale by adding shards (Provisioned mode) or by enabling On-Demand mode, which auto-scales shards up to a service quota. Default retention is 24 hours and can be extended to 365 days [Source: https://www.confluent.io/blog/windowing-in-kafka-streams/].
End-to-end latency is roughly 200 ms — fast enough for fraud detection and IoT telemetry, slower than Kafka’s 10–100 ms because Kinesis records are batched into shards and replicated synchronously to three Availability Zones.
The dominant failure mode is the hot shard. If your partition key has skewed distribution — say, 80% of events carry tenant_id="acme" because Acme is your largest customer — then 80% of traffic concentrates on one shard, hits the 1 MB/s ceiling, and producers see ProvisionedThroughputExceededException even though the stream as a whole is well below capacity. The fix is to design composite keys (acme:user_5, acme:user_6, …) that spread load while preserving the per-user ordering you actually need.
KDS provides at-least-once delivery natively. Achieving exactly-once requires application-level deduplication: read a record, persist its sequenceNumber to DynamoDB inside the same transaction as your business write, and on restart skip records you have already seen. The Kinesis Client Library (KCL) automates the checkpoint half of this pattern.
Worked example — IoT temperature ingest:
import boto3, json, time
kinesis = boto3.client("kinesis")
def emit_reading(device_id: str, temperature_c: float):
record = {
"device_id": device_id,
"ts": int(time.time() * 1000), # epoch millis
"temperature_c": temperature_c,
}
kinesis.put_record(
StreamName="iot-temperature",
Data=json.dumps(record),
PartitionKey=device_id, # ordering per device
)
Using device_id as the partition key guarantees that all readings from a single sensor stay ordered, while different sensors run in parallel across shards. If you have 10,000 devices each emitting one reading per second, that is 10,000 records/s — comfortably within ten shards, with three for headroom.
Key Takeaway: Choose Kinesis Data Streams when you need a sub-second AWS-native replayable log with hours-to-days retention and you can design a balanced partition key; budget engineering effort for hot-shard mitigation and application-level exactly-once.
Amazon MSK (managed Apache Kafka)
Amazon MSK runs real Apache Kafka brokers — no proprietary protocol, no Kinesis abstractions. You get the full Kafka API surface: idempotent producers, transactions, log compaction, consumer groups, the Kafka Streams DSL, Kafka Connect connectors, and read_committed isolation levels [Source: https://www.kai-waehner.de/blog/2023/01/23/apache-kafka-and-apache-flink-a-match-made-in-heaven/]. AWS manages broker provisioning, patching, ZooKeeper (or KRaft, for newer versions), and rolling upgrades.
There are two flavors. MSK Provisioned lets you choose broker instance types (e.g., kafka.m5.large) and EBS volume sizes; it is predictable and cost-efficient for sustained high-throughput workloads. MSK Serverless auto-scales capacity per topic with simpler billing — better for spiky or unpredictable traffic, at a higher per-MB cost.
MSK’s killer feature is native exactly-once semantics. With enable.idempotence=true on the producer plus transactional writes (producer.beginTransaction() / commitTransaction()), and isolation.level=read_committed on the consumer, you get end-to-end exactly-once across multiple topics in a single transaction. This is invaluable for event-sourced microservices where one user action must atomically appear in five downstream topics.
End-to-end latency is 10–100 ms — generally faster than KDS because brokers replicate over a tighter network path and consumers pull continuously rather than polling shard iterators.
Choose MSK when:
- You already have a Kafka ecosystem (Connect, Streams, Schema Registry, ksqlDB) and want to lift-and-shift.
- You need exactly-once across multiple topics in a single transaction.
- Your team has Kafka operational expertise or accepts the steeper learning curve.
Avoid MSK when you only need log delivery to S3 (use Firehose) or sub-second analytics without Kafka tooling needs (KDS is simpler).
Key Takeaway: Amazon MSK gives you the full power of Apache Kafka — including native idempotent producers and cross-topic transactions — at the cost of higher operational complexity than Kinesis or Firehose; pick it when you need Kafka semantics, not just a stream.
Amazon Data Firehose for delivery
Amazon Data Firehose (formerly Kinesis Data Firehose) is not a stream — it has no consumer API, no replay, no offsets. It is a managed pipeline that receives records via PUT or from a Kinesis/MSK source, optionally transforms them with AWS Lambda, optionally converts JSON to Parquet/ORC, and pushes them to S3, Redshift, OpenSearch, Splunk, or HTTP endpoints [Source: https://www.confluent.io/blog/windowing-in-kafka-streams/].
The data model is “buffer and flush.” Firehose accumulates records until either a buffer-size threshold (default 5 MB, configurable up to 128 MB) or a buffer-interval threshold (default 60–300 seconds, minimum 0 seconds for many destinations) is hit, then writes the batch to the destination. Latency is therefore 1–60 seconds — fine for log aggregation, dashboards, and warehouse loading; unsuitable for fraud detection.
Pricing is volume-based (~$0.029/GB ingested, plus ~$0.018–0.025/GB for format conversion). At high volumes this is dramatically cheaper than running shards or brokers for the same job, which is why “send CloudWatch Logs to S3” is the canonical Firehose use case.
Delivery is at-least-once. Firehose retries on destination failures and may produce duplicates. “Exactly-once” is achieved at the destination: S3 writes are deduplicated by Glue jobs, Redshift COPY runs followed by MERGE, OpenSearch uses document _id for upserts.
A common pattern is to fan out from Kinesis Data Streams: one consumer is a Flink job for real-time fraud detection, while a separate Firehose subscription writes the same events to S3 in Parquet for Athena or Redshift Spectrum analytics — sub-second processing where it matters, cheap durable storage everywhere else.
Figure 5.2: AWS streaming fan-out — Kinesis Data Streams as the replayable hub, Flink for sub-second analytics, Firehose for cheap durable delivery.
flowchart LR
Prod["Producers<br/>(SDK / KPL / Agent)"] --> KDS["Kinesis Data Streams<br/>shards, ~200ms, 24h-365d retention"]
KDS --> Flink["Flink on MSF<br/>stateful, exactly-once<br/>~100ms"]
KDS --> FH["Data Firehose<br/>buffer + flush<br/>1-60s latency"]
Flink --> DDB["DynamoDB / SNS<br/>(fraud alerts)"]
FH --> S3["S3 (Parquet)"]
S3 --> Ath["Athena /<br/>Redshift Spectrum"]
MSK["Amazon MSK<br/>(alt: full Kafka API,<br/>cross-topic transactions)"] -.alternative source.-> Flink
[Producers] ──> [Kinesis Data Streams]
│
├──> [Flink on MSF] ──> Fraud alerts (DynamoDB / SNS)
│
└──> [Data Firehose] ──> S3 (Parquet) ──> Athena / Redshift Spectrum
Key Takeaway: Data Firehose is the cheapest, simplest path from a stream into durable storage; it is fire-and-forget delivery with 1–60-second latency and at-least-once semantics handled by destination-side deduplication.
Stream Processing with Flink
A stream becomes useful when something processes it. Apache Flink is the de-facto standard for stateful, exactly-once stream processing: it runs aggregations, joins, windowing, and complex event processing over unbounded data with end-to-end correctness guarantees [Source: https://www.instaclustr.com/blog/apache-flink-vs-apache-kafka-streams/]. AWS exposes it as Amazon Managed Service for Apache Flink (MSF, formerly Kinesis Data Analytics for Apache Flink) — a fully managed runtime that removes the operational burden of cluster ownership.
Amazon Managed Service for Apache Flink
MSF runs Flink TaskManagers on AWS-managed compute units called Kinesis Processing Units (KPUs). One KPU is approximately 1 vCPU, 4 GB RAM, and 50 GB local storage. You pay per KPU-hour; AWS handles patching, scaling, JobManager high availability, and durable checkpoint storage to S3 [Source: https://www.kai-waehner.de/blog/2023/01/23/apache-kafka-and-apache-flink-a-match-made-in-heaven/].
Native connectors exist for Kinesis Data Streams, MSK, Data Firehose, DynamoDB Streams, S3, OpenSearch, and Lambda. The default state backend is RocksDB with incremental checkpoints, which scales to gigabytes of operator state without long checkpoint pauses. The default checkpoint interval is 60 seconds — tune downward for tighter recovery objectives, upward for less I/O overhead.
Operational metrics flow into CloudWatch:
| Metric | What it tells you |
|---|---|
lastCheckpointDuration | How long the most recent snapshot took (rising = backpressure) |
lastCheckpointSize | State size (rising = key cardinality growing) |
numberOfFailedCheckpoints | Recovery health (non-zero = investigate sinks/state) |
currentInputWatermark | How far the pipeline has advanced in event time |
currentOutputWatermark | What the downstream sees (large gap = late firing) |
Auto-scaling can adjust the KPU count based on CPU utilization, or you can set parallelism manually. For most production jobs, start with parallelism equal to the source’s partition count (e.g., 12 KDS shards → 12 KPUs of source parallelism) and scale stateful operators independently if needed.
Key Takeaway: Amazon Managed Service for Apache Flink hides cluster operations behind KPUs and S3-backed checkpoints, exposing the full Flink runtime with sensible defaults — RocksDB state backend, 60-second checkpoints, and native AWS source/sink connectors.
DataStream API vs Table API
Flink offers two programming models that compile to the same underlying execution graph.
The DataStream API is imperative Java/Scala/Python: you write operators (map, filter, keyBy, window, process) explicitly and have full control over state, timers, and side outputs. Use it for complex event processing, custom joins, and anything requiring fine-grained control.
DataStream<TempReading> readings = env
.fromSource(kinesisSource, watermarkStrategy, "iot-temperature")
.map(json -> mapper.readValue(json, TempReading.class));
DataStream<Alert> alerts = readings
.keyBy(TempReading::getDeviceId)
.window(TumblingEventTimeWindows.of(Time.minutes(1)))
.aggregate(new AvgTemperature())
.filter(avg -> avg.value > 80.0)
.map(avg -> new Alert(avg.deviceId, "Overheating", avg.value));
alerts.sinkTo(kinesisAlertSink);
The Table API / Flink SQL is declarative: you describe the computation in SQL or table expressions, and Flink’s planner produces an optimal execution graph. It is shorter, more maintainable, and accessible to analysts who do not write Java. Use it for standard aggregations, joins, and windowing — which covers most pipelines [Source: https://conduktor.io/glossary/kafka-streams-vs-apache-flink].
CREATE TABLE iot_temperature (
device_id STRING,
ts TIMESTAMP_LTZ(3),
temperature_c DOUBLE,
WATERMARK FOR ts AS ts - INTERVAL '5' SECONDS
) WITH ('connector' = 'kinesis', 'stream' = 'iot-temperature', ...);
SELECT
device_id,
window_start,
AVG(temperature_c) AS avg_c
FROM TABLE(TUMBLE(TABLE iot_temperature, DESCRIPTOR(ts), INTERVAL '1' MINUTES))
GROUP BY device_id, window_start
HAVING AVG(temperature_c) > 80.0;
The two APIs interoperate — you can convert a DataStream to a Table and back. A common pattern is to do raw deserialization and enrichment in the DataStream API, expose the enriched stream as a Table, and let analysts build downstream analytics in SQL.
Key Takeaway: Use Flink SQL / Table API for standard aggregations and joins (most pipelines); drop into the DataStream API only when you need custom state machines, side outputs, or operator-level control.
State management and checkpointing
Stateful streaming means operators remember things between events: a windowed aggregator remembers running totals, a join remembers unmatched left-side records, a session detector remembers each user’s last seen timestamp. Flink stores this state in a state backend and periodically snapshots it via checkpoints [Source: https://lists.apache.org/thread/4hzgosgy5okt7spgb96p9fxsmcfh5f0d].
The state backend choice matters:
| Backend | Storage | When to use |
|---|---|---|
| HashMap | JVM heap | Small state, lowest latency, easy debugging |
| RocksDB (incremental) | Local disk + remote upload | Production default; gigabytes of state, low pause times |
| Filesystem (legacy) | Heap + full snapshot to FS | Mostly superseded by RocksDB |
The checkpoint algorithm is a barrier-based asynchronous distributed snapshot (a Chandy-Lamport variant) [Source: https://lists.apache.org/thread/4hzgosgy5okt7spgb96p9fxsmcfh5f0d]:
- The JobManager triggers a checkpoint at a configured interval (e.g., every 60 seconds).
- Sources receive a numbered barrier and record their current offsets/sequence numbers.
- Barriers flow through the operator DAG alongside data, preserving order.
- When an operator receives barriers from all input streams (in
EXACTLY_ONCEmode it aligns them — buffering records from already-arrived inputs until all inputs reach the barrier), it asynchronously snapshots its state to the state backend. - Operators acknowledge the checkpoint to the JobManager.
- Once all tasks acknowledge, the checkpoint is complete and globally durable.
Imagine a parade. The JobManager periodically inserts a flag-bearer (the barrier). Each viewing station (operator) waits until flag-bearers from every parallel route arrive, then photographs itself (snapshot). The photographs collectively form a consistent global snapshot — each operator’s state corresponds to having processed exactly the records before the flag.
Figure 5.3: Flink barrier-based checkpoint — barriers flow through the operator DAG; each operator aligns inputs, snapshots state to S3, and acknowledges the JobManager.
sequenceDiagram
participant JM as JobManager
participant Src as Source (Kinesis/MSK)
participant Op as Keyed Operator
participant Sink as Transactional Sink
participant S3 as S3 (state backend)
JM->>Src: trigger checkpoint N (every 60s)
Src->>Src: record offsets / sequence numbers
Src->>Op: barrier N (alongside data)
Op->>Op: align inputs<br/>(EXACTLY_ONCE buffers early arrivals)
Op->>S3: async snapshot (RocksDB incremental)
Op->>JM: ack checkpoint N
Op->>Sink: barrier N
Sink->>Sink: preCommit() — flush buffers,<br/>prepare 2PC transaction
Sink->>JM: ack checkpoint N
JM->>JM: all tasks acked → CP N durable
JM->>Sink: notifyCheckpointComplete(N)
Sink->>Sink: commit() — atomic publish<br/>(Kafka txn / S3 rename)
Incremental checkpoints (RocksDB only) persist only the SST files that changed since the previous snapshot, dramatically reducing checkpoint duration and S3 cost for large state.
Savepoints are user-triggered checkpoints used for upgrades, parallelism changes, and version migrations. You stop the job with a savepoint, deploy a new application JAR, and restart from the savepoint — losing zero data and zero state.
In AT_LEAST_ONCE mode, barriers are not aligned, which lowers latency but allows duplicates on recovery. Use EXACTLY_ONCE unless you have a specific reason not to.
Key Takeaway: Flink’s barrier-based snapshots produce globally consistent checkpoints without stopping the world; pair RocksDB incremental checkpoints with
EXACTLY_ONCEmode and a 60-second interval as your production starting point.
Time, Windows, and Correctness
Streaming correctness is fundamentally about time. The same record can produce different aggregates depending on whether you bin events by when they happened (“event time”) or when your system saw them (“processing time”). Choosing wrongly gives correct-looking results that fail under network delays, replays, or backfills.
Event time vs processing time
Event time is the timestamp embedded in the record — when the sensor sampled the temperature, when the user clicked the button, when the financial trade was executed. Processing time is the wall-clock time when the streaming engine sees the record. They diverge whenever there is network delay, mobile-device offline buffering, queue lag, or pipeline backfill [Source: https://www.confluent.io/blog/windowing-in-kafka-streams/].
Consider a mobile app that records GPS pings while the user is in the subway. The phone buffers points locally for 20 minutes underground. When the user surfaces, all 20 minutes of pings flush at once. The processing-time view says “1,200 events arrived at 10:23 AM”; the event-time view says “the user moved through these stations between 10:00 and 10:20 AM, in this order.” The event-time view is the correct one for any meaningful analytics.
| Aspect | Event time | Processing time |
|---|---|---|
| Source of timestamp | Record field | System clock |
| Reproducibility | Deterministic under reprocessing | Non-deterministic |
| Latency | Higher (must wait for late data) | Lower |
| Correctness under delay | Correct | Skewed |
| Use cases | Business analytics, billing, audit | Monitoring, alerting, dashboards |
The rule of thumb: use event time for any business logic that must produce the same result if you replay yesterday’s data. Use processing time only for “is the pipeline currently alive” monitoring.
Key Takeaway: Event time produces reproducible, correct aggregates regardless of when records arrive; processing time is convenient but skews under any network or buffering delay — pick event time for business logic.
Tumbling, sliding, and session windows
Tumbling windows are fixed-size, non-overlapping. Every record belongs to exactly one window. Use them for periodic reports: “hourly revenue by region,” “minute-by-minute API error counts” [Source: https://www.slideshare.net/slideshow/windowing-in-kafka-streams-and-flink-sql/267076025].
SELECT window_start, region, SUM(amount) AS revenue
FROM TABLE(TUMBLE(TABLE orders, DESCRIPTOR(order_ts), INTERVAL '1' HOUR))
GROUP BY window_start, region;
Sliding windows are fixed-size but advance by a smaller increment, so windows overlap. Each event belongs to multiple windows (window_size / slide), increasing state size. Use them for moving averages and continuous monitoring: “5-minute rolling CPU average updated every minute” [Source: https://softwaremill.com/windowing-in-big-data-streams-spark-flink-kafka-akka/].
SELECT window_start, AVG(cpu_pct)
FROM TABLE(HOP(TABLE host_metrics, DESCRIPTOR(ts), INTERVAL '1' MINUTE, INTERVAL '5' MINUTE))
GROUP BY window_start;
Session windows are dynamic. They group events separated by less than a configurable inactivity gap, then close when the gap is exceeded. Use them for activity-based grouping: “pages per user session (30-minute idle gap),” “phone-call detail records,” “IoT bursts” [Source: https://conduktor.io/glossary/kafka-streams-vs-apache-flink].
SELECT user_id, window_start, window_end, COUNT(*) AS pageviews
FROM TABLE(SESSION(TABLE clickstream, DESCRIPTOR(ts), INTERVAL '30' MINUTES))
GROUP BY user_id, window_start, window_end;
Global windows are a single never-closing window per key, fired by a custom trigger (e.g., “every 100 events”). Use them for lifetime aggregates and count-based emission [Source: https://www.instaclustr.com/blog/apache-flink-vs-apache-kafka-streams/].
Tumbling (size = 5) Sliding (size = 5, slide = 2) Session (gap = 3)
┌─────┐┌─────┐┌─────┐ ┌─────┐ ┌──────┐ ┌────┐
│ W1 ││ W2 ││ W3 │ │ W1 │ ┌─────┐ │ S1 │ │ S2 │
└─────┘└─────┘└─────┘ └─────┘ │ W2 │ ┌─────┐ └──────┘ └────┘
└─────┘ │ W3 │ events: ••• • •
└─────┘
(overlapping)
Picking the right window is half of getting streaming analytics right. The other half is the watermark.
| Window | Best for | Watch out for |
|---|---|---|
| Tumbling | Periodic reports, billing | Boundary spikes (records hopping windows) |
| Sliding | Rolling averages | State multiplication = memory pressure |
| Session | User-activity analytics | Gap tuning is application-specific |
| Global | Lifetime totals, count triggers | State grows unbounded without TTL |
Key Takeaway: Tumbling windows answer “what happened in this period,” sliding windows answer “what is the trend right now,” session windows answer “what did each user do in one sitting,” and global windows answer “what has happened over all time per key” — match the window to the question.
Watermarks and late-arriving data
A watermark W(t) is a monotonically increasing assertion that “no more events with event-time timestamp ≤ t will arrive” [Source: https://lists.apache.org/thread/4hzgosgy5okt7spgb96p9fxsmcfh5f0d]. Watermarks are how the engine knows it is safe to fire an event-time window — once the watermark advances past the window’s end time, no more “in-time” data should be coming.
Flink generates watermarks via a WatermarkStrategy:
| Strategy | Use when |
|---|---|
forMonotonousTimestamps() | Records strictly in order (rare in practice) |
forBoundedOutOfOrderness(Duration) | Most common; allows N seconds of out-of-order tolerance |
noWatermarks() | Processing-time pipelines |
Custom WatermarkGenerator | Domain-specific logic (e.g., heartbeat-based) |
withIdleness(Duration) | Mark idle source partitions inactive so downstream watermarks can advance |
WatermarkStrategy<TempReading> strategy = WatermarkStrategy
.<TempReading>forBoundedOutOfOrderness(Duration.ofSeconds(5))
.withTimestampAssigner((event, ts) -> event.getTs())
.withIdleness(Duration.ofMinutes(1));
This says: “I expect events to arrive at most 5 seconds out of order, the timestamp is event.getTs(), and if a source partition is silent for a minute, treat it as idle so windows can still fire.”
Watermark propagation through multi-input operators (joins, unions, keyed windows downstream of a parallel source) follows a strict rule: the operator’s output watermark equals the minimum of its input watermarks. This guarantees correctness but introduces the watermark stall — one slow or idle partition holds back the entire pipeline. Fix it with withIdleness(...) and instrument currentInputWatermark to detect stalls early.
Late events — records arriving after the watermark has passed their bucket — get one of three treatments:
- Dropped (default in many APIs).
- Allowed lateness via
allowedLateness(Duration): the window state stays alive past the watermark, and each late event re-fires the window with an updated result. Downstream consumers must handle multiple emissions for the same window (use upsert sinks). - Side output via
sideOutputLateData(OutputTag): late events are routed to a separate stream for inspection, repair, or audit.
Figure 5.4: Watermark-driven window firing — events flow into an event-time window; the watermark advances past the window end, triggering emission, and late events are routed to allowed-lateness re-firing or a side output.
flowchart TD
In["Source records<br/>(event-time ts embedded)"] --> WM{"WatermarkStrategy<br/>forBoundedOutOfOrderness(5s)<br/>+ withIdleness(1m)"}
WM --> Assign["Timestamp + Watermark<br/>assigner"]
Assign --> KB["keyBy(deviceId)"]
KB --> Win["TumblingEventTimeWindow<br/>(1 minute)"]
Win --> Check{"event ts vs<br/>current watermark W(t)?"}
Check -->|"ts in window<br/>and W(t) < end"| Buffer["Buffer in window state<br/>(RocksDB)"]
Check -->|"W(t) >= window end"| Fire["Fire window<br/>emit aggregate"]
Check -->|"ts < W(t)<br/>(late event)"| Lateness{"allowedLateness<br/>still open?"}
Lateness -->|yes| ReFire["Re-fire window<br/>(upsert sink)"]
Lateness -->|no| SideOut["sideOutputLateData<br/>→ audit / repair sink"]
Fire --> Sink["Transactional sink<br/>(2PC: preCommit on barrier,<br/>commit on CP complete)"]
ReFire --> Sink
DataStream<TempReading> stream = ...;
OutputTag<TempReading> lateTag = new OutputTag<TempReading>("late") {};
SingleOutputStreamOperator<Aggregate> mainOut = stream
.keyBy(TempReading::getDeviceId)
.window(TumblingEventTimeWindows.of(Time.minutes(1)))
.allowedLateness(Time.minutes(5))
.sideOutputLateData(lateTag)
.aggregate(new AvgTemperature());
DataStream<TempReading> lateData = mainOut.getSideOutput(lateTag);
lateData.addSink(auditSink); // dead-letter / repair pipeline
End-to-end exactly-once is the composition of three layers [Source: https://lists.apache.org/thread/4hzgosgy5okt7spgb96p9fxsmcfh5f0d]:
- Replayable sources — Kafka offsets and Kinesis sequence numbers are stored inside the checkpoint. On recovery, the source resumes from the offset of the last successful checkpoint.
- Internal state — checkpoint barriers ensure operator state is consistent across the snapshot boundary.
- Transactional or idempotent sinks — the sink either supports two-phase commit (
TwoPhaseCommitSinkFunction) or uses an idempotent operation (key-based upsert) so duplicates produce no observable effect.
The two-phase commit pattern is worth memorizing:
beginTransaction() ──> invoke(record) ──> preCommit() ──> commit()
(start of CP) (during CP) (CP barrier) (CP complete)
When the checkpoint barrier arrives, preCommit() flushes buffers and prepares the transaction (e.g., Kafka producer flush() + sendOffsetsToTransaction()). After the JobManager confirms the checkpoint is durable, notifyCheckpointComplete(checkpointId) triggers commit(), which atomically exposes the writes (Kafka transaction commit, S3 staging-file rename to final). On crash between pre-commit and commit, the transaction is recovered from checkpoint state and either committed or aborted on restart — the writes never become visible until the corresponding checkpoint is durable.
This is the algorithm behind Flink’s exactly-once Kafka sink, S3 FileSink with rolling policy, and most production exactly-once connectors.
Common Flink failure modes and fixes:
| Symptom | Likely cause | Fix |
|---|---|---|
| Watermark stalls; windows never fire | Idle source partition pulls min watermark to -∞ | withIdleness(...) |
| Checkpoint timeout | Sink backpressure or slow state persistence | Increase checkpointTimeout, reduce parallelism, switch to incremental |
| Duplicates downstream after restart | Sink is non-transactional/non-idempotent | Use FileSink commit-on-checkpoint or transactional Kafka producer |
| Growing checkpoint size | Unbounded keyed state | Set TTL on state, or use session windows with gap |
Key Takeaway: Watermarks are the engine’s promise that no more in-time data is coming, and the foundation of correct event-time windows; combine
forBoundedOutOfOrderness,withIdleness,allowedLateness, and side outputs to handle real-world out-of-order data, then pair replayable sources with two-phase-commit sinks for end-to-end exactly-once.
Chapter Summary
Streaming pipelines invert batch-ETL assumptions: data is unbounded, ordering is per-partition, and correctness depends on careful time semantics. AWS provides three streaming primitives — Kinesis Data Streams (low-latency replayable shard log), Amazon MSK (managed Apache Kafka with native exactly-once), and Amazon Data Firehose (fire-and-forget delivery to S3, Redshift, OpenSearch, Splunk) — and the right choice depends on whether you need replay, transactions, or cheap durable delivery.
Apache Flink, most easily run on Amazon Managed Service for Apache Flink, provides stateful exactly-once processing through three composed mechanisms: barrier-based distributed snapshots for internal state, replayable sources whose offsets live inside the checkpoint, and transactional or idempotent sinks via two-phase commit. The DataStream API and Table / SQL API compile to the same execution engine.
Correctness depends on event time, not processing time. Watermarks (forBoundedOutOfOrderness, withIdleness) tell the engine when to fire event-time windows; late events are handled via allowedLateness or sideOutputLateData. Tumbling windows answer “what happened this hour,” sliding windows “what is the trend now,” session windows “what did each user do in one sitting,” and global windows “what has happened across all time per key.”
The next chapter connects these streaming foundations to data quality, contracts, and observability — because a fast pipeline with no schema discipline is just a fast way to corrupt your warehouse.
Key Terms
- Kinesis — Amazon Kinesis Data Streams; AWS-native shard-based replayable log with ~200 ms latency and up to 365-day retention.
- MSK — Amazon Managed Streaming for Apache Kafka; fully managed Kafka brokers with native exactly-once and the full Kafka ecosystem (Connect, Streams, transactions).
- Firehose — Amazon Data Firehose; managed delivery service that buffers and pushes records to S3, Redshift, OpenSearch, Splunk, or HTTP endpoints with 1–60-second latency.
- Apache Flink — Open-source stateful stream processing engine with end-to-end exactly-once semantics, available as a managed service on AWS (Amazon Managed Service for Apache Flink).
- Apache Kafka — Open-source distributed log; the protocol and architecture that Amazon MSK manages.
- Windowing — Grouping streaming events into bounded buckets (tumbling, sliding, session, global) for aggregation.
- Watermark — Monotonically increasing event-time assertion that “no more events with timestamp ≤ t will arrive,” used to trigger event-time window firing.
- Checkpoint — Flink’s barrier-based asynchronous distributed snapshot of operator state plus source offsets, written to a durable backend (S3 on MSF) for recovery.
- Exactly-once — Delivery semantic where each record produces exactly one observable effect, achieved by composing replayable sources, snapshot-consistent state, and transactional / idempotent sinks.
- Event time — Timestamp embedded in the record itself (when the event happened in the real world), as opposed to processing time (when the engine sees it); the basis of reproducible streaming analytics.
Chapter 6: Distributed Processing with Spark and EMR
Learning Objectives
By the end of this chapter, you will be able to:
- Explain the Spark execution model, including how the driver, executors, stages, and tasks coordinate to process data in parallel.
- Use the DataFrame API and Spark SQL to express transformations that run efficiently at scale.
- Tune Spark jobs using partitioning strategies, caching, broadcast joins, and Adaptive Query Execution (AQE).
- Choose between EMR on EC2, EMR Serverless, and EMR on EKS based on workload characteristics, cost profile, and operational requirements.
When pipelines outgrow a single machine, Apache Spark answers the question of how to coordinate hundreds of machines for batch and streaming analytics, and Amazon EMR is the AWS-managed substrate that runs Spark in production. This chapter tours Spark’s scheduling model and the deployment options that shape your cost and operational profile.
Apache Spark Internals
Apache Spark is a unified analytics engine for large-scale data processing. To use it well, you need a mental model of three layers: the cluster topology (driver and executors), the data abstractions (RDD, DataFrame, Dataset), and the optimization machinery (Catalyst and Tungsten).
Driver, Cluster Manager, and Executors
A Spark cluster is like a kitchen running a banquet. The driver is the head chef who reads the orders and decides what dishes to prepare; the cluster manager is the restaurant manager who allocates kitchen stations; the executors are the line cooks who chop, sauté, and plate the food.
The Spark driver runs the application’s main() function and maintains the state of the SparkContext. It converts user actions into tasks, schedules those tasks on executors, collects results, and communicates with the cluster manager [Source: https://spark.apache.org/docs/latest/cluster-overview.html]. If the driver dies, the application dies.
Executors are JVM worker processes on cluster nodes. Each executor runs tasks, caches RDDs and DataFrames in memory, and sends heartbeats back to the driver [Source: https://spark.apache.org/docs/latest/cluster-overview.html]. Executor cores and memory budget are among the most consequential tuning knobs in Spark.
The cluster manager allocates executor processes and handles node failures. Spark supports Standalone, YARN (the default on EMR on EC2), Kubernetes (EMR on EKS and EMR Serverless internally), and Mesos [Source: https://spark.apache.org/docs/latest/cluster-overview.html].
| Component | Role | Lifetime |
|---|---|---|
| Driver | Plans and coordinates execution | Application |
| Cluster Manager | Allocates resources | Cluster |
| Executor | Runs tasks, caches data | Application (typically) |
| Task | Processes one partition | Stage |
The driver-executor split is the source of one of the most common Spark gotchas. df.collect() pulls all partitions back to the driver and can OOM it if the result is large. Conversely, you cannot reference a SparkContext inside a UDF, because the UDF runs on executors that have no access to it.
Figure 6.1: Spark cluster topology — driver, cluster manager, and executors
flowchart TD
User[User Application / spark-submit]
Driver[Spark Driver<br/>SparkContext + DAG Scheduler]
CM[Cluster Manager<br/>YARN / K8s / Standalone]
subgraph Worker1[Worker Node 1]
E1[Executor JVM<br/>cores + cache]
T1[Task]
T2[Task]
end
subgraph Worker2[Worker Node 2]
E2[Executor JVM<br/>cores + cache]
T3[Task]
T4[Task]
end
User --> Driver
Driver -->|requests resources| CM
CM -->|allocates| E1
CM -->|allocates| E2
Driver -->|schedules tasks| E1
Driver -->|schedules tasks| E2
E1 --> T1
E1 --> T2
E2 --> T3
E2 --> T4
E1 -.heartbeat.-> Driver
E2 -.heartbeat.-> Driver
Key Takeaway: The driver plans the work, the cluster manager allocates the workers, and the executors do the work. Knowing where each piece of your code runs is the foundation for both correctness and performance.
RDDs, DataFrames, and Datasets
Spark exposes three layered abstractions for distributed data, each with different trade-offs between flexibility and optimization.
The Resilient Distributed Dataset (RDD) is Spark’s foundational abstraction: a fault-tolerant, partitioned collection of records. The RDD abstraction “enables developers to materialize any point in a processing pipeline into memory across the cluster, meaning that future steps that want to deal with the same dataset need not recompute it or reload it from disk” [Source: http://archive.gersteinlab.org/meetings/s/2015/05.05/Advanced_Analytics_with_Spark-2.pdf]. This in-memory caching is Spark’s headline advantage over disk-based MapReduce. RDDs are flexible but Spark cannot optimize their opaque records.
The DataFrame is a distributed table with a named, typed schema. Because the schema is known, Spark can rearrange operations, push predicates into file readers, and generate efficient code. It is the recommended API for nearly all workloads.
The Dataset is a typed extension of DataFrame in Scala and Java that adds compile-time type safety. PySpark has no separate Dataset API.
| Abstraction | Schema | Type Safety | Optimizer Visibility | Typical Use |
|---|---|---|---|---|
| RDD | None | Compile-time (Scala) / runtime (Python) | Opaque | Custom partitioning, unstructured data |
| DataFrame | Yes | Runtime | Full | 95% of analytics workloads |
| Dataset | Yes | Compile-time | Full (Scala/Java only) | Type-safe pipelines in JVM languages |
A common analogy: if RDD is hand-written assembly code, DataFrame is C — high enough to let a compiler optimize aggressively, low enough to express almost any computation.
Key Takeaway: Default to DataFrames. Reach for RDDs only when you need control the optimizer cannot give you, such as custom partitioners or operations on opaque binary records.
Catalyst Optimizer and Tungsten
Two engines make DataFrame operations fast: the Catalyst optimizer and the Tungsten execution engine. Together, they are the reason a casually written DataFrame query often outperforms a carefully hand-tuned RDD pipeline.
Catalyst is Spark’s query optimizer. It translates a DataFrame query into a logical plan, applies rule-based optimizations (predicate pushdown, constant folding, column pruning), then explores cost-based physical plan alternatives. Catalyst will push a filter down into the Parquet reader so the file format itself skips disqualified row groups, dramatically cutting I/O.
Tungsten is the physical execution engine. It introduces off-heap memory management to avoid JVM GC pauses, cache-friendly binary row formats, and whole-stage code generation that fuses operators into a single tight loop of bytecode. DataFrame execution often approaches the speed of hand-written code.
Consider this PySpark snippet.
sales.filter("region = 'EU'") \
.join(customers, "customer_id") \
.groupBy("country") \
.agg({"amount": "sum"}) \
.show()
Catalyst pushes region = 'EU' into the sales scan, prunes customer columns to customer_id and country, decides whether customers fits a broadcast join, and reorders operations to minimize shuffle volume. Tungsten then generates a single fused operator that does scan → filter → project → join probe → partial aggregate, without materializing intermediate row collections.
Figure 6.2: Catalyst optimizer phases — from DataFrame to executable RDDs
flowchart LR
A[DataFrame / SQL] --> B[Unresolved<br/>Logical Plan]
B -->|Catalog lookup| C[Resolved<br/>Logical Plan]
C -->|Rule-based:<br/>predicate pushdown,<br/>column pruning,<br/>constant folding| D[Optimized<br/>Logical Plan]
D -->|Cost-based<br/>strategy selection| E[Physical Plans]
E -->|Cost model| F[Selected<br/>Physical Plan]
F -->|Tungsten<br/>whole-stage<br/>codegen| G[Executable<br/>RDDs]
Key Takeaway: Catalyst plus Tungsten is why DataFrames win. The optimizer can only optimize what it can see, so prefer declarative DataFrame and SQL operations over imperative
mapcalls on opaque objects.
Jobs, Stages, Tasks, and the Shuffle
Spark uses lazy evaluation: transformations like filter, select, and join build a plan but do not execute. Only when an action is called — collect(), count(), save(), write() — does Spark run anything.
A job is created when an action runs. The DAG scheduler splits each job into stages separated by shuffle boundaries. Within a stage, operators pipeline together because each task reads only data that lives on its partition. Stages are linearly dependent — stage N+1 cannot begin until stage N completes [Source: https://spark.apache.org/docs/latest/cluster-overview.html].
A task is the smallest unit of work; each task processes one partition on one executor core. The task scheduler dispatches tasks with a preference for data locality: PROCESS_LOCAL > NODE_LOCAL > RACK_LOCAL > ANY [Source: https://spark.apache.org/docs/latest/cluster-overview.html].
The shuffle redistributes data across the network. Wide transformations (groupByKey, join, distinct, repartition) force shuffles because rows sharing a key must land on the same partition. The protocol has four phases: map-side partitioning and write to local disk, sorted shuffle write, network fetch by reducers, and reduce-side aggregation [Source: https://spark.apache.org/docs/latest/cluster-overview.html]. Shuffle is by far the most expensive operation in Spark.
Simplified execution flow: action → DAG construction → stage creation at shuffle boundaries → task set creation (one task per partition) → task scheduler dispatches tasks → executors run tasks → shuffle between stages → driver collects results [Source: https://spark.apache.org/docs/latest/rdd-programming-guide.html]. If a partition is lost, Spark recomputes it from lineage — the “resilient” in RDD.
Figure 6.3: Job → Stage → Task hierarchy with shuffle boundary
flowchart TD
Action[Action: df.write / collect] --> Job[Job]
Job --> S1[Stage 1<br/>narrow ops: scan, filter, map]
S1 --> T1a[Task: partition 0]
S1 --> T1b[Task: partition 1]
S1 --> T1c[Task: partition 2]
T1a --> Shuffle{{Shuffle Boundary<br/>groupBy / join / repartition}}
T1b --> Shuffle
T1c --> Shuffle
Shuffle --> S2[Stage 2<br/>narrow ops: aggregate, write]
S2 --> T2a[Task: partition 0]
S2 --> T2b[Task: partition 1]
T2a --> Result[Result to Driver / Sink]
T2b --> Result
Key Takeaway: A job becomes stages at every shuffle boundary, and a stage becomes one task per partition. Watching stage and task behavior in the Spark UI is the most direct way to diagnose performance problems.
Writing Spark Jobs
This section covers the day-to-day Spark APIs, when to reach for SQL versus DataFrame versus PySpark versus Scala, and how the metastore ties data sources together.
DataFrame Transformations and Actions
DataFrame operations come in two flavors: transformations that build a logical plan (lazy) and actions that trigger execution (eager). Transformations include select, filter, withColumn, groupBy, agg, join, union, distinct, repartition. Actions include show, collect, count, take, write, foreach.
A typical PySpark ETL job — bronze to silver — might look like this.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, to_date, lower, when
spark = SparkSession.builder.appName("orders-silver").getOrCreate()
raw = (spark.read
.format("json")
.option("multiLine", "true")
.load("s3://lake/bronze/orders/dt=2026-05-07/"))
silver = (raw
.filter(col("status").isNotNull())
.withColumn("order_date", to_date(col("created_at")))
.withColumn("region", lower(col("region")))
.withColumn("is_refund", when(col("amount") < 0, True).otherwise(False))
.dropDuplicates(["order_id"]))
(silver.write
.format("parquet")
.mode("overwrite")
.partitionBy("order_date", "region")
.save("s3://lake/silver/orders/"))
The pipeline is a chain of transformations that build a single logical plan; nothing runs until .save(). The output is partitioned by date and region so downstream queries can skip files via partition pruning. For production jobs supply an explicit schema rather than relying on inference.
Writing DataFrame code is more like writing a SQL query than a for loop — you describe the result and Spark figures out how to compute it. Resist dropping into rdd.map() for things expressible declaratively; the moment you do, Catalyst loses visibility.
Key Takeaway: Build pipelines as long chains of declarative transformations and trigger execution with a single action. The optimizer rewards code it can see.
Spark SQL and the Hive Metastore
Spark SQL is the same engine as DataFrame with SQL as another front end. Any DataFrame can be registered as a temporary view (df.createOrReplaceTempView("orders")) and queried with SQL; any SQL query returns a DataFrame.
spark.sql("""
SELECT region, COUNT(*) AS cnt, SUM(amount) AS total
FROM orders
WHERE order_date = DATE '2026-05-07'
GROUP BY region
""").show()
For long-lived tables Spark uses a metastore — typically the Hive Metastore or AWS Glue Data Catalog on EMR — to persist schema, location, partition layout, and storage format. With a metastore, SELECT * FROM analytics.orders works from any Spark job, notebook, or BI tool.
EMR integrates with the AWS Glue Data Catalog so the same table definition is visible to Spark, Athena, Redshift Spectrum, and Trino — a critical interoperability win for the lakehouse architectures introduced in Chapter 5 [Source: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-overview-benefits.html].
Key Takeaway: Spark SQL and DataFrame are two faces of the same engine. The metastore is what turns a pile of S3 files into a queryable warehouse that multiple engines can share.
PySpark vs Scala vs SQL
Spark exposes the same engine through Scala (native), PySpark, and SQL. Choice is mostly about team skills with a few performance nuances.
| Aspect | Scala | PySpark | SQL |
|---|---|---|---|
| Native runtime | Yes (JVM) | Python ↔ JVM bridge | JVM |
| Performance for DataFrame ops | Fastest baseline | Same as Scala (operations run in JVM) | Same as DataFrame |
| Performance for UDFs | Fast (JVM UDFs) | Slower (Python serialization), unless using pandas UDFs / Arrow | N/A |
| Type safety | Compile-time (Dataset API) | Runtime | Runtime |
| Ecosystem | Spark library ecosystem, sbt | Pandas, scikit-learn, NumPy | BI tools, ad-hoc analysis |
| Best for | Performance-critical libraries, custom UDFs, framework code | ML pipelines, ETL, data science teams | Reports, ad-hoc queries, transformations expressible in SQL |
DataFrame and SQL operations run inside the JVM regardless of front-end language, so PySpark matches Scala speed for declarative work. PySpark falls behind only on custom Python UDFs, which serialize every row to Python and back — pandas UDFs (vectorized, Arrow-based) close most of that gap.
Pragmatic recommendation: write pipelines in PySpark for team familiarity and ML library access, push reusable transformations into Spark SQL views or Scala libraries, and reserve raw RDDs for genuinely custom situations.
Key Takeaway: Use the language your team knows best for orchestration, and the most declarative API (SQL or DataFrame) for transformation logic. Avoid Python UDFs in hot paths.
Amazon EMR Deployment Models
Amazon EMR is AWS’s managed big-data platform. It supports Spark, Hadoop, Hive, HBase, Trino, and Flink, but for modern data engineering it is shorthand for “managed Spark.” EMR offers three deployment models. AWS reports EMR Spark “runs up to 5.4x faster than open-source Apache Spark” thanks to runtime optimizations [Source: https://aws.amazon.com/emr/].
EMR on EC2 (Classic Clusters)
EMR on EC2 is the original deployment model: a cluster of EC2 instances configured as Hadoop master and worker nodes, with YARN as the cluster manager and HDFS or EMRFS (S3) for storage [Source: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-what-is-emr.html].
Characteristics:
- Infrastructure control: Full management of cluster configuration with direct EC2 instance control [Source: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-overview-benefits.html].
- Cluster lifecycle: Supports both transient clusters (provision, run, terminate) and long-running persistent clusters.
- Cost profile: Pay for instances whether idle or active; flexible instance selection across the EC2 catalog. Spot Instance integration enables 50-90% cost reductions for non-critical workloads [Source: https://aws.amazon.com/emr/].
- Cold start: 3-10 minutes for cluster initialization [Source: https://aws.amazon.com/emr/].
- Autoscaling: Manual scaling and automatic scaling that resizes the cluster to match workload demand [Source: https://aws.amazon.com/emr/features/].
- High availability: Multi-master configurations for HA on YARN, HDFS, Spark, and HBase [Source: https://aws.amazon.com/emr/features/].
- Customization: Run any custom application alongside the standard frameworks — install Python libraries via bootstrap actions, configure JVM flags, mount custom volumes.
When to choose: Long-running clusters that process data 24/7, workloads needing fine-grained configuration control (custom JARs, kernel tuning, mixed framework deployments), use cases requiring multi-master HA, or sustained workloads where Spot Instances on persistent clusters offer compelling savings.
A typical deployment: a 50-node cluster running Spark jobs orchestrated by Airflow against Glue Data Catalog tables, scaling to 200 nodes during overnight batch windows.
Key Takeaway: EMR on EC2 is the right fit for sustained, configuration-heavy workloads where cluster uptime is justified by continuous use.
EMR Serverless
EMR Serverless removes cluster management entirely. You define an “application” with a framework (Spark or Hive) and submit jobs; AWS provisions and tears down capacity transparently [Source: https://aws.amazon.com/emr/].
Characteristics:
- Infrastructure-free: Fully managed, no cluster provisioning, no node sizing decisions [Source: https://aws.amazon.com/emr/].
- Job-based model: Submit jobs directly; AWS handles container orchestration internally.
- Cost profile: Pay-per-job billing based on actual vCPU-hours and memory-hours consumed; no idle costs [Source: https://aws.amazon.com/emr/].
- Cold start: 30-60 seconds for environment initialization (substantially faster than EC2) [Source: https://aws.amazon.com/emr/].
- Autoscaling: Fully automatic and transparent; no scaling policies to configure [Source: https://aws.amazon.com/emr/].
- Framework support: Apache Spark, Apache Flink, and SQL query engines [Source: https://aws.amazon.com/emr/].
- Customization: Limited — you cannot install arbitrary system-level dependencies the way you can on EC2, though custom Python virtualenvs and Docker-based custom images are supported.
When to choose: Bursty, unpredictable workloads where idle clusters would waste money; scheduled analytics that runs for minutes or a few hours; development and test environments where teams want zero infrastructure management; cost-sensitive workloads with no sustained baseline [Source: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-overview-benefits.html].
Typical deployment: a daily 30-minute batch job aggregating yesterday’s events. A long-lived EC2 cluster would waste 23.5 hours of paid capacity per day; EMR Serverless bills only for the 30 minutes of compute.
Key Takeaway: EMR Serverless is the default for sporadic, on-demand jobs because it eliminates the idle cost that traditional clusters impose.
EMR on EKS for Containerized Workloads
EMR on EKS runs Spark jobs as Kubernetes pods on an existing Amazon EKS cluster. Instead of spinning up a parallel EMR cluster, you reuse the EKS infrastructure your organization already operates [Source: https://aws.amazon.com/emr/].
Characteristics:
- Kubernetes-native: Spark jobs run as pods, with native Kubernetes scheduling, RBAC, network policies, and service mesh integration [Source: https://aws.amazon.com/emr/].
- Hybrid platform: Run analytics alongside microservices and other containerized applications on the same cluster.
- Cost profile: You pay for the EKS cluster regardless of EMR usage; EMR on EKS adds a per-vCPU surcharge for Spark jobs. Resource sharing across workloads improves utilization [Source: https://aws.amazon.com/emr/].
- Cold start: 20-60 seconds for pod startup; no cluster provisioning since the EKS cluster already exists [Source: https://aws.amazon.com/emr/].
- Autoscaling: Leverages EKS Cluster Autoscaler and horizontal pod autoscaling [Source: https://aws.amazon.com/emr/features/].
- Customization: Full Kubernetes flexibility — custom container images, node pools with GPUs or specialized instances, init containers, sidecars.
When to choose: Organizations already running EKS in production; teams that want a unified platform for data and application workloads; environments needing advanced Kubernetes scheduling features (taints, tolerations, GPU nodes); maximizing utilization across mixed workload types [Source: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-overview-benefits.html].
Typical deployment: a fintech platform runs API microservices on EKS and adds EMR on EKS to process transaction data; Spark pods schedule onto the same node pool during off-peak hours, recovering otherwise-idle capacity.
Comparing the Three Models
| Dimension | EMR on EC2 | EMR Serverless | EMR on EKS |
|---|---|---|---|
| Infrastructure management | Manual cluster | Fully managed | Kubernetes-managed |
| Cold start | 3-10 minutes | 30-60 seconds | 20-60 seconds |
| Cost model | Instance-based | Pay-per-job (vCPU-hours) | EKS cluster + pod resources |
| Autoscaling | Manual + automatic | Fully automatic | Kubernetes-native |
| Idle cost | High | None | EKS baseline |
| Cluster lifetime | Hours to months | Minutes to hours | Continuous (shared) |
| Configuration control | Maximum | Minimal | Medium-high (K8s) |
| Multi-master HA | Yes | N/A | Via Kubernetes |
| Best for | Sustained 24/7 ETL | Bursty / scheduled / dev | Existing EKS shops |
[Source: https://aws.amazon.com/emr/]
Mature platforms often mix the models: EMR on EC2 (or EKS) for always-on baseline pipelines, EMR Serverless for the long tail of analyst queries, backfills, and data-science exploration. This minimizes idle cost while preserving operational control where it matters.
Key Takeaway: Choose EMR on EC2 for sustained, configuration-heavy work; EMR Serverless for sporadic and unpredictable jobs; EMR on EKS when you already run Kubernetes and want unified infrastructure. Many shops use more than one.
Performance Tuning
A correctly written Spark job can still be slow because of partitioning, join strategy, and memory allocation. Tuning is about shuffle, skew, and the optimizer’s runtime behavior. Adaptive Query Execution (AQE), introduced in Spark 3.0, automates much of what used to be manual.
Partition Sizing and Skew
Partitions are the unit of parallelism. Too few underutilize the cluster; too many drown in scheduling overhead.
Sizing rule of thumb: target 100-200 MB per partition. Tasks then run for tens of seconds, amortizing scheduling overhead while keeping retries cheap. spark.sql.shuffle.partitions (default 200) is rarely the right answer for real workloads.
Skew is the silent killer. If most join keys have a few thousand rows but customer_id = 'enterprise_account' has 50 million, that one partition becomes a “straggler” while the cluster idles [Source: https://dataninjago.com/2022/02/21/spark-sql-query-engine-deep-dive-20-adaptive-query-execution-part-2/]. The Spark UI’s task duration histogram surfaces this immediately.
Manual mitigation strategies:
- Repartition before joins:
df.repartition(200, "join_key")distributes data more evenly before the shuffle. - Salt hot keys: append a random suffix (
enterprise_account_0,enterprise_account_1, …) and replicate the matching side, breaking one giant partition into many. - Filter early: skew on a value you do not need? Drop it before the join.
- Pre-aggregate: if the downstream operation is an aggregation, partial aggregation reduces the data before the shuffle.
AQE’s OptimizeSkewedJoin rule addresses this automatically by detecting oversized partitions at runtime, splitting them, and replicating the matching side. Published benchmarks show reductions from ~7.7 minutes to ~1 minute on heavily skewed joins [Source: https://medium.datadriveninvestor.com/optimizing-spark-performance-with-aqe-a-deep-dive-into-apache-sparks-adaptive-query-execution-ada33916cbdd].
Key Takeaway: Aim for 100-200 MB partitions, watch the Spark UI for straggler tasks, and let AQE handle most skew automatically. Reach for salting only when AQE cannot rescue you.
Broadcast Joins and Shuffle Reduction
A regular join shuffles both sides so matching keys land on the same partition — expensive. A broadcast join sidesteps the shuffle by sending the smaller table to every executor as a hash table; each executor probes its local hash table with no shuffle on the large side [Source: https://spark.apache.org/docs/latest/sql-performance-tuning.html].
When broadcast joins win:
- One side is small (default <30 MB after compression / encoding, configurable via
spark.sql.adaptive.broadcastJoinThresholdorspark.sql.autoBroadcastJoinThreshold) [Source: https://spark.apache.org/docs/latest/sql-performance-tuning.html]. - You have severely skewed data on the large side — broadcast joins are immune to skew because they do not partition the large side at all [Source: https://medium.datadriveninvestor.com/optimizing-spark-performance-with-aqe-a-deep-dive-into-apache-sparks-adaptive-query-execution-ada33916cbdd].
Two ways to invoke a broadcast join:
from pyspark.sql.functions import broadcast
# Explicit hint - recommended when you know the table is small
sales.join(broadcast(dim_region), "region_id")
# Automatic via AQE - converts at runtime when size < threshold
# Requires: spark.sql.adaptive.enabled = true
Broadcast joins OOM the driver if the “small” side turns out to be 5 GB. Validate the size before using an explicit hint, and rely on AQE’s runtime size detection when unsure.
Figure 6.4: Shuffle hash join versus broadcast join
flowchart TB
subgraph Shuffle[Shuffle Hash Join - both sides shuffled]
direction LR
L1[Large Table<br/>partitions] -->|shuffle by key| LS[Repartitioned Large]
S1[Small Table<br/>partitions] -->|shuffle by key| SS[Repartitioned Small]
LS --> J1[Join: matching keys<br/>co-located]
SS --> J1
end
subgraph Broadcast[Broadcast Join - small side replicated]
direction LR
L2[Large Table<br/>partitions stay put] --> J2[Local Hash Probe<br/>on every executor]
S2[Small Table] -->|broadcast<br/>to all executors| HT[Hash Table<br/>in executor memory]
HT --> J2
end
Other shuffle reduction strategies:
- Filter and project early so less data crosses the shuffle.
- Pre-aggregate —
groupBy().agg()beatsgroupBy().collect_list(). - Bucketing at write time: pre-bucket by join keys so subsequent joins skip the shuffle.
- Local shuffle reading:
spark.sql.adaptive.localShuffleReader.enabled=truekeeps post-broadcast shuffle reads local [Source: https://spark.apache.org/docs/latest/sql-performance-tuning.html].
Key Takeaway: The fastest shuffle is the one you avoid. Broadcast joins eliminate the shuffle on the large side; AQE will often choose them automatically when runtime statistics say the small side fits.
Caching, Persisting, and Adaptive Query Execution
Caching keeps a DataFrame’s partitions in memory (or memory + disk) so subsequent actions reuse them instead of recomputing. Two APIs:
df.cache()— equivalent topersist(MEMORY_AND_DISK).df.persist(StorageLevel.MEMORY_ONLY_SER)— full control over storage level (memory only, memory + disk, serialized, replicated).
When to cache:
- A DataFrame is referenced by multiple actions (
df.count()thendf.write(...)). - An iterative algorithm (ML training, graph traversal) revisits the same data.
- Expensive transformations (large joins) feed multiple downstream queries.
When not to cache:
- The DataFrame is used only once — caching adds overhead with no reuse benefit.
- The data does not fit in memory and disk caching would be slower than recomputing.
Always pair .cache() with .unpersist() once you are done; orphaned caches eat executor memory.
Adaptive Query Execution (AQE), default-enabled in modern Spark, is the most important tuning feature added in the last decade. AQE uses runtime statistics — actual intermediate result sizes after each stage — to re-plan the rest of the query [Source: https://spark.apache.org/docs/latest/sql-performance-tuning.html]. Enable it with spark.sql.adaptive.enabled=true.
The three AQE rules every Spark user should know:
| Rule | What it does | Typical benefit |
|---|---|---|
CoalesceShufflePartitions | Merges small post-shuffle partitions into larger ones | Removes scheduling overhead; e.g., 200 partitions coalesced to 4 [Source: https://dataninjago.com/2022/02/21/spark-sql-query-engine-deep-dive-20-adaptive-query-execution-part-2/] |
OptimizeSkewedJoin | Splits oversized partitions and replicates matching side | 7.7 min → 1 min on real skew benchmarks [Source: https://medium.datadriveninvestor.com/optimizing-spark-performance-with-aqe-a-deep-dive-into-apache-sparks-adaptive-query-execution-ada33916cbdd] |
| Adaptive join conversion | Converts sort-merge joins to broadcast at runtime when small side fits | Eliminates shuffle when statistics confirm it [Source: https://spark.apache.org/docs/latest/sql-performance-tuning.html] |
A solid baseline configuration for production Spark on EMR:
# Enable AQE and its sub-rules
spark.sql.adaptive.enabled=true
spark.sql.adaptive.coalescePartitions.enabled=true
spark.sql.adaptive.skewJoin.enabled=true
spark.sql.adaptive.localShuffleReader.enabled=true
# Broadcast join threshold (after AQE runtime sizing)
spark.sql.adaptive.broadcastJoinThreshold=30M
# Dynamic partition pruning (for partitioned fact tables)
spark.sql.dynamicPartitionPruning.enabled=true
# Initial shuffle partition count (AQE will coalesce as needed)
spark.sql.shuffle.partitions=400
[Source: https://spark.apache.org/docs/latest/sql-performance-tuning.html]
AQE represents a shift from static optimization (plans set before execution from table stats) to dynamic optimization (plans adapt to runtime data sizes). For most workloads, turning AQE on is the single biggest tuning win [Source: https://medium.datadriveninvestor.com/optimizing-spark-performance-with-aqe-a-deep-dive-into-apache-sparks-adaptive-query-execution-ada33916cbdd].
Figure 6.5: Adaptive Query Execution — runtime re-planning loop
stateDiagram-v2
[*] --> InitialPlan: Catalyst optimizes<br/>using static stats
InitialPlan --> RunStage: Submit next stage
RunStage --> CollectStats: Stage completes
CollectStats --> Decide: Inspect actual<br/>shuffle output sizes
Decide --> Coalesce: Many tiny<br/>partitions?
Decide --> SkewSplit: Oversized<br/>skewed partition?
Decide --> ConvertJoin: Small side<br/>now fits broadcast?
Decide --> RunStage: No change needed
Coalesce --> RunStage: Merge partitions<br/>and continue
SkewSplit --> RunStage: Split + replicate<br/>matching side
ConvertJoin --> RunStage: Sort-merge to<br/>broadcast hash join
RunStage --> Done: Final stage complete
Done --> [*]
Validation workflow:
- Enable AQE and run the job; capture runtime in the Spark UI.
- Check the SQL tab for AQE annotations showing coalesced partitions or converted joins.
- Look for straggler tasks in the Stages tab; add
broadcast()hints or salt if AQE missed a skew case. - Check executor memory and GC time; if spilling to disk, increase memory or reduce partition size.
- Use
EXPLAIN FORMATTEDto confirm predicate pushdown to file readers.
Key Takeaway: Cache only what you reuse, enable AQE everywhere, and let the optimizer adapt at runtime. Manual tuning (broadcast hints, salting) is now the exception, not the rule.
Chapter Summary
Apache Spark distributes work through a driver-cluster manager-executor topology. The driver translates code into a DAG of stages separated by shuffle boundaries; executors run tasks (one per partition). Shuffles are the most expensive operation in the system, so reducing or avoiding them is the primary tuning lever.
The DataFrame API and Spark SQL are canonical because they let Catalyst and Tungsten see the entire computation. PySpark, Scala, and SQL share the same engine, so language choice is mostly about team productivity — except for Python UDFs, which incur serialization overhead.
Amazon EMR offers three deployment models. EMR on EC2 gives classic clusters with maximum configurability for sustained 24/7 workloads. EMR Serverless eliminates infrastructure management with pay-per-job billing for bursty jobs. EMR on EKS runs Spark as Kubernetes pods on existing EKS clusters. AWS reports up to 5.4x faster Spark performance on EMR than open-source.
Performance tuning has shifted from manual to dynamic. Adaptive Query Execution coalesces shuffle partitions, splits skewed joins, and converts sort-merge joins to broadcast at runtime. Combined with sensible partition sizing (100-200 MB), strategic caching of reused data, and broadcast hints for known-small dimensions, AQE handles most tuning that used to require deep expertise.
Key Terms
- Apache Spark — A unified analytics engine for large-scale distributed data processing, providing in-memory computation, lazy evaluation, and a layered API that runs on a cluster of machines coordinated by a driver and executors [Source: https://spark.apache.org/docs/latest/cluster-overview.html].
- RDD (Resilient Distributed Dataset) — Spark’s foundational abstraction: a fault-tolerant, partitioned collection of records that can be transformed in parallel and recomputed from lineage if a partition is lost [Source: http://archive.gersteinlab.org/meetings/s/2015/05.05/Advanced_Analytics_with_Spark-2.pdf].
- DataFrame — A distributed collection of rows organized into named, typed columns. The schema enables Catalyst optimization and is the canonical Spark API for nearly all modern workloads.
- Catalyst — Spark’s query optimizer. Translates DataFrame and SQL queries into logical plans, applies rule-based and cost-based optimizations, and produces physical execution plans [Source: https://spark.apache.org/docs/latest/sql-performance-tuning.html].
- Executor — A JVM worker process that runs on a cluster node, executes tasks assigned by the driver, caches partitions in memory, and reports heartbeats and results back [Source: https://spark.apache.org/docs/latest/cluster-overview.html].
- Shuffle — The redistribution of data across partitions over the network, triggered by wide transformations like
groupBy,join, andrepartition. The most expensive operation in Spark and the primary tuning target [Source: https://spark.apache.org/docs/latest/cluster-overview.html]. - Broadcast join — A join strategy that sends the smaller table to every executor as a local hash table, eliminating the shuffle on the large side. Default threshold is configurable via
spark.sql.adaptive.broadcastJoinThreshold(typically 30 MB) [Source: https://spark.apache.org/docs/latest/sql-performance-tuning.html]. - EMR (Amazon EMR) — AWS-managed big-data platform that runs Spark, Hadoop, Hive, HBase, Trino, and other frameworks. AWS reports EMR Spark runs up to 5.4x faster than open-source Spark [Source: https://aws.amazon.com/emr/].
- EMR Serverless — Fully managed EMR deployment model with pay-per-job billing (vCPU-hours and memory-hours), no idle cost, and 30-60 second cold start. Best for bursty, unpredictable workloads [Source: https://aws.amazon.com/emr/].
- AQE (Adaptive Query Execution) — Spark 3.0+ feature that uses runtime statistics to re-plan queries during execution, automatically coalescing shuffle partitions, splitting skewed joins, and converting sort-merge joins to broadcast joins. Enabled with
spark.sql.adaptive.enabled=true[Source: https://spark.apache.org/docs/latest/sql-performance-tuning.html].
Chapter 7: Cloud Data Warehousing with Redshift
Learning Objectives
By the end of this chapter, you will be able to:
- Describe the Redshift architecture, including the responsibilities of the leader and compute nodes, the role of slices in parallel execution, and how RA3 managed storage decouples compute from storage.
- Choose appropriate distribution and sort keys to optimize query performance for fact, dimension, and analytical workloads.
- Use Redshift Spectrum and Redshift Serverless to query data lake files in Amazon S3 without first loading them into the warehouse.
- Apply Workload Management (WLM), concurrency scaling, result caching, and materialized views to deliver predictable performance under mixed workloads.
Cloud data warehouses are the muscle of modern analytics platforms. They take the messy, distributed output of pipelines and lakes and turn it into something a business analyst can query in seconds, not hours. Amazon Redshift is one of the longest-running and most widely deployed cloud warehouses, and its architecture demonstrates many of the design ideas you will encounter in Snowflake, BigQuery, and Databricks SQL. If you understand how Redshift distributes work across compute nodes, sorts data on disk, and reaches into S3 through Spectrum, the rest of the cloud-warehouse universe becomes much easier to navigate.
Redshift Architecture
Leader Node and Compute Nodes
A Redshift cluster is a tightly coordinated team of specialists. The leader node is the conductor: when a SQL statement arrives, it parses the query, optimizes it, generates compiled C++ code, and ships that code along with execution instructions to the worker nodes. Critically, the leader node does not store user data and does not do the heavy lifting of scanning rows itself; its job is planning and coordination [Source: https://docs.aws.amazon.com/redshift/latest/dg/c_high_level_system_architecture.html]. Think of it as the air-traffic controller in a busy airport: it doesn’t fly the planes, but nothing lands or takes off without its instructions.
The compute nodes are the planes. Each compute node is a server with its own CPU cores, memory, high-bandwidth network interface, and (for RA3 node types) local SSD storage. Compute nodes execute the compiled query code in parallel, scanning data, evaluating predicates, computing aggregates, and shipping intermediate results back to the leader for final assembly [Source: https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-clusters.html].
| Component | Stores Data? | Executes Queries? | Talks to Clients? |
|---|---|---|---|
| Leader node | No | Plans only | Yes (single endpoint) |
| Compute node | Yes (via slices) | Yes (in parallel) | No (internal network) |
When a client application connects to the cluster, it always connects through the leader node’s endpoint. The leader node is therefore both the brain and the front door, while the compute nodes form the engine room.
Key Takeaway: The leader node plans queries and returns results; compute nodes do the actual scanning and computation. This separation lets Redshift add compute capacity without changing how clients connect.
Slices and Massively Parallel Processing
The unit of parallelism inside Redshift is not the node — it is the slice. Each compute node is partitioned into a fixed number of slices based on its instance type. For example, an ra3.xlplus node has 2 slices, while larger node types have more [Source: https://aws.amazon.com/blogs/big-data/introducing-amazon-redshift-ra3-xlplus-nodes-with-managed-storage/]. Each slice receives a slice (pun intended) of the node’s memory and disk and is responsible for processing its share of the data in parallel with every other slice in the cluster.
This is the essence of Massively Parallel Processing (MPP): rather than a single server churning through a query, dozens or hundreds of slices work concurrently on disjoint partitions of the same dataset. If a query needs to scan a 10 billion-row fact table on a cluster with 32 slices, each slice scans roughly 312 million rows simultaneously. The query that would have taken minutes on a single machine completes in seconds [Source: https://curatepartners.com/general/decoding-redshift-architecture-how-node-types-mpp-design-impact-performance/].
A useful analogy is sorting mail in a giant post office. A single clerk (a traditional database) sorts every envelope sequentially. An MPP system hires 32 clerks (slices), gives each a bin labeled with a range of zip codes, and lets them all sort simultaneously. The supervisor (leader node) hands out the work and assembles the final stacks.
Client SQL
|
v
[Leader Node]
plan + compile
|
----------+----------
| | |
[Node 1] [Node 2] [Node N]
/ \ / \ / \
S1 S2 S1 S2 S1 S2 <-- slices process in parallel
The cluster’s total parallel capacity is nodes x slices_per_node. Doubling the node count doubles the parallelism, which in well-designed schemas roughly halves query time.
Figure 7.1: Redshift leader/compute/slice MPP architecture
flowchart TD
Client[Client Application / BI Tool] -->|SQL via single endpoint| Leader[Leader Node<br/>parse, optimize, compile]
Leader -->|compiled code + plan| N1[Compute Node 1]
Leader -->|compiled code + plan| N2[Compute Node 2]
Leader -->|compiled code + plan| N3[Compute Node N]
N1 --> S1A[Slice 1]
N1 --> S1B[Slice 2]
N2 --> S2A[Slice 1]
N2 --> S2B[Slice 2]
N3 --> S3A[Slice 1]
N3 --> S3B[Slice 2]
S1A -->|partial results| Leader
S1B -->|partial results| Leader
S2A -->|partial results| Leader
S2B -->|partial results| Leader
S3A -->|partial results| Leader
S3B -->|partial results| Leader
Leader -->|final result set| Client
Key Takeaway: Slices, not nodes, are the unit of parallel work. A query is fast when its data is spread evenly across slices and slow when one slice ends up with most of the rows.
RA3 Nodes and Managed Storage
The original Redshift node families (DS2, DC2) coupled compute and storage tightly: the disks lived inside the compute nodes, and growing the warehouse meant adding nodes you didn’t otherwise need. The RA3 family (ra3.xlplus, ra3.4xlarge, ra3.16xlarge) decouples the two by introducing Redshift Managed Storage (RMS) [Source: https://aws.amazon.com/redshift/features/ra3/].
RMS uses a two-tier model:
| Tier | Location | Role |
|---|---|---|
| Tier 1 (hot) | Local NVMe SSDs on each RA3 node | Frequently accessed blocks; queried at SSD speed |
| Tier 2 (cold) | Amazon S3 (managed by Redshift) | Cold blocks, automatically offloaded; petabyte-scale |
Redshift continuously analyzes block temperature, age, and workload patterns, then prefetches hot blocks to local SSD before queries need them. From the user’s perspective, the storage just looks like one big disk, but in reality the hot working set lives next to the CPU while cold history sleeps in S3 [Source: https://aws.amazon.com/blogs/big-data/use-amazon-redshift-ra3-with-managed-storage-in-your-modern-data-architecture/].
The financial impact is significant. With DC2 nodes, a 100 TB warehouse forced you to provision a large number of nodes simply to hold the data, even if your active queries only touched 5 TB. With RA3, you size the cluster for compute (vCPU and RAM) and pay for storage separately at S3-like rates. An ra3.xlplus cluster can address up to 32 TB per node and up to 1024 TB total [Source: https://aws.amazon.com/redshift/features/ra3/].
RA3 also supports cluster relocation — moving a cluster between Availability Zones with the same endpoint and zero RPO — because the persistent data already lives in regional S3 rather than node-local disk.
Figure 7.2: RA3 nodes with Redshift Managed Storage two-tier layout
flowchart LR
Q[Incoming Query] --> CL[Compute Layer<br/>RA3 Nodes vCPU + RAM]
CL -->|hot blocks| T1[Tier 1: Local NVMe SSD<br/>frequently accessed blocks]
CL -->|cold blocks fetched on demand| T2[Tier 2: Amazon S3<br/>Redshift Managed Storage]
T2 -.->|prefetch + tiering<br/>by block temperature| T1
T1 -.->|evict cold blocks| T2
T2 --> AZ[Cross-AZ Durability<br/>cluster relocation, zero RPO]
Key Takeaway: RA3 with Redshift Managed Storage lets you scale compute and storage independently, paying compute prices only for the cores you need and S3-tier prices for cold history.
Schema Design for Performance
Schema design is the single biggest lever for Redshift performance. Two tables with identical columns can have query times that differ by 100x simply because of distribution and sort key choices. Two concepts dominate this section: distribution styles decide which slice each row lives on, and sort keys decide where on disk the rows go within a slice.
Distribution Styles: KEY, ALL, EVEN, AUTO
When you load a row into Redshift, it has to land on exactly one slice. The distribution style of the table tells Redshift how to choose [Source: https://aws.amazon.com/blogs/big-data/amazon-redshift-engineerings-advanced-table-design-playbook-distribution-styles-and-distribution-keys/].
KEY distribution picks one column as the distribution key (DISTKEY). Redshift hashes that column’s value and routes the row to the slice corresponding to that hash. The killer feature is that two tables with the same DISTKEY will have rows with matching key values landing on the same slice. When you join them, no data has to move across the network — the join is collocated [Source: https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-best-dist-key.html].
ALL distribution stores a complete copy of the table on the first slice of every node. Joins against an ALL-distributed table are always collocated regardless of the other side’s distribution, because every node already has every row. The trade-off is storage: a 1 GB ALL-distributed table on a 10-node cluster consumes 10 GB. ALL is therefore reserved for small, slowly changing dimension tables.
EVEN distribution uses round-robin: row 1 to slice 1, row 2 to slice 2, and so on. It guarantees uniform data spread but offers no help to joins; any join will require redistribution.
AUTO distribution is the default when you do not specify a style. Redshift starts with ALL for small tables, switches to EVEN as they grow, and may convert to KEY based on observed query patterns [Source: https://docs.aws.amazon.com/redshift/latest/dg/c_choosing_dist_sort.html].
| Style | Best For | Storage Cost | Join Cost |
|---|---|---|---|
| KEY | Large fact-dimension joins on a stable key | 1x | Free if collocated |
| ALL | Small dimension tables joined many ways | N nodes x 1x | Always free |
| EVEN | Tables with no good join key | 1x | Redistribution required |
| AUTO | Unknown access pattern; let Redshift decide | Adapts | Adapts |
A canonical pattern is the fact + dimension schema. Suppose you have a sales fact table with 10 billion rows and a customer dimension with 50 million rows. They join on customer_id. You should:
CREATE TABLE sales (
sale_id BIGINT,
customer_id BIGINT NOT NULL,
sale_date DATE,
amount NUMERIC(12,2)
)
DISTKEY (customer_id)
SORTKEY (sale_date);
CREATE TABLE customer (
customer_id BIGINT NOT NULL,
name VARCHAR(200),
region VARCHAR(50)
)
DISTKEY (customer_id)
SORTKEY (customer_id);
Both tables now use customer_id as the DISTKEY, so all sales for customer 12345 sit on the same slice as customer 12345’s row in the dimension table. The join is collocated.
For a tiny dimension like country (200 rows), use DISTSTYLE ALL instead, so it joins for free against any fact table regardless of their distribution.
Figure 7.3: Distribution style decision flow (KEY / ALL / EVEN / AUTO)
flowchart TD
Start[New Table] --> Q1{Is the table small<br/>and slowly changing?}
Q1 -->|Yes| ALL[DISTSTYLE ALL<br/>full copy on every node<br/>any join collocated]
Q1 -->|No| Q2{Is there a stable<br/>JOIN column?}
Q2 -->|Yes| KEY[DISTKEY column<br/>hash route to slice<br/>collocated joins]
Q2 -->|No| Q3{Need uniform spread<br/>without join hint?}
Q3 -->|Yes| EVEN[DISTSTYLE EVEN<br/>round-robin<br/>redistribution at join]
Q3 -->|Unknown / mixed| AUTO[DISTSTYLE AUTO<br/>ALL then EVEN then KEY<br/>adapts over time]
Key Takeaway: Choose DISTKEY by JOIN pattern. Co-locate the fact table and its primary dimension on the shared foreign key; replicate small dimensions with ALL.
Sort Keys: Compound vs Interleaved
Once a row is on its slice, the sort key decides where on disk it lands. Redshift stores data in 1 MB blocks and keeps a zone map for each block recording the min and max value of every column. When a query has a WHERE predicate on a sorted column, the optimizer consults the zone maps and skips entire blocks whose range cannot match the filter [Source: https://docs.aws.amazon.com/redshift/latest/dg/t_Sorting_data.html]. On a 10 TB table, a good sort key can turn a full scan into a few hundred-millisecond seek.
There are two sort-key flavors:
Compound sort key (the default) sorts rows by the first column, then by the second within ties, then by the third within those ties, and so on — exactly like a phone book sorted by last name, then first name, then middle initial. Compound keys deliver dramatic speedups when queries filter on the leading column or columns of the key. A (sale_date, region, state) compound key shines for WHERE sale_date = '2026-05-07', still helps for WHERE sale_date = '2026-05-07' AND region = 'us-east', but provides no benefit for WHERE state = 'CA' alone.
Interleaved sort key gives equal weight to each column in the key, no matter the order, by interleaving values via a space-filling curve. They help when queries filter on different subsets of the key columns over time. The trade-offs are real: load and VACUUM REINDEX operations are significantly slower, and interleaved keys should not be used on monotonically increasing columns like identity IDs or timestamps unless the column has long shared prefixes [Source: https://docs.aws.amazon.com/redshift/latest/dg/t_Sorting_data-interleaved.html].
| Aspect | Compound | Interleaved |
|---|---|---|
| Filter pattern | Leading-column predicates | Any subset of key columns |
| Load cost | Low | Higher |
| VACUUM cost | Standard | Significant (REINDEX) |
| Best on | Time-series, ordered keys | Multi-dimensional cubes |
| Avoid on | (works broadly) | Monotonic IDs, dates alone |
A useful analogy: the compound sort key is a strict library card catalog filed by author then title; you find books quickly only if you know the author. The interleaved key is more like a magic catalog where any single attribute (author, title, subject) gets you close to the row, at the cost of a much more expensive nightly re-shelving.
The general rule is to start with a compound sort key on the most common filter column (often a date) and only consider interleaved if you have hard evidence of multi-dimensional ad-hoc filtering at scale.
Key Takeaway: Sort keys reduce I/O by letting Redshift skip blocks via zone maps. Compound is best for predictable, leading-column filters; interleaved is for genuinely multi-dimensional filtering, with higher maintenance cost.
Compression Encodings
Every column in Redshift is stored in a columnar format with a compression encoding. Because columns store homogeneous data, they compress phenomenally well: a column of country codes might compress 20:1, a date column 8:1, a wide string 3:1.
Common encodings include:
- AZ64 — Amazon’s proprietary encoding, well suited to numeric and date types; the modern default for most numeric columns.
- ZSTD — General-purpose, high-ratio compression; a strong default for VARCHAR and CHAR columns.
- LZO — A legacy general-purpose encoding; mostly superseded by ZSTD.
- BYTEDICT, DELTA, MOSTLY8/16/32, RUNLENGTH, TEXT255, TEXT32K — Specialized encodings for low-cardinality, sequential, or sparse data.
- RAW — No compression; used for sort key columns where the encoding overhead would slow scans.
Modern best practice is to let COPY choose encodings via COMPUPDATE ON, or run ANALYZE COMPRESSION and apply the recommendations. The savings translate directly to performance: smaller blocks mean more rows fit in memory and fewer bytes traverse the network and SSD.
Key Takeaway: Columnar storage with per-column compression slashes both storage cost and scan I/O. Use AZ64 for numerics and ZSTD for strings unless analysis suggests otherwise.
Redshift Spectrum and Serverless
Redshift’s relevance has grown precisely because it is no longer just a closed warehouse; it can reach out and query the data lake directly. Redshift Spectrum is the bridge to S3, and Redshift Serverless changes the operational model so you don’t have to babysit clusters.
Querying S3 Data with Spectrum
Redshift Spectrum lets a Redshift cluster run SQL directly against files in S3 — Parquet, ORC, CSV, JSON, Avro — without first loading them into Redshift tables [Source: https://aws-reference-architectures.gitbook.io/datalake/data-analytics/redshift-spectrum]. The mental model is a federation layer: Redshift sees the S3 data as external tables living in an external schema that is backed by the AWS Glue Data Catalog (or a Hive Metastore).
The wiring looks like this:
Redshift Cluster
|
v
External Schema --(metadata)--> Glue Data Catalog
| |
v v
External Tables --(file paths)--> S3 Buckets (Parquet/ORC/...)
A typical setup creates an IAM role granting Redshift S3 read access plus Glue/Athena permissions, and then runs:
CREATE EXTERNAL SCHEMA spectrum_schema
FROM DATA CATALOG
DATABASE 'analytics_lake'
IAM_ROLE 'arn:aws:iam::123456789012:role/redshift-spectrum-role'
REGION 'us-east-1';
After this single command, every table cataloged in the analytics_lake Glue database appears in Redshift as spectrum_schema.<table_name> and is queryable with ordinary SQL [Source: https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-external-schemas.html].
The most powerful pattern is mixing local and external data in one query. Imagine recent hot data lives in a Redshift fact table while colder historical data has been archived to S3. A single query can join across both:
SELECT
c.region,
SUM(s.amount) AS hot_amount,
SUM(h.amount) AS cold_amount
FROM public.customer c
JOIN public.sales s ON s.customer_id = c.customer_id
LEFT JOIN spectrum_schema.archived_sales h
ON h.customer_id = c.customer_id
WHERE s.sale_date >= DATEADD(month, -3, CURRENT_DATE)
GROUP BY c.region;
Spectrum runs on a separate, massively scaled fleet of workers behind the scenes. They perform predicate pushdown (sending WHERE filters down to the file scan), projection pushdown (reading only the needed columns), and partition pruning (using Hive-style partition layouts like s3://lake/sales/year=2026/region=us-east/ to skip whole prefixes) [Source: https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-external-tables.html]. Because Spectrum is priced per terabyte scanned, partition pruning and columnar formats translate directly into dollars saved.
A practical cost example: a 10 TB CSV-format data lake fully scanned costs roughly 10 x $5 = $50 per query. The same data stored as Parquet with year and region partitions, queried with WHERE year = 2026 AND region = 'us-east', might scan only 100 GB — about $0.50 per query. Same answer, 100x cheaper.
Figure 7.4: Spectrum query path from Redshift to S3 via the Glue Data Catalog
sequenceDiagram
participant C as Client
participant L as Redshift Leader Node
participant G as AWS Glue Data Catalog
participant SF as Spectrum Worker Fleet
participant S3 as Amazon S3 (Parquet/ORC)
participant CN as Redshift Compute Nodes
C->>L: SQL referencing spectrum_schema.archived_sales JOIN public.sales
L->>G: Resolve external schema + table metadata
G-->>L: File paths, partitions, column stats
L->>L: Plan: pushdown WHERE/projection, pick partitions
L->>SF: Dispatch external scan with predicates
SF->>S3: Read only matching partitions/columns
S3-->>SF: Filtered Parquet row groups
SF-->>CN: Stream filtered rows to compute nodes
CN->>CN: Join with local Redshift tables
CN-->>L: Partial aggregates
L-->>C: Final result set
Key Takeaway: Spectrum extends Redshift SQL to the S3 data lake without ETL, charging per terabyte scanned. Columnar formats, partitioning, and predicate pushdown are the levers that keep both latency and cost in check.
Redshift Serverless Capacity Model
For many workloads — ad-hoc analytics, data-app backends, sporadic dashboards — running an always-on cluster is overkill. Redshift Serverless replaces the cluster abstraction with a capacity unit called the RPU (Redshift Processing Unit). You set a base capacity (in RPUs) and an optional max, and Redshift auto-scales between them as workloads demand.
| Dimension | Provisioned | Serverless |
|---|---|---|
| Pricing unit | Cluster-hour | RPU-hour |
| Scaling | Manual (resize) | Automatic |
| Cold start | None (always on) | ~30 seconds first query |
| Idle cost | Yes (cluster always billed) | None (pause when idle) |
| WLM | Manual queues | Automatic |
| Best for | Steady, predictable load | Bursty, unpredictable load |
Serverless still supports Spectrum with identical SQL, so a common pattern is “Serverless + Spectrum” for a low-touch data-lake query layer: you build a Glue Data Catalog over an S3 lakehouse, point a Redshift Serverless workgroup at it, and pay only for the RPU-seconds that actual queries consume.
Data Sharing Across Clusters
Redshift data sharing lets one producer cluster expose specific schemas and tables to one or more consumer clusters — even across AWS accounts and regions — without copying data. The producer creates a DATASHARE, grants objects to it, and authorizes specific consumers; the consumer creates a database from that share and queries it as if it were local. This is the standard pattern for separating an ETL/ELT producer cluster from BI consumer clusters, or for sharing curated data with another business unit, while keeping a single source of truth in RMS.
Key Takeaway: Redshift Serverless trades cluster management for RPU-based auto-scaling, while data sharing lets multiple clusters read from one canonical dataset without copies. Together, they form the substrate of a flexible, multi-team analytics platform.
Workload Management
Even on a perfectly designed schema, performance falls apart when ETL jobs, executive dashboards, and data-scientist ad-hoc queries all hit the same cluster simultaneously. Workload Management (WLM) is Redshift’s mechanism for keeping mixed workloads predictable.
Automatic WLM and Queues
WLM organizes incoming queries into queues. Each queue gets a slice of the cluster’s memory and a maximum concurrency (number of simultaneously running queries). Queries are routed to queues based on user group, query group label, or query characteristics.
Manual WLM lets you define queues by hand: “ETL queue gets 50% memory and concurrency 4; BI queue gets 30% memory and concurrency 8; ad-hoc queue gets the remaining 20% memory and concurrency 5.” This works but requires constant tuning as workloads evolve.
Automatic WLM — now the recommended mode — lets Redshift learn from query history and dynamically allocate memory and concurrency per query. You specify only query priorities (LOWEST, LOW, NORMAL, HIGH, HIGHEST), and the engine does the rest, including allocating more memory to a complex aggregate and less to a small lookup [Source: https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-clusters.html].
A good operational pattern:
| Workload | Queue / Priority | Rationale |
|---|---|---|
| Critical ETL loads | HIGHEST | Must finish in their batch window |
| Executive dashboards | HIGH | User-facing, low-latency expectations |
| Analyst ad-hoc | NORMAL | Many users, tolerate seconds |
| Long-running reports | LOW | Background, can wait |
| Experimental queries | LOWEST | Don’t impact others |
Concurrency Scaling
WLM can only redistribute the cluster’s fixed capacity. When every queue is full, queries pile up. Concurrency scaling solves this by automatically spinning up transient secondary clusters when read-only queries queue up; results return through the original cluster, so applications never see the secondary clusters [Source: https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-clusters.html].
Concurrency scaling earns one free hour per day for every 24 hours the main cluster runs, after which it is metered per-second. For a Black Friday spike or a Monday-morning dashboard rush, this turns a 30-second queued query back into a 2-second query without permanent provisioning.
Result Caching and Materialized Views
The fastest query is the one you don’t run. Result caching stores the results of recent queries in the leader node’s memory; if an identical query arrives and the underlying data has not changed, Redshift returns the cached result in milliseconds. This is invisible — no configuration required — and especially powerful for repetitive dashboard refreshes.
For more complex acceleration, materialized views precompute and persist the results of a query (typically an expensive join or aggregation) and let subsequent queries hit the precomputed data:
CREATE MATERIALIZED VIEW mv_daily_sales_by_region
AUTO REFRESH YES
AS
SELECT c.region,
s.sale_date,
SUM(s.amount) AS daily_amount,
COUNT(*) AS order_count
FROM sales s
JOIN customer c USING (customer_id)
GROUP BY c.region, s.sale_date;
With AUTO REFRESH YES, Redshift incrementally maintains the view as new rows arrive in sales. Even better, the optimizer’s automatic query rewrite feature can transparently redirect a user’s query against sales and customer to the materialized view when the view’s columns and predicates match — analysts get the speedup without changing their SQL.
Materialized views are particularly effective for:
- Daily/weekly/monthly rollups consumed by BI dashboards.
- Star-schema joins that appear in many dashboards.
- Pre-aggregated KPIs (revenue by region, active users by cohort).
Key Takeaway: WLM and concurrency scaling keep mixed workloads honest, while result caching and materialized views make the most expensive queries effectively free for repeat callers. Together they let one cluster serve ETL, BI, and ad-hoc users without anyone fighting for resources.
Chapter Summary
Redshift is a textbook example of how a cloud data warehouse trades complexity for performance. The leader node plans queries; the compute nodes execute them in parallel; slices are the actual unit of parallelism; and MPP is the design philosophy that ties them together. The RA3 node family with Redshift Managed Storage decouples compute from storage, letting hot data live on local SSD while cold data sleeps cheaply in S3 — enabling petabyte-scale warehouses sized purely for compute.
Schema design — distribution styles (KEY, ALL, EVEN, AUTO) and sort keys (compound, interleaved) — is by far the largest performance lever. KEY distribution co-locates joining tables; ALL replicates small dimensions; sort keys with zone maps let the engine skip 1 MB blocks entirely. Columnar storage with per-column compression encodings (AZ64, ZSTD, and friends) compounds these savings.
Redshift Spectrum turns the cluster into a federated query engine over S3, with external schemas backed by the Glue Data Catalog and aggressive predicate, projection, and partition pushdown. Redshift Serverless removes cluster management with RPU-based auto-scaling, while data sharing lets multiple clusters read one canonical copy of the data.
Operational excellence comes from Workload Management — preferably the automatic flavor — which routes queries to priority-based queues; from concurrency scaling, which adds transient capacity during bursts; and from result caching and materialized views, which collapse the cost of repetitive analytics to near zero. Master these levers and Redshift becomes a serious general-purpose platform for analytics, BI, and lakehouse workloads — the foundation on which the chapters ahead build streaming, lakehouse, and orchestration patterns.
Key Terms
- Amazon Redshift — AWS’s managed, petabyte-scale, MPP cloud data warehouse, accessed through a single SQL endpoint and capable of querying both warehouse-resident and S3-resident data.
- MPP (Massively Parallel Processing) — An architecture in which many compute units (in Redshift, slices) execute the same query in parallel against disjoint partitions of the data, then combine results.
- Leader node — The single coordinator node in a Redshift cluster; parses, optimizes, compiles, and dispatches queries, but does not store user data.
- Compute node — A worker node that holds data on local SSD (RA3) and executes the leader’s compiled query code in parallel with peers.
- Distribution key (DISTKEY) — A column whose hash decides which slice each row of a table is stored on; choosing the join column as the DISTKEY co-locates joins and eliminates network traffic.
- Sort key (SORTKEY) — One or more columns that determine the on-disk order of rows within a slice, enabling Redshift to skip 1 MB blocks via zone maps when filtering.
- RA3 — The modern Redshift node family (
ra3.xlplus,ra3.4xlarge,ra3.16xlarge) that decouples compute from storage via Redshift Managed Storage, letting compute and storage scale and bill independently. - Spectrum — Redshift’s external query engine that lets clusters run SQL directly against S3 files described by external schemas in the Glue Data Catalog, billed per terabyte scanned.
- WLM (Workload Management) — Redshift’s queue-and-priority system that allocates memory and concurrency to incoming queries; available in manual and automatic modes.
- Materialized view — A precomputed, persisted query result inside Redshift, optionally auto-refreshed and auto-rewritten by the optimizer to accelerate repetitive joins and aggregations.
Chapter 8: Interactive Querying and Federated Analytics
Learning Objectives
By the end of this chapter, you will be able to:
- Use Amazon Athena to run SQL queries against data in Amazon S3 without provisioning or managing servers.
- Optimize Athena query cost and performance through partitioning, columnar formats, compression, bucketing, and CTAS rewrites.
- Build federated queries that span S3, Amazon RDS, Amazon DynamoDB, and external systems such as Snowflake and BigQuery.
- Compare Trino/Presto-based engines (Athena, EMR Trino) with persistent warehouse engines (Redshift) and decide when each is the right tool for the job.
The previous chapter built a curated, governed lakehouse on top of S3 and the AWS Glue Data Catalog. That work pays off only when humans and applications can ask questions of the data. This chapter is about asking questions — interactively, across heterogeneous storage systems, and at a cost you can defend on a budget review. We treat Amazon Athena as the canonical serverless query engine, but every concept (data layout, projection pushdown, the Trino execution model, federation through connectors) applies to any modern lakehouse query engine you might encounter.
Amazon Athena Fundamentals
Most data warehouses follow a “load before you query” model: you stand up a cluster, ingest data, and only then run SQL. Amazon Athena inverts that contract. The data already lives in Amazon S3; you point Athena at it and run SQL immediately. There are no nodes to size, no maintenance windows, and no idle capacity to amortize.
Serverless Query Model
The clearest analogy for Athena is a public library with a fleet of on-call researchers. You arrive with a question, hand it to the front desk (the Athena API), and somewhere behind the scenes a researcher is dispatched to pull exactly the books your question requires. When the answer comes back, the researcher disappears. You never see them. You never pay for the time they spend resting between visits. You pay only for the books they had to open.
That is the operational model Athena exposes. Architecturally, Athena is a managed deployment of the Trino distributed SQL engine (which began life as Facebook’s Presto project, was forked into PrestoDB and PrestoSQL, and was renamed Trino in 2020). When you submit a query, Athena allocates worker capacity from a shared, multi-tenant pool, plans the query, scans S3 objects in parallel, and streams results back [Source: https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/]. The metadata it relies on — table definitions, column types, partition locations — comes from the AWS Glue Data Catalog (covered in Chapter 7), which means any table registered for Glue ETL or Lake Formation governance is immediately queryable from Athena without further configuration.
Because the engine is serverless, Athena’s “always-on availability” is not a marketing claim but a structural property: there is no cluster that can be paused, downsized, or fail. The same SQL that worked yesterday at 2 a.m. will work today at 2 p.m., even if no one queried in between.
Figure 8.1: Athena serverless Trino execution model
flowchart LR
User[SQL Client / Console] -->|Submit query| API[Athena API]
API -->|Lookup table + partitions| Glue[(AWS Glue Data Catalog)]
API -->|Plan + dispatch| Pool[Shared Trino Worker Pool]
Pool -->|Parallel scans| S3[(Amazon S3 objects)]
S3 -->|Columnar reads| Pool
Pool -->|Stream results| API
API -->|Result set| User
Pool -.->|Workers released after query| Pool
Key Takeaway: Athena is a serverless Trino deployment that queries S3 directly using metadata from the Glue Data Catalog, removing all infrastructure management while preserving full ANSI SQL semantics.
Athena Engine Versions and the Trino Lineage
Athena exposes its underlying engine through versioned releases that workgroup administrators select. Engine version 2 was based on PrestoDB; engine version 3 (the current default for new workgroups) is based on Trino and ships with a meaningful set of upgrades — Apache Iceberg native support, improved geospatial functions, the EXPLAIN ANALYZE planner output, and a cost-based optimizer that consumes Glue table statistics [Source: https://aws.amazon.com/blogs/big-data/speed-up-queries-with-cost-based-optimizer-in-amazon-athena/].
A short genealogy clarifies why engine version matters:
| Lineage | Origin | Status in Athena |
|---|---|---|
| Presto (original) | Facebook, 2012 | Renamed PrestoDB; ancestor of engine v2 |
| PrestoSQL | 2018 fork by original Presto creators | Renamed Trino in 2020 |
| Trino | Active open-source project | Powers Athena engine v3 and EMR Trino |
Trino’s distinguishing features are its massively parallel execution, its connector architecture (which we exploit for federation later), and its memory-resident intermediate results. Unlike Spark, which writes shuffle data to disk, Trino keeps query state in memory across worker nodes — making interactive sub-second responses possible on terabyte-scale data, but causing Trino to fall over on multi-hour ETL jobs that exceed cluster memory. Trino is the tool for interactive analytics; Spark is the tool for batch transformation. Athena reflects that boundary.
Key Takeaway: Engine version 3 is built on Trino, not legacy Presto, and brings Iceberg, EXPLAIN ANALYZE, and cost-based optimization — features that materially change how you write and tune queries.
Pricing Per Terabyte Scanned
Athena bills $5.00 per terabyte of data scanned in S3 (US regions, on-demand pricing), with a 10 MB minimum charge per query [Source: https://docs.aws.amazon.com/athena/latest/ug/when-should-i-use-ate.html]. There is no charge for query planning, no cluster-hour charge, and no data-stored charge (S3 storage is billed separately).
This pricing model inverts the optimization mindset most engineers bring from warehouses. In Redshift or Snowflake, you optimize for wall-clock time — a faster query frees up cluster capacity. In Athena, you optimize for bytes read from S3. A query that scans 10 GB in 45 seconds costs the same as one that scans 10 GB in 5 seconds. Engineering effort to shave wall-clock time without reducing bytes saves nothing; effort spent reducing bytes saves money on every subsequent run.
Consider a concrete example. A daily report scans a 1 TB CSV clickstream table. At $5/TB, that costs $5/day or roughly $1,825/year. Convert to Snappy-compressed Parquet partitioned by event_date, and a single-day query scans about 800 MB. The annual cost drops below $1.50. The conversion took an afternoon; the savings recur forever.
Key Takeaway: Athena’s per-terabyte-scanned pricing makes data layout, not query complexity, the dominant cost driver — every optimization in this chapter is, at its core, about reducing bytes read from S3.
Query Optimization
Once you internalize “minimize bytes scanned,” the optimization toolkit organizes itself naturally. There are three layers to attack: what you read (partitioning), how you read it (columnar formats), and what you compute on it once read (CTAS for materialization).
Partition Projection
Partitioning is the practice of arranging files in S3 under a directory hierarchy that encodes a column value in the path. The Hive convention (which Athena, Glue, and EMR all follow) uses key=value segments:
s3://my-lake/sales/year=2024/month=11/day=07/sales-001.parquet
s3://my-lake/sales/year=2024/month=11/day=07/sales-002.parquet
s3://my-lake/sales/year=2024/month=11/day=08/sales-001.parquet
When you submit SELECT * FROM sales WHERE year=2024 AND month=11 AND day=07, Athena consults the Glue Catalog for partition locations, identifies that only one prefix matches, and instructs Trino workers to read only that prefix. Partitions outside the predicate are pruned before any S3 GET request fires. On a table with 365 day-partitions, a single-day query touches roughly 1/365th of the data — a 99.7% reduction in bytes scanned [Source: https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/].
The traditional approach requires you to call MSCK REPAIR TABLE or run a Glue crawler whenever new partitions appear. For a daily-partitioned table that runs for years, the catalog accumulates thousands of partition entries, and partition listing itself becomes a bottleneck. Athena’s solution is partition projection: instead of materializing every partition in the catalog, you declare the partition pattern in table properties, and Athena synthesizes the partition list on the fly from the WHERE clause.
CREATE EXTERNAL TABLE sales (
sale_id BIGINT,
amount DECIMAL(10,2)
)
PARTITIONED BY (year INT, month INT, day INT)
STORED AS PARQUET
LOCATION 's3://my-lake/sales/'
TBLPROPERTIES (
'projection.enabled' = 'true',
'projection.year.type' = 'integer',
'projection.year.range' = '2020,2030',
'projection.month.type' = 'integer',
'projection.month.range' = '1,12',
'projection.day.type' = 'integer',
'projection.day.range' = '1,31',
'storage.location.template' = 's3://my-lake/sales/year=${year}/month=${month}/day=${day}/'
);
With projection enabled, Athena computes partition listings from the declared ranges and only issues S3 list operations against partitions satisfying the predicate. This is particularly valuable for tables with high partition cardinality (e.g., per-minute IoT data) where catalog lookups would dwarf the actual scan cost.
A useful analogy: traditional Hive partitioning is a library where every book is individually catalogued by hand. Partition projection is a library where the librarian knows the shelving rule and computes where any book lives without consulting an index card.
Figure 8.2: Hive partition lookup vs partition projection
flowchart TD
Q[Query: WHERE year=2024 AND month=11 AND day=07]
Q --> Mode{Partition mode?}
Mode -->|Hive partitioning| H1[Glue Catalog<br/>list partitions API]
H1 --> H2[Materialize all<br/>registered partitions]
H2 --> H3[Filter against predicate]
H3 --> Scan[Issue S3 GET on<br/>matching prefix]
Mode -->|Partition projection| P1[Read TBLPROPERTIES<br/>projection.* ranges]
P1 --> P2[Compute partition path<br/>from storage.location.template]
P2 --> Scan
Scan --> Result[(Parquet row groups)]
Key Takeaway: Partitioning prunes data before it is read; partition projection eliminates the catalog-lookup overhead of traditional partitioning, making it the default choice for high-cardinality time-series tables.
Columnar Formats and Compression
Once partitioning has narrowed the file set, the next question is how each file is laid out internally. Row-based formats (CSV, JSON, Avro) store all columns of one row together; column-based formats (Parquet, ORC) store all values of one column together, then move to the next column. The shape difference looks small in a diagram but changes everything about how Trino reads data.
| Format | Layout | Athena cost on SELECT amount FROM sales |
|---|---|---|
| CSV | Row-based, no compression | Reads 100% of file (every row, every column) |
| JSON | Row-based, often verbose | Reads 100% + parsing overhead |
| Avro | Row-based, schema-aware, compressed | Reads 100% but smaller bytes |
| Parquet | Columnar, Snappy/ZSTD compressed | Reads only the amount column data |
| ORC | Columnar, Snappy/Zlib compressed | Reads only the amount column data |
Parquet’s column-projection capability lets Athena read just the columns named in your SELECT clause, skipping the rest entirely. For a 50-column events table where a query selects 3 columns, this is a roughly 17x reduction in bytes read before any filtering happens. Combined with Snappy compression (which typically yields 75%+ size reduction on textual data without measurable CPU overhead), Parquet routinely produces files that are 5-10x smaller than CSV equivalents [Source: https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/].
Parquet also embeds min/max statistics for each column within each “row group” (a chunk of typically 128 MB). When your WHERE clause says amount > 1000, Trino can skip entire row groups whose maximum amount is below 1000, without reading any actual values. This is called predicate pushdown at the file level, and it stacks on top of partition pruning.
The recommended file size is 128-256 MB per Parquet object, with internal row groups of 64-128 MB. Files smaller than 128 MB hurt Trino’s parallelism — each worker pays a fixed S3 GET overhead per file, and with thousands of tiny files that overhead dominates wall-clock time, an effect the Trino community calls “tail latency” [Source: https://trino.io/assets/blog/trino-fest-2024/aws-s3.pdf]. Files larger than 1 GB underutilize the worker fleet because individual workers cannot subdivide a single file beyond row-group boundaries. The “small files problem” is severe enough that Iceberg and Delta Lake both ship dedicated compaction commands to merge daily writes into right-sized files.
Compression choices are simpler than they seem. Snappy is the Parquet default — fast decompression, modest ratio. ZSTD offers 10-30% better compression at slightly higher decompression cost and is increasingly common. GZIP is a legacy choice to avoid for analytical Parquet because its decompression is single-threaded per block.
Key Takeaway: Convert data lake tables to Parquet with Snappy or ZSTD compression and target 128-256 MB files; this single change typically reduces Athena costs by 5-10x while accelerating queries through column pruning and predicate pushdown.
CTAS and INSERT INTO Patterns
CTAS — short for CREATE TABLE AS SELECT — is the Swiss Army knife of Athena optimization. A CTAS statement reads from an existing source, applies any transformation (projection, filter, partitioning, formatting, bucketing), and writes the result as a new table in a single operation. It is the canonical way to convert a raw CSV landing zone into an optimized analytical table:
CREATE TABLE optimized_sales
WITH (
format = 'PARQUET',
parquet_compression = 'SNAPPY',
partitioned_by = ARRAY['year', 'region'],
bucketed_by = ARRAY['customer_id'],
bucket_count = 32,
external_location = 's3://my-lake/curated/sales/'
) AS
SELECT
sale_id,
amount,
customer_id,
year(sale_date) AS year,
region
FROM raw_sales
WHERE sale_date >= DATE '2020-01-01';
This single statement reads the raw landing zone, projects the columns we care about, filters out pre-2020 noise, computes the year partition column, partitions the output by year and region, bucketing by customer_id into 32 buckets, and writes everything as Snappy Parquet to a curated S3 location [Source: https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/].
Bucketing hashes a chosen column (here customer_id) into a fixed number of files per partition. When two bucketed tables are joined on the bucket column, Trino performs a co-located join — bucket 7 of sales joins only against bucket 7 of customers, dramatically reducing shuffle traffic. Bucketing helps for high-cardinality join columns on stable schemas; it does not help one-off ad-hoc queries.
CTAS has two important constraints. First, the destination S3 location must be empty — Athena refuses to overwrite. Second, a single CTAS produces at most 100 partitions. For larger datasets you use INSERT INTO ... SELECT to append additional partitions:
INSERT INTO optimized_sales
SELECT sale_id, amount, customer_id, year(sale_date) AS year, region
FROM raw_sales
WHERE sale_date >= DATE '2024-01-01' AND sale_date < DATE '2024-02-01';
A common production pattern combines CTAS for backfills with scheduled INSERT INTO for daily increments. Many teams have replaced this with Apache Iceberg tables (supported in Athena engine 3+), which provide atomic commits, hidden partitioning, and time travel [Source: https://www.vantage.sh/blog/s3-bill-increase-athena-trino-hive-fix-iceberg-caching].
Key Takeaway: CTAS converts raw lake data to optimized, partitioned, bucketed Parquet in one statement; combine it with INSERT INTO (or Iceberg) for incremental loads, and the resulting tables become the persistent, low-cost foundation of your analytical workloads.
Federated Querying
Up to this point, every query example has assumed the data lives in S3. In practice, organizations have customer profiles in DynamoDB, transactional records in RDS, financial metrics in Snowflake, and event logs in S3 — and the analytical question that matters often spans all of them. Athena Federated Query is the bridge.
Athena Federated Query Connectors
Federated Query extends Athena by introducing the concept of a catalog that points not at the Glue Data Catalog but at an AWS Lambda function. That Lambda function, called a data source connector, is the runtime intermediary between Athena and the external system [Source: https://docs.aws.amazon.com/athena/latest/ug/athena-explain-statement.html].
The architecture, end-to-end, looks like this:
SQL Query
-> Athena Trino engine
-> Federated Query Handler
-> Lambda Data Source Connector (per source)
-> Native API call (DynamoDB GetItem, RDS SQL, Redshift, etc.)
-> Apache Arrow result blocks
-> Athena query plan continues (joins, aggregates)
-> Final result to user
Every connector implements four required handler interfaces:
| Handler | Responsibility | Analogy |
|---|---|---|
MetadataHandler | Lists schemas, tables, columns | A library catalog |
GetTableHandler | Returns table schema for a specific table | A book’s table of contents |
GetSplitsHandler | Divides the data into chunks for parallel reads | Assigning aisles to multiple researchers |
ReadRecordsHandler | Streams actual rows back as Apache Arrow | The researcher reading the book aloud |
The Apache Arrow detail is more than a technicality. Arrow is a columnar in-memory format that allows Trino, Lambda, and external systems to share data without serialization/deserialization overhead. When the connector returns Arrow blocks (up to 64 MB each), Trino consumes them directly, treating remote data with the same primitives it uses for native S3 Parquet.
Figure 8.3: Federated query connector flow (Athena to external source)
sequenceDiagram
participant User as SQL Client
participant Athena as Athena Trino Engine
participant FH as Federated Query Handler
participant Lambda as Lambda Connector
participant Src as External Source<br/>(DynamoDB / RDS / Snowflake)
User->>Athena: SELECT ... FROM lambda_catalog.t
Athena->>FH: Resolve catalog -> Lambda ARN
FH->>Lambda: MetadataHandler.listSchemas/Tables
Lambda->>Src: Describe schema
Src-->>Lambda: Schema metadata
Lambda-->>FH: Table schema
FH->>Lambda: GetSplitsHandler (parallel chunks)
Lambda-->>FH: Split descriptors
par Per split
FH->>Lambda: ReadRecordsHandler(split)
Lambda->>Src: Native API (GetItem / SQL / JDBC)
Src-->>Lambda: Rows
Lambda-->>FH: Apache Arrow blocks (<=64 MB)
end
FH-->>Athena: Arrow batches
Athena->>Athena: Joins, aggregates, post-filter
Athena-->>User: Final result set
AWS publishes pre-built connectors for DynamoDB, RDS (MySQL, PostgreSQL, Oracle, SQL Server), Redshift, OpenSearch, CloudWatch, DocumentDB, Neptune, and many more in the Athena Query Federation SDK. You deploy them via the AWS Serverless Application Repository, configure connection details in Lambda environment variables (with credentials in AWS Secrets Manager), and register the catalog in Athena.
Key Takeaway: Federated Query turns Athena into a polyglot SQL engine by routing source-specific reads through Lambda connectors that return data as Apache Arrow, letting you join external systems with S3 in a single SQL statement.
Querying RDS, DynamoDB, and External Sources
Each connector maps Athena’s query operations to the source system’s native primitives. The mapping matters because it determines what gets pushed down (good — less data moves) versus pulled up to Trino for filtering (bad — more data moves).
DynamoDB is the trickiest because it is not a relational engine. A federated query like SELECT * FROM dynamodb_catalog.users WHERE user_id = 'u-123' is translated by the connector into a DynamoDB GetItem call — efficient, single-digit-millisecond latency. A query with a partition-key equality predicate becomes a Query operation. But a query that filters on a non-key attribute (e.g., WHERE last_login > '2024-01-01') degenerates into a full-table Scan, which is slow and expensive on DynamoDB. The connector pushes down what DynamoDB supports natively and lets Trino post-filter the rest.
-- Efficient: pushes down to GetItem
SELECT user_id, email, last_login
FROM lambda_dynamo.production.user_profiles
WHERE user_id = 'user_12345';
-- Expensive: degenerates to a full Scan
SELECT user_id, email
FROM lambda_dynamo.production.user_profiles
WHERE last_login > TIMESTAMP '2024-01-01 00:00:00';
Treating DynamoDB like an OLTP key-value store and querying by primary keys is the right model; treating it like a warehouse table is a path to surprise bills.
RDS connectors (MySQL, PostgreSQL) are more conventional. The Lambda maintains a connection pool, receives the Trino-side WHERE clause and column list, and constructs a real SQL query against the database — predicate pushdown is essentially complete. The pitfalls are operational: Lambda must be in the same VPC as the RDS instance, credentials live in Secrets Manager, and concurrent Athena queries can saturate RDS connection limits. Route federated reads to a read replica when possible.
A typical RDS-plus-S3 join looks like:
SELECT
c.customer_id,
c.name,
c.tier,
s.total_revenue_2024
FROM lambda_rds.prod.customers c
JOIN s3_lake.curated.revenue_summary s
ON c.customer_id = s.customer_id
WHERE c.created_date > DATE '2024-01-01'
AND c.tier IN ('gold', 'platinum');
Athena pushes the created_date and tier predicates into the RDS query, then joins those rows against the S3-resident revenue_summary table. Without federation, you would have to ETL the customer table into S3 nightly or unload the revenue summary into RDS — both trade staleness or storage cost for query convenience.
Custom external connectors can target any system reachable from Lambda. The Athena Query Federation SDK provides Java scaffolding for the four handlers.
Key Takeaway: Predicate pushdown is the dividing line between cheap and expensive federated queries — write SQL that lets the source system do the filtering, especially for DynamoDB (use partition keys) and RDS (use indexed columns), or you will pay to drag entire tables across Lambda.
Connectors to Snowflake and BigQuery
The connector model extends naturally to non-AWS warehouses. AWS publishes an official Snowflake connector that uses JDBC to query a Snowflake account, and a Google BigQuery connector that uses the BigQuery client libraries. Both connectors push down filters and projections; both bill on two axes (Athena scanned bytes plus the source warehouse’s own pricing model).
A practical scenario: a company stores CRM data in Snowflake, product analytics events in S3, and support tickets in BigQuery (inherited through an acquisition). A single Athena query can answer “which platinum customers filed support tickets and stopped using feature X in the last 30 days”:
SELECT
c.customer_id,
c.tier,
t.ticket_count,
e.last_active_date
FROM snowflake_catalog.crm.customers c
JOIN bigquery_catalog.support.tickets t
ON c.customer_id = t.customer_id
LEFT JOIN s3_lake.events.feature_x_usage e
ON c.customer_id = e.customer_id
WHERE c.tier = 'platinum'
AND t.created_at > current_date - INTERVAL '30' DAY
AND (e.last_active_date IS NULL OR e.last_active_date < current_date - INTERVAL '7' DAY);
This query would be a multi-week ETL project in a pre-federation world. With Athena Federated Query, it is one SQL statement.
The cost analysis becomes more complex: you pay Athena $5/TB scanned, Snowflake for compute warehouse seconds, BigQuery for slot or on-demand bytes, plus Lambda invocation fees and cross-cloud egress. For exploratory queries this is fine; for hourly production workloads, materialize the joined result back into S3 via CTAS.
Latency is the other consideration. Pure S3 queries return in 2-5 seconds; federated DynamoDB queries in 3-8 seconds; RDS or warehouse queries in 5-15 seconds; cross-source joins in 10-30 seconds [Source: https://www.examtopics.com/discussions/amazon/view/74121-exam-aws-certified-data-analytics-specialty-topic-1-question/]. Lambda’s 15-minute hard limit caps long-running reads, and the 64 MB Arrow block size can fragment very large result sets. Federated Query is excellent for joining moderate volumes; it is not a replacement for nightly terabyte ETL.
Key Takeaway: External warehouse connectors (Snowflake, BigQuery) make Athena a true cross-cloud query engine, but the right pattern is exploration via federation, then materialization via CTAS — not running production federated joins at high frequency.
When Athena vs Redshift vs EMR
Having established what Athena does well, we need to set its boundaries against neighboring services. AWS offers three analytics engines that overlap in capability — Athena, Redshift, and EMR — and choosing among them is a recurring data-platform design question. The right framing is not “which is best” but “which workload is each optimized for.”
Ad Hoc Analysis vs Persistent Warehouse
The fundamental difference is whether your workload is interactive or persistent.
Athena is built for ad-hoc analysis. A data scientist exploring a new dataset, an SRE searching CloudTrail logs for an unauthorized API call, an analyst answering a one-time CFO question — these queries may run once and never repeat. There is no cluster sitting idle when no one queries.
Redshift is built for a persistent warehouse. A nightly executive dashboard, a customer-facing analytics product, a CFO’s daily sales close — workloads with predictable concurrency, consistent SLAs, and benefit from materialized views, indexing, and query result caching. Redshift’s RA3 nodes separate storage from compute, giving lake-elastic storage with MPP-database latency [Source: https://www.justaftermidnight247.com/insights/redshift-vs-athena-vs-emr-aws-big-data-solutions-explained/].
EMR is built for big-data processing, particularly non-SQL. Spark for ML feature engineering, Flink for streaming, Hive for legacy batch — EMR provides the full open-source toolchain. EMR runs the heavy ETL that produces curated tables Athena and Redshift query.
Analogy: three workshops in a factory. Athena is the bench for quick repairs as customers walk in. Redshift is the assembly line running the same product day after day. EMR is the heavy machine shop fabricating custom components. All three exist in mature factories; none replaces the others.
Key Takeaway: Athena handles unpredictable, exploratory workloads with zero infrastructure overhead; Redshift handles predictable, high-concurrency BI workloads with persistent compute; EMR handles non-SQL, framework-driven big-data processing.
Cost Models Compared
The pricing models differ as much as the workloads.
| Service | Primary cost driver | Secondary costs | Idle cost |
|---|---|---|---|
| Athena | $5/TB scanned (US) | Glue catalog, S3 storage | Zero |
| Redshift Serverless | RPU-hours (~$0.36 / RPU-hour) | Managed storage, data transfer | Zero (after auto-pause) |
| Redshift Provisioned | Node-hours (24/7) | Managed storage, backups | Full cluster cost |
| EMR Provisioned | EC2 + EMR fee (per instance-hour) | EBS, S3, data transfer | Full cluster cost |
| EMR Serverless | Worker resource-seconds | S3 storage, data transfer | Zero |
A team running ten ad-hoc Athena queries per day, each scanning 5 GB of well-partitioned Parquet, pays about $1/year. The same workload on a small Redshift provisioned cluster costs roughly $1,750/month even when idle. Conversely, a team running 200 high-concurrency BI queries per minute on a 10 TB warehouse pays Redshift’s flat rate while Athena would bill aggressively and likely throttle on concurrency limits.
The break-even point is roughly where daily data scanned exceeds what an idle Redshift cluster would cost. For most sub-50 GB-per-day workloads, Athena wins. For multi-TB predictable BI with sub-second latency, Redshift wins. EMR enters when SQL is no longer expressive enough — ML, streaming, framework-specific processing.
Figure 8.4: Decision flow for Athena vs Redshift vs EMR
flowchart TD
Start[New analytics workload] --> Q1{Workload type?}
Q1 -->|Non-SQL: Spark / Flink / ML| EMR[EMR<br/>provisioned or serverless]
Q1 -->|SQL-only| Q2{Access pattern?}
Q2 -->|Ad-hoc, exploratory,<br/>unpredictable| Q3{Daily scan volume?}
Q3 -->|Sub-50 GB / day| Athena1[Athena<br/>per-TB scanned]
Q3 -->|Multi-TB sustained| Q4
Q2 -->|Predictable BI dashboards,<br/>high concurrency| Q4{Latency SLA?}
Q4 -->|Sub-second, 100s of users| Redshift[Redshift Provisioned<br/>or Serverless]
Q4 -->|Seconds OK,<br/>cost-sensitive| Athena2[Athena +<br/>partitioned Parquet]
EMR --> Output[Curated Parquet / Iceberg in S3]
Output --> Athena1
Output --> Redshift
Key Takeaway: Athena’s per-query pricing dominates for sporadic and exploratory workloads; Redshift’s amortized cluster pricing dominates for predictable, high-concurrency BI; EMR is chosen for capability (non-SQL frameworks), not for cost optimization.
Hybrid Usage Patterns
In practice, mature data platforms use all three engines, each for the workload it is optimized for, unified by the Glue Data Catalog and Lake Formation governance.
A canonical hybrid architecture looks like this:
Raw landing (S3, JSON/CSV)
|
| EMR Spark (heavy transformation)
v
Curated lake (S3, Parquet/Iceberg, Glue Catalog)
|
+--------+--------+
| |
v v
Athena (ad-hoc) Redshift Spectrum + RA3 (BI marts)
| |
v v
Data scientists BI dashboards
EMR Spark performs the heavy ETL — joining raw event streams, deduplicating, enriching with reference data, applying business rules — and writes optimized Parquet to the curated zone. Athena queries that curated zone for exploratory analysis and one-off reports, paying nothing when idle. Redshift loads the most-frequently-queried marts into its own compute layer for high-concurrency BI dashboards, while Redshift Spectrum extends the same SQL surface to the rest of the lake without copying data [Source: https://docs.aws.amazon.com/decision-guides/latest/analytics-on-aws-how-to-choose/analytics-on-aws-how-to-choose.html].
The Glue Data Catalog is the glue (literally) that makes this work. A table registered in Glue is queryable from Athena, accessible via Redshift Spectrum, and visible to EMR Spark — all without redefinition. Lake Formation layers row- and column-level access control over the catalog so that a marketing analyst querying via Athena and a data engineer querying via Spark see the same governed view of the data.
The data lakehouse pattern, popularized by Databricks but fully realizable on AWS with Iceberg, Glue, and Athena, captures this hybrid approach in a single phrase: warehouse-quality structure (ACID transactions, schema evolution, time travel) on lake-economics storage (S3, no data duplication, multi-engine access). Athena engine version 3’s native Iceberg support is what makes this practical without third-party tooling.
Key Takeaway: Mature data platforms run Athena, Redshift, and EMR side by side, sharing a Glue Catalog and Lake Formation governance; the data lakehouse pattern uses Iceberg tables on S3 to give each engine warehouse-quality semantics over lake-economics storage.
Chapter Summary
Amazon Athena replaces the “load before you query” warehouse model with a serverless Trino engine that queries S3 directly, billing $5 per terabyte scanned. The pricing model rewires the engineer’s optimization instincts: minimize bytes read, not wall-clock time. Three layers of optimization compound to deliver order-of-magnitude cost reductions — partitioning (and partition projection) prunes the file set, Parquet with Snappy or ZSTD compression skips entire columns and row groups, and CTAS (with optional bucketing or Iceberg tables) materializes optimized layouts in a single statement.
Federation extends Athena’s reach beyond S3 through Lambda-based data source connectors that translate Trino reads into native operations against DynamoDB, RDS, Redshift, Snowflake, BigQuery, and arbitrary external systems. The Apache Arrow result format lets Trino compose remote data with S3 data in the same query plan, and predicate pushdown — the discipline of letting source systems filter their own data — separates cheap federated queries from runaway bills.
Athena is not the only AWS analytics engine, and it is rarely the only one in a mature data platform. Redshift remains the right choice for high-concurrency, low-latency BI workloads on predictable workloads, while EMR remains the right choice for non-SQL big-data processing in Spark, Hive, Flink, and other frameworks. The three coexist in hybrid architectures unified by the Glue Data Catalog and Lake Formation, with Iceberg on S3 emerging as the lingua franca of the modern data lakehouse.
In the next chapter we will turn from interactive querying to streaming and real-time pipelines — the systems that feed the data lake we have just learned to query.
Key Terms
- Amazon Athena — Serverless interactive query service from AWS that runs Trino-based SQL directly against data in Amazon S3, billed per terabyte scanned with no cluster to manage.
- Trino — Distributed open-source SQL engine designed for interactive analytics over heterogeneous data sources; renamed from PrestoSQL in 2020 and the engine underlying Athena engine version 3.
- Presto — The original Facebook-built distributed SQL engine (2012), now split into PrestoDB (Linux Foundation) and Trino (renamed PrestoSQL); both lineages continue to be developed.
- CTAS (CREATE TABLE AS SELECT) — A single SQL statement that creates a new table from the result of a SELECT, optionally specifying format (Parquet/ORC), compression, partitioning, and bucketing; the canonical optimization tool for converting raw lake data to query-optimized layouts.
- Partition projection — An Athena feature that synthesizes partition listings from declared range patterns (e.g., year 2020-2030, month 1-12) instead of consulting the Glue Catalog for every partition, eliminating catalog-lookup overhead on high-cardinality time-partitioned tables.
- Federated query — A query that spans multiple data sources, executed by a single query engine that delegates source-specific reads to connectors; in Athena, implemented as Lambda functions that translate Trino requests into native API calls and return Apache Arrow blocks.
- Data lakehouse — An architecture pattern that combines lake-economics storage (S3, open formats) with warehouse-quality semantics (ACID, schema evolution, time travel), typically realized through table formats like Apache Iceberg, Delta Lake, or Apache Hudi, queried by multiple engines through a shared catalog.
- Lake Formation — AWS governance layer over the Glue Data Catalog that provides row-, column-, and tag-level access control across Athena, Redshift Spectrum, EMR, and other engines, enabling consistent security policy regardless of which engine queries the data.
Chapter 9: Workflow Orchestration with Airflow and MWAA
Learning Objectives
By the end of this chapter, you will be able to:
- Define a DAG and use Airflow operators, sensors, and task groups to express dependencies and waiting conditions in production pipelines.
- Deploy and operate Amazon MWAA (Managed Workflows for Apache Airflow), including environment sizing, S3-based DAG deployment, and VPC/IAM integration.
- Apply orchestration patterns: backfills, retries, SLA monitoring, and alerting, with idempotency as the foundation of every reliable pipeline.
- Compare Apache Airflow with AWS Step Functions, Dagster, and Prefect, and choose the right orchestrator for a given workload.
Imagine a railroad dispatcher in a busy switching yard. Trains arrive on different tracks at different times, some carrying passengers, others freight, others empty cars headed back to the depot. The dispatcher does not drive any train; instead, they decide which track is clear, which signals turn green, and what order trains depart so nothing collides and everything reaches its destination on time. Workflow orchestrators play exactly this role for data pipelines: they do not move bytes themselves, but they decide when and in what order every extractor, transformer, and loader runs, retrying the ones that stall and alerting humans when something is stuck on the track. This chapter dissects the most widely deployed dispatcher in modern data engineering, Apache Airflow, then shows how AWS packages it as a managed service (MWAA) and how it stacks up against the alternatives.
Apache Airflow Foundations
Apache Airflow began at Airbnb in 2014 and is now the de facto standard for batch workflow orchestration, with thousands of provider packages spanning AWS, GCP, Azure, Snowflake, dbt, Spark, and more [Source: https://airflow.apache.org/docs/]. Its core idea is deceptively simple: express your workflow as Python code, and let a scheduler interpret that code into time-bounded executions.
DAGs, Tasks, and Operators
A DAG (Directed Acyclic Graph) is a Python file that declares a set of tasks and the dependencies between them. “Directed” means edges have a direction (A runs before B), and “acyclic” means there are no loops; without that constraint, a scheduler would never know whether the graph was finished. An operator is a template for the actual unit of work a task performs—PythonOperator runs a Python callable, BashOperator runs a shell command, PostgresOperator runs SQL against a Postgres connection, and so on. Each invocation of an operator inside a DAG produces a task; each scheduled run of that task on a particular logical date produces a task instance.
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'data_team',
'retries': 2,
'retry_delay': timedelta(minutes=5),
}
with DAG(
dag_id='data_pipeline',
default_args=default_args,
schedule_interval='@daily',
start_date=datetime(2024, 1, 1),
catchup=False,
) as dag:
def extract():
return "data"
task_extract = PythonOperator(
task_id='extract_data',
python_callable=extract,
)
The scheduler parses every Python file in the dag_folder roughly every 30 seconds, instantiates the DAG and operator objects, validates the graph for cycles, and stores the resulting metadata in the database [Source: https://airflow.apache.org/docs/]. A critical implication is that every line of top-level code in a DAG file runs on every parse. Calling a slow API or running a heavy SQL query at module scope is a classic anti-pattern; that work should happen inside an operator’s execute() method, which only runs when the task is dispatched.
Sensors are a special kind of operator that wait for an external condition: a file landing in /data/input/, a row appearing in a SQL table, or an S3 prefix becoming non-empty. Sensors run in one of two modes. In poke mode the sensor occupies a worker slot continuously, sleeping briefly between checks; this gives low-latency detection but ties up a slot for hours if the condition is slow to materialize. In reschedule mode the sensor releases the worker between checks, letting other tasks run; the trade-off is slightly higher detection latency [Source: https://biconsult.ru/files/Data_warehouse/Bas_P_Harenslak,_Julian_Rutger_de_Ruiter_Data_Pipelines_with_Apache.pdf].
from airflow.sensors.sql import SqlSensor
wait_for_data = SqlSensor(
task_id='wait_for_db',
conn_id='postgres_conn',
sql="SELECT COUNT(*) FROM raw_data WHERE date = '{{ ds }}'",
success_check=lambda result: result[0][0] > 0,
poke_interval=30,
timeout=1800,
mode='reschedule',
)
Task Groups, introduced in Airflow 2.x to replace the legacy SubDAG pattern, give you visual and logical grouping in the UI without spawning a child DAG. Wrap related work in with TaskGroup('extract'): and the UI collapses it into a single node you can expand, which is invaluable when a DAG balloons past 50 tasks.
The TaskFlow API, also new in Airflow 2.0+, lets you write DAGs as decorated Python functions. Return values become implicit XComs and parameters become dependencies, so you write almost normal Python and Airflow infers the graph:
from airflow.decorators import dag, task
from datetime import datetime
@dag(start_date=datetime(2024, 1, 1), schedule_interval='@daily', catchup=False)
def my_pipeline():
@task
def extract(endpoint: str) -> dict:
return {'records': 100, 'endpoint': endpoint}
@task
def transform(data: dict) -> dict:
data['records_processed'] = data['records'] * 1.1
return data
transform(extract('/api/users'))
my_pipeline_dag = my_pipeline()
Key Takeaway: A DAG is just Python; operators are the verbs, sensors are the waiters, and task groups are the folders. Keep DAG files lightweight—work belongs inside
execute(), not at module scope.
Schedulers, Executors, and Workers
Airflow’s runtime is a four-pillar architecture: the Scheduler parses DAGs and queues task instances; the Webserver renders the UI and serves the REST API; the Workers actually run task code; and the Metadata Database (Postgres in production, MySQL also supported) stores DAG definitions, task states, connections, variables, and XComs [Source: https://biconsult.ru/files/Data_warehouse/Bas_P_Harenslak,_Julian_Rutger_de_Ruiter_Data_Pipelines_with_Apache.pdf].
| Component | Role | Failure Impact |
|---|---|---|
| Metadata DB | Source of truth for state | Total outage; HA replication required |
| Scheduler | Parses DAGs, queues tasks | Tasks stop being scheduled; running tasks finish |
| Webserver | UI + REST API | Operators lose visibility; pipelines keep running |
| Workers | Execute operator code | Running tasks die; retry logic re-queues them |
The executor is the bridge between the scheduler and the workers, and it determines where and how tasks actually run.
The Celery Executor uses a message broker (Redis or RabbitMQ) to hand tasks to a fleet of long-running worker processes. It is mature, easy to scale horizontally (add more worker boxes), and supports task prioritization through named queues. Task startup is fast (5–10 seconds because workers are already warm), but every worker has a fixed resource shape, so a 1 vCPU / 4 GB worker cannot dynamically grow to handle a 16 GB pandas job. You also have to operate the broker.
The Kubernetes Executor spawns a fresh pod for every task, with per-task resource requests and limits supplied via executor_config. This gives near-perfect resource efficiency (the pod dies when the task finishes) and total isolation (no noisy-neighbor effect). The cost is operational complexity (you need a Kubernetes cluster) and slower startup (~30–60 seconds per pod).
| Factor | Celery | Kubernetes |
|---|---|---|
| Setup complexity | Low | High |
| Scaling | Horizontal | Horizontal + Vertical |
| Resource efficiency | Medium | High |
| Task startup latency | 5–10 s | 30–60 s |
| Cloud-native | No | Yes |
| Debugging | Easier (worker logs) | Harder (pod logs) |
Analogy: Celery is a fleet of buses on fixed routes—cheap per passenger, fast to board, but every bus is the same size. Kubernetes is a ride-share dispatching custom vehicles—any size you want, but the car has to drive to you first.
Figure 9.1: Airflow four-pillar runtime architecture
flowchart LR
DAGs[(DAG Folder<br/>Python files)] -->|parse every 30s| Sched[Scheduler]
Sched -->|reads/writes state| MetaDB[(Metadata DB<br/>Postgres)]
Sched -->|enqueues task| Exec{Executor}
Exec -->|Celery: via broker| Broker[(Redis / RabbitMQ / SQS)]
Broker --> W1[Worker 1]
Broker --> W2[Worker 2]
Exec -->|K8s: spawn pod| Pod[Per-task Pod]
W1 --> MetaDB
W2 --> MetaDB
Pod --> MetaDB
Web[Webserver / UI / REST API] --> MetaDB
User([User / Operator]) --> Web
Key Takeaway: The four pillars of Airflow are scheduler, webserver, workers, and metadata DB. Pick Celery for low-latency homogeneous workloads with simple operations; pick Kubernetes when task sizes vary wildly or you already run K8s.
Connections, Variables, and XComs
Airflow ships three first-class mechanisms for moving configuration and small payloads between tasks and external systems. Connections store reusable credentials and endpoints (Postgres host/port/password, AWS key pair, HTTP base URL). They live in the metadata DB (or a secrets backend like AWS Secrets Manager) and are referenced by conn_id so DAG code never embeds passwords. Variables store typed key-value configuration (a list of S3 prefixes to crawl, a feature flag); they are accessed via Variable.get() or templating syntax {{ var.value.my_key }}.
XCom, short for “cross-communication,” lets one task push a small Python value and another task pull it [Source: https://airflow.apache.org/docs/]. By default, XComs are stored as serialized blobs in the metadata DB, so the practical limit is around 64 KB. Pushing a 500 MB DataFrame through XCom will both crash and embarrass you. The correct pattern is to land large data in S3 (or any external store) and pass only the reference—the bucket key, the partition path, the row count—through XCom.
def push_task(**context):
s3_key = upload_dataframe_to_s3(df)
context['task_instance'].xcom_push(key='s3_key', value=s3_key)
def pull_task(**context):
s3_key = context['task_instance'].xcom_pull(task_ids='push_task', key='s3_key')
df = download_dataframe_from_s3(s3_key)
In the TaskFlow API this whole dance is implicit: transform(extract()) pushes extract’s return into XCom and pulls it as transform’s argument automatically. This is elegant for orchestration metadata but does not change the storage cost—the return value still serializes through the database.
Key Takeaway: Connections handle credentials, Variables handle config, and XCom handles small inter-task payloads. Anything bigger than ~64 KB belongs in object storage, with only a reference flowing through XCom.
Amazon MWAA
Self-hosting Airflow means running schedulers, webservers, workers, brokers, a Postgres instance, secrets backends, log aggregation, and patching all of it. Many teams would rather pay AWS to do that. MWAA (Amazon Managed Workflows for Apache Airflow) is a fully managed service that provisions Airflow on AWS Fargate behind a customer-controlled VPC, syncs DAGs from S3, and ships logs to CloudWatch [Source: https://aws.amazon.com/managed-workflows-for-apache-airflow/features/].
Environment Sizing and Scaling
MWAA exposes three environment classes that fix the per-worker compute shape:
| Environment Class | vCPU | Memory (GB) | Default Tasks/Worker | Hourly Price (us-east-1) | Use Case |
|---|---|---|---|---|---|
mw1.small | 1 | 4 | 5 | ~$0.49 | Light workloads, dev environments |
mw1.medium | 2 | 8 | 10 | ~$0.98 | Standard production ETL |
mw1.large | 4 | 16 | 10+ (configurable) | ~$1.96 | Heavy transforms, large pandas jobs |
A typical mw1.medium environment running 24/7 with the minimum 10 workers costs roughly $720/month plus storage and data transfer [Source: https://aws.amazon.com/managed-workflows-for-apache-airflow/features/].
Worker autoscaling is built in. MWAA tracks two CloudWatch metrics—RunningTasks and QueuedTasks—and provisions extra workers when the queue grows past the current worker pool, retiring them when work drops [Source: https://docs.aws.amazon.com/mwaa/latest/userguide/mwaa-autoscaling.html]. You configure a minimum and maximum worker count; the floor controls baseline capacity (and cost) and the ceiling caps burst.
The AdditionalWorkers CloudWatch metric is the operational dial for tuning baseline. If MWAA spends more than six hours per day above its minimum, AWS guidance is to raise the minimum so the steady-state queue does not constantly trigger scale-up latency [Source: https://aws.amazon.com/blogs/big-data/a-guide-to-airflow-worker-pool-optimization-in-amazon-mwaa/]. Three common strategies emerge from the AWS capacity-planning guide [Source: https://aws.amazon.com/blogs/big-data/a-guide-to-capacity-planning-for-airflow-worker-pool-in-amazon-mwaa/]:
- Full base — minimum equals peak; never autoscales. Simplest, most expensive, lowest latency.
- Hybrid — minimum at ~80% of peak, max higher. Best price/performance for predictable spikes.
- Minimal base — minimum near zero, max high. Cheapest, but tasks wait through cold-start.
The webserver also autoscales (Airflow v2.2.2+) based on CPU and active connections, which matters when many users hit the UI simultaneously [Source: https://docs.aws.amazon.com/mwaa/latest/userguide/mwaa-web-server-autoscaling.html].
Key Takeaway: MWAA pricing is dominated by your worker minimum. Start with
mw1.medium, set a hybrid floor based onAdditionalWorkersobservation, and raise the ceiling for safety—not the floor for vanity.
DAG Deployment via S3
MWAA’s deployment contract is unusually opinionated: there is no SSH, no airflow dags CLI, and no rsync. Everything lives in a single S3 bucket structured by convention [Source: https://docs.aws.amazon.com/mwaa/latest/userguide/]:
s3://my-mwaa-bucket/
├── dags/ # Python DAG files (synced ~every 30s)
│ ├── etl_pipeline.py
│ └── reporting.py
├── plugins.zip # Custom operators, hooks, macros
└── requirements.txt # Python dependencies installed via pip
The scheduler polls the dags/ prefix every ~30 seconds and pulls new or modified files; that is your deploy mechanism. Versioning the bucket (recommended) gives you rollback by restoring a prior object version. requirements.txt is processed at environment startup and during version updates—changing it triggers a pipeline restart, which can take 20–30 minutes, so do not iterate on dependencies in production. plugins.zip ships custom Python modules (your MyCorpSnowflakeOperator, for example) and is loaded into the Airflow Python path.
A practical CI/CD pattern looks like this: a developer opens a pull request modifying dags/etl_pipeline.py. CI runs DAG-import tests (python -c "import etl_pipeline") and unit tests against operator callables. On merge, a GitHub Action runs aws s3 sync ./dags s3://my-mwaa-bucket/dags/, and within 30 seconds the change appears in the MWAA UI.
Key Takeaway: S3 is the deployment surface, period. Version the bucket, automate
s3 syncfrom CI, and treatrequirements.txtchanges as planned outages.
Networking and IAM Integration
MWAA always runs inside a customer-owned VPC, which is the lever you use to integrate with private resources. You supply two private subnets across two AZs; MWAA places its Fargate tasks there and peers with whatever databases, RDS instances, or VPC endpoints you expose [Source: https://docs.aws.amazon.com/mwaa/latest/userguide/]. The webserver can be public (internet-accessible UI behind AWS authentication) or private (only reachable through your VPC, typical for regulated environments).
For air-gapped operation, configure VPC endpoints for S3, CloudWatch Logs, ECR, KMS, and SQS so MWAA never traverses the public internet. The IAM execution role (AWSMWAA-*) is what every task assumes; it needs S3 read on the DAG bucket, CloudWatch Logs write on the environment’s log groups, KMS for encrypted secrets, SQS access (Celery’s broker is SQS under the hood for MWAA), and ECS for Fargate task management. Any additional AWS access your DAGs need (writing to a data lake bucket, reading from RDS, calling Bedrock) is added to this same role.
CloudWatch logging is automatic and granular: you can enable streams for the scheduler, webserver, workers, DAG processor, and task logs independently. Crucially, task logs are the ones engineers consult during incidents, and they are accessible both from the Airflow UI and as CloudWatch Logs queries—so you can START | filter @message like /retry/ across thousands of task runs.
Figure 9.2: MWAA environment topology — VPC, IAM, and AWS service integration
graph LR
Dev[Developer] -->|s3 sync| S3[S3 DAG Bucket]
S3 -->|every 30s| Scheduler
subgraph VPC[Customer VPC]
Scheduler --> SQS[(SQS Broker)]
SQS --> Workers[Fargate Workers]
Workers --> RDS[(RDS Postgres metadata DB)]
Workers --> CW[CloudWatch Logs]
end
Workers -->|VPC Endpoint| DataLake[(S3 Data Lake)]
Key Takeaway: MWAA is Airflow on AWS rails: VPC for network reach, IAM execution role for AWS access, CloudWatch for logs. There is no SSH—all observability flows through the AWS plane.
Production Patterns
A DAG that runs on the happy path is a demo. A DAG that survives flaky APIs, late-arriving data, partial failures, and a 3 a.m. on-call rotation is a production pipeline. Three patterns separate the two: idempotency, principled backfill discipline, and SLA-driven alerting.
Idempotent Task Design
Idempotency means running a task twice produces the same result as running it once. It is the foundation of every reliable Airflow pipeline because retries, manual reruns, and backfills will execute the same task multiple times for the same logical date.
The single most important Airflow idiom for idempotency is keying writes on the execution date. Airflow templates {{ ds }} (date string YYYY-MM-DD) and {{ execution_date }} into operator parameters at runtime. A non-idempotent task appends new rows to a table on every run; if it re-runs, you get duplicates. An idempotent version writes to a partition keyed on {{ ds }}, deleting that partition first:
from airflow.providers.postgres.operators.postgres import PostgresOperator
upsert_partition = PostgresOperator(
task_id='upsert_daily_partition',
postgres_conn_id='warehouse',
sql="""
DELETE FROM fact_orders WHERE order_date = '{{ ds }}';
INSERT INTO fact_orders (order_id, order_date, amount)
SELECT order_id, order_date, amount
FROM staging.orders
WHERE order_date = '{{ ds }}';
""",
)
Re-run that task ten times for 2024-03-15 and the result is identical: one clean partition for that date. The S3 equivalent uses partitioned object keys like s3://warehouse/fact_orders/dt=2024-03-15/, where re-running overwrites the same prefix.
Common non-idempotent footguns include:
- Calling
INSERTwithout a correspondingDELETEor upsert constraint. - Using
datetime.now()instead of{{ ds }}(because “now” changes on every retry). - Sending side-effecting API calls (creating a Salesforce record) without an idempotency key.
Key Takeaway: Idempotency is non-negotiable in Airflow. Partition writes by
{{ ds }}, delete-then-insert, and never use wall-clock time inside a task—use the templated execution date.
Backfills and Catchup
A backfill is the act of running historical DAG runs for execution dates earlier than today. The classic use case: you ship a new transformation on March 15 and need it applied to all data from January 1 onward. Airflow can do this two ways.
Catchup is automatic backfilling. If catchup=True (the default in older Airflow versions) and your DAG’s start_date is January 1 with schedule_interval='@daily', the scheduler will, the moment you deploy on March 15, attempt to schedule 73 historical DAG runs in rapid succession. On a busy worker pool this is “explosive backfilling”—you can melt your warehouse before lunch. Production DAGs almost always set catchup=False to disable this, then trigger backfills explicitly when needed [Source: https://airflow.apache.org/docs/].
Manual backfills use the airflow dags backfill CLI (or the UI) with explicit start and end dates and respect max_active_runs so you control the parallelism. A typical pattern:
airflow dags backfill data_pipeline \
--start-date 2024-01-01 \
--end-date 2024-03-14 \
--reset-dagruns
This is only safe if every task in the DAG is idempotent—which closes the loop with the previous section.
Retries are the unit-test of orchestration: every operator should set retries (commonly 2–3) and retry_delay (start at 5 minutes, escalate with retry_exponential_backoff=True). A network blip or a transient warehouse error should heal automatically. Pair retries with execution_timeout so a stuck task does not run forever.
Figure 9.3: Airflow task instance lifecycle — states and transitions
[*] --> none
none --> scheduled: scheduler picks up
scheduled --> queued: executor accepts
queued --> running: worker starts
running --> success: exit 0
running --> failed: exception
running --> up_for_retry: retries remaining
up_for_retry --> scheduled: after retry_delay
failed --> [*]
success --> [*]
running --> up_for_reschedule: sensor reschedule
up_for_reschedule --> scheduled: poke_interval elapsed
scheduled --> skipped: branch / trigger rule
skipped --> [*]
queued --> upstream_failed: dep failed
upstream_failed --> [*]
Key Takeaway: Set
catchup=Falsein production, run backfills manually with explicit windows, and let retries + idempotency absorb transient failures.
SLA Monitoring and Alerting
A pipeline that completes successfully at 11 p.m. when it was due at 9 a.m. is a failed pipeline, even though every task returned green. Airflow’s SLA mechanism flags exactly this. Setting sla=timedelta(hours=2) on a task means “this task should finish within 2 hours of its scheduled execution time.” If it doesn’t, the run still proceeds, but Airflow records an SLA miss in the metadata DB, surfaces it on the SLA Misses page, and (if configured) calls a callback or sends an email.
from datetime import timedelta
upload_to_warehouse = PythonOperator(
task_id='upload_to_warehouse',
python_callable=upload_fn,
sla=timedelta(hours=2),
on_failure_callback=alert_pagerduty,
on_retry_callback=log_retry,
)
Alerting in production typically uses three channels in order of severity:
| Channel | Trigger | Example |
|---|---|---|
| Slack/Teams | SLA miss, retry exhausted | #data-alerts posts task ID + log link |
| DAG-level failure | Owner inbox | |
| PagerDuty / OpsGenie | Tier-1 SLA breach | Wakes on-call engineer |
The on_failure_callback, on_success_callback, and on_retry_callback hooks let you wire any of these. For richer observability—percentile latency over time, retry rate by operator, freshness lag of downstream tables—teams plug Airflow into data observability platforms that consume Airflow’s StatsD metrics [Source: https://www.montecarlodata.com/blog-data-observability-use-cases/].
A good production rule: every DAG has an owner, every owner has an alert channel, and every SLA-bearing task has an explicit sla argument. Silent failures are far more dangerous than loud ones.
Key Takeaway: Define SLAs as
timedeltaarguments on tasks, hook failures into chat and paging via callbacks, and treat “succeeded but late” as a real failure mode.
Alternative Orchestrators
Airflow is dominant, not exclusive. Three well-funded competitors target adjacent problems and sometimes fit better.
AWS Step Functions for State Machines
Step Functions is AWS’s serverless orchestrator built around the Amazon States Language—a JSON DSL for defining state machines [Source: https://docs.aws.amazon.com/step-functions/]. Where Airflow models a workflow as a DAG of tasks, Step Functions models it as states (Task, Choice, Parallel, Map, Wait, Pass, Succeed, Fail) with explicit transitions.
{
"StartAt": "ExtractData",
"States": {
"ExtractData": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123:function:extract",
"Next": "CheckCount",
"Retry": [{"ErrorEquals": ["States.TaskFailed"], "MaxAttempts": 3}]
},
"CheckCount": {
"Type": "Choice",
"Choices": [
{"Variable": "$.count", "NumericGreaterThan": 0, "Next": "Transform"}
],
"Default": "Skip"
},
"Transform": {"Type": "Task", "Resource": "...", "End": true},
"Skip": {"Type": "Succeed"}
}
}
Step Functions shines for service choreography—orchestrating Lambdas, ECS tasks, SageMaker jobs, EventBridge events, and SQS messages with native retries, error catchers, and parallel/map semantics. Pricing is pay-per-state-transition (~$0.000025 per transition for Standard workflows) [Source: https://docs.aws.amazon.com/step-functions/]. There is no scheduler to manage, no DAG file to deploy, no worker pool to size.
Where it falls short relative to Airflow: scheduling is bare (cron via EventBridge, no @daily shorthand), backfill is manual rerunning of historical executions, the DSL is JSON-heavy compared to Python DAGs, and the ecosystem of “providers” (Snowflake, dbt, Spark) is far smaller—you write Lambda glue instead.
Key Takeaway: Step Functions wins for AWS-native service choreography with branching logic; Airflow wins for data-pipeline workloads with rich scheduling and a vast operator ecosystem.
Dagster and Prefect Comparisons
Dagster reframes orchestration around data assets rather than tasks [Source: https://docs.dagster.io/]. You declare a SQL table or ML model as a @asset, Dagster infers the dependency graph from the assets each one consumes, and the platform tracks lineage, freshness, and materialization automatically. Built-in data-quality checks (@asset_check) run in-line. Dagster Cloud adds run insights and asset-level observability that Airflow needs plugins to approximate. The trade-off is conceptual overhead—if your team thinks in tasks, the asset abstraction can feel forced—and a smaller (though fast-growing) ecosystem.
Prefect is the most Pythonic of the four [Source: https://docs.prefect.io/]. A flow is just a decorated function:
from prefect import flow, task
@task
def extract(): return [1, 2, 3]
@task
def transform(x): return x * 2
@flow
def pipeline():
data = extract()
return [transform(x) for x in data] # truly dynamic
That [transform(x) for x in data] is a runtime list comprehension generating tasks dynamically based on extracted data—something Airflow only approximates with Dynamic Task Mapping (introduced in 2.3). Prefect Cloud has free and paid tiers (~$50–300/month) and emphasizes developer experience.
| Dimension | Airflow | Step Functions | Dagster | Prefect |
|---|---|---|---|---|
| Model | DAG | State machine | Asset graph | Python flow |
| Dynamic workflows | Limited (Dynamic Task Mapping) | Map state | Dynamic Graphs API | Native |
| Lineage / data quality | Plugin-based | None native | Built-in | Limited |
| AWS integration | Provider packages | Native | Provider packages | Provider packages |
| Pricing | OSS / MWAA $720+/mo | $0.000025/transition | OSS / Cloud | OSS / Cloud $50–300/mo |
| Community size | Largest | AWS-only | Growing fast | Mid-size |
| Best for | Batch ETL at scale | Service choreography | Data platforms | Python-first dynamic flows |
Choosing an Orchestrator
The decision is rarely about features in isolation; it is about fit with infrastructure, team skills, and workload shape. A useful four-question framework:
- Where does compute live? AWS-only? Step Functions cuts setup. Multi-cloud or on-prem? Airflow, Dagster, or Prefect.
- What is the workload pattern? Daily batch ETL with hundreds of DAGs? Airflow. Asset lineage and quality central? Dagster. Dynamic, data-shaped fan-out? Prefect. Cross-service call graphs? Step Functions.
- What is the team size and shape? Small Python-first team? Prefect’s DX. Large data org with platform team? Airflow’s ecosystem. AWS-shop with serverless culture? Step Functions.
- How important is data observability? Lineage and quality are first-class? Dagster. General run monitoring is enough? Any option, including Airflow with plugins.
A common pattern in mature organizations is to combine orchestrators: Airflow (or MWAA) runs the daily batch warehouse pipelines, Step Functions stitches together event-driven Lambda workflows, and Dagster powers the analytics platform. They are not mutually exclusive—the right answer is whichever tool earns its operational tax on each workload.
Figure 9.4: Orchestrator selection decision flow
flowchart TD
Start([New workflow to orchestrate]) --> Q1{Compute lives<br/>only on AWS?}
Q1 -->|Yes| Q2{Service choreography<br/>Lambda + ECS + SageMaker?}
Q1 -->|No / Multi-cloud| Q3{Daily batch ETL<br/>with rich scheduling?}
Q2 -->|Yes| SF[Step Functions]
Q2 -->|No, batch ETL| Q3
Q3 -->|Yes| Q4{Data assets<br/>and lineage central?}
Q3 -->|No, dynamic flows| Q5{Python-first team<br/>runtime fan-out?}
Q4 -->|Yes| Dagster[Dagster]
Q4 -->|No| Airflow[Airflow / MWAA]
Q5 -->|Yes| Prefect[Prefect]
Q5 -->|No| Airflow
Key Takeaway: Airflow for batch ETL at scale; Step Functions for AWS service choreography; Dagster for asset/quality-first platforms; Prefect for Python-native dynamic flows. Match the tool to workload and infrastructure, and don’t be afraid to mix them.
Chapter Summary
Workflow orchestration is the dispatcher that decides when every piece of your pipeline runs. Apache Airflow has won the batch orchestration market by exposing workflows as Python DAGs interpreted by a four-pillar runtime—scheduler, webserver, workers, and metadata DB—with a vast ecosystem of operators and a choice of Celery (mature, fixed shape) or Kubernetes (dynamic, cloud-native) executors. Sensors wait for external conditions, XComs pass small payloads, and the TaskFlow API keeps DAG code Pythonic.
Amazon MWAA packages all of this on AWS Fargate, with three environment classes (mw1.small/medium/large) sized by vCPU and memory, automatic worker autoscaling driven by CloudWatch task metrics, S3-based DAG deployment, and full VPC and IAM integration. The opinionated S3 contract (dags/, plugins.zip, requirements.txt) makes deploys boring—exactly what you want from a managed service.
Production-grade orchestration rests on three patterns. Idempotency through {{ ds }}-keyed partitioned writes lets retries and backfills run safely. Backfill discipline (catchup=False, manual airflow dags backfill with bounded windows) keeps history under control. SLA monitoring with task-level sla arguments and on_failure_callback hooks turns silent late completions into loud, actionable alerts.
Airflow is not the only option. AWS Step Functions wins for AWS-native service choreography with branching state-machine semantics, Dagster wins for asset- and lineage-first data platforms, and Prefect wins for Python-first teams who need dynamic, runtime-shaped flows. The best orchestrator is the one whose model fits your workload, ecosystem, and team—often a portfolio rather than a single choice.
Key Terms
- Apache Airflow — Open-source workflow orchestrator that expresses pipelines as Python DAGs interpreted by a scheduler, executor, and worker pool backed by a metadata database.
- DAG (Directed Acyclic Graph) — A Python file declaring tasks and their dependencies; “directed” means edges have direction, “acyclic” means no loops, ensuring the scheduler can determine completion.
- Operator — A template for the unit of work a task performs (e.g.,
PythonOperator,BashOperator,PostgresOperator); each invocation in a DAG yields a task. - Sensor — A specialized operator that waits for an external condition (file, SQL row, S3 prefix). Runs in
pokemode (holds the worker) orreschedulemode (releases between checks). - MWAA (Managed Workflows for Apache Airflow) — AWS’s fully managed Airflow service running on Fargate inside a customer VPC, with three environment classes and S3-based DAG deployment.
- XCom (Cross-Communication) — Airflow’s mechanism for passing small (~64 KB) payloads between tasks via the metadata database; large data should flow through external storage with only references in XCom.
- Backfill — Running historical DAG runs for execution dates earlier than today, used to apply new logic to past data; safe only when tasks are idempotent.
- Step Functions — AWS’s serverless orchestrator built around the Amazon States Language, optimized for service choreography with native AWS integrations and pay-per-state-transition pricing.
- Idempotency — The property that running a task multiple times for the same logical date produces identical output; achieved by partitioning writes on
{{ ds }}and delete-then-insert patterns, and required for safe retries and backfills.
Chapter 10: Zero-ETL and SaaS Integration
Learning Objectives
By the end of this chapter, you will be able to:
- Define zero-ETL and explain how managed integrations change the economics and operations of traditional ingestion pipelines.
- Configure an Aurora-to-Redshift zero-ETL integration end to end, including the prerequisite parameter groups, authorization, and database creation steps.
- Use AWS Glue Zero-ETL connectors to ingest SaaS data from Salesforce, SAP OData, ServiceNow, and advertising platforms.
- Identify when a managed integration is the right tool versus when traditional CDC, batch ETL, or open-source streaming (Debezium) remains the better choice.
What Zero-ETL Means
For most of data warehousing’s history, getting data from an operational database into an analytics warehouse meant building a pipeline. You wrote extraction code, scheduled it with cron or Airflow, parsed change logs with Debezium, paid for AWS DMS instances, or stitched together S3 staging plus a COPY command. Each of these layers had a price: engineering time, infrastructure cost, and—most painfully—latency. Zero-ETL is the industry’s collective decision that for the most common case (operational data → analytical store), the cloud provider should own that pipeline so you don’t have to.
Think of zero-ETL like municipal water service. You used to dig your own well, install your own pump, and maintain your own pipes. The water still arrived, but the operational overhead was enormous. Modern plumbing means turning a tap—the utility handles capture, transport, treatment, and delivery. Zero-ETL is plumbing for data: you connect a source to a destination, and the cloud provider runs the change capture, the network transport, and the apply logic invisibly underneath [Source: https://aws.plainenglish.io/zero-etl-is-the-reality-check-every-data-engineer-needs-in-2026-c4623d7df460].
From Pipeline Code to Managed Integration
The phrase “zero-ETL” is slightly misleading—the Extract, Transform, Load work still happens, it’s just that you don’t write or operate it. The cloud provider replaces three artifacts you used to maintain:
| Traditional Artifact | Zero-ETL Replacement |
|---|---|
| Extraction job (DMS task, Debezium connector, custom script) | Managed CDC stream owned by the source service |
| Staging bucket / Kafka topic | Internal AWS-managed transport (invisible) |
Load script (COPY, MERGE, dbt incremental model) | Auto-generated apply into Redshift / OpenSearch / Lakehouse |
The result is that an engineer’s mental model collapses from “build and operate a pipeline” to “create an integration.” The integration is a Cloud-Formation-style declaration: “Replicate Aurora cluster A into Redshift workgroup B, optionally filtering to schemas X and Y.” AWS handles parameter group reads, snapshot exports, ongoing CDC, schema evolution, retries, and backpressure [Source: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/zero-etl.html].
This shift matters because pipeline code has historically been the highest-bug-density part of a data platform. Schema drift breaks parsers; network blips orphan offsets; clock skew corrupts watermarks. Moving that code into a managed service eliminates an entire category of pages.
Key Takeaway: Zero-ETL doesn’t make ETL disappear—it relocates ownership of the pipeline from your team to the cloud provider, replacing extraction code, staging buckets, and load scripts with a single declarative integration object.
Change Capture Under the Hood
Although AWS markets zero-ETL as “magic,” the underlying mechanism is the same log-based Change Data Capture (CDC) pattern that has powered Debezium and DMS for years. For Aurora MySQL, the integration reads the binary log (binlog) in ROW format, which records every insert, update, and delete as a structured event. For Aurora PostgreSQL, it reads the write-ahead log (WAL) using logical replication. The provider’s internal service consumes those records, batches them, and applies them to the target [Source: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/zero-etl.setting-up.html].
The flow has two distinct phases:
Phase 1 — Initial Snapshot
Aurora cluster ──snapshot export──► Redshift target tables
(existing rows copied)
Phase 2 — Ongoing CDC
Aurora binlog ──CDC stream──► Apply engine ──MERGE──► Redshift
(every INSERT/UPDATE/DELETE/DDL captured within seconds)
Phase 1 seeds Redshift with whatever data already lives in Aurora at the moment of integration creation. Phase 2 keeps the two systems in sync continuously. Because the source is the binary log rather than a SQL SELECT, the operational database’s CPU and IO impact stays low—log readers consume only the change record, not the full row scan that an extraction SELECT would force [Source: https://aws.amazon.com/blogs/database/transition-from-aws-dms-to-zero-etl-to-simplify-real-time-data-integration-with-amazon-redshift/].
Figure 10.1: Aurora-to-Redshift zero-ETL change capture flow (snapshot + ongoing CDC)
sequenceDiagram
participant App as Application
participant Aurora as Aurora Cluster
participant Binlog as Binlog / WAL
participant ZE as Zero-ETL Service
participant RS as Redshift Target
Note over Aurora,RS: Phase 1 - Initial Snapshot
ZE->>Aurora: Trigger snapshot export
Aurora-->>ZE: PITR snapshot
ZE->>RS: Bulk load existing rows
RS-->>ZE: Snapshot complete
Note over Aurora,RS: Phase 2 - Ongoing CDC (seconds)
App->>Aurora: INSERT / UPDATE / DELETE
Aurora->>Binlog: Write ROW-format change record
ZE->>Binlog: Tail change events
Binlog-->>ZE: Stream of row deltas
ZE->>RS: MERGE into target tables
RS-->>App: Available for analytics
For SaaS sources via AWS Glue Zero-ETL, the mechanism varies. Salesforce uses the Bulk API for high-throughput ingestion; SAP OData with ODP entities uses delta links that the SAP server itself produces; non-ODP entities fall back to timestamp-based polling or full-extract-plus-upsert [Source: https://aws.amazon.com/blogs/big-data/sap-data-ingestion-and-replication-with-aws-glue-zero-etl/].
Key Takeaway: Zero-ETL integrations are still log-based CDC under the hood—the cloud provider just owns the binlog reader, the network transport, and the apply engine, exposing only a “source plus target” abstraction to the user.
Latency and Consistency Guarantees
The defining performance claim of zero-ETL is near real-time, with replicated rows typically appearing in the analytics target within seconds of a commit on the source [Source: https://aws.amazon.com/rds/aurora/zero-etl/]. That’s an order of magnitude better than batch ETL (hours to days) and competitive with self-managed Debezium or DMS pipelines (seconds to minutes), without the operational overhead of either.
Two consistency caveats matter for real systems:
- Eventual consistency, not synchronous replication. A transaction committed on Aurora is not immediately visible in Redshift. There is a propagation window—usually under 15 seconds, occasionally longer under load. Applications that need strict read-your-writes semantics must read from Aurora; Redshift is for analytics, not operational lookups.
- DDL replication is supported but not always instant. Adding columns, renaming tables, and creating indexes on the source propagate to the target, but more aggressive schema changes (incompatible type alterations, dropping primary keys) can pause or break the integration and require manual intervention.
Because the latency budget is dominated by binlog flush plus apply time, the most common cause of a slow-running integration is high write volume on the source. A workload that bursts 50,000 inserts per second can momentarily lag Redshift behind by a minute or more—still vastly better than the daily batch it replaces, but worth monitoring with CloudWatch metrics [Source: https://docs.aws.amazon.com/redshift/latest/mgmt/zero-etl-using.html].
Key Takeaway: Plan for second-scale, eventually consistent replication—zero-ETL is excellent for analytics and ML feature pipelines but not a substitute for synchronous reads in operational paths.
Database Zero-ETL Integrations
Aurora MySQL/PostgreSQL → Redshift
Aurora-to-Redshift is the flagship zero-ETL integration and the most thoroughly documented. The setup involves two consoles—RDS for the source, Redshift for the target—and a small number of prerequisites that, if missed, cause cryptic failures during integration creation.
Aurora source prerequisites:
The Aurora cluster must use a custom DB cluster parameter group (the default group cannot be modified). Two parameters must be set:
binlog_format = ROW
aws_pitr_enabled = 1
binlog_format = ROW ensures that each individual row change is logged with full before/after values—statement-based or mixed-format binlogs cannot be safely replicated. aws_pitr_enabled = 1 enables enhanced point-in-time recovery, which is required for the integration’s snapshot phase [Source: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/zero-etl.setting-up.html].
The Aurora MySQL or PostgreSQL engine version must also be on a supported list. AWS regularly expands compatibility, so check the current documentation before assuming a legacy version qualifies.
Redshift target prerequisites:
The target Redshift cluster (or serverless workgroup) must have case sensitivity enabled at the cluster level. This is a one-time setting—if the cluster was created with case insensitivity, you cannot retrofit; you must create a new cluster. The cluster also needs an authorization policy identifying the Aurora source ARN as an allowed integration writer [Source: https://docs.aws.amazon.com/redshift/latest/mgmt/zero-etl-setting-up.create-integration-aurora.html].
Two-console workflow:
RDS Console ────────────────────────► Redshift Console
"Create zero-ETL integration" "Create database from integration"
├── Source: Aurora cluster ├── Select integration
├── Target: Redshift workgroup ├── Source database name
├── (Optional) data filters └── Target database name
└── Create └── Now queryable!
In RDS, you select the source cluster, the target Redshift namespace, and (optionally) data filters that scope replication to specific tables or schemas. Filtering is a critical cost-and-compliance lever: if your Aurora cluster has 800 tables but only 40 are interesting for analytics, replicating just those 40 reduces Redshift storage cost and avoids accidentally exporting PII tables [Source: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/zero-etl.creating.html].
After RDS creates the integration (which takes a few minutes for the snapshot phase), you switch to Redshift’s Query Editor v2 and run “Create database from integration.” This produces a queryable Redshift database backed by the replicated Aurora data. From that point, you can SELECT against it, build materialized views, join it with other Redshift schemas, and feed it into dbt models or Redshift ML.
Pricing model:
There is no separate fee for the integration itself. You pay only for the Aurora storage/compute you already pay for, the Redshift compute that ingests and stores the replicated data, and any cross-region or cross-account data transfer charges. Compared to running a DMS replication instance ($150–$1,500+ per month per instance), this is often the cheapest path [Source: https://aws.amazon.com/rds/aurora/zero-etl/].
Concrete example: An e-commerce company runs orders, inventory, and customer profiles in Aurora MySQL. Their analytics team needs hourly dashboards plus a daily ML model that predicts churn. Pre-zero-ETL, they ran:
- A Debezium connector on a self-managed Kafka cluster ($600/month EC2)
- An S3 sink connector ($200/month engineer-hours to maintain)
- A Redshift COPY job orchestrated by Airflow ($300/month MWAA)
- Approximately 40 hours/month of pipeline incident response
Post-zero-ETL, all four artifacts collapse into a single integration. Operating cost drops to roughly the marginal Redshift compute, and incident response on the ingestion path drops to near zero because there is no longer infrastructure to break.
Key Takeaway: Aurora-to-Redshift zero-ETL is a free-of-charge replacement for DMS/Debezium pipelines feeding analytics, requiring only a parameter group, case-sensitive Redshift, and a two-console workflow.
DynamoDB → Redshift / OpenSearch
DynamoDB is AWS’s key-value/document database, optimized for single-digit-millisecond reads at any scale. Its weakness is analytics: DynamoDB cannot run group-by queries, complex joins, or full-text search efficiently. Historically, teams worked around this with DynamoDB Streams plus Lambda plus Firehose pipelines—expensive in code and operations.
Zero-ETL collapses this with two integration targets:
- DynamoDB → Redshift for SQL analytics, BI dashboards, and ML training data
- DynamoDB → OpenSearch for full-text search and log-style queries
Under the hood, DynamoDB zero-ETL uses point-in-time recovery (PITR) for the initial export plus DynamoDB Streams (or the underlying change log) for ongoing CDC. Documents are flattened into Redshift tables or OpenSearch indices according to schema mapping rules.
Example scenario: A SaaS product stores per-user feature flags and event counters in DynamoDB. Product managers want a dashboard showing feature adoption by cohort. With zero-ETL to Redshift, the latest DynamoDB document for each user becomes a row in a Redshift table, joinable with billing and signup tables. The dashboard refreshes near-real-time, and the team writes zero pipeline code.
RDS → Redshift
Beyond Aurora, AWS has progressively extended zero-ETL to standard RDS engines, including RDS for MySQL. The mechanics mirror Aurora: parameter group, binlog enable, target authorization, console-based integration creation. The supported engine matrix expands every few quarters, so the practical advice is to consult the current AWS documentation before designing around assumed support [Source: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/zero-etl.html].
The architectural pattern is identical regardless of source engine:
┌─────────────────────────────┐
│ Redshift Target │
│ (case-sensitive cluster) │
└──────────────▲──────────────┘
│ replicated
│ (seconds)
┌──────────────────────────┴──────────────────────────┐
│ │
┌────┴─────┐ ┌──────────┐ ┌──────────────┐ ┌──────┴───┐
│ Aurora │ │ Aurora │ │ DynamoDB │ │ RDS for │
│ MySQL │ │ Postgres │ │ │ │ MySQL │
└──────────┘ └──────────┘ └──────────────┘ └──────────┘
Key Takeaway: Database zero-ETL is the same pattern instantiated for many engines—Aurora MySQL, Aurora PostgreSQL, DynamoDB, and RDS—giving Redshift (and sometimes OpenSearch) a unified, near-real-time view of operational data.
Figure 10.2: Database zero-ETL fan-in across heterogeneous AWS sources
flowchart BT
AM[Aurora MySQL<br/>binlog ROW] -->|zero-ETL| RS[(Redshift Target<br/>case-sensitive cluster)]
AP[Aurora PostgreSQL<br/>logical WAL] -->|zero-ETL| RS
DDB[DynamoDB<br/>PITR + Streams] -->|zero-ETL| RS
DDB -->|zero-ETL| OS[(OpenSearch<br/>full-text index)]
RDS[RDS for MySQL<br/>binlog ROW] -->|zero-ETL| RS
RS --> BI[BI / dbt / Redshift ML]
OS --> SRCH[Search & log analytics]
SaaS Zero-ETL
The most disruptive recent expansion of zero-ETL is its application to Software-as-a-Service sources, where data has historically lived behind REST or GraphQL APIs that required custom connectors. AWS Glue Zero-ETL extends the managed-integration pattern to the SaaS world.
Salesforce, SAP, ServiceNow Connectors
AWS Glue Zero-ETL currently supports the following SaaS sources [Source: https://docs.aws.amazon.com/glue/latest/dg/zero-etl-sources.html]:
| Source | Auth Mechanism | Incremental Support |
|---|---|---|
| Salesforce | OAuth / Authorization Code | Yes (automatic, Bulk API) |
| Salesforce Marketing Cloud Account Engagement | OAuth | Yes |
| SAP OData (ODP_SAP, ODP_CDS) | Service URL + credentials | Yes (delta links) |
| SAP OData (non-ODP) | Service URL + credentials | Timestamp field, or full + upserts |
| ServiceNow | Instance URL + credentials | Yes |
| Zendesk | OAuth | Yes |
| Zoho CRM | OAuth | Yes |
| Facebook Ads | OAuth | Yes |
| Instagram Ads | OAuth | Yes |
Configuration is a two-step process:
Step 1 — Create a Glue Data Catalog connection. This is a reusable artifact storing the SaaS instance URL, IAM service role, and authentication credentials (typically OAuth tokens via authorization code flow for Salesforce or a static credential set for SAP). Always use the “Test connection” button before proceeding—a misconfigured OAuth scope is the most common silent failure.
Step 2 — Create a zero-ETL integration. From the AWS Glue console, pick the SaaS connector type, choose the existing connection, select the entities/tables to replicate, and pick a destination (Redshift or SageMaker Lakehouse) [Source: https://docs.aws.amazon.com/glue/latest/dg/zero-etl-configuring-integration.html].
AWS Glue Zero-ETL — SaaS Pattern
Salesforce / SAP / ServiceNow
│
│ (Bulk API / delta link / timestamp poll)
▼
┌──────────────────────┐
│ Glue Connection │ ←─ OAuth or username/password
│ (Data Catalog) │ ←─ IAM role with Lake Formation perms
└──────────┬───────────┘
│
▼
┌──────────────────────┐
│ Glue Zero-ETL │ ←─ Entity selection
│ Integration │ ←─ Snapshot + incremental
└──────────┬───────────┘
│
┌──────┴──────┐
▼ ▼
Amazon SageMaker
Redshift Lakehouse
Critical permission requirement: The IAM service role used by the integration must have AWS Lake Formation permissions on the target Glue database. The most common Salesforce zero-ETL failure is IngestionFailed in CloudWatch logs, almost always caused by missing Lake Formation grants on the destination database. The standard pattern is a role like zero_etl_bulk_demo_role with DESCRIBE, CREATE_TABLE, and ALTER permissions on the target Glue database [Source: https://aws.amazon.com/blogs/big-data/accelerate-aws-glue-zero-etl-data-ingestion-using-salesforce-bulk-api/].
Salesforce specifics: Glue uses the Salesforce Bulk API rather than the standard REST API for zero-ETL. Bulk API is asynchronous, optimized for large datasets, and consumes far fewer API call quota units per row than REST. For an org pulling 50 million Account records, the Bulk API path can be 10x cheaper in API calls—an important cost lever given Salesforce’s per-call governance limits.
SAP specifics: SAP’s Operational Data Provisioning (ODP) framework is the gold path because it produces native delta tokens (delta links) that ODP-aware extractors consume directly. For non-ODP entities, you must designate a timestamp field (e.g., LastChangeDateTime); rows where that field advances become incremental updates. Without a timestamp field, you fall back to full extracts plus upserts—which means deletes are not detected, a real correctness pitfall for slowly-deleting datasets [Source: https://aws.amazon.com/blogs/big-data/sap-data-ingestion-and-replication-with-aws-glue-zero-etl/]. AWS expanded supported SAP entity types in late 2025, broadening practical coverage [Source: https://aws.amazon.com/about-aws/whats-new/2025/11/aws-glue-additional-sap-entities-zero-etl-integration-sources/].
ServiceNow specifics: Configuration is straightforward with a connection URL and credentials [Source: https://docs.aws.amazon.com/glue/latest/dg/connecting-to-data-servicenow.html]. Incremental sync works automatically. Common use case: pulling incident, change, and CMDB tables into Redshift to join with infrastructure cost data for ITSM analytics.
Marketing Data Sources (Facebook Ads, Google Ads)
Marketing analytics has historically been the wild west of pipelines—every advertising platform has a different rate-limited API, a different attribution window, and a different schema-per-account. Facebook Ads and Instagram Ads are now first-class Glue zero-ETL sources, replacing custom Singer/Airbyte/Fivetran taps for AWS-native shops.
The configuration follows the same two-step Glue pattern. Once running, the integration replicates campaigns, ad sets, ads, and insights tables on a configurable cadence, landing them in Redshift where they can be joined with first-party customer data for attribution modeling. Note that Google Ads is not yet in the official Glue zero-ETL list at the time of writing—teams needing Google Ads data still use Fivetran, Airbyte, or custom Cloud Functions, which is one practical limit to watch.
Schema Mapping and Incremental Sync
When zero-ETL pulls Salesforce Account records or SAP material master entities into Redshift, schema mapping happens automatically:
- Source field types are mapped to Redshift-compatible types (
VARCHAR,TIMESTAMP,DECIMAL, etc.). - Composite types (Salesforce reference fields, SAP nested OData properties) are flattened into joinable columns.
- New columns added on the source are propagated to the target on the next sync.
Incremental sync is the property that makes zero-ETL economical. A nightly full-table reload of 50 million Salesforce rows is impossibly expensive in API quota; an incremental sync that pulls only the changes since the last sync (typically a few thousand rows) is trivial.
The mechanism varies by source:
- Salesforce/ServiceNow/Zendesk: automatic—the source platform exposes change tracking, and Glue consumes it.
- SAP ODP: delta links provided by the SAP server tell Glue exactly which rows changed since the last token.
- SAP non-ODP with timestamp: Glue queries
WHERE LastChangeDate > :watermark, captures changes, advances the watermark. - SAP non-ODP without timestamp: full reload + upsert. Inserts and updates land correctly, but deletes are silently lost—a correctness footgun.
Key Takeaway: Glue zero-ETL turns previously bespoke SaaS pipelines into a connection-plus-integration declaration, but you must still understand each source’s incremental mechanism and Lake Formation permissions or face IngestionFailed errors.
Figure 10.3: SaaS connector schema mapping and incremental sync flow
flowchart LR
subgraph SaaS[SaaS Sources]
SF[Salesforce<br/>Bulk API]
SAP[SAP OData<br/>ODP delta links]
SN[ServiceNow<br/>change tracking]
end
subgraph Glue[AWS Glue Zero-ETL]
CONN[Glue Data Catalog<br/>Connection<br/>OAuth / creds]
INT[Zero-ETL Integration<br/>entity selection]
MAP[Schema Mapper<br/>type coercion<br/>flatten composites]
WM[Watermark / Delta Token]
end
subgraph Target[Targets with Lake Formation grants]
RS[(Redshift)]
LH[(SageMaker Lakehouse)]
end
SF --> CONN
SAP --> CONN
SN --> CONN
CONN --> INT
INT --> MAP
INT <--> WM
MAP --> RS
MAP --> LH
Trade-offs and Limits
When You Still Need Pipelines
Despite its appeal, zero-ETL is not a universal solution. The clearest decision points come from comparing it head-to-head with the alternatives [Source: https://www.birjob.com/blog/cdc-replaced-etl-debezium-postgres]:
| Factor | Zero-ETL | Debezium | AWS DMS | Batch ETL |
|---|---|---|---|---|
| Latency | Seconds | Seconds | Minutes | Hours/days |
| Cost | No charge (AWS-managed) | Infra + engineering | Per-instance fees | Compute + orchestration |
| Infrastructure | Zero management | Self-managed Kafka | Managed instance | Scheduled compute |
| Flexibility | AWS ecosystem only | Any DB + any sink | AWS-focused | Any source/target |
| Transformation | Replication only | Stream processing possible | Limited | Full SQL/Python in-flight |
Choose Debezium when:
- You need multi-cloud flexibility or have non-AWS source/target databases.
- You need true streaming into Kafka for microservices, not just analytics replication.
- You require fine-grained, application-level control over the change stream (filtering, enrichment, transformation in-flight).
Choose AWS DMS when:
- You’re performing a one-time migration with minimal ongoing CDC.
- You need replication between heterogeneous databases not yet supported by zero-ETL.
- You accept vendor lock-in for a managed but separately-billed service.
Choose batch ETL when:
- Latency genuinely doesn’t matter—weekly executive reports, monthly financial reconciliation.
- The source lacks log-based CDC (legacy databases, ad-hoc REST APIs, flat-file FTP feeds).
- You need heavy analytical transformations during ingestion: complex multi-table joins, full-dataset aggregations, ML feature engineering. Zero-ETL is replication-focused; if your ingestion logic includes a 12-table join, you still need a batch pipeline behind it.
The brutal truth is that zero-ETL is excellent at one job—replicating operational data into an analytics store—and bad at everything else. Try to use it for transformation, multi-cloud federation, or fine-grained event filtering, and you’ll quickly hit walls.
Figure 10.4: Decision tree comparing zero-ETL, Debezium, DMS, and batch ETL
flowchart TD
Start[Need to move data<br/>from source to analytics] --> Q1{Latency budget?}
Q1 -->|Hours / days OK| Batch[Batch ETL<br/>scheduled compute]
Q1 -->|Seconds to minutes| Q2{Heavy in-flight<br/>transformation?}
Q2 -->|Yes - joins, aggs, ML feats| Batch
Q2 -->|No - replication only| Q3{Multi-cloud or<br/>non-AWS sink?}
Q3 -->|Yes| Deb[Debezium + Kafka<br/>self-managed CDC]
Q3 -->|No - all AWS| Q4{One-time<br/>migration?}
Q4 -->|Yes| DMS[AWS DMS<br/>per-instance fee]
Q4 -->|No - ongoing| Q5{Source supported<br/>by zero-ETL?}
Q5 -->|Yes| ZE[Zero-ETL<br/>managed, no integration fee]
Q5 -->|No| DMS
Key Takeaway: Zero-ETL is a replication tool, not a transformation tool—choose Debezium for multi-cloud streaming, DMS for migrations, and batch for heavy in-flight transformation or non-CDC sources.
Cost and Quota Considerations
Although zero-ETL itself is free of integration charges, the hidden costs are real:
- Redshift storage and compute. Replicated data takes Redshift space; query workload against it consumes RPU/credits. For a verbose Aurora cluster, the Redshift bill can grow non-trivially.
- Cross-region transfer. If your Aurora cluster is in
us-east-1and your Redshift workgroup is inus-west-2, every replicated byte incurs cross-region transfer fees. Co-locating dramatically reduces this. - Source-side API quotas (SaaS). Salesforce Bulk API is cheaper than REST but not free—each org has API governor limits. SAP gateway throughput, ServiceNow rate limits, Facebook Ads insights API tokens all impose ceilings. A misconfigured “sync every 5 minutes” against Salesforce can drain a daily API quota by lunch.
- Filtering hygiene. Replicating 800 Aurora tables when you need 40 means 20x the Redshift cost and 20x the cross-region transfer. Use data filters aggressively at integration creation time—it’s much harder to remove tables later.
The general rule: zero-ETL eliminates pipeline-engineering cost but does not eliminate data-volume cost. Treat the integration like any other production data flow with an appropriate cost-monitoring dashboard.
Key Takeaway: Free integration does not mean free data—Redshift storage, cross-region transfer, and SaaS API quotas remain real costs that engineers must filter and monitor.
Hybrid Zero-ETL Plus Transformation
In real architectures, zero-ETL rarely stands alone. The dominant industry pattern is hybrid: use zero-ETL to land raw operational data into a Redshift “raw” or “bronze” schema, then use dbt, Redshift materialized views, or scheduled batch jobs to transform that raw data into clean, modeled “silver” and “gold” layers consumable by BI and ML.
Aurora ───zero-ETL───► Redshift "raw" schema
│
│ (dbt models, materialized views, scheduled SQL)
▼
"silver" / "gold" marts
│
├──► Tableau / QuickSight
├──► Redshift ML / SageMaker
└──► Reverse ETL to Salesforce
This pattern preserves the strengths of both approaches: zero-ETL eliminates the brittle ingestion layer, while batch transformations handle the joins, aggregations, and quality checks that zero-ETL was never designed to do.
Figure 10.5: Hybrid zero-ETL plus transformation architecture (bronze/silver/gold)
flowchart LR
subgraph Sources[Operational Sources]
AU[Aurora]
DD[DynamoDB]
SAAS[SaaS via Glue]
end
subgraph Bronze[Raw / Bronze]
RAW[(Redshift raw schema<br/>zero-ETL landing)]
end
subgraph Silver[Silver - cleaned/joined]
DBT[dbt models<br/>Materialized views<br/>Scheduled SQL]
CLEAN[(Conformed marts)]
end
subgraph Gold[Gold - business-ready]
MARTS[(Domain marts)]
end
subgraph Consumers
BI[Tableau / QuickSight]
ML[Redshift ML / SageMaker]
REV[Reverse ETL to Salesforce]
end
AU -->|zero-ETL| RAW
DD -->|zero-ETL| RAW
SAAS -->|zero-ETL| RAW
RAW --> DBT --> CLEAN --> MARTS
MARTS --> BI
MARTS --> ML
MARTS --> REV
A practical migration strategy for moving an existing pipeline-based architecture to zero-ETL follows the 9-week phased pattern documented by the BirJob case study [Source: https://www.birjob.com/blog/cdc-replaced-etl-debezium-postgres]:
Week 1-2: Deploy zero-ETL in shadow mode (alongside existing pipeline)
Week 3-4: Dual-write comparison; validate row counts, schemas, latencies
Week 5: Cut over non-critical consumers (internal dashboards) first
Week 6-8: Migrate primary warehouse feeds to zero-ETL backed tables
Week 9+: Decommission legacy DMS/Debezium pipeline; archive code
The structure preserves the legacy pipeline as a fallback for at least two weeks of dual-write validation, dramatically reducing migration risk.
Key Takeaway: The right way to use zero-ETL in production is hybrid—land raw data via the managed integration, then transform it downstream with dbt or scheduled SQL, and migrate from legacy pipelines via a phased shadow-mode strategy.
Chapter Summary
Zero-ETL represents a structural shift in data engineering: the pipeline code that historically dominated platform-team workload is increasingly absorbed by the cloud provider’s managed services. Aurora-to-Redshift integration delivers second-scale CDC with a parameter-group-and-console workflow, no separate fees, and dramatic reductions in operational toil compared to DMS or self-managed Debezium. DynamoDB and RDS extensions generalize the pattern across operational data stores, and AWS Glue Zero-ETL brings the same managed-integration model to SaaS sources—Salesforce via Bulk API, SAP via ODP delta links, ServiceNow, Zendesk, and the major advertising platforms.
The pattern is not universal. Zero-ETL is replication-focused; transformation-heavy ingestion still belongs in batch pipelines, multi-cloud streaming still belongs in Debezium plus Kafka, and one-time migrations still suit DMS. The dominant production pattern is hybrid: zero-ETL for raw landing, dbt or scheduled SQL for transformation, with phased shadow-mode migrations from legacy pipelines. Properly applied, zero-ETL converts ingestion from a high-bug-density pipeline-engineering problem into a declarative integration object—freeing teams to focus on the analytics, ML, and product work that actually creates business value.
Key Terms
- zero-ETL — A managed data-integration pattern where the cloud provider owns the extraction, transport, and load of operational data into an analytics store, eliminating user-managed pipeline code and infrastructure.
- Aurora — Amazon’s cloud-native relational database service compatible with MySQL and PostgreSQL; the flagship source for AWS zero-ETL integrations into Redshift.
- DynamoDB — Amazon’s managed key-value/document NoSQL database; supports zero-ETL replication into Redshift for analytics and OpenSearch for full-text search.
- Redshift integration — A declarative AWS object that ties a source (Aurora, RDS, DynamoDB, or a Glue SaaS connector) to a target Redshift cluster, automatically handling snapshot loading and ongoing CDC.
- SaaS connector — In AWS Glue Zero-ETL, a preconfigured integration adapter for a specific SaaS platform (Salesforce, SAP OData, ServiceNow, Zendesk, Zoho CRM, Facebook Ads, Instagram Ads) that handles authentication, schema mapping, and incremental sync.
- managed integration — A pipeline whose extraction, transport, and load logic is operated by the cloud provider rather than the customer; zero-ETL is the canonical example.
- incremental sync — The mechanism by which only changed rows since the last sync are pulled from a source, using log-based CDC (binlog/WAL), platform-native change tracking (Salesforce, ServiceNow), delta tokens (SAP ODP), or timestamp watermarks.
Chapter 11: Data Governance, Catalog, and Access Control
Learning Objectives
By the end of this chapter, you will be able to:
- Differentiate technical metadata, business metadata, and lineage, and explain why each matters for compliance and discovery.
- Use the AWS Glue Data Catalog and AWS Lake Formation together to govern lake data, register schemas, and enforce permissions across query engines.
- Configure Amazon DataZone for enterprise data publishing and subscription, including domains, projects, business glossaries, and data products.
- Implement fine-grained access control using row filters, column masks, and LF-TBAC (Lake Formation tag-based access control), and reason about cross-account sharing.
Why Governance Matters
A modern data platform that no one trusts is worse than no platform at all. When analysts cannot tell which customers table is the “real” one, when auditors cannot prove who accessed personally identifiable information last quarter, or when a data scientist accidentally pulls a column containing Social Security numbers into a notebook, the engineering work that came before — ingestion, transformation, partitioning, optimization — collapses into liability. Governance is the discipline that turns a pile of well-engineered storage into an enterprise asset.
Think of governance as the building code for a city of data. Anyone can stack bricks; the building code ensures the wiring is up to spec, the exits are marked, and the inspector can verify both. Without it, every new floor adds risk. With it, every new floor adds value.
Compliance: GDPR, HIPAA, SOC 2
Regulations dictate that personally identifiable information (PII), protected health information (PHI), and financial records be tracked, restricted, and auditable end-to-end. The EU’s General Data Protection Regulation (GDPR) demands that organizations know where every European user’s data lives and can delete it on request. The U.S. Health Insurance Portability and Accountability Act (HIPAA) constrains who may view PHI and requires audit trails. SOC 2 attestations require evidence that controls — including data access controls — are operating effectively over time.
In AWS, these requirements translate into three concrete needs:
- A canonical inventory of every table, column, and partition (the AWS Glue Data Catalog plays this role).
- A policy engine that decides who can see what (Lake Formation grants, IAM, and row/column filters).
- An audit trail that proves who accessed which row when (CloudTrail logs of Lake Formation events). Lake Formation tracks “source events: original CloudTrail API calls; transformation time: the
eventDaypartition showing when events were processed; query history: which accounts/principals accessed which data” [Source: https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-lake-queries.html].
Key Takeaway: Governance is not a feature you bolt on — it is the connective tissue between regulation and engineering. If your catalog, your policy engine, and your audit log do not agree, your compliance story is fiction.
Data Quality and Trust
Trust is the currency of analytics. A dashboard whose numbers contradict last week’s report drives users away from the platform and back to private spreadsheets. Governance enforces trust by exposing provenance (where this data came from), freshness (when it last updated), and quality metrics (what fraction of rows pass validation). Glue Data Quality scores, embedded in Glue Data Catalog tables, become a visible attribute consumers see before they query.
Consider the analogy of a grocery store. You trust the milk because it has a sell-by date, a producer name, and a USDA stamp. Without those, you would never buy it — and a data product without lineage, owner, and freshness is the same carton with the labels peeled off.
Key Takeaway: Quality and trust are observable properties. If a consumer cannot see who owns a table, when it last loaded, and what quality checks it passed, they will assume the worst.
Cost and Discovery
Ungoverned lakes balloon in cost. Engineers re-derive the same daily_active_users metric in five projects because they cannot find the canonical version. Storage tiers go unmanaged because no one knows which datasets are stale. Query costs spike because partition strategies are invisible to analysts who write SELECT *.
A catalog drives down both costs simultaneously. Discovery — through search, glossaries, and tags — prevents duplication. Metadata-aware tooling (partition pruning, column-level masking) prevents wasteful scans. Lake Formation enforces partition-level access so that “by partitioning your data, you can restrict the amount of data scanned by each query, thereby improving performance and reducing cost” [Source: https://docs.aws.amazon.com/security-lake/latest/userguide/subscriber-query-examples2.html].
Key Takeaway: Discovery is cost control. The cheapest query is the one you do not have to re-derive because someone already published a trusted data product.
Catalog and Metadata
Metadata is data about data. Without it, a Parquet file in S3 is an opaque blob. With it, that same file becomes a queryable table, a discoverable asset, and a governed resource. AWS organizes metadata in three layers:
| Layer | What it Describes | Where it Lives |
|---|---|---|
| Technical metadata | Schema, types, partitions, formats, locations | Glue Data Catalog |
| Business metadata | Glossary terms, owners, descriptions, classifications | DataZone, Glue tags |
| Lineage metadata | Source-to-target mappings, transformation history | Glue, DataZone, OpenLineage |
Figure 11.1: Three-tier metadata model and the AWS services that host each layer.
flowchart TD
subgraph Tech["Technical Metadata"]
T1[Schemas & Types]
T2[Partitions & Formats]
T3[S3 Locations & Serdes]
end
subgraph Biz["Business Metadata"]
B1[Glossary Terms]
B2[Owners & Descriptions]
B3[Classifications & Tags]
end
subgraph Lin["Lineage Metadata"]
L1[Source -> Target Maps]
L2[Transformation History]
L3[Impact Graph]
end
Tech --> Glue[(AWS Glue Data Catalog)]
Biz --> DZ[(Amazon DataZone)]
Lin --> Both[Glue + DataZone + OpenLineage]
Glue --> Engines[Athena / Redshift / EMR]
DZ --> Portal[DataZone Portal Search]
Both --> Graph[Lineage Graph View]
Glue Data Catalog as Hive Metastore
The AWS Glue Data Catalog is a Hive-compatible metastore: a managed registry of databases, tables, columns, types, partitions, and storage locations. Any tool that speaks the Hive metastore protocol — Athena, Redshift Spectrum, EMR (Spark, Presto, Hive), and Lake Formation itself — can read it. This is the single most important integration point in the AWS analytics stack: register a table once, query it from anywhere.
A Glue table entry typically captures:
- Database and table names, e.g.,
amazon_security_lake_glue_db_us_east_1.amazon_security_lake_table_us_east_1_eks_audit_2_0[Source: https://docs.aws.amazon.com/security-lake/latest/userguide/subscriber-query-examples2.html]. - Column definitions, including nested structures (
api.request,http_request.url.path) and array types that requireUNNEST. - Partition columns, such as
time_dt,region,accountid. The same source notes that “Security Lake now implements partitioning throughtime_dt,region, andaccountid. Whereas, Security Lake 1.0 implemented partitioning througheventDay,region, andaccountidparameters” — illustrating how the catalog tracks schema evolution across versions [Source: https://docs.aws.amazon.com/security-lake/latest/userguide/subscriber-query-examples2.html]. - Storage descriptor: input format (Parquet, ORC, JSON), serde, and S3 location.
Below is a typical query flow once metadata is registered:
SELECT activity_name, time_dt, src_endpoint.ip
FROM "amazon_security_lake_glue_db_us_east_1"
."amazon_security_lake_table_us_east_1_eks_audit_2_0"
WHERE time_dt BETWEEN CURRENT_TIMESTAMP - INTERVAL '7' DAY
AND CURRENT_TIMESTAMP;
Athena does not know the file layout — it asks Glue. Glue says “those files are Parquet, partitioned by time_dt, here are the S3 prefixes,” and Athena prunes the scan to one week. The catalog is doing real work in every query.
Schema evolution deserves special attention. As producers add columns or rename fields, the catalog tracks versioned schemas so older partitions can still be queried with their original layout while newer partitions adopt the new schema. The OCSF version progression from _1_0 to _2_0 is an example: tables coexist under the same database, and consumers choose the version that matches their query.
Key Takeaway: The Glue Data Catalog is the Rosetta Stone of AWS analytics. Whatever engine you use, it reads the same definitions, which means schema changes propagate everywhere — and so do governance decisions made on top of those definitions.
Business Glossary in DataZone
Technical metadata tells you a column is a VARCHAR(36). It does not tell you the column represents a “Customer Identifier — the canonical UUID issued by Identity Service v3, considered PII under GDPR Article 4.” That second sentence is business metadata, and Amazon DataZone is the AWS service designed to manage it.
DataZone’s business glossary is a hierarchical collection of standardized terms. A term like Customer Identifier can sit under a parent category Customer Domain, link to a metadata form describing data sensitivity, and be attached to assets, schemas, and individual columns across many Glue tables. “The business glossary is a collection of standardized business terms applied to assets (e.g., tables, S3 objects) and schemas/columns for classification, search, and consistency” [Source: https://docs.aws.amazon.com/datazone/latest/userguide/create-maintain-business-glossary.html].
Why glossaries matter:
- Search. Analysts find data by business meaning (“revenue”) instead of guessing table names.
- Consistency. A term has one definition company-wide; “Active Customer” cannot mean two things in two domains.
- Compliance. Tagging columns with glossary terms like
PIIorRestrictedenables policy automation: any column taggedPIIautomatically inherits masking rules.
| Metadata Type | Purpose | Example |
|---|---|---|
| Technical | Engine optimization | time_dt: timestamp, partition_key=true |
| Business | Discovery and meaning | Customer Identifier: PII, owner=Identity Team |
| Operational | Freshness and quality | last_loaded: 2026-05-07T03:00Z, quality_score: 0.987 |
| Lineage | Provenance | derived_from: raw_events.cloudtrail_v2 |
Key Takeaway: Technical metadata is for engines; business metadata is for humans. A platform that captures only one will be either unsearchable or unqueryable.
Lineage and Impact Analysis
Lineage is the directed graph that connects sources to derived datasets. Each edge represents a transformation: an ETL job, a SQL view, a Glue crawler, a Spark step. Lineage answers two complementary questions:
- Provenance (“upstream”): “Where did the values in this dashboard come from?”
- Impact (“downstream”): “If I change this source column, what breaks?”
In AWS, lineage is captured at multiple layers. Lake Formation records source-to-query chains; the Glue Data Catalog stores ETL job references that produced each table; DataZone consolidates lineage into a single visual graph for data products. The integrated chain looks like this:
AWS Services (CloudTrail, VPC Flow, EKS)
-> Lake Formation (ingestion control)
-> Glue Data Catalog (metadata registration)
-> Query Execution (Athena/EMR/Redshift Spectrum)
-> Results & Audit Trail
[Source: https://docs.aws.amazon.com/security-lake/latest/userguide/cloudtrail-query-examples.html]
A practical analogy: lineage is the airline route map. If a storm grounds a hub airport, the impact map shows every connecting flight that will be disrupted. If a passenger asks how their bag traveled, the provenance map shows every leg. Both are projections of the same graph, used by different roles.
Impact analysis is especially valuable when retiring or refactoring a column. Without lineage, deprecating customer_uuid is a roll-the-dice exercise; with lineage, you know the four downstream models and three Looker dashboards that depend on it before you make the change.
Key Takeaway: Lineage gives the data platform memory. Without it, every change is a guess; with it, change becomes a controllable engineering activity.
Lake Formation
AWS Lake Formation layers governance on top of the Glue Data Catalog. Where Glue describes your data, Lake Formation decides who can use it. It introduces a permission model that is simpler, finer-grained, and more portable than raw IAM policies — and it integrates with every Glue-aware query engine.
Permissions vs IAM
Pure IAM policies on S3 prefixes are clumsy for analytics. They cannot express “Alice can see all columns of customers except ssn,” and they cannot scale to hundreds of tables without a thicket of statements. Lake Formation introduces a relational permission model — SELECT, INSERT, ALTER, DROP, DESCRIBE — granted on databases, tables, and columns.
The hand-off works like this. Lake Formation registers an S3 location and takes over its access decisions. When Athena runs a query, it asks Lake Formation for a vended credential scoped to exactly the rows and columns the user is allowed to see. The user’s underlying IAM identity gets just enough permission to call Lake Formation; Lake Formation then does the heavy lifting.
“The Lake Formation data lake administrator must grant SELECT permissions on the relevant databases and tables to the IAM identity that queries the data” [Source: https://docs.aws.amazon.com/security-lake/latest/userguide/subscriber-query-examples.html]. This grant is the entry point for most analytics workflows.
A three-tier permission model emerges:
| Tier | Mechanism | Example |
|---|---|---|
| Catalog-level | Glue database/table grants | GRANT DESCRIBE ON DATABASE finance_db TO IAM:role/Analyst |
| Column/row-level | Lake Formation filters | Mask ssn; row filter region = 'EU' |
| Principal-based | IAM roles, cross-account | Third-party subscribers via IAM roles |
[Source: https://docs.aws.amazon.com/security-lake/latest/userguide/subscriber-query-examples.html]
Compared to raw IAM, Lake Formation grants are easier to audit (one row per principal/resource pair), easier to revoke (a single REVOKE), and easier to reason about (no JSON policy gymnastics).
A simple example shows the column-level filter at work. A customers table contains customer_id, email, address, ssn. The marketing analyst’s grant looks like:
GRANT SELECT (customer_id, email)
ON TABLE customers TO IAM:role/MarketingAnalyst;
Even if the analyst writes SELECT *, Athena returns only the two permitted columns. The query engine, not the policy author, enforces the filter at runtime.
Key Takeaway: Lake Formation replaces S3 IAM acrobatics with a database-style permission model. If a permission feels hard to express in IAM, it is probably a one-line Lake Formation grant.
Tag-Based Access Control (LF-TBAC)
Named-resource grants do not scale. Imagine a company with 5,000 tables and 200 roles: that is a million potential cell entries to maintain. LF-TBAC (Lake Formation tag-based access control) inverts the model. Instead of writing grants per resource, you tag resources with attributes and write grants that match those attributes.
An LF-Tag is a key/value pair attached to a database, table, or column. Common patterns include domain=finance, classification=restricted, pii=true. “LF-Tags are attributes attached to Data Catalog resources (databases, tables, and columns) that define permissions based on characteristics rather than individual resource names. Each LF-Tag consists of a key and one or more values” [Source: https://docs.aws.amazon.com/lake-formation/latest/dg/tag-based-access-control.html].
Tags inherit hierarchically: a database tag flows to its tables, a table tag flows to its columns, and lower levels can override. So a database might be tagged classification=internal while a single column inside is tagged classification=restricted.
A grant is then expressed as an LF-Tag expression:
GRANT SELECT ON
LF-TAG-EXPRESSION (
domain IN ('customer'),
classification IN ('internal')
)
TO IAM:role/MarketingAnalyst;
This means “Marketing can see anything tagged domain=customer AND classification=internal.” If you later add 50 new customer tables, all you do is tag them domain=customer, classification=internal — Marketing automatically gets access. If a new column is sensitive, you tag it classification=restricted and Marketing automatically loses access to that column.
Figure 11.2: LF-TBAC evaluation flow from query to vended credentials.
flowchart LR
User[IAM:role/MarketingAnalyst] -->|SELECT * FROM customers| Athena
Athena -->|GetTable + Authorize| LF[Lake Formation]
LF -->|Lookup tags on table/columns| Cat[(Glue Data Catalog)]
Cat -->|domain=customer<br/>classification=internal<br/>pii=ssn on column| LF
LF -->|Match grant expression<br/>domain IN customer<br/>classification IN internal| Decide{Tag<br/>match?}
Decide -->|Yes| Vend[Vend scoped<br/>credentials]
Decide -->|No / column tagged pii| Mask[Drop or mask column]
Vend --> Result[Filtered rows + columns]
Mask --> Result
Result --> User
The analogy is a museum’s security badging. You do not list every painting each guard can stand near; you give the guard a badge level (Bronze, Silver, Gold) and tag each painting with a required level. Adding new paintings is just sticking on a tag, not rewriting every guard’s job description. “LF-TBAC is more scalable than named resource methods and requires less permission management overhead, especially with hundreds or thousands of tables” [Source: https://docs.aws.amazon.com/lake-formation/latest/dg/tag-based-access-control.html].
| Granularity | Tag Example | Effect |
|---|---|---|
| Database | domain=finance | All tables inherit |
| Table | dataset=transactions | Overrides domain default if conflict |
| Column | pii=ssn | Specific column-level restriction |
LF-TBAC also composes with row filters. A grant expression of domain=customer, classification=internal plus a row filter region = 'EU' gives the EU marketing analyst exactly the slice they need — no more, no less.
Key Takeaway: Tags scale where named resources do not. With LF-TBAC, governance becomes attribute-driven: tag once, grant once, and your permissions follow your data automatically.
Cross-Account and Cross-Region Sharing
Most enterprises are multi-account by design — a Producer Account owns the data, a Consumer Account does the analytics, and a Central Governance Account holds the catalog. Lake Formation makes this model viable by supporting cross-account grants on both named resources and LF-Tags.
The mechanics are straightforward. The producer grants DESCRIBE and ASSOCIATE on LF-Tags to the consumer account ID, then grants resource permissions through the tag expression. “Lake Formation supports granting DESCRIBE and ASSOCIATE permissions on LF-Tags across accounts; granting permissions on Data Catalog resources across accounts using LF-TBAC (principal = AWS account ID)” [Source: https://docs.aws.amazon.com/lake-formation/latest/dg/tag-based-access-control.html]. The consumer’s Lake Formation administrator then sub-grants those permissions to specific roles inside their account.
Producer Account (111111111111)
↳ GRANTS LF-Tag(domain=customer) DESCRIBE/ASSOCIATE
to Consumer Account (222222222222)
↳ GRANTS SELECT on LF-Tag(domain=customer, classification=internal)
to Consumer Account (222222222222)
Consumer Account (222222222222)
↳ GRANTS SELECT on LF-Tag(domain=customer, classification=internal)
to IAM:role/MarketingAnalyst
Cross-region works similarly through Lake Formation cross-region resource shares, often combined with Glue Data Catalog cross-region replication for low-latency reads. The key insight is that only the catalog metadata crosses boundaries — the underlying S3 data stays put — so consumers see a federated view rather than copies of the data.
Real-world example: a security operations team in us-west-2 queries a Security Lake catalog hosted in us-east-1. They never copy CloudTrail data. They consume metadata, then query in place. “You can also create third-party subscribers in the Security Lake console, API, or AWS CLI. Third-party subscribers can also query Lake Formation data from the sources that you specify” [Source: https://docs.aws.amazon.com/security-lake/latest/userguide/subscriber-query-examples.html].
Figure 11.3: Cross-account governance topology — metadata crosses, S3 data stays put.
graph TD
subgraph Producer["Producer Account 111111111111"]
PS[(S3 Data Lake)]
PG[Glue Catalog]
PLF[Lake Formation Admin]
end
subgraph Central["Central Governance Account"]
CC[Shared Catalog View]
Tags[LF-Tags<br/>domain, classification]
end
subgraph Consumer["Consumer Account 222222222222"]
CLF[Lake Formation Admin]
CRole[IAM:role/MarketingAnalyst]
CAthena[Athena Workgroup]
end
PG -->|register| Tags
PLF -->|GRANT DESCRIBE/ASSOCIATE<br/>on LF-Tag| CC
PLF -->|GRANT SELECT<br/>on tag expression| CLF
CLF -->|sub-grant| CRole
CRole -->|query| CAthena
CAthena -.->|read in place<br/>vended credentials| PS
style PS fill:#1f3a5f,stroke:#58a6ff
style Tags fill:#1f3a5f,stroke:#58a6ff
Key Takeaway: Cross-account sharing in Lake Formation is metadata-first. Data does not move; access does. This is what makes a real data mesh feasible at AWS scale.
Data Mesh with DataZone
A data mesh is an organizational and technical pattern that decentralizes data ownership. Instead of one central team controlling a monolithic warehouse, individual business domains (Customer, Finance, Logistics) own and publish their data as data products, while a central platform team supplies the governance, discovery, and policy machinery. Amazon DataZone is AWS’s managed implementation of this pattern.
Domains and Data Products
DataZone organizes the world into domains and projects. A domain is a top-level container, typically deployed in a central governance account, that holds the catalog, the glossary, the projects, and the policy rules. A project is a use-case workspace inside a domain — a place where a small group collaborates with a defined set of tools (Athena, Redshift, SageMaker) and data assets.
“Domains are top-level containers hosting catalogs, projects, and governance rules. Producer/consumer AWS accounts are associated with domains (e.g., central governance account hosts the domain and data portal)” [Source: https://docs.aws.amazon.com/datazone/latest/userguide/datazone-concepts.html].
A data product is the unit of consumption. Concretely, it is a curated bundle of one or more assets — a Glue table, a Redshift view, an S3 prefix — wrapped in business metadata: name, description, owner, sensitivity, lineage, freshness, glossary terms, sample queries. A consumer browsing the catalog does not see “table glue_db_x.t_098”; they see “Active Customers (Daily) — owned by Customer Identity, refreshed at 03:00 UTC, certified, contains PII.”
| DataZone Concept | What it is | Example |
|---|---|---|
| Domain | Org-wide container | acme-corp |
| Project (Producer) | Publishes data | Customer Identity Team |
| Project (Consumer) | Subscribes to data | Marketing Analytics |
| Asset | Underlying technical object | Glue table customers_v3 |
| Data Product | Curated, published bundle | Active Customers (Daily) |
Think of data products as books in a public library. The library (DataZone) does not own the books; it lets independent publishers (domain teams) shelve them with consistent metadata (Dewey decimal, author, summary) so any reader (consumer) can find and check them out.
Key Takeaway: A data mesh is not a tool — it is a contract that domains own their data products and a platform supplies the governance fabric. DataZone is the fabric.
Publish/Subscribe Model
DataZone implements a producer/consumer workflow that mirrors how application teams consume APIs:
- Producers in a project create assets — often by attaching a Glue crawler that catalogs S3 data — add business metadata, glossary terms, and sample queries, and publish the assets as a data product to the domain catalog [Source: https://docs.aws.amazon.com/datazone/latest/userguide/datazone-concepts.html].
- Consumers discover the data product via search powered by metadata and glossary terms.
- The consumer subscribes on behalf of their project, attaching a justification (purpose, intended use, retention).
- The data owner receives a notification in the data portal and approves or rejects the request.
- On approval, fulfillment workflows automatically grant access. For Glue tables, DataZone calls Lake Formation to create the appropriate grants. For Redshift, it adjusts data sharing. For non-native assets, DataZone publishes an EventBridge event so custom fulfillment can take over [Source: https://aws.amazon.com/blogs/big-data/unlock-data-across-organizational-boundaries-using-amazon-datazone-now-generally-available/].
- The consumer’s project environment is automatically wired so analysts can query immediately via Athena or Redshift, with no manual IAM or Lake Formation configuration.
[Producer Project] [Domain Catalog] [Consumer Project]
curate assets discover
| |
|---- publish data product ---> |
| <---- search/browse -------- |
| |
| <---- subscription request --- |
approve |
|---- fulfill: LF grant + env wiring --------------------> |
query
The analogy: DataZone treats data like an internal app store. Producers ship versioned products, consumers request installs, owners approve, and the OS (Lake Formation) installs the permissions. Nobody emails JSON policies anymore.
Figure 11.4: DataZone publish/subscribe lifecycle as a sequence of producer, domain, owner, and consumer interactions.
sequenceDiagram
participant P as Producer Project
participant D as Domain Catalog
participant O as Data Owner
participant LF as Lake Formation
participant C as Consumer Project
P->>P: Curate asset + add glossary terms
P->>D: Publish data product
C->>D: Search / browse catalog
D-->>C: Discover data product
C->>D: Subscribe (with justification)
D->>O: Notify approval request
O->>D: Approve subscription
D->>LF: Fulfillment: create grants
LF-->>C: Wire env (Athena/Redshift)
C->>LF: Run query
LF-->>C: Return governed results
Figure 11.5: Lineage flow from raw producer sources through governed transformations to consumer outputs.
flowchart LR
A[CloudTrail / VPC Flow / EKS] --> B[Lake Formation<br/>Ingestion Control]
B --> C[(Glue Data Catalog<br/>Metadata Registration)]
C --> D{Query Engines}
D --> D1[Athena]
D --> D2[Redshift Spectrum]
D --> D3[EMR / Spark]
D1 --> E[Curated Data Product]
D2 --> E
D3 --> E
E --> F[DataZone Lineage View]
E --> G[Dashboards / ML]
F -.upstream provenance.-> A
G -.downstream impact.-> E
Key Takeaway: Publish/subscribe converts ad-hoc access requests into a versioned, auditable, automated workflow. The producer never sees a Lake Formation console; the consumer never sees an IAM role ARN.
SageMaker Catalog Integration
In late 2024, AWS unified DataZone with SageMaker under the SageMaker Catalog umbrella. This integration lets approved DataZone subscriptions appear directly in SageMaker Studio, where data scientists can query them via Athena, load them into Spark sessions, or pipe them into training jobs — all without manual data movement and with end-to-end governance preserved.
“DataZone integration with SageMaker / analytics: approved subscriptions automatically wire Lake Formation grants into the consumer project’s environments so users can query immediately via Athena, Redshift, QuickSight, and SageMaker, supporting end-to-end self-service analytics and ML” [Source: https://docs.aws.amazon.com/datazone/latest/userguide/what-is-datazone.html].
A practical scenario:
- A data scientist in the
Churn Modelingproject searches the catalog for “active customers.” - They subscribe to the
Active Customers (Daily)data product, citing “customer churn prediction.” - The Identity team owner approves overnight.
- The next morning, the scientist opens SageMaker Studio. The dataset is already accessible via the Athena workgroup pre-wired into their project — no IAM tickets, no S3 paths to memorize.
- Lake Formation enforces column masking: the
ssncolumn is hidden, theemailcolumn is hashed. - Every query is logged in CloudTrail and visible in the DataZone lineage view.
This loop turns governance from a roadblock into a runway. The scientist gets data faster because governance is automated, not despite it.
Key Takeaway: When DataZone, Lake Formation, and SageMaker Catalog work together, governance becomes invisible: consumers receive permissions as a side effect of subscription, and producers retain full control without writing a single grant by hand.
Chapter Summary
Governance is the layer that makes a modern data platform trustworthy, compliant, and discoverable. We started with why governance matters: regulatory obligations (GDPR, HIPAA, SOC 2), the trust analysts place in published numbers, and the operational cost of duplicated, undiscoverable data. We saw that governance has three interlocking concerns — inventory, policy, and audit — each with a specific AWS service.
The AWS Glue Data Catalog is the technical metadata backbone, holding databases, tables, schemas, partitions, and storage descriptors that every Glue-aware query engine consumes. Amazon DataZone layers business metadata — glossaries, owners, descriptions, classifications — on top, so humans can find data by meaning rather than by table name. Lineage stitches both layers into a graph that supports both upstream provenance (where did this come from?) and downstream impact analysis (what breaks if I change this?).
AWS Lake Formation transforms the catalog into a governed resource. It replaces tangled IAM policies on S3 with a database-style grant model that supports column masking, row filters, cross-account sharing, and LF-TBAC (tag-based access control). LF-TBAC is the scalability lever: tag your data once with attributes like domain and classification, and grants follow your data automatically as it grows.
Finally, Amazon DataZone orchestrates a data mesh by organizing the platform into domains, projects, and data products with a publish/subscribe workflow. Producers publish curated data products; consumers subscribe via the portal; fulfillment automatically wires Lake Formation grants and environments. With SageMaker Catalog integration, those grants extend cleanly into ML workflows without manual configuration.
The central lesson of this chapter: governance is not a barrier to analytics — it is the substrate that makes self-service analytics safe at enterprise scale. When the catalog, the policy engine, the lineage graph, and the publishing workflow agree, your data platform stops being a liability and starts being an asset.
Key Terms
- AWS Glue Data Catalog — A managed Hive-compatible metastore that stores technical metadata (databases, tables, columns, partitions, storage formats) for every Glue-aware query engine on AWS, including Athena, Redshift Spectrum, and EMR. The single source of truth for schema and storage layout.
- Lake Formation — AWS’s governance service that adds permissions, audit, and lifecycle controls on top of the Glue Data Catalog. Supports column masking, row filtering, cross-account sharing, and tag-based access control.
- Amazon DataZone — AWS’s enterprise data catalog and data mesh platform that organizes data into domains and projects, hosts business glossaries, and supports a publish/subscribe workflow for data products with automatic Lake Formation fulfillment.
- Data mesh — An organizational and technical pattern that decentralizes data ownership to business domains while a central platform team supplies governance, discovery, and policy fabric. Producers own data products; consumers subscribe through a self-service catalog.
- Lineage — The directed graph of how data flows from source systems through transformations to final consumers. Used for provenance (“where did this come from?”) and impact analysis (“what breaks if I change this?”).
- LF-TBAC — Lake Formation tag-based access control. Permissions granted via expressions over LF-Tags (key/value attributes attached to databases, tables, or columns), enabling attribute-driven, scalable governance over thousands of resources.
- Data product — In DataZone, a curated, published bundle of one or more assets enriched with business metadata (owner, description, sensitivity, lineage, glossary terms). The unit of publishing and subscription in a data mesh.
- SageMaker Catalog — The unified catalog experience that connects DataZone to SageMaker Studio, allowing approved data product subscriptions to appear automatically in ML environments with Lake Formation grants already wired in.
Chapter 12: Search, Logs, and Observability with OpenSearch
Learning Objectives
By the end of this chapter, you will be able to:
- Explain how Amazon OpenSearch Service indexes documents using inverted indexes, shards, and replicas to enable fast search across log corpora.
- Build a log ingestion pipeline using Amazon OpenSearch Ingestion (the managed Data Prepper service), including sources, processors, and sinks defined in YAML.
- Use OpenSearch Dashboards to explore logs, configure alerts, detect anomalies, and analyze distributed traces for application performance monitoring.
- Apply hot, UltraWarm, and cold tiering through Index State Management (ISM) policies to dramatically reduce log retention costs while preserving queryability.
Every Lambda function, Glue job, Kinesis consumer, and microservice produces operational telemetry — logs, metrics, and traces — that must be searched, alerted on, and retained. This chapter examines the search and observability stack built around Amazon OpenSearch Service, the AWS-managed fork of Elasticsearch and Kibana, and shows how to control the cost curve as log volume grows from megabytes to petabytes.
OpenSearch Fundamentals
OpenSearch is a distributed search and analytics engine optimized for full-text search, structured filtering, aggregations, and time-series log analytics. Amazon OpenSearch Service is the managed offering: AWS provisions the cluster, applies patches, manages snapshots, and exposes the API behind a domain endpoint. To use it effectively for log analytics, you need a clear mental model of how documents flow into shards and how shards distribute across nodes.
Inverted Indexes and Shards
The core data structure inside OpenSearch is the inverted index. A normal (forward) index maps a document ID to its text content; an inverted index flips that relationship and maps each term to the list of document IDs that contain it. Consider two log lines: “Beauty is in the eye of the beholder” and “Beauty and the beast” [Source: https://docs.opensearch.org/latest/getting-started/intro/]. The inverted index stores entries like beauty -> [1, 2], beholder -> [1], and beast -> [2]. When you query for “beauty”, OpenSearch performs a single dictionary lookup rather than scanning every document. This is what makes a search across billions of log lines feel interactive.
An analogy: the index at the back of a textbook. To find every page mentioning “shard,” you look up the term and jump to the listed pages rather than reading the whole book. The inverted index is that book index, automatically maintained with additional structures for term frequency, position, and ranking.
Documents in OpenSearch are JSON objects, indexes are collections of documents, and the mapping defines field types (text, keyword, date, integer, geo_point). The vocabulary maps cleanly onto a relational mental model:
| Relational Concept | OpenSearch Concept |
|---|---|
| Database | Cluster |
| Table | Index |
| Row | Document |
| Column | Field |
| Schema | Mapping |
| Primary Key | _id field |
[Source: https://docs.opensearch.org/latest/getting-started/intro/]
A single index quickly outgrows a single machine, so OpenSearch partitions each index into shards. A shard is an independent Lucene index — a self-contained inverted index plus the documents that produced it. Indexes have primary shards (the authoritative copies of partitioned data) and replica shards (full copies of primaries, placed on different nodes for fault tolerance and read scaling). The default for newer OpenSearch versions is 1 primary and 1 replica, though many production log clusters configure 3-5 primaries per index based on expected size [Source: https://docs.aws.amazon.com/opensearch-service/latest/developerguide/bp.html].
Routing works as follows. When you index a document, OpenSearch hashes the document ID (or a custom routing key) modulo the primary shard count, picks a primary shard, writes the document, then replicates to each replica. With 5 primaries and 1 replica, a single bulk write touches 10 shards. A search request, by contrast, goes to either a primary or replica copy of each shard — only 5 shards are queried per request, in parallel, and OpenSearch coordinates the result merge [Source: https://docs.aws.amazon.com/opensearch-service/latest/developerguide/bp.html].
The crucial constraint: primary shard count is immutable after index creation. You cannot simply add primary shards to a running index without reindexing into a new index or using the split index API on a read-only copy [Source: https://aws.amazon.com/blogs/big-data/patterns-for-updating-amazon-opensearch-service-index-settings-and-mappings/]. Replica count, however, is dynamic — you can scale replicas up before a known traffic spike and back down afterwards.
Key Takeaway: OpenSearch uses inverted indexes to make text search fast, and it scales horizontally by partitioning each index into immutable primary shards plus dynamic replicas. Plan primary shard count carefully at index-creation time because changing it later requires a reindex.
Cluster, Node, and Replica Concepts
A cluster is the unit of management in OpenSearch — a group of nodes that share the same cluster.name and coordinate to host indexes. Within the cluster, nodes play one or more roles. Dedicated cluster manager nodes (formerly called master nodes) maintain cluster state — which indexes exist, where shards live, which nodes are healthy. Data nodes hold shards and serve indexing and search traffic. Coordinator nodes (sometimes split out for large clusters) accept client requests, fan them out to data nodes, and merge responses. Amazon OpenSearch Service abstracts most of this — you choose instance types and counts, and AWS picks the role assignments — but understanding the roles helps you reason about failure modes.
Replicas serve two purposes. First, they are a high-availability mechanism: if a node holding primary shard 0 fails, the cluster manager promotes one of its replicas (which lives on a different node by design) to primary, and indexing continues. Second, replicas scale read throughput because search queries can be served by either the primary or any replica. With segment replication on O-series instances, writes wait for replica acknowledgements by default, which provides strong durability but couples primary and replica latency [Source: https://docs.aws.amazon.com/opensearch-service/latest/developerguide/bp.html].
Shard sizing has practical bounds. AWS recommends 10-30 GB per shard for search-dominated workloads and up to 50 GB for log workloads where queries scan large time ranges [Source: https://aws.amazon.com/blogs/big-data/best-practices-for-right-sizing-amazon-opensearch-service-domains/]. Each open shard consumes file handles, JVM heap, and cluster-state metadata, so AWS suggests fewer than 20 shards per GB of heap, or roughly fewer than 1,000 shards per node depending on instance size [Source: https://aws.amazon.com/blogs/big-data/amazon-opensearch-service-101-how-many-shards-do-i-need/]. Shard count is a Goldilocks problem: too few and individual shards bloat past 50 GB and slow down; too many and the cluster manager spends all its time tracking metadata.
For log workloads, the best practice is time-based rolling indexes managed by an index template. Daily indexes named logs-2026.05.07, logs-2026.05.08 let you delete old indexes wholesale (cheap) instead of deleting documents from a huge single index (expensive). An index template applies settings automatically when a matching index is created:
PUT /_index_template/logs-template
{
"index_patterns": ["logs*"],
"template": {
"settings": {
"index": {
"number_of_shards": 3,
"number_of_replicas": 1
}
}
}
}
The shard topology then looks like this for a typical 6-data-node cluster:
Index logs-2026.05.07 (3 primaries, 1 replica = 6 shards)
Node A Node B Node C Node D Node E Node F
+------+ +------+ +------+ +------+ +------+ +------+
| P0 | | P1 | | P2 | | R0 | | R1 | | R2 |
+------+ +------+ +------+ +------+ +------+ +------+
(primary 0) (primary 1) (primary 2) (replica 0) (replica 1) (replica 2)
Rule: a primary and its replica never share a node.
If Node A fails, the cluster manager promotes R0 on Node D to primary, then schedules a new replica copy on a remaining node to restore the redundancy invariant.
Figure 12.1: Primary and replica shard topology across a 6-data-node cluster
flowchart TB
subgraph Cluster["OpenSearch Cluster: logs-2026.05.07"]
direction LR
subgraph Primaries["Primary Shards"]
direction LR
NA["Node A<br/>P0"]
NB["Node B<br/>P1"]
NC["Node C<br/>P2"]
end
subgraph Replicas["Replica Shards"]
direction LR
ND["Node D<br/>R0"]
NE["Node E<br/>R1"]
NF["Node F<br/>R2"]
end
end
NA -. replicates to .-> ND
NB -. replicates to .-> NE
NC -. replicates to .-> NF
Client["Indexing<br/>Request"] -->|hash(_id) mod 3| NA
Client -->|hash(_id) mod 3| NB
Client -->|hash(_id) mod 3| NC
Search["Search<br/>Request"] --> NA
Search --> NE
Search --> NF
Key Takeaway: A cluster is a set of cooperating nodes; data nodes hold shards, cluster manager nodes track state. Replicas live on different nodes than their primaries to provide failover and read scaling. For log workloads, use time-rolling indexes with templates and target shards in the 10-50 GB range with fewer than ~20 shards per node.
OpenSearch vs. Elasticsearch Fork History
OpenSearch began life as Elasticsearch and Kibana, the open-source search stack from Elastic NV. In January 2021, Elastic changed the licensing from Apache 2.0 to a dual SSPL/Elastic License model that restricted use by certain managed-service providers. AWS forked the last Apache-2.0 versions and renamed them OpenSearch and OpenSearch Dashboards. OpenSearch 1.0 GA shipped in mid-2021, and the project is governed by an open-source community under the Linux Foundation [Source: https://docs.opensearch.org/latest/getting-started/intro/].
Practical implications:
- The query DSL, document model, REST API, and most plugins are compatible with the Elasticsearch 7.10 era. Client libraries work against either engine with minor changes.
- The codebases have diverged since 2021. Elasticsearch added ESQL and proprietary features; OpenSearch added neural search, ML Commons, security analytics, and observability plugins independently.
- Amazon OpenSearch Service supports legacy Elasticsearch (up to 7.10) and modern OpenSearch (1.x, 2.x). New deployments should choose OpenSearch unless a legacy app pins a specific Elasticsearch version.
- OpenSearch remains fully Apache-2.0; Elasticsearch later reintroduced AGPLv3 alongside its proprietary licenses in late 2024 [Source: https://cloudchipr.com/blog/aws-opensearch].
Key Takeaway: OpenSearch is the Apache-2.0 fork of Elasticsearch 7.10 maintained by AWS and a broader community. It is API-compatible with that era of Elasticsearch but has evolved independently since 2021. New AWS deployments should use OpenSearch.
Log Ingestion
Production log pipelines need backpressure, retries, parsing, enrichment, and routing. AWS offers three mechanisms to feed OpenSearch: Amazon OpenSearch Ingestion (managed Data Prepper), self-hosted Data Prepper, and Amazon Data Firehose.
OpenSearch Ingestion Service
Amazon OpenSearch Ingestion (OSIS) is the AWS-managed Data Prepper service. You upload a YAML pipeline definition, choose a capacity range in OpenSearch Compute Units (OCUs), and AWS runs the pipeline as a serverless service that auto-scales between your minimum and maximum OCU bounds [Source: https://docs.aws.amazon.com/aws/opensearch-service/latest/developerguide/ingestion-process.html].
A pipeline has four logical stages. The source is where data enters — HTTP push endpoints, S3 buckets, OpenTelemetry trace/metric/log endpoints, Kafka, or existing OpenSearch indexes for migrations [Source: https://docs.aws.amazon.com/opensearch-service/latest/developerguide/osis-features-overview.html]. A buffer holds events between stages; OSIS supports in-memory (default) and persistent buffering for at-least-once durability. Processors transform events: grok parses unstructured log lines using regex patterns, date parses timestamps, mutate adds/removes fields, otel_trace shapes spans, service_map builds service graphs, and conditional routing forks events. The sink writes to OpenSearch, OpenSearch Serverless, S3, or another pipeline [Source: https://docs.aws.amazon.com/opensearch-service/latest/developerguide/pipeline-config-reference.html].
Capacity is measured in OCUs. Stateless pipelines scale to 96 OCUs (384 with persistent buffering); stateful pipelines top out at 48 OCUs (192 with buffering) [Source: https://docs.aws.amazon.com/opensearch-service/latest/developerguide/osis-features-overview.html]. One OCU handles a few thousand events per second of typical log data.
A worked example: ingest application logs from S3, parse with grok, ship to OpenSearch.
log-pipeline:
source:
s3:
acknowledgments: true
notification_type: sqs
compression: gzip
codec:
newline:
sqs:
queue_url: "https://sqs.us-east-1.amazonaws.com/123456789012/log-events"
aws:
region: "us-east-1"
sts_role_arn: "arn:aws:iam::123456789012:role/osis-pipeline-role"
processor:
- grok:
match:
message: ['%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} %{GREEDYDATA:msg}']
- date:
match:
- key: timestamp
patterns: ["ISO8601"]
destination: "@timestamp"
sink:
- opensearch:
hosts: ["https://search-prod-abc123.us-east-1.es.amazonaws.com"]
index: "logs-app-%{yyyy.MM.dd}"
aws:
sts_role_arn: "arn:aws:iam::123456789012:role/osis-pipeline-role"
region: "us-east-1"
[Source: https://docs.aws.amazon.com/opensearch-service/latest/developerguide/creating-pipeline.html]
OSIS assumes the sts_role_arn to read from SQS/S3 and write to OpenSearch — the role must be the same across all sinks in a pipeline [Source: https://docs.opensearch.org/latest/data-prepper/pipelines/configuration/sinks/opensearch/]. The index pattern logs-app-%{yyyy.MM.dd} produces daily rolling indexes; OSIS creates them on demand.
Figure 12.2: Data Prepper / OSIS source-buffer-processor-sink pipeline structure
flowchart LR
subgraph Sources["Source Plugins"]
S3["S3 + SQS"]
HTTP["HTTP Push"]
OTEL["OTel Endpoint"]
KAFKA["Kafka"]
end
Buffer["Buffer<br/>(in-memory or<br/>persistent)"]
subgraph Processors["Processor Chain"]
direction LR
P1["grok<br/>(parse log lines)"] --> P2["date<br/>(parse @timestamp)"] --> P3["mutate<br/>(enrich fields)"] --> P4["conditional<br/>routing"]
end
subgraph Sinks["Sink Plugins"]
OS["OpenSearch<br/>Domain"]
OSS["OpenSearch<br/>Serverless"]
S3OUT["S3 Archive"]
PIPE["Another<br/>Pipeline"]
end
Sources --> Buffer --> Processors --> Sinks
Key Takeaway: OpenSearch Ingestion is a managed Data Prepper that runs YAML-defined source/processor/sink pipelines on auto-scaling OCUs. It eliminates the need to operate Logstash, Fluentd, or self-hosted Data Prepper for most AWS-resident log workloads.
Data Prepper Pipelines
Data Prepper is the open-source upstream that powers OSIS. Run it yourself for on-premises sources, custom Java processors, or colocation with another tool. The YAML syntax is identical to OSIS, easing migration in either direction.
Self-hosted Data Prepper runs as a JVM container or pod and supports the same source, processor, and sink plugins as OSIS [Source: https://github.com/opensearch-project/data-prepper/blob/main/data-prepper-plugins/opensearch/README.md]. A pipeline can chain into another via the pipeline source/sink — useful for fan-in deduplication followed by fan-out routing. Up to 10 sub-pipelines can be chained per file [Source: https://docs.aws.amazon.com/opensearch-service/latest/developerguide/pipeline-config-reference.html].
A common pattern: stage one ingests raw OpenTelemetry traces, stage two computes a service map and ships raw spans and graph aggregates to separate indexes [Source: https://aws.amazon.com/blogs/big-data/top-strategies-for-high-volume-tracing-with-amazon-opensearch-ingestion/].
otel-trace-pipeline:
source:
otel_trace_source:
ssl: false
processor:
- otel_trace_raw:
sink:
- pipeline:
name: "raw-trace-pipeline"
- pipeline:
name: "service-map-pipeline"
raw-trace-pipeline:
source:
pipeline:
name: "otel-trace-pipeline"
sink:
- opensearch:
hosts: ["https://search-prod-abc123.us-east-1.es.amazonaws.com"]
index_type: "trace-analytics-raw"
service-map-pipeline:
source:
pipeline:
name: "otel-trace-pipeline"
processor:
- service_map_stateful:
sink:
- opensearch:
hosts: ["https://search-prod-abc123.us-east-1.es.amazonaws.com"]
index_type: "trace-analytics-service-map"
This pipeline-of-pipelines pattern is also how OSIS supports complex topologies inside a single pipeline definition.
Key Takeaway: Data Prepper is the open-source pipeline engine — same YAML, same plugins as OSIS — that you self-host when you need on-premises ingestion, custom processors, or hybrid networking. Sub-pipelines let you fan out and aggregate.
Firehose-to-OpenSearch Delivery
When logs already flow through Amazon Data Firehose (Chapter 9), you can deliver them straight to OpenSearch without a separate pipeline. A Firehose delivery stream buffers records by size or time, optionally invokes a Lambda transformation, and bulk-indexes the result. This is the lowest-friction option when producers already write to Firehose for S3 archival, when you do not need stateful processing, and when you want managed delivery without OCU pipeline ops.
Firehose-to-OpenSearch supports VPC and public domains, plus Serverless collections. It supports automatic index rotation by hour, day, week, or month — logs becomes logs-2026-05-07 for daily rotation. Records that fail after retries are written to an S3 backup bucket for replay.
Trade-off versus OSIS: Firehose is simpler but less expressive. Grok parsing requires a Lambda transformer; conditional routing is unavailable (one stream maps to one index pattern). For pass-through, Firehose wins; for parsing and fan-out, OSIS wins.
Decision matrix:
| Need | Choose |
|---|---|
| Pass-through indexing of structured records | Firehose |
| Grok parsing of unstructured logs | OSIS or Data Prepper |
| Distributed-trace ingestion (OTel) | OSIS or Data Prepper |
| On-premises collector reachability | Self-hosted Data Prepper |
| At-least-once with persistent buffering | OSIS with persistent buffering |
| Single AWS-resident, simple, cheap path | Firehose |
Key Takeaway: Use Firehose for simple structured log delivery, OpenSearch Ingestion for parsing and routing, and self-hosted Data Prepper when you need on-premises networking or custom code. All three converge on the same OpenSearch domain.
Visualization and Alerting
Indexed logs are inert until something queries them. OpenSearch ships OpenSearch Dashboards for ad-hoc exploration plus alerting, anomaly detection, and trace-analytics plugins for production observability.
OpenSearch Dashboards
OpenSearch Dashboards is the Kibana fork, renamed in the 2021 split. It runs as a Node.js application that talks to an OpenSearch domain and provides a browser-based UI for several core workflows.
The Discover view is the day-to-day SRE workspace. You select an index pattern (e.g., logs-app-*), pick a time range, and run queries in OpenSearch Query DSL or the SQL-like Piped Processing Language (PPL). Discover shows raw documents, a histogram of event counts, and per-field breakdowns. The killer feature for incident response is time-range comparison: pivot from “what is traffic now” to “what was traffic at this time last week” in two clicks.
Visualize builds line, bar, heatmap, geo-map, and tag-cloud charts against aggregations. Aggregations come in two flavors: bucket aggregations (terms, date histogram, range) and metric aggregations (count, average, percentiles, cardinality). Visualizations compose into Dashboards that teams pin during incidents.
The Dev Tools console is a REST playground for raw API calls (GET _cluster/health, PUT /_index_template/...), invaluable for debugging mappings or testing ISM policies.
For multi-tenant environments, fine-grained access control restricts users to specific index patterns, fields (field-level security), or documents (document-level security). Application teams typically own a dashboard tenant; platform teams own a global tenant for cross-cutting dashboards.
Key Takeaway: OpenSearch Dashboards is the UI layer for exploration (Discover), visualization (charts and dashboards), and administration (Dev Tools). Fine-grained access control supports multi-tenant log lakes where teams own their own indexes and dashboards.
Alerting and Anomaly Detection
A dashboard you have to look at is an outage you missed. The Alerting plugin turns saved queries into scheduled monitors. A monitor has:
- Inputs: a query against indexes, returning documents or aggregations.
- Triggers: per-document, per-bucket, or query-level conditions.
- Actions: Slack, Teams, email, SNS, webhook, or PagerDuty notifications.
Example: a monitor that fires when level: ERROR events from any single service exceed 50 in 5 minutes. A per-bucket trigger over a terms aggregation on service.name produces a separate alert per offending service.
The Anomaly Detection plugin layers ML on top. It runs Random Cut Forest — an unsupervised ensemble for time-series anomalies — over a feature (count, average, sum) bucketed by a category (host, region, customer). Detectors learn normal patterns, score new buckets, and emit anomaly grades. Grades above threshold drive alerts but adapt to seasonality and drift, perfect for “every service has its own normal traffic” cases where a static threshold would flap or miss.
Production hierarchy: static threshold alerts for SLO violations (latency p99 > 500 ms, errors > 1%); anomaly alerts for novel failures; composite monitors combining multiple inputs (errors AND latency) to suppress flapping.
Key Takeaway: OpenSearch Alerting turns saved queries into scheduled monitors with rich notification destinations. Anomaly Detection adds ML-based seasonal-aware thresholds, ideal for catching weird-but-not-yet-broken behavior.
Trace Analytics and APM
OpenSearch supports distributed trace analytics through OpenTelemetry (OTel) ingestion and a Dashboards plugin. Spans flow from instrumented services — usually via the OTel Collector or AWS Distro for OpenTelemetry (ADOT) — into an OSIS or Data Prepper pipeline. The pipeline runs otel_trace_raw to flatten spans, service_map_stateful to compute a service graph with edge metrics like latency and error rate, and writes to otel-v1-apm-span-* for raw spans and otel-v1-apm-service-map for graph aggregates [Source: https://aws.amazon.com/blogs/big-data/top-strategies-for-high-volume-tracing-with-amazon-opensearch-ingestion/].
The Trace Analytics view offers a service map with node color for error rate and edge thickness for call volume; a traces table with end-to-end latency and span waterfalls; and dashboards with throughput, latency percentiles, and error rate over time.
The trace data flow:
+----------------+ +-----------------+ +-----------------+ +----------------+
| App with OTel |--->| ADOT Collector |--->| OSIS pipeline |--->| OpenSearch |
| instrumentation| | (gRPC/HTTP) | | otel_trace_raw | | trace indexes |
+----------------+ +-----------------+ | service_map | +----------------+
+-----------------+
Analogy: traces are flight-tracking radar. Each request is an aircraft; each span is a leg (taxi, climb, cruise, descent). The trace view shows one flight path; the service map is the airport-pair graph aggregating all flights — needed to see which routes are systemically slow.
Figure 12.3: OpenTelemetry trace ingestion flow into OpenSearch trace analytics
sequenceDiagram
participant App as Instrumented App<br/>(OTel SDK)
participant ADOT as ADOT Collector<br/>(gRPC/HTTP)
participant OSIS as OSIS Pipeline<br/>(otel_trace_raw +<br/>service_map_stateful)
participant Raw as OpenSearch<br/>otel-v1-apm-span-*
participant Map as OpenSearch<br/>otel-v1-apm-service-map
participant Dash as Trace Analytics<br/>Dashboard
App->>ADOT: Emit spans (trace_id, span_id, parent_id)
ADOT->>OSIS: Batch OTLP export
OSIS->>OSIS: Flatten spans (otel_trace_raw)
OSIS->>OSIS: Compute edges (service_map_stateful)
par Fan-out to two indexes
OSIS->>Raw: Index raw span documents
and
OSIS->>Map: Index service-graph aggregates
end
Dash->>Raw: Query waterfall by trace_id
Dash->>Map: Query service map (latency/error rate)
Dash-->>App: Display APM dashboards
For log-trace correlation, instrument services to inject trace_id and span_id into every log line. With logs and traces in the same domain, a Discover query for a trace ID surfaces both, giving full context in one tool.
Key Takeaway: OpenSearch Trace Analytics ingests OpenTelemetry spans, computes service maps automatically, and presents APM-style dashboards. Correlating trace IDs into application logs unifies logs and traces in a single observability surface.
Cost and Tiering
Log retention is where naive deployments turn into runaway bills. Recent logs need fast queries; quarter-old logs must exist for compliance but rarely get touched. OpenSearch addresses this with three storage tiers and an automation engine.
UltraWarm and Cold Storage Tiers
The hot tier is the default. Data lives on instance-attached EBS volumes (or NVMe on graviton ephemeral instances), with full IOPS and memory caching. Hot is fast but expensive — about $0.169/GB-month for EBS plus data-node cost [Source: https://aws.amazon.com/opensearch-service/pricing/].
UltraWarm backs indexes with S3 plus an LRU cache on local SSD and in memory. Migrated indexes become read-only, are optimized by force_merge to one segment per shard, and their segment files are uploaded to S3. Queries pull segments from S3 into cache on demand: the first query is seconds-to-tens-of-seconds while cold; subsequent queries are interactive. UltraWarm storage is $0.024/GB-month — about 85% cheaper than hot — plus UltraWarm node cost (~$0.238/hr medium ~1.5 TB; $2.68/hr large ~20 TB) [Source: https://aws.amazon.com/opensearch-service/pricing/].
Cold storage detaches indexes entirely — only metadata stays in the cluster. You pay S3 standard rates (~$0.0125/GB-month) [Source: https://docs.aws.amazon.com/opensearch-service/latest/developerguide/ultrawarm.html]. To query a cold index you must first attach it back to UltraWarm. Cold is for once-a-year data: security audits, regulatory archives, long-tail debugging.
The pricing tiers compared:
| Tier | Use Case (typical) | Storage Cost | Compute Cost | Query Performance |
|---|---|---|---|---|
| Hot | Recent logs (0-7 days) | ~$0.169/GB-mo + IOPS | Full data-node instances | Fastest, sub-second |
| UltraWarm | Historical (7-90 days) | $0.024/GB-mo | $0.238-$2.68/hr per node | Interactive (S3 + cache) |
| Cold | Archive (>90 days) | ~$0.0125/GB-mo (S3) | Pay-per-attach | Slow first query (attach), then UltraWarm-like |
[Source: https://aws.amazon.com/opensearch-service/pricing/, https://docs.aws.amazon.com/opensearch-service/latest/developerguide/ultrawarm.html]
Worked cost example: 100 TB retained 365 days with 7 days hot, 83 days UltraWarm, 275 days cold.
- Hot: 100 TB × (7/365) × $0.169/GB-mo × 12 mo ≈ $3,890/yr (plus instances)
- UltraWarm: 100 TB × (83/365) × $0.024/GB-mo × 12 mo ≈ $6,540/yr (plus warm nodes)
- Cold: 100 TB × (275/365) × $0.0125/GB-mo × 12 mo ≈ $11,300/yr
Total storage about $21,700/yr versus $202,800/yr for a naive all-hot retention — roughly a 9x cost reduction before counting the larger hot-only instance fleet you would also need.
Key Takeaway: Hot storage is fast and expensive; UltraWarm is S3-backed with caching at ~$0.024/GB-month; cold is S3-archive at ~$0.0125/GB-month with attach-on-query semantics. Tiering routinely saves 80-90% on log retention bills.
Index State Management (ISM)
Manually moving indexes between tiers does not scale. Index State Management (ISM) is OpenSearch’s built-in policy engine. An ISM policy describes a finite-state machine of states (hot, warm, cold, delete), actions performed on state entry, and conditions that trigger transitions [Source: https://docs.aws.amazon.com/opensearch-service/latest/developerguide/ism.html].
A complete log-retention policy:
PUT _plugins/_ism/policies/log-lifecycle
{
"policy": {
"description": "Log retention: hot->UltraWarm->cold->delete",
"default_state": "hot",
"states": [
{
"name": "hot",
"actions": [
{ "rollover": {
"min_size": "50GB",
"min_index_age": "1d",
"min_doc_count": 50000000
}}
],
"transitions": [
{ "state_name": "warm",
"conditions": { "min_index_age": "7d" }
}
]
},
{
"name": "warm",
"actions": [
{ "warm_migration": {} },
{ "force_merge": { "max_num_segments": 1 } },
{ "replica_count": { "number_of_replicas": 1 } }
],
"transitions": [
{ "state_name": "cold",
"conditions": { "min_index_age": "30d" }
}
]
},
{
"name": "cold",
"actions": [
{ "cold_migration": { "timestamp_field": "timestamp" } }
],
"transitions": [
{ "state_name": "delete",
"conditions": { "min_index_age": "90d" }
}
]
},
{
"name": "delete",
"actions": [
{ "cold_delete": {} }
]
}
],
"ism_template": [
{ "index_patterns": ["logs-*"], "priority": 100 }
]
}
}
[Source: https://oneuptime.com/blog/post/2026-02-12-opensearch-index-state-management-ism/view]
The lifecycle visualized:
day 0 day 7 day 30 day 90
| | | |
v v v v
+------+ +----------+ +------+ +--------+
| HOT | ---> | ULTRAWARM| ------> | COLD | ---------> | DELETE |
+------+ +----------+ +------+ +--------+
(writes) (read-only, (S3 only, (cold_delete)
cached) attach to query)
rollover at 50GB, 1d, or 50M docs
force_merge to 1 segment before warm migration
The ism_template block auto-attaches the policy to any new index matching logs-*. Combined with rolling daily indexes from Firehose or OSIS, the lifecycle runs without operator intervention.
Figure 12.4: ISM hot/UltraWarm/cold/delete lifecycle state machine
stateDiagram-v2
[*] --> hot: Index created<br/>(ism_template auto-attach)
hot --> hot: rollover<br/>(50GB / 1d / 50M docs)
hot --> warm: min_index_age >= 7d<br/>actions: warm_migration,<br/>force_merge to 1 segment,<br/>replica_count = 1
warm --> cold: min_index_age >= 30d<br/>action: cold_migration<br/>(detach to S3, metadata only)
cold --> delete: min_index_age >= 90d
delete --> [*]: cold_delete<br/>(remove metadata)
note right of hot
EBS-backed
~$0.169/GB-month
Sub-second queries
end note
note right of warm
S3 + LRU cache
~$0.024/GB-month
Read-only, interactive
end note
note right of cold
S3 archive
~$0.0125/GB-month
Attach to query
end note
The force_merge before warm migration is a critical optimization. Lucene segments accumulate as documents are written; merging to one segment per shard removes deleted documents and consolidates layout, shrinking storage and speeding up cold-cache queries on UltraWarm [Source: https://docs.aws.amazon.com/opensearch-service/latest/developerguide/ultrawarm.html]. Skipping it leaves hundreds of small segments per shard, each requiring a separate S3 fetch.
ISM also supports notification actions (Slack, webhook), replica_count to drop replicas before migration, and index_priority to control recovery ordering after restart.
Key Takeaway: ISM automates the hot/warm/cold/delete lifecycle through a state machine of actions and conditions. Auto-attach via
ism_templatepaired with rolling indexes makes log retention a configure-once policy decision rather than a daily ops chore.
Serverless OpenSearch Collections
For workloads where you do not want to size a cluster, Amazon OpenSearch Serverless offers collections — managed, auto-scaling OpenSearch endpoints typed as time-series (logs), search, or vector search. AWS provisions and scales OpenSearch Compute Units (OCUs) behind the scenes, splits indexing and search compute, and stores data on S3 by default [Source: https://aws.amazon.com/opensearch-service/pricing/].
Key differences from provisioned:
- No node sizing. You set minimum/maximum OCU bounds; AWS scales between them.
- Storage on S3 by default. There is no UltraWarm or cold tier because storage is already S3 — but you also lose explicit hot/warm control.
- Pricing. Serverless bills per OCU-hour for indexing and search separately, plus storage. Cheaper for spiky/small workloads, often costlier than a well-sized provisioned domain at steady high volume.
- Feature subset. Trace analytics works; ISM is unsupported because the storage model differs; some plugins and security features are unavailable.
- Sink configuration. OSIS pipelines targeting Serverless need
serverless: trueplusindex_type: management_disabled[Source: https://docs.opensearch.org/latest/data-prepper/pipelines/configuration/sinks/opensearch/].
Choose Serverless for variable/bursty volumes, new workloads, or vector search. Stay provisioned for steady high-volume ingest where ISM tiering is the cost win, plugin-heavy workloads, or strict latency targets where you want to choose instance types yourself.
Key Takeaway: OpenSearch Serverless collections eliminate cluster sizing for spiky workloads but trade away ISM tiering and some plugins. Use Serverless for variable or new workloads; use provisioned domains with ISM for steady-state high-volume log retention where tiering economics dominate.
Chapter Summary
OpenSearch is the search and observability backbone of an AWS data platform. The engine is built on inverted indexes and shards — primary shards partition data immutably; replicas live on different nodes for HA and read scaling. Shard sizing should target 10-50 GB with fewer than ~20 shards per node. OpenSearch is the Apache-2.0 fork of Elasticsearch 7.10 from 2021, API-compatible with that era but evolving independently.
Log ingestion is most often handled by Amazon OpenSearch Ingestion (OSIS), the managed Data Prepper, which runs YAML pipelines of source -> processor -> sink stages on auto-scaling OCUs. Self-hosted Data Prepper covers on-premises and custom-code cases. Amazon Data Firehose offers a simpler pass-through path for structured records.
Visualization and alerting come from OpenSearch Dashboards (the Kibana fork), the Alerting plugin, the Anomaly Detection plugin for ML-based seasonal alerts, and Trace Analytics for OpenTelemetry APM with service maps.
Cost is governed by tiered storage: hot is fast and expensive; UltraWarm is S3-backed at ~85% lower cost; cold is detached S3 archive at S3 rates. ISM policies automate the lifecycle and auto-attach via ism_template. OpenSearch Serverless eliminates cluster sizing for spiky workloads but trades away ISM tiering. Together these primitives ingest, search, alert on, and retain petabyte-scale log data at a fraction of an unmanaged cluster’s cost.
Key Terms
- OpenSearch: AWS’s Apache-2.0 fork of Elasticsearch 7.10 and Kibana, released in 2021 and now maintained by an open-source community under the Linux Foundation. Amazon OpenSearch Service is the AWS managed offering [Source: https://docs.opensearch.org/latest/getting-started/intro/].
- Elasticsearch: The original distributed search engine from Elastic NV, whose 2021 license change to SSPL/Elastic License triggered the OpenSearch fork. Elasticsearch later reintroduced AGPLv3 alongside its proprietary licenses [Source: https://cloudchipr.com/blog/aws-opensearch].
- Inverted index: The core data structure mapping each term to the list of document IDs that contain it; enables fast full-text search without scanning every document [Source: https://docs.opensearch.org/latest/getting-started/intro/].
- Shard: An independent Lucene index that stores a partition of an OpenSearch index. Primary shards are authoritative and immutable in count; replica shards are dynamic copies on different nodes for HA and read scaling [Source: https://docs.aws.amazon.com/opensearch-service/latest/developerguide/bp.html].
- Data Prepper: The open-source pipeline engine for OpenSearch that runs YAML-defined source/processor/sink pipelines. OpenSearch Ingestion is the AWS-managed version [Source: https://github.com/opensearch-project/data-prepper/blob/main/data-prepper-plugins/opensearch/README.md].
- OpenSearch Dashboards: The fork of Kibana, providing Discover, Visualize, Dashboard, and Dev Tools UIs for searching and administering OpenSearch [Source: https://docs.opensearch.org/latest/getting-started/intro/].
- UltraWarm: A read-only OpenSearch storage tier backed by S3 plus a local SSD/memory cache, priced at $0.024/GB-month for storage — roughly 85% cheaper than hot — for historical log data with interactive query latency after warm-up [Source: https://docs.aws.amazon.com/opensearch-service/latest/developerguide/ultrawarm.html].
- ISM (Index State Management): The OpenSearch policy engine that automates index lifecycles via a finite-state machine of states, actions (rollover, warm_migration, force_merge, cold_migration, cold_delete), and transitions (gated by
min_index_ageand similar conditions) [Source: https://docs.aws.amazon.com/opensearch-service/latest/developerguide/ism.html]. - Serverless collection: A managed, auto-scaling OpenSearch endpoint typed as time-series, search, or vector search. Storage is S3-by-default; pricing is per OCU-hour. ISM tiering is not supported because the storage model is already serverless [Source: https://aws.amazon.com/opensearch-service/pricing/].
Chapter 13: BI, ML, and Cost Optimization in Production
Learning Objectives
By the end of this chapter, you will be able to:
- Connect Amazon QuickSight to lakehouse and warehouse sources, configuring SPICE refreshes, VPC connectors, and IAM roles for production dashboards.
- Integrate machine learning pipelines with analytics platforms via SageMaker Feature Store, building point-in-time-correct training datasets and serving real-time inference from warehouse-resident features.
- Apply cost optimization patterns including right-sizing, storage tiering, and query economics to reduce platform spend by 50-75 percent while preserving SLA targets.
- Synthesize a reference architecture combining ingestion, lakehouse storage, transformation, governance, BI, and ML services from prior chapters into an end-to-end production design.
Business Intelligence with QuickSight
Business intelligence is the layer where engineered data finally meets the people who make decisions with it. A pipeline that ingests terabytes of clickstream and conforms it to a star schema delivers zero business value until a regional sales lead can open a dashboard at 7 AM and see whether yesterday’s promotion worked. Amazon QuickSight is AWS’s native managed BI service, designed to plug directly into the Redshift warehouses, S3 data lakes, and Athena query layers that previous chapters built. It differs from traditional BI tools by being serverless, per-session billed, and tightly integrated with AWS identity and networking.
QuickSight has three distinct capability layers: a connector and security layer that talks to your data sources, a SPICE in-memory acceleration layer that decouples dashboard latency from source-system load, and an end-user experience layer that includes dashboards, embedded analytics, and the natural-language assistant Q. Each layer has its own cost model and failure modes, and effective production deployment requires deliberate choices at all three.
Figure 13.1: QuickSight three-layer architecture with SPICE acceleration and Q natural-language interface.
flowchart TD
subgraph Sources["Data Sources"]
RS[Redshift]
S3[S3 / Athena]
RDS[RDS / Aurora]
SaaS[SaaS Connectors]
end
subgraph Connect["Connector and Security Layer"]
VPC[VPC Connector]
IAM[IAM Roles]
RLS[Row-Level Security]
end
subgraph SPICE["SPICE Acceleration Layer"]
Ingest[Scheduled Ingestion]
Mem[(Columnar In-Memory Store<br/>4:1-6:1 compression)]
Refresh[Full / Incremental / On-Demand]
end
subgraph Experience["End-User Experience"]
Dash[Dashboards]
Embed[Embedded Analytics]
Q[Q Natural-Language]
end
Sources --> Connect
Connect --> SPICE
Ingest --> Mem
Refresh --> Ingest
SPICE --> Experience
Connect -. Direct Query .-> Experience
SPICE In-Memory Engine
SPICE stands for Super-fast, Parallel, In-memory Calculation Engine, and it is the proprietary columnar in-memory store that makes QuickSight feel responsive even when the underlying source is a 50-terabyte Redshift cluster or a partitioned Parquet lake [Source: https://aws.amazon.com/blogs/big-data/amazon-redshift-out-of-the-box-performance-innovations-for-data-lake-queries/]. When you ingest a dataset into SPICE, QuickSight extracts the rows from the source, compresses them columnarly (typical ratios are 4:1 to 6:1), and stores them in a distributed memory tier that the dashboard renderer queries directly. The result is aggregation latency in the 10-100 millisecond range, regardless of how loaded the source warehouse happens to be at that moment.
The analogy here is a coffee thermos. The espresso machine (your warehouse) is expensive, slow to warm up, and shared with the whole building. Instead of walking back for every sip, you brew a thermos in the morning (SPICE ingestion), keep it on your desk, and pour from it instantly all day. The thermos is smaller than the machine’s output so you choose what goes in; the contents go stale unless you refill on a schedule.
SPICE capacity is allocated per account: Standard edition gives 10 GB total; Enterprise gives 10 GB base plus 0.5 GB per provisioned user up to 512 GB. That capacity is shared across all datasets in the account, so a careless team can fill it with overlapping snapshots and starve the organization. Production deployments treat SPICE capacity as a finite shared resource, monitor usage through the admin console, and partition large datasets into smaller ones covering tighter time windows or business units.
Refresh strategies are the second SPICE design lever. There are three modes:
| Refresh Mode | Behavior | When to Use |
|---|---|---|
| Full refresh | Re-ingests entire dataset from source | Small datasets, dimension tables, schema changes |
| Incremental refresh | Loads only rows newer than a timestamp watermark | Large append-only fact tables with monotonic event time |
| On-demand refresh | API or console-triggered manual run | Post-ETL completion via EventBridge, ad-hoc fixes |
Standard edition caps refreshes at four per day; Enterprise allows up to 32 (roughly every 45 minutes). Schedule Redshift refreshes during low-query windows so SPICE ingestion does not contend with interactive analytics. The trade-off is always freshness versus cost: a dashboard accurate to the minute is dramatically more expensive than one accurate to the hour.
Key Takeaway: SPICE is QuickSight’s in-memory accelerator that turns slow source queries into 10-100 ms dashboard interactions; treat its 10 GB to 512 GB account capacity as a finite shared resource and pick refresh schedules that match the freshness your business actually needs, not the maximum your edition allows.
Q Natural Language Queries
Q in QuickSight is the natural-language layer that lets a business user type “what were our top five products by revenue last quarter in EMEA” and receive a generated chart instead of a SQL prompt [Source: https://aws.amazon.com/blogs/machine-learning/dynamic-text-to-sql-for-enterprise-workloads-with-amazon-bedrock-agents/]. Under the hood, Q is a text-to-SQL pipeline: the user’s question is parsed by an NLP model, mapped against the dataset’s schema and metadata, translated to SQL, executed against SPICE or the source, and rendered as a visualization with the chart type selected automatically.
Q is an Enterprise-only feature and its accuracy is almost entirely determined by the quality of dataset metadata. A column literally named cust_acct_typ_cd_v2 will not be matched against the phrase “customer type”; a column named customer_account_type with a description “Indicates retail, wholesale, or partner customer category” will be. Production Q deployments invest in:
- Business-friendly column names and descriptions added in the QuickSight dataset editor, not just inherited from source-system DDL.
- Hierarchies declared explicitly (Date -> Quarter -> Month -> Day; Region -> Country -> City) so Q can answer “month-over-month” or “drill into California” correctly.
- Custom synonyms and grouping definitions so “EMEA” maps to the set of country codes your finance team actually uses, not Wikipedia’s.
- Data category tags (currency, percentage, date) so Q knows that
gross_marginshould be formatted as a percentage and aggregated as a weighted average, not a sum.
Q is a productivity multiplier for analysts, not a replacement. Ambiguous questions still need clarification; cross-dataset queries often need manual SQL refinement. Performance varies by source: SPICE-backed feels instant; partitioned Athena is fast; unpartitioned S3 can be brutally slow.
Embedded Analytics
The third experience layer is embedded analytics: dashboards or the full Q bar surfaced inside another application via a signed embed URL. This is how a SaaS company exposes per-customer reporting inside its own product, and how an internal tool integrates BI into an operational workflow.
Embedding requires careful design across three dimensions. The identity dimension uses QuickSight’s row-level security and namespace isolation to ensure tenant A cannot see tenant B’s data. The session dimension generates short-lived embed URLs server-side via GenerateEmbedUrlForRegisteredUser or GenerateEmbedUrlForAnonymousUser, never embedding long-lived credentials. The theming dimension uses QuickSight themes and custom CSS so users perceive a single product, not a bolted-on iframe.
For multi-tenant scenarios, the anonymous-user embed pattern with session-tags is standard: each session is tagged with the tenant ID, row-level security rules filter datasets on that tag, and the host application controls which dashboard a user can request.
Key Takeaway: Q makes BI conversational but only when metadata is curated, and embedded analytics turns dashboards into product features when paired with disciplined session-level identity and row-level security.
ML and Analytics Convergence
For most of BI’s history, analytics and ML lived in separate tools with separate data copies. The convergence story of the past few years is that those worlds are merging onto the same lakehouse storage, and the bridge between them is the SageMaker Feature Store and a new generation of generative-BI assistants that operate over warehouse data.
SageMaker Integration with the Lakehouse
SageMaker is AWS’s managed ML platform covering notebooks, training, model registry, deployment, and monitoring. Its critical integration point is that SageMaker training jobs, batch-transform jobs, and Studio notebooks read directly from the same S3 lakehouse and Glue Data Catalog as analytics workloads. Training a churn model no longer requires an export to a separate ML store; it reads the same Iceberg or Hudi table the dashboards read.
Concrete integration patterns include:
- Glue ETL into SageMaker training: a scheduled Glue job lands curated features in S3 as partitioned Parquet, and a SageMaker training job uses the corresponding Glue Data Catalog table as its input dataset.
- Athena Federated Query in notebooks: a SageMaker Studio notebook runs Athena queries against lake tables to build training samples, with full pushdown of partition filters.
- EMR + Spark connector: large-scale feature engineering runs on EMR Spark clusters and writes directly to feature groups via the
sagemaker-feature-store-pysparkconnector for historical backfills. - Redshift ML: in-warehouse SQL training of certain model classes (XGBoost, linear learner) where the warehouse is the source of truth and exporting to S3 would add latency.
The architectural shift this enables is that ML stops being a downstream consumer with its own data copy and becomes a peer workload on the same governed lakehouse, with the same Lake Formation row and column controls applied uniformly.
Feature Stores on Warehouse Data
A feature store is a system that manages the inputs to ML models with two non-negotiable properties: low-latency online lookups for inference, and point-in-time-correct historical retrieval for training. SageMaker Feature Store implements both as paired stores per feature group:
| Store | Backend | Latency | Primary Use |
|---|---|---|---|
| Online | Managed in-memory KV | Single-digit ms | Real-time inference via GetRecord / BatchGetRecord |
| Offline | S3 Parquet (Glue cataloged) | Seconds to minutes | Training set creation, batch inference, analytics |
The offline store is literally just S3 Parquet partitioned by event time, registered in the Glue Data Catalog, and queryable from Athena, Redshift Spectrum, or EMR. That detail matters because it means feature data is part of the lakehouse, not a sibling to it. An analyst can query feature group history with the same Athena workgroup they use for fact tables; a Lake Formation administrator can grant or revoke access using the same policy primitives.
Feature ingestion follows four canonical patterns:
- Batch ingestion — scheduled Glue or EMR jobs read from S3 or Redshift, transform, and call
PutRecord(or use the Spark connector). The connector dual-writes to online and offline stores in one operation. - Stream ingestion — Kinesis Data Streams or Managed Service for Apache Flink push records through Lambda handlers that call
PutRecordcontinuously, producing real-time feature updates. - Backfill — historical Parquet files are written directly to the offline-store S3 path and registered, bypassing per-record API overhead for large historical loads.
- CDC ingestion — DMS or Debezium captures change events, pushes them through Kinesis or Kafka, and a stream ingestion handler updates features as upstream rows change.
For upsert-heavy workloads where the same record key is updated repeatedly (a customer profile that changes daily), the Iceberg-format offline store option enables ACID upserts, schema evolution, and time travel. For append-only event histories (every transaction, every clickstream event), the Glue/Standard offline store is simpler and cheaper.
Figure 13.2: SageMaker Feature Store paired online/offline architecture with multiple ingestion paths.
flowchart LR
subgraph Ingestion["Ingestion Patterns"]
Batch[Batch: Glue / EMR]
Stream[Stream: Kinesis / Flink]
Backfill[Backfill: Direct Parquet]
CDC[CDC: DMS / Debezium]
end
subgraph FS["SageMaker Feature Store"]
FG[Feature Group]
Online[(Online Store<br/>In-Memory KV<br/>~ms latency)]
Offline[(Offline Store<br/>S3 Parquet / Iceberg<br/>Glue Cataloged)]
end
subgraph Consumers["Consumers"]
Inf[Real-Time Inference<br/>GetRecord / BatchGetRecord]
Train[Training Sets<br/>ASOF Joins on EventTime]
Ana[Athena / Redshift Spectrum<br/>Analytics]
end
Batch --> FG
Stream --> FG
Backfill --> Offline
CDC --> FG
FG --> Online
FG --> Offline
Online --> Inf
Offline --> Train
Offline --> Ana
The discipline separating production feature stores from prototypes is point-in-time correctness. Every offline-store record carries both EventTime (when the value was true in the real world) and WriteTime (when it was written). Training queries use ASOF joins to retrieve the feature value current at prediction time, not the latest value. Without this, a churn model is poisoned by post-cancellation profile updates — classic label leakage that produces 99% dev accuracy and 60% production accuracy.
Generative BI Assistants
The third leg of the analytics-ML convergence is the rise of generative-BI assistants: LLM-powered agents that take a user’s natural-language question, plan a sequence of warehouse queries, and synthesize a narrative answer with charts and citations. Q in QuickSight sits at the simpler end of this spectrum (single-question text-to-SQL); Bedrock Agents and similar frameworks at the more complex end orchestrate multi-step reasoning across catalogs, semantic layers, and external knowledge bases [Source: https://aws.amazon.com/blogs/machine-learning/dynamic-text-to-sql-for-enterprise-workloads-with-amazon-bedrock-agents/].
The data engineering work behind a successful generative-BI deployment is largely metadata work:
- Semantic layer: a published set of curated metrics (“monthly active users,” “gross margin”) with explicit definitions, owners, and grain, so the LLM does not have to reinvent business logic from raw schemas.
- Glossary and synonyms: terms specific to your industry mapped to the columns and tables that implement them.
- Lineage: tracked from raw source through transformations to the metric the assistant cites, so the answer can include “computed from
fct_ordersjoined todim_customer, refreshed at 06:00 UTC.” - Guardrails: query cost quotas, schema-restriction policies, and result-size caps to prevent a single poorly-worded prompt from scanning a petabyte.
The pattern that fails is pointing an LLM at a raw lake and expecting it to figure out what “active customer” means. The pattern that succeeds is publishing a curated semantic layer first and treating the LLM as a translator into that layer.
Key Takeaway: SageMaker Feature Store unifies analytics and ML on the same governed lakehouse with paired online/offline stores; point-in-time correctness via ASOF joins and curated semantic metadata are the engineering disciplines that turn this unification from theoretical to reliable.
Cost Optimization Playbook
Cost optimization in modern data platforms is about shaping workloads to the billing model of each service. Snowflake bills credits per second (60s minimum); BigQuery bills per terabyte scanned or by reserved slots; Redshift bills per provisioned hour or per serverless second; Athena bills per terabyte scanned; S3 bills per gigabyte-month with tier-dependent rates [Source: https://www.finops.org/wg/finops-data-cloud-platforms/] [Source: https://datavidhya.com/blog/snowflake-vs-bigquery-vs-redshift/]. Cost optimization means understanding unit economics and applying levers that reduce cost (compute hours, bytes scanned, storage tier) without harming value (query latency, freshness, completeness).
Storage Class and Lifecycle Tuning
S3 storage tiering is the highest-leverage and lowest-risk cost lever in most data platforms because storage typically dominates lakehouse bills and the access patterns are heavily skewed: a small fraction of data drives most queries.
| Tier | Approx. Cost vs Standard | Retrieval Latency | Best For |
|---|---|---|---|
| S3 Standard | 100% | ms | Hot, frequently queried data |
| S3 Intelligent-Tiering | ~60% | ms (auto-managed) | Mixed/unknown access patterns |
| S3 Standard-IA | ~55% | ms (per-GB retrieval fee) | Known infrequent access, > 30 days |
| S3 Glacier Instant Retrieval | ~25% | ms | Rarely accessed, instant needed |
| S3 Glacier Flexible Retrieval | ~15% | minutes-hours | Archive with occasional restore |
| S3 Glacier Deep Archive | ~5% | 12+ hours | Compliance archive |
S3 Intelligent-Tiering is the default-good choice for lakehouses with unpredictable access: data is automatically moved between Frequent, Infrequent, and (optionally) Archive tiers based on observed access patterns, typically saving 40-50 percent versus pure Standard with no engineering work. Glacier tiers save 90+ percent for data that must be retained but is essentially never accessed — compliance archives, raw event logs older than the retention SLA, point-in-time backups [Source: https://motherduck.com/learn/data-warehouse-tco/].
Lifecycle policies are the mechanism that moves data automatically. A typical lakehouse policy looks like: keep raw bronze data in Standard for 30 days, move to Intelligent-Tiering for 90 days, transition to Glacier Flexible Retrieval at 180 days, expire at 7 years. Curated gold tables remain in Standard or Intelligent-Tiering indefinitely because they are the primary query target.
The corresponding lever for warehouse-internal storage is columnar format and compression. Storing fact tables as Parquet or ORC instead of CSV reduces storage cost by 75+ percent and reduces scan cost by even more because predicate pushdown and column pruning skip irrelevant data entirely. Snowflake and BigQuery do this transparently; Redshift, Athena, and EMR require deliberate format choices in the ETL.
Query Economics: Scan Reduction
Query economics is the discipline of minimizing what each query reads, because in scan-billed services every byte read is a billed byte and in compute-billed services every byte read is a millisecond of compute paid for [Source: https://aws.amazon.com/blogs/big-data/amazon-redshift-out-of-the-box-performance-innovations-for-data-lake-queries/].
The four scan-reduction levers, in order of typical impact:
- Partitioning: physically separates data so that filters on the partition key skip entire directories. A clickstream table partitioned by
event_datelets a query for “yesterday’s events” read 1 day out of 730 instead of all 730. BigQuery, Athena, Redshift Spectrum, Snowflake, and Iceberg all support partitioning natively. - Clustering / sort keys: within a partition, sorts rows so range filters touch contiguous blocks. BigQuery clustering can reduce scans by 80+ percent on range filters; Redshift sort keys serve the same purpose. Z-ordering in Delta Lake and Iceberg is the lakehouse equivalent.
- Materialized views: pre-compute expensive aggregations once, query them many times. A daily revenue rollup materialized view turns a 10-billion-row scan into a 365-row scan for any query that filters by date.
- BI Engine / result caching: BigQuery BI Engine caches hot data in memory and can cut scans by 90 percent for dashboard workloads; Snowflake’s result cache returns identical-query results for free; Redshift’s result cache and SPICE in QuickSight serve the same role at different layers.
A classic anti-pattern is SELECT * FROM big_table WHERE LOWER(country) = 'us'. The LOWER call defeats partition pruning because the optimizer cannot prove the function preserves the filter on storage layout. Normalizing country codes at ingestion turns a full-table scan into a pruned one.
For Athena and BigQuery, the cost-per-byte model means a runaway query is a runaway invoice. Production deployments enforce per-query scan quotas (BigQuery custom quotas, Athena workgroup limits) so a typo cannot become a five-figure bill. Snowflake resource monitors play the same role for credit-based billing.
Figure 13.3: Cost optimization decision flow across storage, scan, and compute levers.
flowchart TD
Start([Workload Cost Review]) --> Q1{Storage<br/>dominates bill?}
Q1 -->|Yes| Tier[Apply S3 Lifecycle:<br/>Standard 30d to IT 90d to<br/>Glacier IR 1y to Deep Archive]
Q1 -->|No| Q2{Per-byte scan<br/>billing?}
Tier --> Q2
Q2 -->|Yes| Scan[Add Partitioning<br/>+ Clustering / Sort Keys<br/>+ Materialized Views<br/>+ Result Cache]
Q2 -->|No| Q3{Compute<br/>under-utilized?}
Scan --> Quota[Enforce Per-Query<br/>Scan Quotas]
Quota --> Q3
Q3 -->|Yes| Right[Right-Size:<br/>Auto-Suspend<br/>Serverless Mode<br/>Smaller Warehouse]
Q3 -->|No| Q4{Steady<br/>baseline load?}
Right --> Q4
Q4 -->|Yes| Reserve[Reserved Capacity:<br/>RIs / Slot Reservations<br/>up to 75% savings]
Q4 -->|No| Tag[Per-Query Tagging<br/>+ FinOps Chargeback]
Reserve --> Tag
Tag --> End([50-75% Savings])
Right-Sizing Compute and Reserved Capacity
Right-sizing means matching provisioned compute to actual workload demand, neither over nor under, with as little manual tuning as possible. Each platform has its own levers:
- Snowflake: choose virtual warehouse size per workload (XS for ad-hoc, M for ETL, L for heavy analytics) and set auto-suspend to 60-300 seconds so idle warehouses stop billing. Multi-cluster warehouses scale out for concurrency but can quadruple peak costs if min-cluster is set too high. Monitor
WAREHOUSE_LOAD_HISTORYweekly to identify oversized warehouses. - Redshift: switch to Redshift Serverless for variable workloads (per-second billing after 60-second minimum), use elastic resize for predictable diurnal patterns, and adopt provisioned clusters with Reserved Instances for steady 24/7 workloads where the savings (up to 75 percent versus on-demand) outweigh the commitment risk.
- BigQuery: by default fully serverless with on-demand pricing; for steady high-volume workloads, slot reservations stabilize cost and isolate workloads from noisy neighbors. Edition-based pricing (Standard, Enterprise, Enterprise Plus) lets you align features and commitment levels.
- Athena: reservation-style is via workgroup configuration and Apache Iceberg + columnar formats; the primary right-sizing lever is data layout, not compute.
Reserved capacity — Snowflake Capacity Units, BigQuery slot reservations, Redshift RIs — yields up to 75 percent savings versus on-demand for stable workloads. The trap is overcommitment: a one-year RI for a workload migrated off-platform after six months locks in waste. Size reservations from 90 days of historical burn and cover only the steady baseline; let on-demand absorb peaks [Source: https://www.revefi.com/blog/data-warehouse-optimization-comparison].
The FinOps practice that makes this sustainable is per-query tagging mapped to teams. Snowflake QUERY_TAG, BigQuery labels, and AWS resource tags let you allocate spend back to the incurring team — the precondition for any chargeback or showback model. With tags, each team sees its own bill and self-optimizes.
Key Takeaway: Cost optimization is workload shaping to billing models — tier storage by access pattern, reduce scans through partitioning and clustering, right-size compute with auto-suspend and reservations, and tag every query so spend is attributable; combined techniques typically deliver 50-75 percent savings without harming SLA.
Reference Architectures
Reference architectures are the synthesis exercise that pulls together the services and patterns from prior chapters into deployable shapes. A reference architecture is not a product; it is a documented, tested combination of components that solves a class of problems with known trade-offs. The point of having one is that new projects start from a known-good template instead of relitigating fundamental decisions on every greenfield build.
End-to-End Batch and Streaming Reference
The canonical AWS reference architecture for a modern data platform combines the services from Chapters 4-12 into a single end-to-end flow:
Sources (operational DBs, SaaS, events, IoT)
|
v
Ingestion: DMS (CDC) | Kinesis Data Streams (events) | AppFlow (SaaS) | Glue (batch)
|
v
Bronze: S3 raw zone (Parquet, partitioned by event_date), Glue Catalog
|
v
Silver: Glue / EMR transforms -> S3 conformed zone (Iceberg / Delta), Lake Formation governance
|
v
Gold: dbt on Redshift or Snowflake -> star/snowflake schemas, materialized views
|
+--> Analytics: QuickSight (SPICE) <- Redshift / Athena
+--> ML: SageMaker Feature Store (online + offline) <- Glue / Spark
+--> Reverse ETL: Hightouch / Census -> CRM, marketing tools
|
v
Observability: CloudWatch + Datadog metrics, OpenLineage, Monte Carlo data quality
Governance: Lake Formation (RBAC), AWS Glue Data Catalog, Atlan / Collibra for business catalog
Security: IAM roles, KMS encryption, VPC endpoints, audit via CloudTrail
The key design decisions baked into this template:
- Streaming and batch share storage, not just landing zones. Both write to the same S3-backed Iceberg silver tables; downstream consumers see a unified view regardless of whether the data arrived via Kinesis or DMS.
- Schema-on-read at bronze, schema-enforced at silver. Raw zones tolerate sloppy upstream schemas; conformed zones reject anything that does not match the contract.
- Governance flows through Lake Formation, applied uniformly to analytics, ML, and reverse-ETL consumers. There is one access-control surface, not three.
- Observability is cross-cutting, not bolted on per service. Lineage flows through OpenLineage events; data quality through Monte Carlo or Great Expectations; pipeline health through Airflow or Step Functions metadata.
This reference is not the only valid shape — a Snowflake-centric stack, a Databricks lakehouse stack, or a hybrid multi-cloud stack would draw the same diagram with different boxes — but it captures the structural pattern that virtually every production design follows: ingestion, layered storage with progressive curation, transformation, analytics-and-ML branches, and orthogonal governance/observability/security planes.
Figure 13.4: End-to-end batch + streaming reference architecture with parallel BI and ML branches under unified governance.
flowchart TD
subgraph SRC["Sources"]
DB[Operational DBs]
EV[Events / IoT]
SAAS[SaaS]
end
subgraph ING["Ingestion"]
DMS[DMS CDC]
KDS[Kinesis Streams]
AF[AppFlow]
GL[Glue Batch]
end
subgraph LAKE["Layered Lakehouse Storage"]
BR[Bronze: S3 Raw<br/>Parquet partitioned]
SI[Silver: Iceberg / Delta<br/>Conformed Zone]
GO[Gold: Redshift / Snowflake<br/>Star Schemas + MVs]
end
subgraph BI["Analytics Branch"]
QS[QuickSight + SPICE]
ATH[Athena Ad-hoc]
end
subgraph ML["ML Branch"]
FS[Feature Store<br/>Online + Offline]
SM[SageMaker Training<br/>+ Endpoints]
end
subgraph RE["Reverse ETL"]
HT[Hightouch / Census]
end
subgraph CROSS["Cross-Cutting Planes"]
GOV[Lake Formation Governance]
OBS[Observability:<br/>OpenLineage + Monte Carlo]
SEC[Security: IAM + KMS + VPC]
end
DB --> DMS
EV --> KDS
SAAS --> AF
DB --> GL
DMS --> BR
KDS --> BR
AF --> BR
GL --> BR
BR --> SI
SI --> GO
GO --> QS
GO --> ATH
SI --> FS
FS --> SM
GO --> HT
GOV -.-> LAKE
GOV -.-> BI
GOV -.-> ML
OBS -.-> LAKE
SEC -.-> ING
SEC -.-> LAKE
Multi-Tenant Data Platform Pattern
Multi-tenancy adds a second axis to every component above. Whether you are an ISV serving external customers, a platform team serving internal business units, or a public-sector agency serving multiple departments, the engineering question is the same: how do you achieve cost-efficient sharing of infrastructure with airtight isolation of data and workloads?
The three canonical isolation models, with trade-offs:
| Model | Isolation Boundary | Cost Efficiency | Blast Radius | Use Case |
|---|---|---|---|---|
| Pool (shared) | Row-level (tenant_id column) | Highest | Largest | Many small tenants, similar workloads |
| Bridge (silo within shared cluster) | Schema or database | Medium | Medium | Mid-size tenants, regulatory variation |
| Silo (dedicated) | Account or cluster | Lowest | Smallest | Few large tenants, strict compliance |
The pool model uses row-level security (Lake Formation row filters, Redshift RLS, Snowflake row access policies) and session tags that propagate from the calling user through every query. The bridge model gives each tenant its own schema in a shared warehouse, simplifying RLS at the cost of less efficient compute sharing. The silo model gives each tenant its own cluster or even its own AWS account, which is the only model that satisfies the strictest sovereignty and isolation requirements but multiplies infrastructure cost.
Most production multi-tenant platforms end up hybrid: pool for the long tail of small tenants, silo for the handful of enterprise customers whose contracts demand dedicated infrastructure, with a clear migration path between the two as customers grow. The data engineering work is mostly in the propagation layer — making sure that tenant identity flows from authentication through ingestion through transformation through queries through dashboards, and that no code path forgets to apply it.
Figure 13.5: Multi-tenant isolation models compared across pool, bridge, and silo patterns.
graph TD
subgraph Pool["Pool Model: Shared Everything"]
P1[Shared Cluster]
P2[Shared Schema]
P3[Tables with tenant_id column]
P4[Row-Level Security + Session Tags]
P1 --> P2 --> P3 --> P4
end
subgraph Bridge["Bridge Model: Schema-Per-Tenant"]
B1[Shared Cluster]
B2[Tenant A Schema]
B3[Tenant B Schema]
B4[Tenant C Schema]
B1 --> B2
B1 --> B3
B1 --> B4
end
subgraph Silo["Silo Model: Dedicated Infrastructure"]
S1[Tenant A Account / Cluster]
S2[Tenant B Account / Cluster]
S3[Tenant C Account / Cluster]
end
Cost[Cost Efficiency:<br/>Pool > Bridge > Silo] -.-> Pool
Iso[Isolation Strength:<br/>Silo > Bridge > Pool] -.-> Silo
Hybrid([Hybrid in Practice:<br/>Pool small tenants<br/>Silo enterprise tenants]) --> Pool
Hybrid --> Silo
Capstone: Design Your Own Pipeline
The final synthesis exercise of this textbook is to design a complete pipeline for a realistic scenario, justifying every component choice against the trade-offs covered in prior chapters. A worked example follows.
Scenario: A growing e-commerce company (50M monthly orders, 5M active customers, 15 markets) needs a unified data platform supporting (a) executive dashboards refreshed hourly, (b) marketing reverse-ETL to seven SaaS tools, (c) a real-time fraud detection model, and (d) data-science self-service for ad-hoc analysis. Annual data platform budget: $1.5M. Compliance: GDPR, PCI-DSS for payment data.
Design:
- Ingestion: DMS for the operational PostgreSQL (CDC to Kinesis Data Streams); Kinesis for clickstream and order events; AppFlow for Salesforce, Stripe, and Zendesk; Glue scheduled jobs for partner CSV drops.
- Bronze: S3 with Intelligent-Tiering, partitioned by
sourceandevent_date, Glue Catalog. Lake Formation tag-based access control with PCI tag on the payment-related columns. - Silver: Glue Spark jobs for normalization and PII tokenization; Iceberg tables for upsert-heavy entities (customer, product), Hudi for the order event log. Schema enforcement via dbt tests at the bronze->silver boundary.
- Gold: Redshift Serverless for analytical marts (using RA3 nodes’ managed storage to keep the lakehouse / warehouse separation thin); dbt for transformations with daily and hourly refreshes; materialized views for executive dashboards.
- BI: QuickSight Enterprise with SPICE for executive dashboards (4-hour refresh), direct-query Redshift for ad-hoc analyst exploration, Q enabled on a curated semantic-layer dataset with hierarchies and synonyms configured.
- ML: SageMaker Feature Store (online + Iceberg offline) for fraud features; Kinesis-fed stream ingestion for real-time card and session features; SageMaker real-time endpoint behind API Gateway for fraud scoring. Training data via Athena ASOF joins.
- Reverse ETL: Hightouch reading Redshift gold marts, syncing to Salesforce, Braze, Iterable.
- Governance: Lake Formation centrally; Atlan as the business catalog; OpenLineage for lineage; Monte Carlo for data observability on the top 50 critical tables.
- Cost controls: S3 lifecycle (Standard 30d -> Intelligent-Tiering 90d -> Glacier IR 1y -> Deep Archive 7y); Redshift Serverless with weekly review of
SYS_QUERY_HISTORY; Athena workgroup with 1 TB per-query scan limit; QuickSight SPICE capacity allocated per business unit; per-team tags on every query and Glue job. - Security: KMS CMKs per environment; VPC endpoints for Redshift, S3, Glue; QuickSight VPC connector; IAM roles with least-privilege per workload; CloudTrail to a dedicated logging account.
This design intentionally trades some optimal-per-service efficiency for architectural coherence: a Snowflake-centric design might be marginally faster for some queries, a Databricks-centric design might be marginally cheaper for some ML training runs, but the cohesion of staying within the AWS-native services reduces integration cost, simplifies IAM, and lets a single platform team operate the whole stack. The capstone exercise is not to find the theoretically optimal stack; it is to make every component choice deliberately, document the trade-off, and design for the team that will actually operate the system on Tuesday morning at 3 AM when something breaks.
Key Takeaway: Reference architectures are tested templates, not products; the canonical AWS shape combines layered S3/Iceberg storage with parallel BI and ML branches under unified Lake Formation governance, and the capstone discipline is to pick services deliberately against your team, budget, and compliance constraints rather than chasing per-service local optima.
Chapter Summary
This chapter closed the loop on the data engineering pipeline by connecting the lakehouse and warehouse to the people and models that consume the data. Amazon QuickSight provides a managed BI surface with three layers: connectors and security to data sources, the SPICE in-memory engine that achieves 10-100 ms aggregation latency through columnar compression and scheduled refreshes, and an experience layer including dashboards, embedded analytics, and the natural-language assistant Q. Effective deployment treats SPICE capacity as a finite shared resource and invests heavily in metadata for Q.
ML and analytics convergence is mediated by SageMaker Feature Store, whose paired online (single-digit ms KV) and offline (S3 Parquet on the lakehouse) stores let one feature definition serve both real-time inference and point-in-time-correct training. Ingestion patterns span batch, stream, backfill, and CDC; ASOF joins on EventTime and WriteTime prevent label leakage; generative-BI assistants extend the same lakehouse with curated semantic layers and guardrails.
Cost optimization is workload shaping to billing models. Storage tiering with S3 Intelligent-Tiering and Glacier saves 40-90 percent versus Standard with minimal engineering effort. Query economics through partitioning, clustering, materialized views, and result caching cut scans dramatically on per-byte-billed services. Right-sizing through auto-suspend, serverless modes, and reserved capacity matches compute to actual demand; per-query tagging makes optimization a distributed practice rather than a central crusade. Combined techniques routinely deliver 50-75 percent savings.
Reference architectures synthesize prior chapters into deployable templates: layered S3 bronze/silver/gold, parallel BI and ML branches, unified Lake Formation governance, and orthogonal observability and security planes. Multi-tenant patterns trade isolation against cost across pool, bridge, and silo models. The capstone discipline is to design deliberately for your team, budget, and compliance posture rather than chasing per-service local optima.
Key Terms
- Amazon QuickSight: AWS’s managed serverless BI service that connects to Redshift, S3, Athena, and other sources to deliver dashboards, embedded analytics, and natural-language analytics through Q.
- SPICE (Super-fast, Parallel, In-memory Calculation Engine): QuickSight’s proprietary columnar in-memory engine that caches data with 4:1-6:1 compression to deliver 10-100 ms aggregation latency, decoupling dashboard performance from source-system load.
- Q in QuickSight: Enterprise-edition natural-language analytics feature that translates user questions into SQL and visualizations; accuracy is determined primarily by the quality of dataset metadata, hierarchies, synonyms, and data-category tags.
- SageMaker: AWS’s managed ML platform covering notebook development, training, model registry, deployment, and monitoring; integrates with the lakehouse through Glue, Athena, Spark, and the Feature Store.
- Feature store: A system that manages ML feature inputs with paired online (low-latency KV for inference) and offline (S3 Parquet for training and analytics) stores, enforcing point-in-time correctness through
EventTimeandWriteTimeASOF joins. - Reference architecture: A documented, tested combination of components that solves a class of problems with known trade-offs; the canonical AWS data platform pattern combines ingestion, layered S3/Iceberg storage, transformation, parallel BI and ML branches, and unified governance.
- Cost optimization: The practice of shaping workloads to billing models through storage tiering, scan reduction, right-sizing, reserved capacity, and per-query tagging; combined techniques typically deliver 50-75 percent savings without harming SLA.
- Right-sizing: Matching provisioned compute to actual workload demand through auto-suspend, serverless modes, elastic resize, and reservation-based commitments scaled from historical burn data; per-platform levers include Snowflake warehouse sizing, Redshift Serverless or Reserved Instances, and BigQuery slot reservations.