Study Guide: Chapter 2 — The Lakehouse Paradigm

Pre-Quiz: Warehouse vs Lake vs Lakehouse

1. A finance team complains that running an analyst's quarterly revenue scan against the production e-commerce database slows down checkout. Why does this motivate building a separate data warehouse?

A) Warehouses are cheaper per gigabyte than transactional databases B) OLTP databases are tuned for short, row-oriented transactions and suffer when long OLAP scans lock tables and blow out caches C) Warehouses can ingest unstructured video and JSON, which OLTP databases cannot D) Warehouses use schema-on-read, allowing analysts to skip schema design

2. Why did the early-2010s combination of cheap S3 storage, JSON/clickstream data, and machine learning push organizations toward data lakes rather than warehouses?

A) Lakes provided stronger ACID guarantees than warehouses B) Warehouses had finally added native support for video and image data C) Lakes accepted any format on cheap object storage and applied schema only at read time, making them ideal for unstructured data and ML training D) Lakes used proprietary columnar formats faster than warehouse storage

3. An engineer notices that two pipelines writing Parquet files to the same S3 directory are producing duplicate rows and partial files. The team starts calling the lake a "data swamp." What core capability is missing that would have prevented this?

A) Columnar compression B) ACID transaction guarantees on writes to the shared storage layer C) A faster network between S3 and the compute cluster D) An OLTP database fronting S3

4. Which statement best captures the central insight of the lakehouse paradigm?

A) Replace S3 with a proprietary columnar store to gain warehouse performance B) Add a transaction layer (Delta Lake, Iceberg, or Hudi) on top of Parquet on object storage to gain warehouse-grade ACID and schema enforcement at lake economics C) Run both a warehouse and a lake permanently and copy data between them with ETL D) Store all data in a single OLTP database that does both transactions and analytics

5. An architect must choose a storage substrate for a workload requiring full ACID, fast MERGE/UPDATE/DELETE, and engine interoperability across Spark, Athena, and Redshift. Based on the comparison table in the chapter, which option fits best?

A) A pure data lake on S3 with raw Parquet only B) A traditional proprietary data warehouse with coupled storage and compute C) A lakehouse using object storage plus an open table format such as Iceberg or Delta Lake D) An OLTP database with a star schema added on top

1. Warehouse vs Lake vs Lakehouse

Key Points

Warehouses exist because OLTP databases cannot safely host long analytical scans — analytics is a read-heavy, scan-wide workload that locks tables in OLTP systems.
Warehouses deliver ACID and SQL but at proprietary-storage cost, schema-on-write rigidity, and structured-data-only reach.
Data lakes (S3 / ADLS / GCS) gave cheap storage and schema-on-read flexibility, but their lack of ACID transactions led to "data swamps."
Three open table formats — Delta Lake, Apache Iceberg, Apache Hudi — provide ACID, time travel, and engine interoperability on top of Parquet.
The lakehouse keeps the lake's substrate and adds a transaction layer — warehouse reliability at lake economics, with one copy of data instead of two parallel stacks.

OLTP vs OLAP: why warehouses exist

Operational systems power transactions: each touches a few rows, completes in milliseconds, and demands ACID so two customers don't book the same airline seat. Analytics is the inverse — scanning millions of rows across a few columns. Running that scan on the OLTP database would lock tables and degrade the customer experience, so organizations built the data warehouse: a separate, OLAP-optimized store with proprietary columnar formats and star/snowflake schemas tuned for BI.

Warehouses excel at structured, repeatable reporting, but pay three costs: dollars (proprietary storage is expensive), rigidity (schema-on-write — every column must be defined before loading), and narrowness (structured tabular data only — no images, JSON blobs, or sensor logs).

Data lake origin story

Around 2010, three forces collided: storage got dramatically cheaper (S3 commoditized durable bytes), new data sources exploded (mobile clickstreams, IoT, JSON events), and ML practitioners needed raw, unsampled data. The data lake was the answer — dump everything into S3, ADLS, or GCS, and worry about structure later. This is schema-on-read: define a schema when you query, not when you write.

The economics were transformative — pennies per gigabyte versus dollars in proprietary storage. But lakes lacked ACID guarantees: two pipelines writing the same Parquet directory could produce partial files, duplicates, or corrupt partitions. There was no atomic update, no easy GDPR delete. Lakes that accumulated undocumented, inconsistent data became known as data swamps.

Analogy: a warehouse is a bonded archive where every box is labeled before entry; a lake is a self-storage unit where you can throw anything in any order — and the unit next door might leak into yours overnight.

Why lakehouses emerged

By the late 2010s, organizations were running two parallel stacks: a lake for ML and exploration, and a warehouse for BI, with brittle ETL between them. This duplicated storage and let definitions drift ("revenue" meant different things in each system).

The lakehouse keeps the cheap object-storage substrate and open Parquet format, but adds a transaction layer that gives ACID semantics, schema enforcement, time travel, and fast row-level updates. Three open table formats now dominate this layer: Delta Lake (Databricks), Apache Iceberg (Netflix; used by Snowflake and AWS), and Apache Hudi (Uber). All three provide optimistic concurrency control, snapshot isolation, and engine interoperability.

Figure 2.1: Evolution from warehouse to lake to lakehouse

timeline title Evolution of Analytical Data Architectures 1990s : Data Warehouse : Proprietary columnar storage : Schema-on-write, ACID, BI only 2010s : Data Lake : Cheap object storage (S3/ADLS/GCS) : Schema-on-read, ML-friendly : No ACID, risk of data swamps 2020s : Data Lakehouse : Object storage + transaction layer : Delta Lake / Iceberg / Hudi : ACID + open formats + plural engines

Sequential reveal: 1990s warehouse, 2010s lake, 2020s lakehouse with ACID + open formats.

Comparison across architectures

Dimension	Warehouse	Lake	Lakehouse
Storage substrate	Proprietary, expensive	S3 / ADLS / GCS, cheap	Object storage + transaction layer
Data formats	Tabular only	Any format	Any; tabular wrapped in open table format
Schema model	Schema-on-write (strict)	Schema-on-read (loose)	Schema-on-write enforced via metastore
ACID	Native, full	Minimal or none	Full ACID via Delta / Iceberg / Hudi
Update / delete	Full DML	Often rewrite partition	Fast MERGE / UPDATE / DELETE
Workloads	OLAP, BI	Exploration, ML	BI + SQL + ML + streaming, unified
Cost profile	High (coupled)	Lowest (storage only)	Low storage, optimized compute

Post-Quiz: Warehouse vs Lake vs Lakehouse

1. A finance team complains that running an analyst's quarterly revenue scan against the production e-commerce database slows down checkout. Why does this motivate building a separate data warehouse?

2. Why did the early-2010s combination of cheap S3 storage, JSON/clickstream data, and machine learning push organizations toward data lakes rather than warehouses?

A) Columnar compression B) ACID transaction guarantees on writes to the shared storage layer C) A faster network between S3 and the compute cluster D) An OLTP database fronting S3

4. Which statement best captures the central insight of the lakehouse paradigm?

Pre-Quiz: Lakehouse Architecture and SageMaker Lakehouse

1. In the three-layer lakehouse architecture, what is the role of the metastore (e.g., Unity Catalog, Glue Data Catalog)?

A) It physically stores the Parquet bytes that hold table data B) It is the query engine that executes SQL against tables C) It turns directories of Parquet files into named, schema'd, transactional tables and tracks snapshots, partitions, and ACLs D) It compresses data in flight between engines

2. The chapter says the metastore is the "keystone" that enables ACID guarantees on top of immutable object-store files. What mechanism makes this possible?

A) The metastore atomically swaps the snapshot pointer that lists which files constitute the current table state B) The metastore rewrites every Parquet file on every commit C) S3 itself supports cross-object multi-row transactions D) The query engine holds a global table lock for the duration of every read

3. The chapter summarizes the architectural shift as "storage is shared, compute is plural and disposable." Which scenario best illustrates this property?

A) A nightly job copies data from S3 into Redshift Managed Storage so dashboards can query it B) A Spark job in EMR finishes ETL and shuts down; an analyst then queries the same Iceberg tables with Athena, paying only for bytes scanned, with no copies made C) Each engine maintains its own copy of every table to avoid lock contention D) All queries must run on a single permanent Spark cluster shared by everyone

4. SageMaker Lakehouse advertises a "single copy of data" spanning S3, Redshift Managed Storage, and SaaS sources. What technical choice makes this feasible?

A) AWS forces all tables into a proprietary Redshift binary format B) AWS Glue Data Catalog acts as the unified metastore across S3 and Redshift, while Apache Iceberg is the open, engine-neutral table format any compatible engine can read C) Every query is routed through a single Athena gateway D) Customers must replicate every table into all three storage locations

5. A retail company has orders in Redshift, clickstream JSON in S3, and customer master data in Salesforce. They want a unified "customer lifetime value" feature for an ML model. Why is the SageMaker Lakehouse approach materially better than the pre-lakehouse alternative?

A) Pre-lakehouse, the same single-engine SQL query already could join in-place across Redshift, S3, and Salesforce, so the difference is mostly cosmetic B) Lakehouse joins are slower but require less storage C) A single Spark job in Unified Studio joins all three sources in place via the Glue catalog and writes an Iceberg table immediately queryable by Athena and Redshift — eliminating multiple ETL pipelines and duplicated copies D) Lakehouses remove the need for any access controls between engines

2. Lakehouse Architecture and SageMaker Lakehouse

Key Points

Every lakehouse decomposes into three layers: object storage at the bottom, a metastore (catalog) in the middle, and plural query engines on top.
Apache Parquet's columnar layout gives lakehouses warehouse-class scan performance — read only the columns you query.
The metastore is the keystone — it converts Parquet directories into transactional tables via atomic snapshot pointer swaps.
SageMaker Lakehouse (re:Invent 2024, GA) uses Apache Iceberg + AWS Glue Data Catalog and presents a single copy of data across S3, Redshift Managed Storage, and federated SaaS sources via zero-ETL.
Fine-grained access (row/column rules) is enforced consistently across Athena, Redshift, and EMR through SageMaker Data and AI Governance on Amazon DataZone.

Object storage as the foundation

Every lakehouse rests on a cloud object store: Amazon S3, Azure Data Lake Storage Gen2, or Google Cloud Storage. These provide effectively infinite, eleven-nines-durable, pay-per-byte storage that decouples completely from compute. Files are typically Apache Parquet — a columnar format that compresses well, supports predicate pushdown, and is readable by virtually every analytic engine.

Analogy: object storage is like a city's water supply. Every restaurant draws from the same pipes; each cooks its own cuisine. The utility doesn't care what you cook — it guarantees clean water at scale and low cost.

Metastore and catalog layer

Raw Parquet files in S3 are just files. The metastore turns them into tables — things with names, schemas, partitions, statistics, and access control. When you commit a write, the metastore atomically swaps the snapshot pointer; that is what gives lakehouses ACID guarantees on top of immutable object-store files.

Production lakehouses use one of: Unity Catalog (Databricks), AWS Glue Data Catalog, Apache Nessie (Git-like branching), or the older Hive Metastore. Without a central catalog, every engine maintains its own picture of the data and they drift apart.

Query engines accessing shared data

The top layer is whatever compute you bring: Apache Spark for ETL/ML, Trino/Presto for interactive SQL, Amazon Athena for serverless ad-hoc, Amazon Redshift for BI, EMR/Glue for managed Spark/Flink, Databricks Photon, and DuckDB for local analytics. The architectural shift: storage is shared, compute is plural and disposable — no copies, no ETL between engines.

Figure 2.2: Three-layer lakehouse architecture

flowchart TD subgraph Engines["Query Engines (plural, disposable)"] Spark["Apache Spark"] Trino["Trino / Presto"] Athena["Amazon Athena"] Redshift["Amazon Redshift"] EMR["EMR / Glue"] Duck["DuckDB"] end subgraph Catalog["Metastore / Catalog Layer"] Meta["Unity Catalog / Glue Data Catalog / Nessie
schemas, snapshots, partitions, ACLs"] end subgraph Storage["Object Storage Foundation"] S3["Amazon S3 / ADLS / GCS
Parquet files (open, columnar)"] end Spark --> Meta Trino --> Meta Athena --> Meta Redshift --> Meta EMR --> Meta Duck --> Meta Meta --> S3

AWS SageMaker Lakehouse

Announced at AWS re:Invent 2024 and now generally available, Amazon SageMaker Lakehouse presents a single copy of data spanning S3 lakes, Redshift Managed Storage, and federated SaaS sources (Salesforce, SAP, ServiceNow, Zendesk, DynamoDB, Snowflake). A Spark job in EMR can read a Redshift table in place — without unloading it to S3 — and a Redshift query can join that table with an S3-resident Iceberg table in the same SQL statement. Zero-ETL integrations pull from SaaS sources directly into the lakehouse.

Built on Apache Iceberg as the open table format and AWS Glue Data Catalog as the unified metastore. Fine-grained, identity-based access (row/column rules) is enforced consistently across Athena, Redshift, and EMR through SageMaker Data and AI Governance on Amazon DataZone. SageMaker Unified Studio bundles EMR Studio, Glue Studio, the Redshift Query Editor, SageMaker Studio, and Bedrock tools into one workspace.

Figure 2.3: SageMaker Lakehouse unified access

flowchart LR subgraph Sources["Data Sources"] S3["S3 Data Lake
(Iceberg / Parquet)"] RMS["Redshift Managed Storage
(warehouse tables)"] SaaS["SaaS / Federated
Salesforce, SAP,
ServiceNow, Zendesk,
DynamoDB, Snowflake"] end Glue["AWS Glue Data Catalog
(unified metastore)"] subgraph Studio["SageMaker Unified Studio"] EMR["EMR / Spark"] Athena["Athena"] RS["Redshift Query Editor"] SM["SageMaker Studio (ML)"] end S3 -- "in-place" --> Glue RMS -- "in-place" --> Glue SaaS -- "zero-ETL" --> Glue Glue --> EMR Glue --> Athena Glue --> RS Glue --> SM

Fan-in: sources illuminate, packets flow to the Glue catalog, the catalog glows, then plural engines light up.

SageMaker Lakehouse to generic-layer mapping

Generic Layer	SageMaker Lakehouse
Object storage	Amazon S3 + Redshift Managed Storage
Open table format	Apache Iceberg
Metastore / catalog	AWS Glue Data Catalog
Query engines	EMR, Glue, Redshift, Athena, Spark
Governance	SageMaker Data and AI Governance (DataZone)
User workspace	SageMaker Unified Studio

Post-Quiz: Lakehouse Architecture and SageMaker Lakehouse

1. In the three-layer lakehouse architecture, what is the role of the metastore (e.g., Unity Catalog, Glue Data Catalog)?

2. The chapter says the metastore is the "keystone" that enables ACID guarantees on top of immutable object-store files. What mechanism makes this possible?

3. The chapter summarizes the architectural shift as "storage is shared, compute is plural and disposable." Which scenario best illustrates this property?

4. SageMaker Lakehouse advertises a "single copy of data" spanning S3, Redshift Managed Storage, and SaaS sources. What technical choice makes this feasible?

Pre-Quiz: Medallion Architecture

1. Which statement best describes the Bronze layer's purpose in medallion architecture?

A) Bronze contains business-aggregated KPIs ready for executive dashboards B) Bronze is the deduplicated, schema-enforced, record-level layer for analytics engineers C) Bronze is the append-only, raw single-source-of-truth landing zone preserving full payload and ingestion timestamps for replay and audit D) Bronze is the OLTP database that powers production checkout

2. An e-commerce Bronze table contains duplicates from client retries, malformed addresses, and currency codes in mixed cases ("USD", "Usd", "usd"). Where in the medallion architecture should this be cleaned up, and what changes happen there?

A) Gold layer — by aggregating the bad rows together so they cancel out B) Silver layer — deduplication on order ID + timestamp, rejecting rows with invalid postal codes, normalizing currency to upper-case, joining the customer dimension, and enforcing a strict schema with non-nullable order ID C) Bronze layer — by overwriting the raw source data in place D) The OLTP source — by changing the production database before ingestion

3. The chapter emphasizes that each medallion layer is "rebuildable from Bronze." Why is this property a superpower for evolving systems?

A) It guarantees Bronze tables run faster than Silver or Gold tables B) When the definition of a metric like "active customer" changes, you can change the Silver/Gold logic and rerun to get a consistent rebuild instead of hand-patching every downstream table C) It allows Bronze to be deleted at any time without consequence D) It lets the OLTP source be queried directly for BI dashboards

4. Which consumer aligns best with the Silver layer, given its design as cleansed, deduplicated, schema-enforced, record-level data?

A) Auditors investigating "what did our data look like on March 14?" B) Executives reading a daily revenue-by-region dashboard C) Data scientists and analytics engineers building features and models who need fine-grained data without source-system inconsistencies D) The web checkout API serving live customers

5. Why is Gold the appropriate destination for a `daily_revenue_by_region` table that powers an executive dashboard, rather than Silver or Bronze?

A) Gold tables are aggregated, dimensional, business-defined, and optimized for query performance — exactly what an executive dashboard needs, while Silver remains record-level and Bronze is raw B) Bronze always pre-computes all aggregates by default, so Gold would duplicate work C) Silver tables are append-only and cannot be partitioned for performance D) Gold tables are stored outside the lakehouse in a different system

3. Medallion Architecture

Key Points

Medallion architecture organizes data into three quality tiers — Bronze (raw), Silver (cleaned and validated), Gold (business-ready aggregates).
Bronze is append-only and immutable: full payload, ingestion timestamp, source identifier — never deduplicate or transform business meaning here.
Silver enforces strict schemas after cleansing, deduplication, normalization, and reference-data joins; it remains record-level (one row per real-world event).
Gold is aggregated and dimensional — partitioned, clustered, often pre-joined for BI dashboards and ML feature tables.
Each layer is rebuildable from Bronze, so when a definition changes you rerun rather than hand-patch every downstream table.

Bronze: raw landing zone

Bronze is the single source of truth for ingested data. Tables are append-only, immutable, and as close to the source format as possible. Schema is applied loosely (schema-on-read) so ingestion stays robust against upstream changes. Bronze captures the full payload, an ingestion timestamp, a source-system identifier, and optional metadata such as file path or Kafka offset.

Bronze deliberately does not deduplicate, join, or drop columns. Because everything is append-only and timestamped, you can replay any downstream pipeline against any historical snapshot. Analogy: Bronze is a hospital's intake record — you write down everything the patient says, contradictions and all. Diagnosis happens later.

Silver: cleaned and conformed

Silver is where data becomes trustworthy. Transformations include: cleansing (repair malformed rows), deduplication (one record per business event), normalization (standard units, dates, country codes), enrichment via joins with reference data, and schema enforcement (every Silver table has a strict, documented schema; non-conforming writes fail).

Crucially, Silver stays non-aggregated: each row is a real-world event or entity (an order, a session, a sensor reading). This makes Silver the natural home for data scientists and analytics engineers.

Gold: business-ready aggregates

Gold materializes business meaning. Tables are aggregated, dimensional, or feature-engineered for direct consumption by BI dashboards, executive reports, and ML models. They are optimized for query performance — partitioned, clustered, and often pre-joined.

Typical Gold tables: daily/hourly aggregates ("revenue by region by day"), star-schema dimensional models, KPIs aligned to business definitions, and ML feature tables ready for training and inference.

Worked example: e-commerce orders

Bronze (`orders_raw`): every API payload, including duplicates from client retries, malformed addresses, mixed-case currency codes ("USD", "Usd", "usd").
Silver (`orders`): deduplicate on order ID + timestamp; reject rows with invalid postal codes (sent to a quarantine table); upper-case all currency codes; join the customer dimension to add region; enforce a strict schema with non-nullable order ID and amount.
Gold (`daily_revenue_by_region`): aggregate the Silver `orders` table grouped by region and order date — powers the executive dashboard, refreshes nightly, fully rebuildable from Bronze if the definition changes.

Figure 2.4: Medallion architecture flow

flowchart LR Source["Source Systems
APIs, Kafka, files, CDC"] subgraph Bronze["Bronze Layer"] B["Raw, append-only
Schema-on-read
Full payload + ingest timestamp"] end subgraph Silver["Silver Layer"] S["Cleansed, deduped, normalized
Schema-enforced
Record-level events"] end subgraph Gold["Gold Layer"] G["Aggregates, dimensional models
ML feature tables
Business definitions"] end Source --> B B --> S S --> G B -. "auditors / replay" .-> Auditor(["Auditor"]) S -. "data scientists" .-> DS(["Data Scientist"]) G -. "BI / executives / ML" .-> BI(["Dashboard / ML"]) G -. "rebuildable from Bronze" .-> B

Bronze, Silver, then Gold reveal in turn; the green dashed arrow shows that any layer can be rebuilt from Bronze.

Layer comparison

Layer	Purpose	Schema	Transformations	Typical Consumers
Bronze	Raw single source of truth	Schema-on-read, loose	Ingestion only; append-only	Pipeline operators, auditors, replay jobs
Silver	Cleaned, validated record-level data	Schema-enforced, strict	Cleanse, dedupe, normalize, join, validate	Data scientists, analytics engineers
Gold	Business-ready aggregates and models	Schema-enforced, dimensional	Aggregate, curate, feature-engineer	BI dashboards, executives, ML inference

A common deployment shape is a hub-and-spoke arrangement: a central Data Hub owns canonical Bronze/Silver/Gold tables for cross-organization concepts (customer, employee, product), and each domain (Sales, Marketing, Finance) maintains its own Bronze/Silver/Gold extensions that join to the hub. This avoids both central bottlenecks and uncoordinated domain duplication.

Post-Quiz: Medallion Architecture

1. Which statement best describes the Bronze layer's purpose in medallion architecture?

3. The chapter emphasizes that each medallion layer is "rebuildable from Bronze." Why is this property a superpower for evolving systems?

4. Which consumer aligns best with the Silver layer, given its design as cleansed, deduplicated, schema-enforced, record-level data?

5. Why is Gold the appropriate destination for a `daily_revenue_by_region` table that powers an executive dashboard, rather than Silver or Bronze?

Chapter 2: The Lakehouse Paradigm

Learning Objectives

1. Warehouse vs Lake vs Lakehouse

Key Points

OLTP vs OLAP: why warehouses exist

Data lake origin story

Why lakehouses emerged

Figure 2.1: Evolution from warehouse to lake to lakehouse

Comparison across architectures

2. Lakehouse Architecture and SageMaker Lakehouse

Key Points

Object storage as the foundation

Metastore and catalog layer

Query engines accessing shared data

Figure 2.2: Three-layer lakehouse architecture

AWS SageMaker Lakehouse

Figure 2.3: SageMaker Lakehouse unified access

SageMaker Lakehouse to generic-layer mapping

3. Medallion Architecture

Key Points

Bronze: raw landing zone

Silver: cleaned and conformed

Gold: business-ready aggregates

Worked example: e-commerce orders

Figure 2.4: Medallion architecture flow

Layer comparison

Your Progress

Answer Explanations