Chapter 2: The Lakehouse Paradigm

Learning Objectives

Pre-Quiz: Warehouse vs Lake vs Lakehouse

1. A finance team complains that running an analyst's quarterly revenue scan against the production e-commerce database slows down checkout. Why does this motivate building a separate data warehouse?

2. Why did the early-2010s combination of cheap S3 storage, JSON/clickstream data, and machine learning push organizations toward data lakes rather than warehouses?

3. An engineer notices that two pipelines writing Parquet files to the same S3 directory are producing duplicate rows and partial files. The team starts calling the lake a "data swamp." What core capability is missing that would have prevented this?

4. Which statement best captures the central insight of the lakehouse paradigm?

5. An architect must choose a storage substrate for a workload requiring full ACID, fast MERGE/UPDATE/DELETE, and engine interoperability across Spark, Athena, and Redshift. Based on the comparison table in the chapter, which option fits best?

1. Warehouse vs Lake vs Lakehouse

Key Points

OLTP vs OLAP: why warehouses exist

Operational systems power transactions: each touches a few rows, completes in milliseconds, and demands ACID so two customers don't book the same airline seat. Analytics is the inverse — scanning millions of rows across a few columns. Running that scan on the OLTP database would lock tables and degrade the customer experience, so organizations built the data warehouse: a separate, OLAP-optimized store with proprietary columnar formats and star/snowflake schemas tuned for BI.

Warehouses excel at structured, repeatable reporting, but pay three costs: dollars (proprietary storage is expensive), rigidity (schema-on-write — every column must be defined before loading), and narrowness (structured tabular data only — no images, JSON blobs, or sensor logs).

Data lake origin story

Around 2010, three forces collided: storage got dramatically cheaper (S3 commoditized durable bytes), new data sources exploded (mobile clickstreams, IoT, JSON events), and ML practitioners needed raw, unsampled data. The data lake was the answer — dump everything into S3, ADLS, or GCS, and worry about structure later. This is schema-on-read: define a schema when you query, not when you write.

The economics were transformative — pennies per gigabyte versus dollars in proprietary storage. But lakes lacked ACID guarantees: two pipelines writing the same Parquet directory could produce partial files, duplicates, or corrupt partitions. There was no atomic update, no easy GDPR delete. Lakes that accumulated undocumented, inconsistent data became known as data swamps.

Analogy: a warehouse is a bonded archive where every box is labeled before entry; a lake is a self-storage unit where you can throw anything in any order — and the unit next door might leak into yours overnight.

Why lakehouses emerged

By the late 2010s, organizations were running two parallel stacks: a lake for ML and exploration, and a warehouse for BI, with brittle ETL between them. This duplicated storage and let definitions drift ("revenue" meant different things in each system).

The lakehouse keeps the cheap object-storage substrate and open Parquet format, but adds a transaction layer that gives ACID semantics, schema enforcement, time travel, and fast row-level updates. Three open table formats now dominate this layer: Delta Lake (Databricks), Apache Iceberg (Netflix; used by Snowflake and AWS), and Apache Hudi (Uber). All three provide optimistic concurrency control, snapshot isolation, and engine interoperability.

Figure 2.1: Evolution from warehouse to lake to lakehouse

timeline title Evolution of Analytical Data Architectures 1990s : Data Warehouse : Proprietary columnar storage : Schema-on-write, ACID, BI only 2010s : Data Lake : Cheap object storage (S3/ADLS/GCS) : Schema-on-read, ML-friendly : No ACID, risk of data swamps 2020s : Data Lakehouse : Object storage + transaction layer : Delta Lake / Iceberg / Hudi : ACID + open formats + plural engines
1990s Data Warehouse Schema-on-write ACID, BI only 2010s Data Lake Schema-on-read No ACID, risk of swamps 2020s Lakehouse ACID + open formats Plural engines, one copy From Warehouse to Lake to Lakehouse Each era added a capability while keeping (or returning to) cheap storage
Sequential reveal: 1990s warehouse, 2010s lake, 2020s lakehouse with ACID + open formats.

Comparison across architectures

DimensionWarehouseLakeLakehouse
Storage substrateProprietary, expensiveS3 / ADLS / GCS, cheapObject storage + transaction layer
Data formatsTabular onlyAny formatAny; tabular wrapped in open table format
Schema modelSchema-on-write (strict)Schema-on-read (loose)Schema-on-write enforced via metastore
ACIDNative, fullMinimal or noneFull ACID via Delta / Iceberg / Hudi
Update / deleteFull DMLOften rewrite partitionFast MERGE / UPDATE / DELETE
WorkloadsOLAP, BIExploration, MLBI + SQL + ML + streaming, unified
Cost profileHigh (coupled)Lowest (storage only)Low storage, optimized compute
Post-Quiz: Warehouse vs Lake vs Lakehouse

1. A finance team complains that running an analyst's quarterly revenue scan against the production e-commerce database slows down checkout. Why does this motivate building a separate data warehouse?

2. Why did the early-2010s combination of cheap S3 storage, JSON/clickstream data, and machine learning push organizations toward data lakes rather than warehouses?

3. An engineer notices that two pipelines writing Parquet files to the same S3 directory are producing duplicate rows and partial files. The team starts calling the lake a "data swamp." What core capability is missing that would have prevented this?

4. Which statement best captures the central insight of the lakehouse paradigm?

5. An architect must choose a storage substrate for a workload requiring full ACID, fast MERGE/UPDATE/DELETE, and engine interoperability across Spark, Athena, and Redshift. Based on the comparison table in the chapter, which option fits best?

Pre-Quiz: Lakehouse Architecture and SageMaker Lakehouse

1. In the three-layer lakehouse architecture, what is the role of the metastore (e.g., Unity Catalog, Glue Data Catalog)?

2. The chapter says the metastore is the "keystone" that enables ACID guarantees on top of immutable object-store files. What mechanism makes this possible?

3. The chapter summarizes the architectural shift as "storage is shared, compute is plural and disposable." Which scenario best illustrates this property?

4. SageMaker Lakehouse advertises a "single copy of data" spanning S3, Redshift Managed Storage, and SaaS sources. What technical choice makes this feasible?

5. A retail company has orders in Redshift, clickstream JSON in S3, and customer master data in Salesforce. They want a unified "customer lifetime value" feature for an ML model. Why is the SageMaker Lakehouse approach materially better than the pre-lakehouse alternative?

2. Lakehouse Architecture and SageMaker Lakehouse

Key Points

Object storage as the foundation

Every lakehouse rests on a cloud object store: Amazon S3, Azure Data Lake Storage Gen2, or Google Cloud Storage. These provide effectively infinite, eleven-nines-durable, pay-per-byte storage that decouples completely from compute. Files are typically Apache Parquet — a columnar format that compresses well, supports predicate pushdown, and is readable by virtually every analytic engine.

Analogy: object storage is like a city's water supply. Every restaurant draws from the same pipes; each cooks its own cuisine. The utility doesn't care what you cook — it guarantees clean water at scale and low cost.

Metastore and catalog layer

Raw Parquet files in S3 are just files. The metastore turns them into tables — things with names, schemas, partitions, statistics, and access control. When you commit a write, the metastore atomically swaps the snapshot pointer; that is what gives lakehouses ACID guarantees on top of immutable object-store files.

Production lakehouses use one of: Unity Catalog (Databricks), AWS Glue Data Catalog, Apache Nessie (Git-like branching), or the older Hive Metastore. Without a central catalog, every engine maintains its own picture of the data and they drift apart.

Query engines accessing shared data

The top layer is whatever compute you bring: Apache Spark for ETL/ML, Trino/Presto for interactive SQL, Amazon Athena for serverless ad-hoc, Amazon Redshift for BI, EMR/Glue for managed Spark/Flink, Databricks Photon, and DuckDB for local analytics. The architectural shift: storage is shared, compute is plural and disposable — no copies, no ETL between engines.

Figure 2.2: Three-layer lakehouse architecture

flowchart TD subgraph Engines["Query Engines (plural, disposable)"] Spark["Apache Spark"] Trino["Trino / Presto"] Athena["Amazon Athena"] Redshift["Amazon Redshift"] EMR["EMR / Glue"] Duck["DuckDB"] end subgraph Catalog["Metastore / Catalog Layer"] Meta["Unity Catalog / Glue Data Catalog / Nessie
schemas, snapshots, partitions, ACLs"] end subgraph Storage["Object Storage Foundation"] S3["Amazon S3 / ADLS / GCS
Parquet files (open, columnar)"] end Spark --> Meta Trino --> Meta Athena --> Meta Redshift --> Meta EMR --> Meta Duck --> Meta Meta --> S3

AWS SageMaker Lakehouse

Announced at AWS re:Invent 2024 and now generally available, Amazon SageMaker Lakehouse presents a single copy of data spanning S3 lakes, Redshift Managed Storage, and federated SaaS sources (Salesforce, SAP, ServiceNow, Zendesk, DynamoDB, Snowflake). A Spark job in EMR can read a Redshift table in place — without unloading it to S3 — and a Redshift query can join that table with an S3-resident Iceberg table in the same SQL statement. Zero-ETL integrations pull from SaaS sources directly into the lakehouse.

Built on Apache Iceberg as the open table format and AWS Glue Data Catalog as the unified metastore. Fine-grained, identity-based access (row/column rules) is enforced consistently across Athena, Redshift, and EMR through SageMaker Data and AI Governance on Amazon DataZone. SageMaker Unified Studio bundles EMR Studio, Glue Studio, the Redshift Query Editor, SageMaker Studio, and Bedrock tools into one workspace.

Figure 2.3: SageMaker Lakehouse unified access

flowchart LR subgraph Sources["Data Sources"] S3["S3 Data Lake
(Iceberg / Parquet)"] RMS["Redshift Managed Storage
(warehouse tables)"] SaaS["SaaS / Federated
Salesforce, SAP,
ServiceNow, Zendesk,
DynamoDB, Snowflake"] end Glue["AWS Glue Data Catalog
(unified metastore)"] subgraph Studio["SageMaker Unified Studio"] EMR["EMR / Spark"] Athena["Athena"] RS["Redshift Query Editor"] SM["SageMaker Studio (ML)"] end S3 -- "in-place" --> Glue RMS -- "in-place" --> Glue SaaS -- "zero-ETL" --> Glue Glue --> EMR Glue --> Athena Glue --> RS Glue --> SM
SageMaker Lakehouse: One Catalog, One Copy of Data S3 Data Lake Iceberg / Parquet Redshift Managed warehouse tables SaaS / Federated SFDC, SAP, Zendesk AWS Glue Catalog Apache Iceberg single copy of data EMR / Spark Athena Redshift SageMaker Studio
Fan-in: sources illuminate, packets flow to the Glue catalog, the catalog glows, then plural engines light up.

SageMaker Lakehouse to generic-layer mapping

Generic LayerSageMaker Lakehouse
Object storageAmazon S3 + Redshift Managed Storage
Open table formatApache Iceberg
Metastore / catalogAWS Glue Data Catalog
Query enginesEMR, Glue, Redshift, Athena, Spark
GovernanceSageMaker Data and AI Governance (DataZone)
User workspaceSageMaker Unified Studio
Post-Quiz: Lakehouse Architecture and SageMaker Lakehouse

1. In the three-layer lakehouse architecture, what is the role of the metastore (e.g., Unity Catalog, Glue Data Catalog)?

2. The chapter says the metastore is the "keystone" that enables ACID guarantees on top of immutable object-store files. What mechanism makes this possible?

3. The chapter summarizes the architectural shift as "storage is shared, compute is plural and disposable." Which scenario best illustrates this property?

4. SageMaker Lakehouse advertises a "single copy of data" spanning S3, Redshift Managed Storage, and SaaS sources. What technical choice makes this feasible?

5. A retail company has orders in Redshift, clickstream JSON in S3, and customer master data in Salesforce. They want a unified "customer lifetime value" feature for an ML model. Why is the SageMaker Lakehouse approach materially better than the pre-lakehouse alternative?

Pre-Quiz: Medallion Architecture

1. Which statement best describes the Bronze layer's purpose in medallion architecture?

2. An e-commerce Bronze table contains duplicates from client retries, malformed addresses, and currency codes in mixed cases ("USD", "Usd", "usd"). Where in the medallion architecture should this be cleaned up, and what changes happen there?

3. The chapter emphasizes that each medallion layer is "rebuildable from Bronze." Why is this property a superpower for evolving systems?

4. Which consumer aligns best with the Silver layer, given its design as cleansed, deduplicated, schema-enforced, record-level data?

5. Why is Gold the appropriate destination for a `daily_revenue_by_region` table that powers an executive dashboard, rather than Silver or Bronze?

3. Medallion Architecture

Key Points

Bronze: raw landing zone

Bronze is the single source of truth for ingested data. Tables are append-only, immutable, and as close to the source format as possible. Schema is applied loosely (schema-on-read) so ingestion stays robust against upstream changes. Bronze captures the full payload, an ingestion timestamp, a source-system identifier, and optional metadata such as file path or Kafka offset.

Bronze deliberately does not deduplicate, join, or drop columns. Because everything is append-only and timestamped, you can replay any downstream pipeline against any historical snapshot. Analogy: Bronze is a hospital's intake record — you write down everything the patient says, contradictions and all. Diagnosis happens later.

Silver: cleaned and conformed

Silver is where data becomes trustworthy. Transformations include: cleansing (repair malformed rows), deduplication (one record per business event), normalization (standard units, dates, country codes), enrichment via joins with reference data, and schema enforcement (every Silver table has a strict, documented schema; non-conforming writes fail).

Crucially, Silver stays non-aggregated: each row is a real-world event or entity (an order, a session, a sensor reading). This makes Silver the natural home for data scientists and analytics engineers.

Gold: business-ready aggregates

Gold materializes business meaning. Tables are aggregated, dimensional, or feature-engineered for direct consumption by BI dashboards, executive reports, and ML models. They are optimized for query performance — partitioned, clustered, and often pre-joined.

Typical Gold tables: daily/hourly aggregates ("revenue by region by day"), star-schema dimensional models, KPIs aligned to business definitions, and ML feature tables ready for training and inference.

Worked example: e-commerce orders

Figure 2.4: Medallion architecture flow

flowchart LR Source["Source Systems
APIs, Kafka, files, CDC"] subgraph Bronze["Bronze Layer"] B["Raw, append-only
Schema-on-read
Full payload + ingest timestamp"] end subgraph Silver["Silver Layer"] S["Cleansed, deduped, normalized
Schema-enforced
Record-level events"] end subgraph Gold["Gold Layer"] G["Aggregates, dimensional models
ML feature tables
Business definitions"] end Source --> B B --> S S --> G B -. "auditors / replay" .-> Auditor(["Auditor"]) S -. "data scientists" .-> DS(["Data Scientist"]) G -. "BI / executives / ML" .-> BI(["Dashboard / ML"]) G -. "rebuildable from Bronze" .-> B
Medallion: Bronze → Silver → Gold Each layer rebuildable from Bronze B Bronze - Raw, append-only - Full payload - Ingestion timestamp - Schema-on-read S Silver - Cleansed, deduped - Normalized, joined - Schema-enforced - Record-level events G Gold - Aggregates, KPIs - Dimensional models - ML feature tables - Business-defined Auditors / Replay Data Scientists BI / Execs / ML rebuildable from Bronze
Bronze, Silver, then Gold reveal in turn; the green dashed arrow shows that any layer can be rebuilt from Bronze.

Layer comparison

LayerPurposeSchemaTransformationsTypical Consumers
BronzeRaw single source of truthSchema-on-read, looseIngestion only; append-onlyPipeline operators, auditors, replay jobs
SilverCleaned, validated record-level dataSchema-enforced, strictCleanse, dedupe, normalize, join, validateData scientists, analytics engineers
GoldBusiness-ready aggregates and modelsSchema-enforced, dimensionalAggregate, curate, feature-engineerBI dashboards, executives, ML inference

A common deployment shape is a hub-and-spoke arrangement: a central Data Hub owns canonical Bronze/Silver/Gold tables for cross-organization concepts (customer, employee, product), and each domain (Sales, Marketing, Finance) maintains its own Bronze/Silver/Gold extensions that join to the hub. This avoids both central bottlenecks and uncoordinated domain duplication.

Post-Quiz: Medallion Architecture

1. Which statement best describes the Bronze layer's purpose in medallion architecture?

2. An e-commerce Bronze table contains duplicates from client retries, malformed addresses, and currency codes in mixed cases ("USD", "Usd", "usd"). Where in the medallion architecture should this be cleaned up, and what changes happen there?

3. The chapter emphasizes that each medallion layer is "rebuildable from Bronze." Why is this property a superpower for evolving systems?

4. Which consumer aligns best with the Silver layer, given its design as cleansed, deduplicated, schema-enforced, record-level data?

5. Why is Gold the appropriate destination for a `daily_revenue_by_region` table that powers an executive dashboard, rather than Silver or Bronze?

Your Progress

Answer Explanations