Contrast data warehouses, data lakes, and lakehouses across schema, cost, and workload axes
Explain how SageMaker Lakehouse and similar platforms unify access across S3 and Redshift
Identify when a lakehouse is the right choice versus a pure warehouse or pure lake
Describe the medallion (bronze/silver/gold) layering pattern
Pre-Quiz: Warehouse vs Lake vs Lakehouse
1. A finance team complains that running an analyst's quarterly revenue scan against the production e-commerce database slows down checkout. Why does this motivate building a separate data warehouse?
2. Why did the early-2010s combination of cheap S3 storage, JSON/clickstream data, and machine learning push organizations toward data lakes rather than warehouses?
3. An engineer notices that two pipelines writing Parquet files to the same S3 directory are producing duplicate rows and partial files. The team starts calling the lake a "data swamp." What core capability is missing that would have prevented this?
4. Which statement best captures the central insight of the lakehouse paradigm?
5. An architect must choose a storage substrate for a workload requiring full ACID, fast MERGE/UPDATE/DELETE, and engine interoperability across Spark, Athena, and Redshift. Based on the comparison table in the chapter, which option fits best?
1. Warehouse vs Lake vs Lakehouse
Key Points
Warehouses exist because OLTP databases cannot safely host long analytical scans — analytics is a read-heavy, scan-wide workload that locks tables in OLTP systems.
Warehouses deliver ACID and SQL but at proprietary-storage cost, schema-on-write rigidity, and structured-data-only reach.
Data lakes (S3 / ADLS / GCS) gave cheap storage and schema-on-read flexibility, but their lack of ACID transactions led to "data swamps."
Three open table formats — Delta Lake, Apache Iceberg, Apache Hudi — provide ACID, time travel, and engine interoperability on top of Parquet.
The lakehouse keeps the lake's substrate and adds a transaction layer — warehouse reliability at lake economics, with one copy of data instead of two parallel stacks.
OLTP vs OLAP: why warehouses exist
Operational systems power transactions: each touches a few rows, completes in milliseconds, and demands ACID so two customers don't book the same airline seat. Analytics is the inverse — scanning millions of rows across a few columns. Running that scan on the OLTP database would lock tables and degrade the customer experience, so organizations built the data warehouse: a separate, OLAP-optimized store with proprietary columnar formats and star/snowflake schemas tuned for BI.
Warehouses excel at structured, repeatable reporting, but pay three costs: dollars (proprietary storage is expensive), rigidity (schema-on-write — every column must be defined before loading), and narrowness (structured tabular data only — no images, JSON blobs, or sensor logs).
Data lake origin story
Around 2010, three forces collided: storage got dramatically cheaper (S3 commoditized durable bytes), new data sources exploded (mobile clickstreams, IoT, JSON events), and ML practitioners needed raw, unsampled data. The data lake was the answer — dump everything into S3, ADLS, or GCS, and worry about structure later. This is schema-on-read: define a schema when you query, not when you write.
The economics were transformative — pennies per gigabyte versus dollars in proprietary storage. But lakes lacked ACID guarantees: two pipelines writing the same Parquet directory could produce partial files, duplicates, or corrupt partitions. There was no atomic update, no easy GDPR delete. Lakes that accumulated undocumented, inconsistent data became known as data swamps.
Analogy: a warehouse is a bonded archive where every box is labeled before entry; a lake is a self-storage unit where you can throw anything in any order — and the unit next door might leak into yours overnight.
Why lakehouses emerged
By the late 2010s, organizations were running two parallel stacks: a lake for ML and exploration, and a warehouse for BI, with brittle ETL between them. This duplicated storage and let definitions drift ("revenue" meant different things in each system).
The lakehouse keeps the cheap object-storage substrate and open Parquet format, but adds a transaction layer that gives ACID semantics, schema enforcement, time travel, and fast row-level updates. Three open table formats now dominate this layer: Delta Lake (Databricks), Apache Iceberg (Netflix; used by Snowflake and AWS), and Apache Hudi (Uber). All three provide optimistic concurrency control, snapshot isolation, and engine interoperability.
Figure 2.1: Evolution from warehouse to lake to lakehouse
timeline
title Evolution of Analytical Data Architectures
1990s : Data Warehouse
: Proprietary columnar storage
: Schema-on-write, ACID, BI only
2010s : Data Lake
: Cheap object storage (S3/ADLS/GCS)
: Schema-on-read, ML-friendly
: No ACID, risk of data swamps
2020s : Data Lakehouse
: Object storage + transaction layer
: Delta Lake / Iceberg / Hudi
: ACID + open formats + plural engines
Sequential reveal: 1990s warehouse, 2010s lake, 2020s lakehouse with ACID + open formats.
Comparison across architectures
Dimension
Warehouse
Lake
Lakehouse
Storage substrate
Proprietary, expensive
S3 / ADLS / GCS, cheap
Object storage + transaction layer
Data formats
Tabular only
Any format
Any; tabular wrapped in open table format
Schema model
Schema-on-write (strict)
Schema-on-read (loose)
Schema-on-write enforced via metastore
ACID
Native, full
Minimal or none
Full ACID via Delta / Iceberg / Hudi
Update / delete
Full DML
Often rewrite partition
Fast MERGE / UPDATE / DELETE
Workloads
OLAP, BI
Exploration, ML
BI + SQL + ML + streaming, unified
Cost profile
High (coupled)
Lowest (storage only)
Low storage, optimized compute
Post-Quiz: Warehouse vs Lake vs Lakehouse
1. A finance team complains that running an analyst's quarterly revenue scan against the production e-commerce database slows down checkout. Why does this motivate building a separate data warehouse?
2. Why did the early-2010s combination of cheap S3 storage, JSON/clickstream data, and machine learning push organizations toward data lakes rather than warehouses?
3. An engineer notices that two pipelines writing Parquet files to the same S3 directory are producing duplicate rows and partial files. The team starts calling the lake a "data swamp." What core capability is missing that would have prevented this?
4. Which statement best captures the central insight of the lakehouse paradigm?
5. An architect must choose a storage substrate for a workload requiring full ACID, fast MERGE/UPDATE/DELETE, and engine interoperability across Spark, Athena, and Redshift. Based on the comparison table in the chapter, which option fits best?
Pre-Quiz: Lakehouse Architecture and SageMaker Lakehouse
1. In the three-layer lakehouse architecture, what is the role of the metastore (e.g., Unity Catalog, Glue Data Catalog)?
2. The chapter says the metastore is the "keystone" that enables ACID guarantees on top of immutable object-store files. What mechanism makes this possible?
3. The chapter summarizes the architectural shift as "storage is shared, compute is plural and disposable." Which scenario best illustrates this property?
4. SageMaker Lakehouse advertises a "single copy of data" spanning S3, Redshift Managed Storage, and SaaS sources. What technical choice makes this feasible?
5. A retail company has orders in Redshift, clickstream JSON in S3, and customer master data in Salesforce. They want a unified "customer lifetime value" feature for an ML model. Why is the SageMaker Lakehouse approach materially better than the pre-lakehouse alternative?
2. Lakehouse Architecture and SageMaker Lakehouse
Key Points
Every lakehouse decomposes into three layers: object storage at the bottom, a metastore (catalog) in the middle, and plural query engines on top.
Apache Parquet's columnar layout gives lakehouses warehouse-class scan performance — read only the columns you query.
The metastore is the keystone — it converts Parquet directories into transactional tables via atomic snapshot pointer swaps.
SageMaker Lakehouse (re:Invent 2024, GA) uses Apache Iceberg + AWS Glue Data Catalog and presents a single copy of data across S3, Redshift Managed Storage, and federated SaaS sources via zero-ETL.
Fine-grained access (row/column rules) is enforced consistently across Athena, Redshift, and EMR through SageMaker Data and AI Governance on Amazon DataZone.
Object storage as the foundation
Every lakehouse rests on a cloud object store: Amazon S3, Azure Data Lake Storage Gen2, or Google Cloud Storage. These provide effectively infinite, eleven-nines-durable, pay-per-byte storage that decouples completely from compute. Files are typically Apache Parquet — a columnar format that compresses well, supports predicate pushdown, and is readable by virtually every analytic engine.
Analogy: object storage is like a city's water supply. Every restaurant draws from the same pipes; each cooks its own cuisine. The utility doesn't care what you cook — it guarantees clean water at scale and low cost.
Metastore and catalog layer
Raw Parquet files in S3 are just files. The metastore turns them into tables — things with names, schemas, partitions, statistics, and access control. When you commit a write, the metastore atomically swaps the snapshot pointer; that is what gives lakehouses ACID guarantees on top of immutable object-store files.
Production lakehouses use one of: Unity Catalog (Databricks), AWS Glue Data Catalog, Apache Nessie (Git-like branching), or the older Hive Metastore. Without a central catalog, every engine maintains its own picture of the data and they drift apart.
Query engines accessing shared data
The top layer is whatever compute you bring: Apache Spark for ETL/ML, Trino/Presto for interactive SQL, Amazon Athena for serverless ad-hoc, Amazon Redshift for BI, EMR/Glue for managed Spark/Flink, Databricks Photon, and DuckDB for local analytics. The architectural shift: storage is shared, compute is plural and disposable — no copies, no ETL between engines.
Figure 2.2: Three-layer lakehouse architecture
flowchart TD
subgraph Engines["Query Engines (plural, disposable)"]
Spark["Apache Spark"]
Trino["Trino / Presto"]
Athena["Amazon Athena"]
Redshift["Amazon Redshift"]
EMR["EMR / Glue"]
Duck["DuckDB"]
end
subgraph Catalog["Metastore / Catalog Layer"]
Meta["Unity Catalog / Glue Data Catalog / Nessie schemas, snapshots, partitions, ACLs"]
end
subgraph Storage["Object Storage Foundation"]
S3["Amazon S3 / ADLS / GCS Parquet files (open, columnar)"]
end
Spark --> Meta
Trino --> Meta
Athena --> Meta
Redshift --> Meta
EMR --> Meta
Duck --> Meta
Meta --> S3
AWS SageMaker Lakehouse
Announced at AWS re:Invent 2024 and now generally available, Amazon SageMaker Lakehouse presents a single copy of data spanning S3 lakes, Redshift Managed Storage, and federated SaaS sources (Salesforce, SAP, ServiceNow, Zendesk, DynamoDB, Snowflake). A Spark job in EMR can read a Redshift table in place — without unloading it to S3 — and a Redshift query can join that table with an S3-resident Iceberg table in the same SQL statement. Zero-ETL integrations pull from SaaS sources directly into the lakehouse.
Built on Apache Iceberg as the open table format and AWS Glue Data Catalog as the unified metastore. Fine-grained, identity-based access (row/column rules) is enforced consistently across Athena, Redshift, and EMR through SageMaker Data and AI Governance on Amazon DataZone. SageMaker Unified Studio bundles EMR Studio, Glue Studio, the Redshift Query Editor, SageMaker Studio, and Bedrock tools into one workspace.
Figure 2.3: SageMaker Lakehouse unified access
flowchart LR
subgraph Sources["Data Sources"]
S3["S3 Data Lake (Iceberg / Parquet)"]
RMS["Redshift Managed Storage (warehouse tables)"]
SaaS["SaaS / Federated Salesforce, SAP, ServiceNow, Zendesk, DynamoDB, Snowflake"]
end
Glue["AWS Glue Data Catalog (unified metastore)"]
subgraph Studio["SageMaker Unified Studio"]
EMR["EMR / Spark"]
Athena["Athena"]
RS["Redshift Query Editor"]
SM["SageMaker Studio (ML)"]
end
S3 -- "in-place" --> Glue
RMS -- "in-place" --> Glue
SaaS -- "zero-ETL" --> Glue
Glue --> EMR
Glue --> Athena
Glue --> RS
Glue --> SM
Fan-in: sources illuminate, packets flow to the Glue catalog, the catalog glows, then plural engines light up.
SageMaker Lakehouse to generic-layer mapping
Generic Layer
SageMaker Lakehouse
Object storage
Amazon S3 + Redshift Managed Storage
Open table format
Apache Iceberg
Metastore / catalog
AWS Glue Data Catalog
Query engines
EMR, Glue, Redshift, Athena, Spark
Governance
SageMaker Data and AI Governance (DataZone)
User workspace
SageMaker Unified Studio
Post-Quiz: Lakehouse Architecture and SageMaker Lakehouse
1. In the three-layer lakehouse architecture, what is the role of the metastore (e.g., Unity Catalog, Glue Data Catalog)?
2. The chapter says the metastore is the "keystone" that enables ACID guarantees on top of immutable object-store files. What mechanism makes this possible?
3. The chapter summarizes the architectural shift as "storage is shared, compute is plural and disposable." Which scenario best illustrates this property?
4. SageMaker Lakehouse advertises a "single copy of data" spanning S3, Redshift Managed Storage, and SaaS sources. What technical choice makes this feasible?
5. A retail company has orders in Redshift, clickstream JSON in S3, and customer master data in Salesforce. They want a unified "customer lifetime value" feature for an ML model. Why is the SageMaker Lakehouse approach materially better than the pre-lakehouse alternative?
Pre-Quiz: Medallion Architecture
1. Which statement best describes the Bronze layer's purpose in medallion architecture?
2. An e-commerce Bronze table contains duplicates from client retries, malformed addresses, and currency codes in mixed cases ("USD", "Usd", "usd"). Where in the medallion architecture should this be cleaned up, and what changes happen there?
3. The chapter emphasizes that each medallion layer is "rebuildable from Bronze." Why is this property a superpower for evolving systems?
4. Which consumer aligns best with the Silver layer, given its design as cleansed, deduplicated, schema-enforced, record-level data?
5. Why is Gold the appropriate destination for a `daily_revenue_by_region` table that powers an executive dashboard, rather than Silver or Bronze?
3. Medallion Architecture
Key Points
Medallion architecture organizes data into three quality tiers — Bronze (raw), Silver (cleaned and validated), Gold (business-ready aggregates).
Bronze is append-only and immutable: full payload, ingestion timestamp, source identifier — never deduplicate or transform business meaning here.
Silver enforces strict schemas after cleansing, deduplication, normalization, and reference-data joins; it remains record-level (one row per real-world event).
Gold is aggregated and dimensional — partitioned, clustered, often pre-joined for BI dashboards and ML feature tables.
Each layer is rebuildable from Bronze, so when a definition changes you rerun rather than hand-patch every downstream table.
Bronze: raw landing zone
Bronze is the single source of truth for ingested data. Tables are append-only, immutable, and as close to the source format as possible. Schema is applied loosely (schema-on-read) so ingestion stays robust against upstream changes. Bronze captures the full payload, an ingestion timestamp, a source-system identifier, and optional metadata such as file path or Kafka offset.
Bronze deliberately does not deduplicate, join, or drop columns. Because everything is append-only and timestamped, you can replay any downstream pipeline against any historical snapshot. Analogy: Bronze is a hospital's intake record — you write down everything the patient says, contradictions and all. Diagnosis happens later.
Silver: cleaned and conformed
Silver is where data becomes trustworthy. Transformations include: cleansing (repair malformed rows), deduplication (one record per business event), normalization (standard units, dates, country codes), enrichment via joins with reference data, and schema enforcement (every Silver table has a strict, documented schema; non-conforming writes fail).
Crucially, Silver stays non-aggregated: each row is a real-world event or entity (an order, a session, a sensor reading). This makes Silver the natural home for data scientists and analytics engineers.
Gold: business-ready aggregates
Gold materializes business meaning. Tables are aggregated, dimensional, or feature-engineered for direct consumption by BI dashboards, executive reports, and ML models. They are optimized for query performance — partitioned, clustered, and often pre-joined.
Typical Gold tables: daily/hourly aggregates ("revenue by region by day"), star-schema dimensional models, KPIs aligned to business definitions, and ML feature tables ready for training and inference.
Worked example: e-commerce orders
Bronze (`orders_raw`): every API payload, including duplicates from client retries, malformed addresses, mixed-case currency codes ("USD", "Usd", "usd").
Silver (`orders`): deduplicate on order ID + timestamp; reject rows with invalid postal codes (sent to a quarantine table); upper-case all currency codes; join the customer dimension to add region; enforce a strict schema with non-nullable order ID and amount.
Gold (`daily_revenue_by_region`): aggregate the Silver `orders` table grouped by region and order date — powers the executive dashboard, refreshes nightly, fully rebuildable from Bronze if the definition changes.
Figure 2.4: Medallion architecture flow
flowchart LR
Source["Source Systems APIs, Kafka, files, CDC"]
subgraph Bronze["Bronze Layer"]
B["Raw, append-only Schema-on-read Full payload + ingest timestamp"]
end
subgraph Silver["Silver Layer"]
S["Cleansed, deduped, normalized Schema-enforced Record-level events"]
end
subgraph Gold["Gold Layer"]
G["Aggregates, dimensional models ML feature tables Business definitions"]
end
Source --> B
B --> S
S --> G
B -. "auditors / replay" .-> Auditor(["Auditor"])
S -. "data scientists" .-> DS(["Data Scientist"])
G -. "BI / executives / ML" .-> BI(["Dashboard / ML"])
G -. "rebuildable from Bronze" .-> B
Bronze, Silver, then Gold reveal in turn; the green dashed arrow shows that any layer can be rebuilt from Bronze.
Layer comparison
Layer
Purpose
Schema
Transformations
Typical Consumers
Bronze
Raw single source of truth
Schema-on-read, loose
Ingestion only; append-only
Pipeline operators, auditors, replay jobs
Silver
Cleaned, validated record-level data
Schema-enforced, strict
Cleanse, dedupe, normalize, join, validate
Data scientists, analytics engineers
Gold
Business-ready aggregates and models
Schema-enforced, dimensional
Aggregate, curate, feature-engineer
BI dashboards, executives, ML inference
A common deployment shape is a hub-and-spoke arrangement: a central Data Hub owns canonical Bronze/Silver/Gold tables for cross-organization concepts (customer, employee, product), and each domain (Sales, Marketing, Finance) maintains its own Bronze/Silver/Gold extensions that join to the hub. This avoids both central bottlenecks and uncoordinated domain duplication.
Post-Quiz: Medallion Architecture
1. Which statement best describes the Bronze layer's purpose in medallion architecture?
2. An e-commerce Bronze table contains duplicates from client retries, malformed addresses, and currency codes in mixed cases ("USD", "Usd", "usd"). Where in the medallion architecture should this be cleaned up, and what changes happen there?
3. The chapter emphasizes that each medallion layer is "rebuildable from Bronze." Why is this property a superpower for evolving systems?
4. Which consumer aligns best with the Silver layer, given its design as cleansed, deduplicated, schema-enforced, record-level data?
5. Why is Gold the appropriate destination for a `daily_revenue_by_region` table that powers an executive dashboard, rather than Silver or Bronze?