Define the core responsibilities of a data engineer and how the discipline differs from analytics engineering and ML engineering
Compare batch, micro-batch, and streaming processing models with appropriate use cases
Trace the historical evolution from on-premises ETL to modern cloud-native data platforms
Identify the key drivers of modern pipeline architectures: latency, scale, cost, and governance
Pre-Quiz: Roles and the Modern Stack
1. A startup hires a "data engineer" but its real need is someone to write SQL transformations in dbt to produce dim_customer and fct_orders tables for the BI team. Which role best matches that work?
Data engineer (pipeline engineer)
Analytics engineer
Platform engineer
ML data engineer
2. A retailer in 2005 transformed data inside Informatica's engine before loading aggregated tables into Teradata. Why was the transform-before-load order economically necessary at the time?
SQL engines could not yet express joins or aggregates
Warehouse storage was expensive and warehouse compute was scarce, so messy raw data could not be landed and cleaned later
Regulators required transformations to occur outside the warehouse
Object storage like S3 had not yet been invented and there was no place to land raw data
3. A team wants the marketing analytics workload to never block the data science team's heavy experiment, even though both query the same tables. Which architectural property of the modern data stack most directly enables this?
ELT instead of ETL
Use of dbt for transformations
Decoupled storage and compute, where independent compute clusters read the same single copy of data
Monolithic vendor tooling that owns ingest, transform, and serve
4. A CTO worries about vendor lock-in but still wants ACID transactions, schema enforcement, and time travel on data sitting in S3. Which approach best addresses both goals?
Move all data into Snowflake's proprietary native tables
Store data as raw CSV files and rely on application code for consistency
Adopt an open table format like Iceberg or Delta Lake over S3 so multiple engines can read the same governed tables
Run a single self-hosted Teradata cluster to centralize storage and compute
5. A 30-engineer fintech is choosing between managed Snowflake and self-hosted Spark for its general analytics warehouse. Fraud detection is their core differentiator; warehousing is not. Which decision best matches the chapter's heuristic?
Self-host both the warehouse and fraud detection to maximize control
Buy managed Snowflake for the commodity warehouse layer; consider self-hosting only the fraud-detection stream layer
Buy managed services for everything, including fraud detection logic
Self-host the warehouse but buy a managed fraud-detection SaaS
Roles and the Modern Stack
Key Points
A modern data engineer owns the path from ingestion to query-ready storage and is judged on reliability, scalability, and cost-efficiency, with tools like Airflow, Kafka, and Spark.
The data team has fractured into specialized roles: data engineer (pipelines), analytics engineer (dbt/SQL modeling), platform engineer (shared infra), and ML data engineer (feature pipelines).
The industry shifted from monolithic ETL (one vendor, transform-before-load, on-prem) to composable ELT (best-of-breed, load-then-transform-in-place) once cloud storage became cheap and compute became elastic.
Decoupled storage and compute is the defining architectural shift: a single copy of data on object storage, queried by many independent compute engines.
Open table formats (Iceberg, Delta Lake, Hudi) over object storage produce the lakehouse pattern, blending lake economics with warehouse semantics and reducing vendor lock-in.
Specialization Across the Pipeline
The "data team" of 2015 was usually one or two generalists. Today, work specializes by layer of the stack. The data engineer wires Airflow, Kafka, and Spark to land reliable raw data in the warehouse. The analytics engineer writes dbt models to transform raw tables into business-friendly objects like dim_customer. The platform engineer runs the shared Kubernetes, Terraform, IAM, and CI/CD that everyone above them depends on. The ML data engineer branches off the warehouse to feed feature stores and model training.
If data is electricity, the platform engineer builds the grid, the data engineer wires the houses, the analytics engineer designs the appliances, and the ML data engineer powers a specialized factory next door.
Figure 1.1: Modern data team roles across the pipeline stack
flowchart LR
Sources["Source Systems (APIs, DBs, Logs)"]
DE["Data Engineer Airflow / Kafka / Spark"]
WH["Warehouse / Lakehouse Snowflake / Databricks"]
AE["Analytics Engineer dbt / SQL / Git"]
BI["BI Dashboards Looker / Hex"]
MLE["ML Data Engineer Feast / MLflow"]
FS["Feature Store"]
Models["ML Models"]
PE["Platform Engineer: Terraform, Kubernetes, CI/CD, IAM, Observability"]
Sources --> DE --> WH --> AE --> BI
WH --> MLE --> FS --> Models
PE -.supports.-> DE
PE -.supports.-> WH
PE -.supports.-> AE
PE -.supports.-> MLE
Role
Primary Concern
Tooling
Output Consumer
Data Engineer
Ingest -> store reliability
Airflow, Kafka, Spark
Warehouse/lakehouse tables
Analytics Engineer
Business-logic modeling
dbt, SQL, Git
Analysts, dashboards
Platform Engineer
Shared infra & DevOps
Terraform, K8s, CI/CD
Other data teams
ML Data Engineer
Feature pipelines
Python, MLflow, feature stores
Data scientists, models
From Monolithic ETL to Composable ELT
The original architecture was monolithic ETL: one vendor (Informatica, Ab Initio, IBM DataStage) extracted, transformed in a proprietary engine, and loaded an on-prem warehouse like Teradata. Transformation had to happen before loading because warehouse storage was expensive and compute was scarce.
The modern pattern inverts this. ELT loads raw data into a cloud warehouse first, then transforms in place using SQL. Cheap object storage (pennies per GB-month) and elastic compute (a thousand cores for ten minutes) erased the economic constraint that forced ETL.
Figure 1.2: Monolithic ETL versus composable ELT
flowchart TB
subgraph Monolithic["Monolithic ETL (2005)"]
S1["Oracle Source"] --> I1["Informatica Extract + Transform"]
I1 --> T1["Teradata Warehouse"]
end
subgraph Composable["Composable ELT (2025)"]
S2["Postgres Source"] --> F["Fivetran (Extract + Load)"]
F --> SF["Snowflake (Raw Tables)"]
SF --> D["dbt (Transform in SQL)"]
D --> SF2["Snowflake (fct_orders, dim_product)"]
A["Airflow (Orchestration)"] -.triggers.-> F
A -.triggers.-> D
SF2 --> L["Looker Dashboards"]
end
Decoupled Storage and Compute
In a traditional warehouse like Teradata, storage and compute lived on the same physical nodes — to store more, you had to buy more compute too. Decoupled storage and compute is the defining architectural shift of the modern stack. A single copy of data sits on cheap, durable object storage (S3, Google Colossus, Azure ADLS), while many independent compute engines read it: Snowflake Virtual Warehouses, BigQuery slot pools, Databricks Spark + Photon clusters, Trino federations.
A traditional warehouse is a hotel where every guest must rent the entire floor including the kitchen and the gym. Decoupled compute/storage is a public library — books sit on shared shelves, and any reader can show up, check out what they need, and leave without the others noticing.
Animated: each compute engine lights up and connects to the shared storage slab.
Open Table Formats and the Lakehouse
Object storage alone gives you only files. To get warehouse behavior — ACID, schema enforcement, time travel — you layer an open table format (Iceberg, Delta Lake, Hudi) on top. The lakehouse is what you get when you combine open table formats over object storage with warehouse-style query engines: cheap scalable storage with ACID guarantees, plus interoperability so Snowflake, Databricks, Trino, Athena, Spark, and Flink can all read the same tables.
Specialized warehouses with proprietary storage and tightly-integrated caches still often outperform open-format engines fetching from remote object storage, but the gap is closing — and even closed-platform vendors (Snowflake, BigQuery, Synapse) have added Iceberg compatibility to reduce customer fears of lock-in.
Figure 1.5: Decoupled storage and compute with multiple independent engines
Modern stack decisions sit on a spectrum from fully managed SaaS (Snowflake, BigQuery, Databricks, Fivetran, dbt Cloud) to self-hosted open source (Spark on EKS, Airflow on K8s, raw Kafka). Managed trades money for engineering time; self-hosting trades engineering time for cost flexibility and control. The pragmatic heuristic for teams under ~50 engineers: buy managed for layers that aren't your differentiator; self-host only where you have a specific reason — cost at scale, regulatory, or genuine differentiation.
Dimension
Managed (Snowflake)
Self-Hosted (Spark on K8s)
Time to first value
Days
Weeks to months
Ops burden
Vendor handles
You handle (upgrades, scaling, on-call)
Cost at small scale
Often cheaper (no idle infra)
Often more expensive
Cost at large scale
Premium per unit
Can be much cheaper if optimized
Customization
Limited
Full
Lock-in risk
Higher
Lower
Post-Quiz: Roles and the Modern Stack
1. A startup hires a "data engineer" but its real need is someone to write SQL transformations in dbt to produce dim_customer and fct_orders tables for the BI team. Which role best matches that work?
Data engineer (pipeline engineer)
Analytics engineer
Platform engineer
ML data engineer
2. A retailer in 2005 transformed data inside Informatica's engine before loading aggregated tables into Teradata. Why was the transform-before-load order economically necessary at the time?
SQL engines could not yet express joins or aggregates
Warehouse storage was expensive and warehouse compute was scarce, so messy raw data could not be landed and cleaned later
Regulators required transformations to occur outside the warehouse
Object storage like S3 had not yet been invented and there was no place to land raw data
3. A team wants the marketing analytics workload to never block the data science team's heavy experiment, even though both query the same tables. Which architectural property of the modern data stack most directly enables this?
ELT instead of ETL
Use of dbt for transformations
Decoupled storage and compute, where independent compute clusters read the same single copy of data
Monolithic vendor tooling that owns ingest, transform, and serve
4. A CTO worries about vendor lock-in but still wants ACID transactions, schema enforcement, and time travel on data sitting in S3. Which approach best addresses both goals?
Move all data into Snowflake's proprietary native tables
Store data as raw CSV files and rely on application code for consistency
Adopt an open table format like Iceberg or Delta Lake over S3 so multiple engines can read the same governed tables
Run a single self-hosted Teradata cluster to centralize storage and compute
5. A 30-engineer fintech is choosing between managed Snowflake and self-hosted Spark for its general analytics warehouse. Fraud detection is their core differentiator; warehousing is not. Which decision best matches the chapter's heuristic?
Self-host both the warehouse and fraud detection to maximize control
Buy managed Snowflake for the commodity warehouse layer; consider self-hosting only the fraud-detection stream layer
Buy managed services for everything, including fraud detection logic
Self-host the warehouse but buy a managed fraud-detection SaaS
Pre-Quiz: Processing Models and Architectures
1. A finance team needs a daily P&L close that must be auditable and accurate but can run overnight. Which processing model is the best fit and why?
Streaming, because modern stacks should default to streaming
Batch, because the SLA tolerates hours of latency and batch is simpler, cheaper, and easy to debug via re-runs
Micro-batch, because it gives sub-second latency at low cost
Lambda, because P&L always requires both batch accuracy and stream freshness
2. Which statement best describes the "two armies" problem of Lambda architecture?
Two clusters are required, one for storage and one for compute
The same business logic must be implemented and maintained twice — once in the batch framework and once in the stream framework — keeping them in sync forever
The serving layer must run on two different cloud providers
Two competing stream engines (Flink and Kafka Streams) must be deployed side by side
3. A team adopts Kappa architecture. The metric definition for "trending now" changes, so they need to reprocess the last 30 days of data. How is historical reprocessing accomplished in Kappa?
Run a separate batch job over a Hadoop snapshot of the data
Replay the durable, partitioned Kafka log from an earlier offset through the same Flink pipeline
Manually export the warehouse, transform with dbt, and re-import
Pause the stream and delete the existing serving-layer indexes so they will rebuild themselves
4. A fraud-scoring model must complete within 200 milliseconds during card authorization. Which processing model is mandatory, and why?
Batch, because batch is the most reliable model for financial use cases
Micro-batch, because it provides sub-second latency at lower complexity than full streaming
Streaming with stateful operators, because the consumer SLA requires per-event sub-second processing
Lambda, because fraud always needs both historical batch accuracy and real-time scoring in the same response
5. A team currently runs a daily batch dashboard refresh and is asked to deliver "near-real-time" data. They estimate going from daily to hourly is cheap, but going from one minute to one second would require introducing Druid or a key-value store. Which insight does this illustrate?
Batch and streaming have identical cost curves
Latency improvements scale linearly with cost
Each order-of-magnitude latency improvement is a step-function increase in operational complexity, so streaming should be reserved for use cases that genuinely require it
Modern dashboards always need sub-second latency by default
Processing Models and Architectures
Key Points
Three canonical processing models live on a latency-vs-complexity spectrum: batch (hours-to-days, cheap, simple), micro-batch (seconds-to-minutes, medium), streaming (sub-second, complex, most expensive).
Lambda architecture combines a batch layer (accurate historical), a speed layer (real-time approximate), and a serving layer (merged) — at the cost of duplicating business logic across two engines (the "two armies" problem).
Kappa eliminates the batch layer: a single streaming pipeline backed by a durable replayable log (Kafka) and a unified engine (Flink). Historical reprocessing means replaying the log from an earlier offset.
The right processing model is determined by the consumer's SLA, not the team's enthusiasm. Latency improvements compound across pipeline hops; the slowest hop sets the floor.
Freshness has a cost curve that bends sharply upward: each order-of-magnitude latency improvement is a step-function increase in operational complexity.
Batch, Micro-Batch, Streaming
Batch is like sending a daily mail truck — efficient and predictable, but slow to deliver any single letter. Micro-batch is a courier on a five-minute circuit. Streaming is a pneumatic tube — every letter shoots through the moment it's dropped in.
Model
Typical Latency
Complexity
Cost Profile
Example Use Case
Batch
Hours to days
Low
Lowest (idle between runs)
Nightly financial reconciliation
Micro-batch
Seconds to minutes
Medium
Medium (continuous compute)
Near-real-time dashboards
Streaming
Sub-second
High
Highest (always-on, stateful)
Fraud detection, live personalization
Lambda vs Kappa
Lambda (Nathan Marz, ~2011) splits the pipeline into a batch layer (Hadoop, Spark) for accurate historical views, a speed layer (Flink, Kafka Streams) for low-latency recent data, and a serving layer (Druid, Pinot, Cassandra) that merges them. Its fatal weakness is the two-armies problem: every metric must be implemented twice and kept in sync forever.
Kappa (Jay Kreps, ~2014) eliminates the batch layer. All data flows through a single streaming pipeline backed by a durable, replayable log (Kafka) processed by a unified engine (Flink). Historical reprocessing is done by replaying the log from an earlier offset — one codebase, one set of semantics. Migrating from Lambda to Kappa-style unified pipelines can yield 50-70% simpler operations.
Figure 1.3: Lambda versus Kappa architectures
flowchart LR
subgraph Lambda["Lambda Architecture"]
E1["Event Source"] --> B["Batch Layer (Spark / Hadoop)"]
E1 --> SP["Speed Layer (Flink / Kafka Streams)"]
B --> SV1["Serving Layer (Druid / Pinot)"]
SP --> SV1
SV1 --> Q1["Query / Dashboard"]
end
subgraph Kappa["Kappa Architecture"]
E2["Event Source"] --> K["Kafka (Durable Replayable Log)"]
K --> FL["Flink (Unified Stream Engine)"]
FL --> SV2["Serving Layer"]
SV2 --> Q2["Query / Dashboard"]
K -.replay from offset.-> FL
end
Lambda forks every event into batch + speed paths; Kappa runs one pipeline and replays the log when needed.
SLA-Driven Design
Three rules of thumb. First, work backwards from the user — a 6 a.m. dashboard does not need streaming; a 200 ms fraud check does. Second, latency requirements compound across the pipeline; a streaming source landing in an hourly warehouse is at best hour-old. Third, freshness has a cost curve that bends sharply upward: daily-to-hourly is cheap, hourly-to-one-minute usually requires re-architecting toward streaming, and one-minute-to-one-second often requires changing your storage layer.
Figure 1.4: Decision tree for choosing a processing model from consumer SLA
flowchart TD
Start["What is the consumer's latency SLA?"]
Start --> Q1{"Sub-second required?"}
Q1 -->|Yes| Stream["Streaming (Flink, Kafka Streams) + stateful operators"]
Q1 -->|No| Q2{"Seconds to minutes?"}
Q2 -->|Yes| Micro["Micro-batch (Spark Structured Streaming)"]
Q2 -->|No| Q3{"Hours acceptable?"}
Q3 -->|Yes| Batch["Batch (Airflow + Spark / SQL)"]
Q3 -->|No| Q4{"Need both batch accuracy and stream freshness?"}
Q4 -->|Yes, two codebases OK| Lambda["Lambda Architecture"]
Q4 -->|Prefer single codebase| Kappa["Kappa Architecture"]
Each tier reveals in turn; the orange curve traces the step-function cost increase as freshness improves.
Post-Quiz: Processing Models and Architectures
1. A finance team needs a daily P&L close that must be auditable and accurate but can run overnight. Which processing model is the best fit and why?
Streaming, because modern stacks should default to streaming
Batch, because the SLA tolerates hours of latency and batch is simpler, cheaper, and easy to debug via re-runs
Micro-batch, because it gives sub-second latency at low cost
Lambda, because P&L always requires both batch accuracy and stream freshness
2. Which statement best describes the "two armies" problem of Lambda architecture?
Two clusters are required, one for storage and one for compute
The same business logic must be implemented and maintained twice — once in the batch framework and once in the stream framework — keeping them in sync forever
The serving layer must run on two different cloud providers
Two competing stream engines (Flink and Kafka Streams) must be deployed side by side
3. A team adopts Kappa architecture. The metric definition for "trending now" changes, so they need to reprocess the last 30 days of data. How is historical reprocessing accomplished in Kappa?
Run a separate batch job over a Hadoop snapshot of the data
Replay the durable, partitioned Kafka log from an earlier offset through the same Flink pipeline
Manually export the warehouse, transform with dbt, and re-import
Pause the stream and delete the existing serving-layer indexes so they will rebuild themselves
4. A fraud-scoring model must complete within 200 milliseconds during card authorization. Which processing model is mandatory, and why?
Batch, because batch is the most reliable model for financial use cases
Micro-batch, because it provides sub-second latency at lower complexity than full streaming
Streaming with stateful operators, because the consumer SLA requires per-event sub-second processing
Lambda, because fraud always needs both historical batch accuracy and real-time scoring in the same response
5. A team currently runs a daily batch dashboard refresh and is asked to deliver "near-real-time" data. They estimate going from daily to hourly is cheap, but going from one minute to one second would require introducing Druid or a key-value store. Which insight does this illustrate?
Batch and streaming have identical cost curves
Latency improvements scale linearly with cost
Each order-of-magnitude latency improvement is a step-function increase in operational complexity, so streaming should be reserved for use cases that genuinely require it
Modern dashboards always need sub-second latency by default
Pre-Quiz: Drivers and Trade-offs
1. A team's Snowflake bill triples in a month. Investigation shows a "primary" warehouse running 24/7, even though it sits idle most evenings and weekends. Which lever is most likely to fix this?
Switch to Databricks because it is always cheaper
Enable auto-suspend / auto-resume so idle compute releases automatically, and right-size the warehouse to the workload
Materialize every dbt model regardless of usage to reduce future scans
Move the warehouse to an on-prem cluster sized for peak load
2. A BigQuery user notices their query bill is much higher than expected. Their queries do SELECT * against a 10 TB table and lack partition filters. Which architecture-level fix maps best to BigQuery's pricing model?
Use only columnar projection (SELECT specific columns) and partition pruning, because BigQuery charges by bytes scanned
Switch to Snowflake because Snowflake never charges for scans
Use spot instances, since BigQuery's bill is dominated by per-second compute
Disable BigQuery's slot pool to force serial query execution
3. A HIPAA-regulated healthcare company has analysts, ML engineers, and auditors all needing access to patient data. Which architecture choice best satisfies governance pressure?
Each team copies production data into its own bucket so it can move quickly
Store raw records in S3 with Iceberg tables and a centralized catalog (e.g., Unity Catalog) for fine-grained access, so all users share one governed view
Email CSV extracts to teams who request data and audit via inboxes
Encrypt the warehouse with a single shared password rotated annually
4. A multinational bank acquires a competitor running entirely on a different cloud. The architecture team wants flexibility without paying full multi-cloud overhead. Which combination of tools best supports that?
Rebuild the acquired cloud's data on the primary cloud immediately and forbid future cross-cloud reads
Open table formats for storage portability, cloud-portable platforms like Snowflake/Databricks, and federated query engines like Trino — while still picking a primary cloud
Run identical full data stacks on every cloud the bank touches
Adopt only proprietary single-cloud-native services everywhere
5. The chapter argues governance "shapes architecture" rather than being an afterthought. Which architectural choice most directly reflects this claim?
Deploying separate data copies for ML, BI, and ad-hoc analysis to maximize team autonomy
Using a single governed lakehouse layer (open table formats + centralized catalog) that both BI and ML workloads consume, so policy applies uniformly
Documenting access rules in a wiki page rather than enforcing them in the catalog
Allowing ML teams to copy production data into side buckets where governance does not apply
Drivers and Trade-offs
Key Points
Cost in the modern stack is a function of architecture, not just usage: right-sized compute, aggressive auto-suspend, and matching workload patterns to the platform's pricing model are the levers that separate efficient teams from expensive ones.
Pricing models matter — BigQuery rewards columnar projection and partition pruning; Snowflake rewards short bursts on right-sized warehouses; Databricks rewards spot instances and job clusters over interactive clusters.
Governance is no longer an afterthought: open table formats with ACID semantics and centralized catalogs (Unity Catalog, Snowflake RBAC, Lake Formation, Polaris) make compliance and access control tractable across BI and ML.
The lakehouse architecture is governance-friendly because a single governed layer serves both BI and ML, eliminating the side-bucket pattern that breaks compliance.
Multi-cloud and hybrid are operational realities. Open formats, cloud-portable platforms, and federated query engines (Trino, Starburst) keep architectures flexible — most pragmatic teams pick a primary cloud and treat secondaries as exceptions.
Cost-Performance Trade-offs
Decoupled storage and compute reframed cost. In the cloud, you pay for what you use — but only if your architecture lets you scale down. The fastest way to lose money is provisioning an always-on Snowflake warehouse or Databricks cluster and never turning it off. Three disciplines:
Right-size compute to workload. A nightly ETL job needs a large warehouse for 30 minutes, not a medium warehouse for 4 hours.
Use auto-suspend and auto-scale aggressively. Idle compute is the single biggest source of waste.
Push transformations into cheaper layers. Materializing a heavily-used aggregate once is cheaper than scanning the raw table thousands of times — but materializing data nobody queries wastes both storage and pipeline runtime.
Governance and Compliance
A decade ago governance meant a wiki page describing the schema. Today it means lineage, access control, classification, retention, and audit — often under GDPR, CCPA, HIPAA, SOX, or the EU AI Act. The modern stack responds in two ways: open table formats (Iceberg, Delta) make schema enforcement, time travel, and audit trails first-class features, and catalog/access-control layers (Unity Catalog, Snowflake RBAC, AWS Lake Formation, Apache Polaris) sit above storage and centralize permissions, masking, and lineage across engines.
The lakehouse pattern is particularly attractive from a governance perspective because a single governed layer serves both BI and ML, eliminating the common anti-pattern of ML teams copying production data into a side bucket where governance does not apply.
Multi-Cloud and Hybrid Realities
Acquisitions, regulatory data-residency requirements, vendor pricing leverage, and pre-existing investments produce multi-cloud and hybrid (cloud + on-prem) realities whether the architecture team wants them or not. Three patterns recur:
Cloud-portable platforms (Databricks, Snowflake) run on AWS, Azure, and GCP, providing one consistent data platform across clouds.
Open table formats let data physically live in any object store while remaining queryable from engines elsewhere — Iceberg tables in S3 can be read by Trino in GCP.
Federated query engines (Trino, Starburst Galaxy) span multiple underlying systems with one SQL query, without moving data.
Most pragmatic teams pick a primary cloud and treat secondaries as exceptions, using open formats and portable platforms to keep the door open without paying full multi-cloud overhead.
Post-Quiz: Drivers and Trade-offs
1. A team's Snowflake bill triples in a month. Investigation shows a "primary" warehouse running 24/7, even though it sits idle most evenings and weekends. Which lever is most likely to fix this?
Switch to Databricks because it is always cheaper
Enable auto-suspend / auto-resume so idle compute releases automatically, and right-size the warehouse to the workload
Materialize every dbt model regardless of usage to reduce future scans
Move the warehouse to an on-prem cluster sized for peak load
2. A BigQuery user notices their query bill is much higher than expected. Their queries do SELECT * against a 10 TB table and lack partition filters. Which architecture-level fix maps best to BigQuery's pricing model?
Use only columnar projection (SELECT specific columns) and partition pruning, because BigQuery charges by bytes scanned
Switch to Snowflake because Snowflake never charges for scans
Use spot instances, since BigQuery's bill is dominated by per-second compute
Disable BigQuery's slot pool to force serial query execution
3. A HIPAA-regulated healthcare company has analysts, ML engineers, and auditors all needing access to patient data. Which architecture choice best satisfies governance pressure?
Each team copies production data into its own bucket so it can move quickly
Store raw records in S3 with Iceberg tables and a centralized catalog (e.g., Unity Catalog) for fine-grained access, so all users share one governed view
Email CSV extracts to teams who request data and audit via inboxes
Encrypt the warehouse with a single shared password rotated annually
4. A multinational bank acquires a competitor running entirely on a different cloud. The architecture team wants flexibility without paying full multi-cloud overhead. Which combination of tools best supports that?
Rebuild the acquired cloud's data on the primary cloud immediately and forbid future cross-cloud reads
Open table formats for storage portability, cloud-portable platforms like Snowflake/Databricks, and federated query engines like Trino — while still picking a primary cloud
Run identical full data stacks on every cloud the bank touches
Adopt only proprietary single-cloud-native services everywhere
5. The chapter argues governance "shapes architecture" rather than being an afterthought. Which architectural choice most directly reflects this claim?
Deploying separate data copies for ML, BI, and ad-hoc analysis to maximize team autonomy
Using a single governed lakehouse layer (open table formats + centralized catalog) that both BI and ML workloads consume, so policy applies uniformly
Documenting access rules in a wiki page rather than enforcing them in the catalog
Allowing ML teams to copy production data into side buckets where governance does not apply