Chapter 1: Foundations of Modern Data Engineering

Learning Objectives

Pre-Quiz: Roles and the Modern Stack

1. A startup hires a "data engineer" but its real need is someone to write SQL transformations in dbt to produce dim_customer and fct_orders tables for the BI team. Which role best matches that work?

Data engineer (pipeline engineer)
Analytics engineer
Platform engineer
ML data engineer

2. A retailer in 2005 transformed data inside Informatica's engine before loading aggregated tables into Teradata. Why was the transform-before-load order economically necessary at the time?

SQL engines could not yet express joins or aggregates
Warehouse storage was expensive and warehouse compute was scarce, so messy raw data could not be landed and cleaned later
Regulators required transformations to occur outside the warehouse
Object storage like S3 had not yet been invented and there was no place to land raw data

3. A team wants the marketing analytics workload to never block the data science team's heavy experiment, even though both query the same tables. Which architectural property of the modern data stack most directly enables this?

ELT instead of ETL
Use of dbt for transformations
Decoupled storage and compute, where independent compute clusters read the same single copy of data
Monolithic vendor tooling that owns ingest, transform, and serve

4. A CTO worries about vendor lock-in but still wants ACID transactions, schema enforcement, and time travel on data sitting in S3. Which approach best addresses both goals?

Move all data into Snowflake's proprietary native tables
Store data as raw CSV files and rely on application code for consistency
Adopt an open table format like Iceberg or Delta Lake over S3 so multiple engines can read the same governed tables
Run a single self-hosted Teradata cluster to centralize storage and compute

5. A 30-engineer fintech is choosing between managed Snowflake and self-hosted Spark for its general analytics warehouse. Fraud detection is their core differentiator; warehousing is not. Which decision best matches the chapter's heuristic?

Self-host both the warehouse and fraud detection to maximize control
Buy managed Snowflake for the commodity warehouse layer; consider self-hosting only the fraud-detection stream layer
Buy managed services for everything, including fraud detection logic
Self-host the warehouse but buy a managed fraud-detection SaaS

Roles and the Modern Stack

Key Points

Specialization Across the Pipeline

The "data team" of 2015 was usually one or two generalists. Today, work specializes by layer of the stack. The data engineer wires Airflow, Kafka, and Spark to land reliable raw data in the warehouse. The analytics engineer writes dbt models to transform raw tables into business-friendly objects like dim_customer. The platform engineer runs the shared Kubernetes, Terraform, IAM, and CI/CD that everyone above them depends on. The ML data engineer branches off the warehouse to feed feature stores and model training.

If data is electricity, the platform engineer builds the grid, the data engineer wires the houses, the analytics engineer designs the appliances, and the ML data engineer powers a specialized factory next door.

Figure 1.1: Modern data team roles across the pipeline stack

flowchart LR Sources["Source Systems
(APIs, DBs, Logs)"] DE["Data Engineer
Airflow / Kafka / Spark"] WH["Warehouse / Lakehouse
Snowflake / Databricks"] AE["Analytics Engineer
dbt / SQL / Git"] BI["BI Dashboards
Looker / Hex"] MLE["ML Data Engineer
Feast / MLflow"] FS["Feature Store"] Models["ML Models"] PE["Platform Engineer: Terraform, Kubernetes, CI/CD, IAM, Observability"] Sources --> DE --> WH --> AE --> BI WH --> MLE --> FS --> Models PE -.supports.-> DE PE -.supports.-> WH PE -.supports.-> AE PE -.supports.-> MLE
RolePrimary ConcernToolingOutput Consumer
Data EngineerIngest -> store reliabilityAirflow, Kafka, SparkWarehouse/lakehouse tables
Analytics EngineerBusiness-logic modelingdbt, SQL, GitAnalysts, dashboards
Platform EngineerShared infra & DevOpsTerraform, K8s, CI/CDOther data teams
ML Data EngineerFeature pipelinesPython, MLflow, feature storesData scientists, models

From Monolithic ETL to Composable ELT

The original architecture was monolithic ETL: one vendor (Informatica, Ab Initio, IBM DataStage) extracted, transformed in a proprietary engine, and loaded an on-prem warehouse like Teradata. Transformation had to happen before loading because warehouse storage was expensive and compute was scarce.

The modern pattern inverts this. ELT loads raw data into a cloud warehouse first, then transforms in place using SQL. Cheap object storage (pennies per GB-month) and elastic compute (a thousand cores for ten minutes) erased the economic constraint that forced ETL.

Figure 1.2: Monolithic ETL versus composable ELT

flowchart TB subgraph Monolithic["Monolithic ETL (2005)"] S1["Oracle Source"] --> I1["Informatica
Extract + Transform"] I1 --> T1["Teradata Warehouse"] end subgraph Composable["Composable ELT (2025)"] S2["Postgres Source"] --> F["Fivetran
(Extract + Load)"] F --> SF["Snowflake
(Raw Tables)"] SF --> D["dbt
(Transform in SQL)"] D --> SF2["Snowflake
(fct_orders, dim_product)"] A["Airflow
(Orchestration)"] -.triggers.-> F A -.triggers.-> D SF2 --> L["Looker Dashboards"] end

Decoupled Storage and Compute

In a traditional warehouse like Teradata, storage and compute lived on the same physical nodes — to store more, you had to buy more compute too. Decoupled storage and compute is the defining architectural shift of the modern stack. A single copy of data sits on cheap, durable object storage (S3, Google Colossus, Azure ADLS), while many independent compute engines read it: Snowflake Virtual Warehouses, BigQuery slot pools, Databricks Spark + Photon clusters, Trino federations.

A traditional warehouse is a hotel where every guest must rent the entire floor including the kitchen and the gym. Decoupled compute/storage is a public library — books sit on shared shelves, and any reader can show up, check out what they need, and leave without the others noticing.

Decoupled Storage and Compute Snowflake VW BI workload Databricks Spark + Photon ETL BigQuery Ad-hoc slots Trino Federated SQL Object Storage Layer S3 / Google Colossus / Azure ADLS — single copy, Parquet + Iceberg/Delta Independent compute, scaled and billed per workload All engines read the same governed data
Animated: each compute engine lights up and connects to the shared storage slab.

Open Table Formats and the Lakehouse

Object storage alone gives you only files. To get warehouse behavior — ACID, schema enforcement, time travel — you layer an open table format (Iceberg, Delta Lake, Hudi) on top. The lakehouse is what you get when you combine open table formats over object storage with warehouse-style query engines: cheap scalable storage with ACID guarantees, plus interoperability so Snowflake, Databricks, Trino, Athena, Spark, and Flink can all read the same tables.

Specialized warehouses with proprietary storage and tightly-integrated caches still often outperform open-format engines fetching from remote object storage, but the gap is closing — and even closed-platform vendors (Snowflake, BigQuery, Synapse) have added Iceberg compatibility to reduce customer fears of lock-in.

Figure 1.5: Decoupled storage and compute with multiple independent engines

flowchart TD subgraph Compute["Independent Compute Engines"] VW1["Snowflake VW
(BI Workload)"] VW2["Snowflake VW
(Data Science)"] DB["Databricks Cluster
(ETL with Photon)"] BQ["BigQuery Slots
(Ad-hoc Queries)"] TR["Trino
(Federated SQL)"] end Storage["Object Storage Layer
S3 / Google Colossus / Azure ADLS
(single copy of data, Parquet + Iceberg / Delta)"] VW1 --> Storage VW2 --> Storage DB --> Storage BQ --> Storage TR --> Storage

Managed vs Self-Hosted

Modern stack decisions sit on a spectrum from fully managed SaaS (Snowflake, BigQuery, Databricks, Fivetran, dbt Cloud) to self-hosted open source (Spark on EKS, Airflow on K8s, raw Kafka). Managed trades money for engineering time; self-hosting trades engineering time for cost flexibility and control. The pragmatic heuristic for teams under ~50 engineers: buy managed for layers that aren't your differentiator; self-host only where you have a specific reason — cost at scale, regulatory, or genuine differentiation.

DimensionManaged (Snowflake)Self-Hosted (Spark on K8s)
Time to first valueDaysWeeks to months
Ops burdenVendor handlesYou handle (upgrades, scaling, on-call)
Cost at small scaleOften cheaper (no idle infra)Often more expensive
Cost at large scalePremium per unitCan be much cheaper if optimized
CustomizationLimitedFull
Lock-in riskHigherLower
Post-Quiz: Roles and the Modern Stack

1. A startup hires a "data engineer" but its real need is someone to write SQL transformations in dbt to produce dim_customer and fct_orders tables for the BI team. Which role best matches that work?

Data engineer (pipeline engineer)
Analytics engineer
Platform engineer
ML data engineer

2. A retailer in 2005 transformed data inside Informatica's engine before loading aggregated tables into Teradata. Why was the transform-before-load order economically necessary at the time?

SQL engines could not yet express joins or aggregates
Warehouse storage was expensive and warehouse compute was scarce, so messy raw data could not be landed and cleaned later
Regulators required transformations to occur outside the warehouse
Object storage like S3 had not yet been invented and there was no place to land raw data

3. A team wants the marketing analytics workload to never block the data science team's heavy experiment, even though both query the same tables. Which architectural property of the modern data stack most directly enables this?

ELT instead of ETL
Use of dbt for transformations
Decoupled storage and compute, where independent compute clusters read the same single copy of data
Monolithic vendor tooling that owns ingest, transform, and serve

4. A CTO worries about vendor lock-in but still wants ACID transactions, schema enforcement, and time travel on data sitting in S3. Which approach best addresses both goals?

Move all data into Snowflake's proprietary native tables
Store data as raw CSV files and rely on application code for consistency
Adopt an open table format like Iceberg or Delta Lake over S3 so multiple engines can read the same governed tables
Run a single self-hosted Teradata cluster to centralize storage and compute

5. A 30-engineer fintech is choosing between managed Snowflake and self-hosted Spark for its general analytics warehouse. Fraud detection is their core differentiator; warehousing is not. Which decision best matches the chapter's heuristic?

Self-host both the warehouse and fraud detection to maximize control
Buy managed Snowflake for the commodity warehouse layer; consider self-hosting only the fraud-detection stream layer
Buy managed services for everything, including fraud detection logic
Self-host the warehouse but buy a managed fraud-detection SaaS
Pre-Quiz: Processing Models and Architectures

1. A finance team needs a daily P&L close that must be auditable and accurate but can run overnight. Which processing model is the best fit and why?

Streaming, because modern stacks should default to streaming
Batch, because the SLA tolerates hours of latency and batch is simpler, cheaper, and easy to debug via re-runs
Micro-batch, because it gives sub-second latency at low cost
Lambda, because P&L always requires both batch accuracy and stream freshness

2. Which statement best describes the "two armies" problem of Lambda architecture?

Two clusters are required, one for storage and one for compute
The same business logic must be implemented and maintained twice — once in the batch framework and once in the stream framework — keeping them in sync forever
The serving layer must run on two different cloud providers
Two competing stream engines (Flink and Kafka Streams) must be deployed side by side

3. A team adopts Kappa architecture. The metric definition for "trending now" changes, so they need to reprocess the last 30 days of data. How is historical reprocessing accomplished in Kappa?

Run a separate batch job over a Hadoop snapshot of the data
Replay the durable, partitioned Kafka log from an earlier offset through the same Flink pipeline
Manually export the warehouse, transform with dbt, and re-import
Pause the stream and delete the existing serving-layer indexes so they will rebuild themselves

4. A fraud-scoring model must complete within 200 milliseconds during card authorization. Which processing model is mandatory, and why?

Batch, because batch is the most reliable model for financial use cases
Micro-batch, because it provides sub-second latency at lower complexity than full streaming
Streaming with stateful operators, because the consumer SLA requires per-event sub-second processing
Lambda, because fraud always needs both historical batch accuracy and real-time scoring in the same response

5. A team currently runs a daily batch dashboard refresh and is asked to deliver "near-real-time" data. They estimate going from daily to hourly is cheap, but going from one minute to one second would require introducing Druid or a key-value store. Which insight does this illustrate?

Batch and streaming have identical cost curves
Latency improvements scale linearly with cost
Each order-of-magnitude latency improvement is a step-function increase in operational complexity, so streaming should be reserved for use cases that genuinely require it
Modern dashboards always need sub-second latency by default

Processing Models and Architectures

Key Points

Batch, Micro-Batch, Streaming

Batch is like sending a daily mail truck — efficient and predictable, but slow to deliver any single letter. Micro-batch is a courier on a five-minute circuit. Streaming is a pneumatic tube — every letter shoots through the moment it's dropped in.

ModelTypical LatencyComplexityCost ProfileExample Use Case
BatchHours to daysLowLowest (idle between runs)Nightly financial reconciliation
Micro-batchSeconds to minutesMediumMedium (continuous compute)Near-real-time dashboards
StreamingSub-secondHighHighest (always-on, stateful)Fraud detection, live personalization

Lambda vs Kappa

Lambda (Nathan Marz, ~2011) splits the pipeline into a batch layer (Hadoop, Spark) for accurate historical views, a speed layer (Flink, Kafka Streams) for low-latency recent data, and a serving layer (Druid, Pinot, Cassandra) that merges them. Its fatal weakness is the two-armies problem: every metric must be implemented twice and kept in sync forever.

Kappa (Jay Kreps, ~2014) eliminates the batch layer. All data flows through a single streaming pipeline backed by a durable, replayable log (Kafka) processed by a unified engine (Flink). Historical reprocessing is done by replaying the log from an earlier offset — one codebase, one set of semantics. Migrating from Lambda to Kappa-style unified pipelines can yield 50-70% simpler operations.

Figure 1.3: Lambda versus Kappa architectures

flowchart LR subgraph Lambda["Lambda Architecture"] E1["Event Source"] --> B["Batch Layer
(Spark / Hadoop)"] E1 --> SP["Speed Layer
(Flink / Kafka Streams)"] B --> SV1["Serving Layer
(Druid / Pinot)"] SP --> SV1 SV1 --> Q1["Query / Dashboard"] end subgraph Kappa["Kappa Architecture"] E2["Event Source"] --> K["Kafka
(Durable Replayable Log)"] K --> FL["Flink
(Unified Stream Engine)"] FL --> SV2["Serving Layer"] SV2 --> Q2["Query / Dashboard"] K -.replay from offset.-> FL end
Lambda vs Kappa: Data Flow LAMBDA — two codebases, two armies Source events Batch Layer Spark / Hadoop Speed Layer Flink / KStreams Serving Layer Druid / Pinot Query KAPPA — single pipeline, replayable log Source events Kafka Log durable, replayable Flink unified engine Serving + Query replay from offset
Lambda forks every event into batch + speed paths; Kappa runs one pipeline and replays the log when needed.

SLA-Driven Design

Three rules of thumb. First, work backwards from the user — a 6 a.m. dashboard does not need streaming; a 200 ms fraud check does. Second, latency requirements compound across the pipeline; a streaming source landing in an hourly warehouse is at best hour-old. Third, freshness has a cost curve that bends sharply upward: daily-to-hourly is cheap, hourly-to-one-minute usually requires re-architecting toward streaming, and one-minute-to-one-second often requires changing your storage layer.

Figure 1.4: Decision tree for choosing a processing model from consumer SLA

flowchart TD Start["What is the consumer's
latency SLA?"] Start --> Q1{"Sub-second
required?"} Q1 -->|Yes| Stream["Streaming
(Flink, Kafka Streams)
+ stateful operators"] Q1 -->|No| Q2{"Seconds to
minutes?"} Q2 -->|Yes| Micro["Micro-batch
(Spark Structured Streaming)"] Q2 -->|No| Q3{"Hours acceptable?"} Q3 -->|Yes| Batch["Batch
(Airflow + Spark / SQL)"] Q3 -->|No| Q4{"Need both batch
accuracy and
stream freshness?"} Q4 -->|Yes, two codebases OK| Lambda["Lambda Architecture"] Q4 -->|Prefer single codebase| Kappa["Kappa Architecture"]
Latency Tiers and the Step-Function Cost Curve Freshness (faster →) Cost & Complexity BATCH hours-to-days · cheap MICRO-BATCH seconds-to-minutes · medium STREAMING sub-second · highest + stateful operators cost bends sharply days hours minutes seconds sub-second
Each tier reveals in turn; the orange curve traces the step-function cost increase as freshness improves.
Post-Quiz: Processing Models and Architectures

1. A finance team needs a daily P&L close that must be auditable and accurate but can run overnight. Which processing model is the best fit and why?

Streaming, because modern stacks should default to streaming
Batch, because the SLA tolerates hours of latency and batch is simpler, cheaper, and easy to debug via re-runs
Micro-batch, because it gives sub-second latency at low cost
Lambda, because P&L always requires both batch accuracy and stream freshness

2. Which statement best describes the "two armies" problem of Lambda architecture?

Two clusters are required, one for storage and one for compute
The same business logic must be implemented and maintained twice — once in the batch framework and once in the stream framework — keeping them in sync forever
The serving layer must run on two different cloud providers
Two competing stream engines (Flink and Kafka Streams) must be deployed side by side

3. A team adopts Kappa architecture. The metric definition for "trending now" changes, so they need to reprocess the last 30 days of data. How is historical reprocessing accomplished in Kappa?

Run a separate batch job over a Hadoop snapshot of the data
Replay the durable, partitioned Kafka log from an earlier offset through the same Flink pipeline
Manually export the warehouse, transform with dbt, and re-import
Pause the stream and delete the existing serving-layer indexes so they will rebuild themselves

4. A fraud-scoring model must complete within 200 milliseconds during card authorization. Which processing model is mandatory, and why?

Batch, because batch is the most reliable model for financial use cases
Micro-batch, because it provides sub-second latency at lower complexity than full streaming
Streaming with stateful operators, because the consumer SLA requires per-event sub-second processing
Lambda, because fraud always needs both historical batch accuracy and real-time scoring in the same response

5. A team currently runs a daily batch dashboard refresh and is asked to deliver "near-real-time" data. They estimate going from daily to hourly is cheap, but going from one minute to one second would require introducing Druid or a key-value store. Which insight does this illustrate?

Batch and streaming have identical cost curves
Latency improvements scale linearly with cost
Each order-of-magnitude latency improvement is a step-function increase in operational complexity, so streaming should be reserved for use cases that genuinely require it
Modern dashboards always need sub-second latency by default
Pre-Quiz: Drivers and Trade-offs

1. A team's Snowflake bill triples in a month. Investigation shows a "primary" warehouse running 24/7, even though it sits idle most evenings and weekends. Which lever is most likely to fix this?

Switch to Databricks because it is always cheaper
Enable auto-suspend / auto-resume so idle compute releases automatically, and right-size the warehouse to the workload
Materialize every dbt model regardless of usage to reduce future scans
Move the warehouse to an on-prem cluster sized for peak load

2. A BigQuery user notices their query bill is much higher than expected. Their queries do SELECT * against a 10 TB table and lack partition filters. Which architecture-level fix maps best to BigQuery's pricing model?

Use only columnar projection (SELECT specific columns) and partition pruning, because BigQuery charges by bytes scanned
Switch to Snowflake because Snowflake never charges for scans
Use spot instances, since BigQuery's bill is dominated by per-second compute
Disable BigQuery's slot pool to force serial query execution

3. A HIPAA-regulated healthcare company has analysts, ML engineers, and auditors all needing access to patient data. Which architecture choice best satisfies governance pressure?

Each team copies production data into its own bucket so it can move quickly
Store raw records in S3 with Iceberg tables and a centralized catalog (e.g., Unity Catalog) for fine-grained access, so all users share one governed view
Email CSV extracts to teams who request data and audit via inboxes
Encrypt the warehouse with a single shared password rotated annually

4. A multinational bank acquires a competitor running entirely on a different cloud. The architecture team wants flexibility without paying full multi-cloud overhead. Which combination of tools best supports that?

Rebuild the acquired cloud's data on the primary cloud immediately and forbid future cross-cloud reads
Open table formats for storage portability, cloud-portable platforms like Snowflake/Databricks, and federated query engines like Trino — while still picking a primary cloud
Run identical full data stacks on every cloud the bank touches
Adopt only proprietary single-cloud-native services everywhere

5. The chapter argues governance "shapes architecture" rather than being an afterthought. Which architectural choice most directly reflects this claim?

Deploying separate data copies for ML, BI, and ad-hoc analysis to maximize team autonomy
Using a single governed lakehouse layer (open table formats + centralized catalog) that both BI and ML workloads consume, so policy applies uniformly
Documenting access rules in a wiki page rather than enforcing them in the catalog
Allowing ML teams to copy production data into side buckets where governance does not apply

Drivers and Trade-offs

Key Points

Cost-Performance Trade-offs

Decoupled storage and compute reframed cost. In the cloud, you pay for what you use — but only if your architecture lets you scale down. The fastest way to lose money is provisioning an always-on Snowflake warehouse or Databricks cluster and never turning it off. Three disciplines:

  1. Right-size compute to workload. A nightly ETL job needs a large warehouse for 30 minutes, not a medium warehouse for 4 hours.
  2. Use auto-suspend and auto-scale aggressively. Idle compute is the single biggest source of waste.
  3. Push transformations into cheaper layers. Materializing a heavily-used aggregate once is cheaper than scanning the raw table thousands of times — but materializing data nobody queries wastes both storage and pipeline runtime.

Governance and Compliance

A decade ago governance meant a wiki page describing the schema. Today it means lineage, access control, classification, retention, and audit — often under GDPR, CCPA, HIPAA, SOX, or the EU AI Act. The modern stack responds in two ways: open table formats (Iceberg, Delta) make schema enforcement, time travel, and audit trails first-class features, and catalog/access-control layers (Unity Catalog, Snowflake RBAC, AWS Lake Formation, Apache Polaris) sit above storage and centralize permissions, masking, and lineage across engines.

The lakehouse pattern is particularly attractive from a governance perspective because a single governed layer serves both BI and ML, eliminating the common anti-pattern of ML teams copying production data into a side bucket where governance does not apply.

Multi-Cloud and Hybrid Realities

Acquisitions, regulatory data-residency requirements, vendor pricing leverage, and pre-existing investments produce multi-cloud and hybrid (cloud + on-prem) realities whether the architecture team wants them or not. Three patterns recur:

Most pragmatic teams pick a primary cloud and treat secondaries as exceptions, using open formats and portable platforms to keep the door open without paying full multi-cloud overhead.

Post-Quiz: Drivers and Trade-offs

1. A team's Snowflake bill triples in a month. Investigation shows a "primary" warehouse running 24/7, even though it sits idle most evenings and weekends. Which lever is most likely to fix this?

Switch to Databricks because it is always cheaper
Enable auto-suspend / auto-resume so idle compute releases automatically, and right-size the warehouse to the workload
Materialize every dbt model regardless of usage to reduce future scans
Move the warehouse to an on-prem cluster sized for peak load

2. A BigQuery user notices their query bill is much higher than expected. Their queries do SELECT * against a 10 TB table and lack partition filters. Which architecture-level fix maps best to BigQuery's pricing model?

Use only columnar projection (SELECT specific columns) and partition pruning, because BigQuery charges by bytes scanned
Switch to Snowflake because Snowflake never charges for scans
Use spot instances, since BigQuery's bill is dominated by per-second compute
Disable BigQuery's slot pool to force serial query execution

3. A HIPAA-regulated healthcare company has analysts, ML engineers, and auditors all needing access to patient data. Which architecture choice best satisfies governance pressure?

Each team copies production data into its own bucket so it can move quickly
Store raw records in S3 with Iceberg tables and a centralized catalog (e.g., Unity Catalog) for fine-grained access, so all users share one governed view
Email CSV extracts to teams who request data and audit via inboxes
Encrypt the warehouse with a single shared password rotated annually

4. A multinational bank acquires a competitor running entirely on a different cloud. The architecture team wants flexibility without paying full multi-cloud overhead. Which combination of tools best supports that?

Rebuild the acquired cloud's data on the primary cloud immediately and forbid future cross-cloud reads
Open table formats for storage portability, cloud-portable platforms like Snowflake/Databricks, and federated query engines like Trino — while still picking a primary cloud
Run identical full data stacks on every cloud the bank touches
Adopt only proprietary single-cloud-native services everywhere

5. The chapter argues governance "shapes architecture" rather than being an afterthought. Which architectural choice most directly reflects this claim?

Deploying separate data copies for ML, BI, and ad-hoc analysis to maximize team autonomy
Using a single governed lakehouse layer (open table formats + centralized catalog) that both BI and ML workloads consume, so policy applies uniformly
Documenting access rules in a wiki page rather than enforcing them in the catalog
Allowing ML teams to copy production data into side buckets where governance does not apply

Your Progress

Answer Explanations