Study Guide: Foundations of Modern Data Engineering

Pre-Quiz: Roles and the Modern Stack

1. A startup hires a "data engineer" but its real need is someone to write SQL transformations in dbt to produce dim_customer and fct_orders tables for the BI team. Which role best matches that work?

Data engineer (pipeline engineer)

Analytics engineer

Platform engineer

ML data engineer

2. A retailer in 2005 transformed data inside Informatica's engine before loading aggregated tables into Teradata. Why was the transform-before-load order economically necessary at the time?

SQL engines could not yet express joins or aggregates

Warehouse storage was expensive and warehouse compute was scarce, so messy raw data could not be landed and cleaned later

Regulators required transformations to occur outside the warehouse

Object storage like S3 had not yet been invented and there was no place to land raw data

3. A team wants the marketing analytics workload to never block the data science team's heavy experiment, even though both query the same tables. Which architectural property of the modern data stack most directly enables this?

ELT instead of ETL

Use of dbt for transformations

Decoupled storage and compute, where independent compute clusters read the same single copy of data

Monolithic vendor tooling that owns ingest, transform, and serve

4. A CTO worries about vendor lock-in but still wants ACID transactions, schema enforcement, and time travel on data sitting in S3. Which approach best addresses both goals?

Move all data into Snowflake's proprietary native tables

Store data as raw CSV files and rely on application code for consistency

Adopt an open table format like Iceberg or Delta Lake over S3 so multiple engines can read the same governed tables

Run a single self-hosted Teradata cluster to centralize storage and compute

5. A 30-engineer fintech is choosing between managed Snowflake and self-hosted Spark for its general analytics warehouse. Fraud detection is their core differentiator; warehousing is not. Which decision best matches the chapter's heuristic?

Self-host both the warehouse and fraud detection to maximize control

Buy managed Snowflake for the commodity warehouse layer; consider self-hosting only the fraud-detection stream layer

Buy managed services for everything, including fraud detection logic

Self-host the warehouse but buy a managed fraud-detection SaaS

Roles and the Modern Stack

Key Points

A modern data engineer owns the path from ingestion to query-ready storage and is judged on reliability, scalability, and cost-efficiency, with tools like Airflow, Kafka, and Spark.
The data team has fractured into specialized roles: data engineer (pipelines), analytics engineer (dbt/SQL modeling), platform engineer (shared infra), and ML data engineer (feature pipelines).
The industry shifted from monolithic ETL (one vendor, transform-before-load, on-prem) to composable ELT (best-of-breed, load-then-transform-in-place) once cloud storage became cheap and compute became elastic.
Decoupled storage and compute is the defining architectural shift: a single copy of data on object storage, queried by many independent compute engines.
Open table formats (Iceberg, Delta Lake, Hudi) over object storage produce the lakehouse pattern, blending lake economics with warehouse semantics and reducing vendor lock-in.

Specialization Across the Pipeline

The "data team" of 2015 was usually one or two generalists. Today, work specializes by layer of the stack. The data engineer wires Airflow, Kafka, and Spark to land reliable raw data in the warehouse. The analytics engineer writes dbt models to transform raw tables into business-friendly objects like dim_customer. The platform engineer runs the shared Kubernetes, Terraform, IAM, and CI/CD that everyone above them depends on. The ML data engineer branches off the warehouse to feed feature stores and model training.

If data is electricity, the platform engineer builds the grid, the data engineer wires the houses, the analytics engineer designs the appliances, and the ML data engineer powers a specialized factory next door.

Figure 1.1: Modern data team roles across the pipeline stack

flowchart LR Sources["Source Systems
(APIs, DBs, Logs)"] DE["Data Engineer
Airflow / Kafka / Spark"] WH["Warehouse / Lakehouse
Snowflake / Databricks"] AE["Analytics Engineer
dbt / SQL / Git"] BI["BI Dashboards
Looker / Hex"] MLE["ML Data Engineer
Feast / MLflow"] FS["Feature Store"] Models["ML Models"] PE["Platform Engineer: Terraform, Kubernetes, CI/CD, IAM, Observability"] Sources --> DE --> WH --> AE --> BI WH --> MLE --> FS --> Models PE -.supports.-> DE PE -.supports.-> WH PE -.supports.-> AE PE -.supports.-> MLE

Role	Primary Concern	Tooling	Output Consumer
Data Engineer	Ingest -> store reliability	Airflow, Kafka, Spark	Warehouse/lakehouse tables
Analytics Engineer	Business-logic modeling	dbt, SQL, Git	Analysts, dashboards
Platform Engineer	Shared infra & DevOps	Terraform, K8s, CI/CD	Other data teams
ML Data Engineer	Feature pipelines	Python, MLflow, feature stores	Data scientists, models

From Monolithic ETL to Composable ELT

The original architecture was monolithic ETL: one vendor (Informatica, Ab Initio, IBM DataStage) extracted, transformed in a proprietary engine, and loaded an on-prem warehouse like Teradata. Transformation had to happen before loading because warehouse storage was expensive and compute was scarce.

The modern pattern inverts this. ELT loads raw data into a cloud warehouse first, then transforms in place using SQL. Cheap object storage (pennies per GB-month) and elastic compute (a thousand cores for ten minutes) erased the economic constraint that forced ETL.

Figure 1.2: Monolithic ETL versus composable ELT

flowchart TB subgraph Monolithic["Monolithic ETL (2005)"] S1["Oracle Source"] --> I1["Informatica
Extract + Transform"] I1 --> T1["Teradata Warehouse"] end subgraph Composable["Composable ELT (2025)"] S2["Postgres Source"] --> F["Fivetran
(Extract + Load)"] F --> SF["Snowflake
(Raw Tables)"] SF --> D["dbt
(Transform in SQL)"] D --> SF2["Snowflake
(fct_orders, dim_product)"] A["Airflow
(Orchestration)"] -.triggers.-> F A -.triggers.-> D SF2 --> L["Looker Dashboards"] end

Decoupled Storage and Compute

In a traditional warehouse like Teradata, storage and compute lived on the same physical nodes — to store more, you had to buy more compute too. Decoupled storage and compute is the defining architectural shift of the modern stack. A single copy of data sits on cheap, durable object storage (S3, Google Colossus, Azure ADLS), while many independent compute engines read it: Snowflake Virtual Warehouses, BigQuery slot pools, Databricks Spark + Photon clusters, Trino federations.

A traditional warehouse is a hotel where every guest must rent the entire floor including the kitchen and the gym. Decoupled compute/storage is a public library — books sit on shared shelves, and any reader can show up, check out what they need, and leave without the others noticing.

Animated: each compute engine lights up and connects to the shared storage slab.

Open Table Formats and the Lakehouse

Object storage alone gives you only files. To get warehouse behavior — ACID, schema enforcement, time travel — you layer an open table format (Iceberg, Delta Lake, Hudi) on top. The lakehouse is what you get when you combine open table formats over object storage with warehouse-style query engines: cheap scalable storage with ACID guarantees, plus interoperability so Snowflake, Databricks, Trino, Athena, Spark, and Flink can all read the same tables.

Specialized warehouses with proprietary storage and tightly-integrated caches still often outperform open-format engines fetching from remote object storage, but the gap is closing — and even closed-platform vendors (Snowflake, BigQuery, Synapse) have added Iceberg compatibility to reduce customer fears of lock-in.

Figure 1.5: Decoupled storage and compute with multiple independent engines

flowchart TD subgraph Compute["Independent Compute Engines"] VW1["Snowflake VW
(BI Workload)"] VW2["Snowflake VW
(Data Science)"] DB["Databricks Cluster
(ETL with Photon)"] BQ["BigQuery Slots
(Ad-hoc Queries)"] TR["Trino
(Federated SQL)"] end Storage["Object Storage Layer
S3 / Google Colossus / Azure ADLS
(single copy of data, Parquet + Iceberg / Delta)"] VW1 --> Storage VW2 --> Storage DB --> Storage BQ --> Storage TR --> Storage

Managed vs Self-Hosted

Modern stack decisions sit on a spectrum from fully managed SaaS (Snowflake, BigQuery, Databricks, Fivetran, dbt Cloud) to self-hosted open source (Spark on EKS, Airflow on K8s, raw Kafka). Managed trades money for engineering time; self-hosting trades engineering time for cost flexibility and control. The pragmatic heuristic for teams under ~50 engineers: buy managed for layers that aren't your differentiator; self-host only where you have a specific reason — cost at scale, regulatory, or genuine differentiation.

Dimension	Managed (Snowflake)	Self-Hosted (Spark on K8s)
Time to first value	Days	Weeks to months
Ops burden	Vendor handles	You handle (upgrades, scaling, on-call)
Cost at small scale	Often cheaper (no idle infra)	Often more expensive
Cost at large scale	Premium per unit	Can be much cheaper if optimized
Customization	Limited	Full
Lock-in risk	Higher	Lower

Post-Quiz: Roles and the Modern Stack

Data engineer (pipeline engineer)

Analytics engineer

Platform engineer

ML data engineer

2. A retailer in 2005 transformed data inside Informatica's engine before loading aggregated tables into Teradata. Why was the transform-before-load order economically necessary at the time?

SQL engines could not yet express joins or aggregates

Warehouse storage was expensive and warehouse compute was scarce, so messy raw data could not be landed and cleaned later

Regulators required transformations to occur outside the warehouse

Object storage like S3 had not yet been invented and there was no place to land raw data

ELT instead of ETL

Use of dbt for transformations

Decoupled storage and compute, where independent compute clusters read the same single copy of data

Monolithic vendor tooling that owns ingest, transform, and serve

4. A CTO worries about vendor lock-in but still wants ACID transactions, schema enforcement, and time travel on data sitting in S3. Which approach best addresses both goals?

Move all data into Snowflake's proprietary native tables

Store data as raw CSV files and rely on application code for consistency

Adopt an open table format like Iceberg or Delta Lake over S3 so multiple engines can read the same governed tables

Run a single self-hosted Teradata cluster to centralize storage and compute

Self-host both the warehouse and fraud detection to maximize control

Buy managed Snowflake for the commodity warehouse layer; consider self-hosting only the fraud-detection stream layer

Buy managed services for everything, including fraud detection logic

Self-host the warehouse but buy a managed fraud-detection SaaS

Pre-Quiz: Processing Models and Architectures

1. A finance team needs a daily P&L close that must be auditable and accurate but can run overnight. Which processing model is the best fit and why?

Streaming, because modern stacks should default to streaming

Batch, because the SLA tolerates hours of latency and batch is simpler, cheaper, and easy to debug via re-runs

Micro-batch, because it gives sub-second latency at low cost

Lambda, because P&L always requires both batch accuracy and stream freshness

2. Which statement best describes the "two armies" problem of Lambda architecture?

Two clusters are required, one for storage and one for compute

The same business logic must be implemented and maintained twice — once in the batch framework and once in the stream framework — keeping them in sync forever

The serving layer must run on two different cloud providers

Two competing stream engines (Flink and Kafka Streams) must be deployed side by side

3. A team adopts Kappa architecture. The metric definition for "trending now" changes, so they need to reprocess the last 30 days of data. How is historical reprocessing accomplished in Kappa?

Run a separate batch job over a Hadoop snapshot of the data

Replay the durable, partitioned Kafka log from an earlier offset through the same Flink pipeline

Manually export the warehouse, transform with dbt, and re-import

Pause the stream and delete the existing serving-layer indexes so they will rebuild themselves

4. A fraud-scoring model must complete within 200 milliseconds during card authorization. Which processing model is mandatory, and why?

Batch, because batch is the most reliable model for financial use cases

Micro-batch, because it provides sub-second latency at lower complexity than full streaming

Streaming with stateful operators, because the consumer SLA requires per-event sub-second processing

Lambda, because fraud always needs both historical batch accuracy and real-time scoring in the same response

5. A team currently runs a daily batch dashboard refresh and is asked to deliver "near-real-time" data. They estimate going from daily to hourly is cheap, but going from one minute to one second would require introducing Druid or a key-value store. Which insight does this illustrate?

Batch and streaming have identical cost curves

Latency improvements scale linearly with cost

Each order-of-magnitude latency improvement is a step-function increase in operational complexity, so streaming should be reserved for use cases that genuinely require it

Modern dashboards always need sub-second latency by default

Processing Models and Architectures

Key Points

Three canonical processing models live on a latency-vs-complexity spectrum: batch (hours-to-days, cheap, simple), micro-batch (seconds-to-minutes, medium), streaming (sub-second, complex, most expensive).
Lambda architecture combines a batch layer (accurate historical), a speed layer (real-time approximate), and a serving layer (merged) — at the cost of duplicating business logic across two engines (the "two armies" problem).
Kappa eliminates the batch layer: a single streaming pipeline backed by a durable replayable log (Kafka) and a unified engine (Flink). Historical reprocessing means replaying the log from an earlier offset.
The right processing model is determined by the consumer's SLA, not the team's enthusiasm. Latency improvements compound across pipeline hops; the slowest hop sets the floor.
Freshness has a cost curve that bends sharply upward: each order-of-magnitude latency improvement is a step-function increase in operational complexity.

Batch, Micro-Batch, Streaming

Batch is like sending a daily mail truck — efficient and predictable, but slow to deliver any single letter. Micro-batch is a courier on a five-minute circuit. Streaming is a pneumatic tube — every letter shoots through the moment it's dropped in.

Model	Typical Latency	Complexity	Cost Profile	Example Use Case
Batch	Hours to days	Low	Lowest (idle between runs)	Nightly financial reconciliation
Micro-batch	Seconds to minutes	Medium	Medium (continuous compute)	Near-real-time dashboards
Streaming	Sub-second	High	Highest (always-on, stateful)	Fraud detection, live personalization

Lambda vs Kappa

Lambda (Nathan Marz, ~2011) splits the pipeline into a batch layer (Hadoop, Spark) for accurate historical views, a speed layer (Flink, Kafka Streams) for low-latency recent data, and a serving layer (Druid, Pinot, Cassandra) that merges them. Its fatal weakness is the two-armies problem: every metric must be implemented twice and kept in sync forever.

Kappa (Jay Kreps, ~2014) eliminates the batch layer. All data flows through a single streaming pipeline backed by a durable, replayable log (Kafka) processed by a unified engine (Flink). Historical reprocessing is done by replaying the log from an earlier offset — one codebase, one set of semantics. Migrating from Lambda to Kappa-style unified pipelines can yield 50-70% simpler operations.

Figure 1.3: Lambda versus Kappa architectures

flowchart LR subgraph Lambda["Lambda Architecture"] E1["Event Source"] --> B["Batch Layer
(Spark / Hadoop)"] E1 --> SP["Speed Layer
(Flink / Kafka Streams)"] B --> SV1["Serving Layer
(Druid / Pinot)"] SP --> SV1 SV1 --> Q1["Query / Dashboard"] end subgraph Kappa["Kappa Architecture"] E2["Event Source"] --> K["Kafka
(Durable Replayable Log)"] K --> FL["Flink
(Unified Stream Engine)"] FL --> SV2["Serving Layer"] SV2 --> Q2["Query / Dashboard"] K -.replay from offset.-> FL end

Lambda forks every event into batch + speed paths; Kappa runs one pipeline and replays the log when needed.

SLA-Driven Design

Three rules of thumb. First, work backwards from the user — a 6 a.m. dashboard does not need streaming; a 200 ms fraud check does. Second, latency requirements compound across the pipeline; a streaming source landing in an hourly warehouse is at best hour-old. Third, freshness has a cost curve that bends sharply upward: daily-to-hourly is cheap, hourly-to-one-minute usually requires re-architecting toward streaming, and one-minute-to-one-second often requires changing your storage layer.

Figure 1.4: Decision tree for choosing a processing model from consumer SLA

flowchart TD Start["What is the consumer's
latency SLA?"] Start --> Q1{"Sub-second
required?"} Q1 -->|Yes| Stream["Streaming
(Flink, Kafka Streams)
+ stateful operators"] Q1 -->|No| Q2{"Seconds to
minutes?"} Q2 -->|Yes| Micro["Micro-batch
(Spark Structured Streaming)"] Q2 -->|No| Q3{"Hours acceptable?"} Q3 -->|Yes| Batch["Batch
(Airflow + Spark / SQL)"] Q3 -->|No| Q4{"Need both batch
accuracy and
stream freshness?"} Q4 -->|Yes, two codebases OK| Lambda["Lambda Architecture"] Q4 -->|Prefer single codebase| Kappa["Kappa Architecture"]

Each tier reveals in turn; the orange curve traces the step-function cost increase as freshness improves.

Post-Quiz: Processing Models and Architectures