Study Guide: Chapter 4 — Batch Ingestion and ETL/ELT Pipelines

Pre-Section Quiz: ETL vs ELT and AWS Glue

1. Why does ELT typically outperform ETL when the destination is a modern cloud warehouse?

A. Cloud warehouses can scan more bytes per second than any intermediate ETL server.

B. Cloud warehouses provide elastic, pay-per-use compute that scales up for transforms and back down when idle, while a dedicated ETL server costs the same whether busy or not.

C. ELT pipelines never need a transformation framework because raw data is self-describing.

D. ETL is incompatible with semi-structured formats like Parquet or JSON.

2. Which scenario most strongly justifies keeping an ETL stage in front of a cloud warehouse?

A. The team wants new fields to ride along automatically without redeployment.

B. The warehouse already supports semi-structured types like VARIANT.

C. PII columns must be hashed or tokenized before raw values are ever queryable in the warehouse.

D. The pipeline produces fewer than one million rows per day.

3. What is the primary role of the AWS Glue Data Catalog?

A. It executes Spark jobs and writes their results back to S3.

B. It is a central metadata repository (table definitions, schemas, partitions) that lets Athena, Redshift Spectrum, EMR, and Glue jobs all agree on what a table is.

C. It stores the actual row-level data on managed SSDs operated by AWS.

D. It is a visual ETL builder that generates PySpark code from drag-and-drop diagrams.

4. A bucket of JSON files has the field address.zip stored as a string in some records and as an integer in others. What does a Glue DynamicFrame do that a Spark DataFrame cannot?

A. It rejects every record that does not match a single inferred schema.

B. It silently casts every zip to a string by default with no way to reconsider.

C. It stores both possibilities per record using a "choice type" and lets you resolve the ambiguity later via ResolveChoice.

D. It refuses to load the data and emits a Spark schema-mismatch exception.

5. Which Glue tuning lever directly addresses the driver memory pressure caused by an S3 prefix containing thousands of small files?

A. Switching to G.8X workers and accepting the cost.

B. Disabling Adaptive Query Execution.

C. Setting the connection option groupFiles: 'inPartition' so many small files collapse into a single Spark task.

D. Calling job.commit() twice at the end of the script.

ETL vs ELT and AWS Glue

For three decades "ETL" was synonymous with data integration: a dedicated server pulled rows out of source systems, reshaped them in flight, and wrote the polished result into the warehouse. Cloud warehouses inverted that flow. Today most pipelines load raw data first and transform it inside the warehouse, an arrangement called ELT. Knowing when to use each pattern, and where the boundary lies, is the most consequential architectural decision in a batch pipeline.

Key Points

ELT is the default in cloud warehouses because compute elasticity, pay-per-use billing, and schema-on-read each remove a constraint that ETL used to impose.
ETL still applies when raw data must not land in the warehouse: PII redaction, strict schema enforcement, or crossing trust boundaries.
Hybrid is normal. "ETL the unsafe, ELT the rest" — hash PII in flight; load everything else raw.
Transformation frameworks are non-optional. dbt and SQL Mesh treat SQL as code: version-controlled, tested, dependency-aware.
AWS Glue bundles four capabilities for batch pipelines: a metadata catalog, schema crawlers, a visual ETL builder (Glue Studio), and a serverless Spark runtime.

Why ELT dominates in cloud warehouses

Cloud warehouses favor ELT for three reinforcing reasons. First, compute elasticity: Snowflake, BigQuery, and Redshift can scale compute up to handle a transform and back down when it finishes; a dedicated ETL server costs the same whether processing 100 GB or sitting idle. Second, pay-per-use billing: Snowflake charges per second of warehouse compute, BigQuery per byte scanned, Redshift via reserved or serverless capacity — all of which reward "transform when needed" rather than "transform always." Third, schema agility: loading raw data first means new fields land automatically and historical reprocessing is a query, not a redeployment.

The performance gap can be dramatic. Loading 100 GB of clickstream data with traditional ETL means an intermediate server has to extract, join, aggregate, and then push the result into the warehouse — often over hours. The ELT version loads the raw 100 GB into Snowflake in minutes and applies the same transformations using massively parallel warehouse compute.

Animation: ETL vs ELT — transform-then-load vs load-then-transform

Both flows start at the source. ETL transforms in an intermediate server before the warehouse sees the data; ELT lands raw and transforms inside the warehouse using elastic compute.

ETL vs ELT comparison

Factor	ETL	ELT
Processing power	Bound by intermediate server	Warehouse auto-scales
Data movement	Multiple hops, repeated I/O	Single load, transform in place
Latency on large datasets	Bottlenecked at transform layer	Near real-time within warehouse
Schema changes	Pipeline redeploy + backfill	New fields ride along automatically
Cost when idle	24/7 server cost	Storage only

Analogy. ETL is a meal-kit company that chops your vegetables in a central kitchen and ships pre-prepped boxes. ELT is a grocery delivery service: raw ingredients arrive in your kitchen and you decide what to make tonight. The grocery model wastes nothing if your menu changes; the meal kit forces an upstream change every time you want a new dish.

When ETL still applies

ELT is the default but not the universal answer. ETL keeps a role wherever raw data must not land in the warehouse in its original form:

PII redaction and minimization. GDPR/HIPAA require that PII be processed only for legitimate purposes; if your warehouse cannot guarantee that no one will query a raw email or ssn, you must hash, tokenize, or mask before the row arrives.
Strict schema enforcement. Regulatory feeds, financial reports, and ML feature stores cannot tolerate optional fields, type coercion, or schema drift. ETL's schema-on-write rejects malformed records at the boundary instead of corrupting downstream tables.
Cross-warehouse or air-gapped destinations. When data must traverse trust boundaries or move between clouds, transformation has to happen in a neutral compute layer (Glue, Spark on EMR, Fivetran).

The transformation layer: dbt and SQL Mesh

If raw data lands in the warehouse, transformations have to live somewhere. dbt treats SQL as software: each model is a SELECT in a .sql file plus a YAML descriptor; dbt compiles the files into a DAG, runs them in dependency order, and reports test failures. SQL Mesh adds virtual data environments, semantic versioning of models, and explicit handling of breaking changes via preview environments. Both treat business logic as code in version control, tested before deployment, with the warehouse as the runtime.

AWS Glue overview

AWS Glue is Amazon's serverless data integration service. It bundles four capabilities you will use repeatedly: a managed metadata catalog, automatic schema crawlers, a visual ETL builder (Glue Studio), and a serverless Spark runtime. Together they cover the lifecycle from "we just got a folder of CSVs in S3" to "production-grade Parquet tables refreshing every hour."

Glue crawlers and the Data Catalog

The Glue Data Catalog is the central metadata repository: table definitions, schemas, partition info, and pointers to underlying data, queryable by Athena, Redshift Spectrum, EMR, and Glue jobs. A crawler populates it. Point a crawler at an S3 prefix or a JDBC source and it will (1) classify objects (JSON, Parquet, CSV, Avro, ORC), (2) infer schema and partition structure from prefixes like year=2026/month=05/day=07/, and (3) register or update a table in a target Glue database.

Figure 4.2: Glue crawler-to-catalog-to-consumer pipeline

flowchart TD S3[("S3 raw zone
year=/month=/day=")] -->|scan + classify| C[Glue Crawler] C -->|infer schema
+ partitions| DC[(Glue Data Catalog
my_database.orders)] DC --> A[Athena queries] DC --> J[Glue Spark jobs] DC --> RS[Redshift Spectrum] DC --> EMR[EMR / external tools] classDef catalog fill:#1f3a5f,stroke:#58a6ff,color:#fff; class DC catalog;

Glue Studio visual ETL

Not every transformation deserves hand-written Spark. Glue Studio is a drag-and-drop interface; you wire source nodes (Catalog, S3, JDBC, Kinesis), transformation nodes (apply mapping, filter, join, drop fields, aggregate, custom SQL), and sink nodes, and Glue Studio generates PySpark or Scala. It is appropriate for column projections, type casts, joins of a fact to a small dimension, and review-friendly artifacts. Hand-written Spark is the right tool when logic exceeds simple flow-graph transformations.

Glue Spark jobs and DynamicFrames

Under the hood, Glue jobs run on managed Apache Spark. AWS provisions executors, scales them within configured limits, and tears them down when the job finishes; you pay per second of DPU consumption. The Glue API introduces an abstraction on top of Spark called the DynamicFrame: similar to a DataFrame but tolerant of schema variance per record. If half your JSON has address.zip as a string and half as an integer, a DataFrame would fail to infer a single schema; a DynamicFrame stores both possibilities using a "choice type" and lets you resolve later with ResolveChoice.

Aspect	DynamicFrame	Spark DataFrame
Schema strictness	Tolerates variance per record	Requires uniform schema
Native sources	Glue Catalog, S3, JDBC	Standard Spark sources
Built-in transforms	ApplyMapping, ResolveChoice, Relationalize, DropNullFields	Standard Spark API
Best for	Heterogeneous, semi-structured data	Cleaned, conforming data

The Relationalize transform is the killer feature: given a deeply nested JSON structure, it walks the tree and produces a set of relational tables joined by surrogate keys — exactly what you need to load nested data into a relational warehouse. Once data is clean, convert a DynamicFrame to a DataFrame with .toDF() and use Spark SQL.

Three tuning levers matter before the first production deploy:

File grouping. S3 prefixes with thousands of small files create one Spark task per file, putting pressure on the driver. groupFiles: 'inPartition' collapses many small files into a single task.
Adaptive Query Execution (AQE). Default in Glue 4.0+. Converts sort-merge to broadcast joins at runtime when one side is small, and rebalances skewed partitions. Almost always a net win.
Worker sizing. Default G.1X has 4 vCPU / 16 GB. Move to G.2X or G.4X on executor OOMs — but check whether the OOM is caused by skew (one key dominates) before throwing memory at it.

The job.commit() call is the signal that tells Glue to advance the job bookmark and mark the run successful. Without it, the next run reprocesses the same data.

Post-Section Quiz: ETL vs ELT and AWS Glue