Chapter 3: Storage Foundations and Open Table Formats

Learning Objectives

Pre-Section Quiz: File Formats and Columnar Storage

1. Why does Parquet typically read 5-40x less data than CSV for an analytical query that filters and aggregates a few columns?

Parquet uses tighter ASCII encoding than CSV Parquet stores values column-by-column with rich min/max statistics, so engines can read only relevant columns and skip irrelevant row groups Parquet files are stored unencrypted, which removes overhead Parquet files are always smaller because they drop NULL values entirely

2. What is "predicate pushdown" in the context of Parquet?

Pushing the WHERE clause from the application server down into the database A query optimization where filters are evaluated against file/row-group/page statistics so irrelevant data is never read Forcing the query engine to scan every page to verify the predicate A Hive feature that rewrites JOINs as MapReduce jobs

3. You are designing a Kafka pipeline where producers and consumers evolve schemas independently. Which file format is the natural fit for the on-the-wire records?

Parquet, because it has the strongest compression CSV, because it is human-readable Avro, because it is a compact binary row format with an embedded schema designed for streaming ORC, because it was built for streaming workloads

4. Inside a Parquet file, what is the correct hierarchy of physical units from largest to smallest?

Page → Column Chunk → Row Group → File File → Row Group → Column Chunk → Page File → Page → Row Group → Column Chunk Row Group → File → Page → Column Chunk

5. Which encoding inside Parquet is best suited for a low-cardinality column like country with values that repeat thousands of times?

Bit-packing, because countries are stored as small integers Bloom filters, because they speed up point lookups Dictionary encoding, because it replaces repeated values with compact integer IDs Run-length encoding only, since RLE always beats dictionary encoding

1. File Formats and Columnar Storage

Storage decisions cascade through every layer of a data platform. The bytes you commit to disk determine how much money you spend on storage, how much money you spend on scans, and how fast queries return. This partition focuses on file formats — the lowest layer of the storage stack — with special attention to why columnar Parquet has become the default for analytics.

Key Points

1.1 Row-Oriented vs Columnar: The Core Trade-off

Row-oriented formats — CSV, JSON, Avro — store records the way you'd write them on a notepad: all the fields for one record together, then all the fields for the next. CSV is the lowest common denominator (every spreadsheet and scripting language reads it), JSON layers in nested types (great for API payloads), and Avro is a compact binary row format with an embedded schema designed for high-throughput streaming on Apache Kafka.

Row formats are great when you need to read or write whole records: ingesting webhook events, exporting a customer profile, replaying a Kafka topic. They are terrible for analytics because answering “average order value last quarter?” forces the engine to read every byte of every order — including customer addresses, SKUs, and shipping notes you don't care about — just to grab one column.

Parquet and ORC flip the layout. Instead of grouping rows together, they group values from the same column together. A query that touches three of fifty columns reads roughly 6% of the file rather than 100%. Apache Parquet has become the de facto standard for analytical lakes; ORC is its close cousin, more common in legacy Hive deployments.

1.2 Parquet's Hierarchical Layout

Parquet organizes a file hierarchically: the file is split into row groups of around 128 MB, each row group is sliced into per-column column chunks, and each chunk is broken into pages that hold the actual encoded values. At every level, Parquet stores rich metadata: min/max values, null counts, distinct counts, and optional bloom filters.

flowchart TD File["Parquet File"] File --> RG1["Row Group 1 (~128 MB)"] File --> RG2["Row Group 2 (~128 MB)"] File --> Footer["Footer (file metadata + schema)"] RG1 --> CC1["Column Chunk: order_id"] RG1 --> CC2["Column Chunk: customer_id"] RG1 --> CC3["Column Chunk: amount"] CC1 --> P1["Page 1 (encoded values + stats)"] CC1 --> P2["Page 2 (encoded values + stats)"] CC1 --> P3["Page N ..."] Footer --> Stats["Min/Max, null counts, bloom filters"] Stats -. "predicate pushdown skips chunks/pages" .-> RG2

That metadata enables predicate pushdown — the engine consults statistics and skips data that cannot match the filter without reading the values. If a query asks WHERE shipdate ≤ '1996-09-02' and a row group's max shipdate is '1996-08-15', the engine reads it. If the row group's min shipdate is '1997-01-01', the engine skips the entire row group. On TPC-H SF20 this kind of pruning skips roughly 30% of data on selective filters.

Columnar Predicate Pushdown

Query: SELECT SUM(amount) WHERE amount > 100. Only the amount column is fetched; row groups whose max(amount) is below 100 are skipped via pushdown.

order_id customer_id amount shipping_notes addr RG1 max=850 RG2 max=42 SKIPPED — max < 100 RG3 max=1200 Engine reads only the amount column

1.3 Encodings That Compound the Win

Parquet stacks several encodings on top of compression:

EncodingWhat It DoesBest For
Dictionary encodingReplaces repeated values with integer IDsLow-cardinality columns (status, country)
Run-length encoding (RLE)Compresses runs of identical values into (value, count)Sorted or naturally clustered columns
Bit-packingStores small integers using only the bits they needCounts, IDs, encoded categoricals
Bloom filtersProbabilistic structure for “definitely not present” checksPoint lookups by ID

The numbers are striking. On a 194 GB Allstate dataset, Parquet compressed the data to 4.7 GB — ~97% reduction — and read 3.5x less data on full scans. On TPC-H SF20, Parquet was 5x smaller than CSV (3.2 GB vs 16 GB) and 7-10x faster on joins (2 s versus 20 s). On AWS Athena, the same query scanned 117 KB of Parquet versus 48 MB of CSV, which directly shows up on the bill.

1.4 Choosing a Format by Access Pattern

The choice is rarely “Parquet always wins.” It depends on access pattern.

PatternRecommended FormatWhy
Streaming ingest with schema evolutionAvroCompact binary, embedded schema, row-based writes
Webhook landing zone, debug-friendlyJSON or CSVHuman-readable, no tooling required
Analytical scans, BI dashboardsParquet (or ORC on Hive)Columnar, compressed, predicate pushdown
Small (<100 MB) reference dataCSV is fineTooling overhead exceeds benefit
OLAP at >100 MB scaleParquet5-40x smaller, 7-100x faster queries

A common pattern is to land raw data as JSON or Avro for fidelity, then immediately convert to Parquet for analytics. The raw layer is the audit trail; the Parquet layer is what dashboards actually query.

Post-Section Quiz: File Formats and Columnar Storage

1. Why does Parquet typically read 5-40x less data than CSV for an analytical query that filters and aggregates a few columns?

Parquet uses tighter ASCII encoding than CSV Parquet stores values column-by-column with rich min/max statistics, so engines can read only relevant columns and skip irrelevant row groups Parquet files are stored unencrypted, which removes overhead Parquet files are always smaller because they drop NULL values entirely

2. What is "predicate pushdown" in the context of Parquet?

Pushing the WHERE clause from the application server down into the database A query optimization where filters are evaluated against file/row-group/page statistics so irrelevant data is never read Forcing the query engine to scan every page to verify the predicate A Hive feature that rewrites JOINs as MapReduce jobs

3. You are designing a Kafka pipeline where producers and consumers evolve schemas independently. Which file format is the natural fit for the on-the-wire records?

Parquet, because it has the strongest compression CSV, because it is human-readable Avro, because it is a compact binary row format with an embedded schema designed for streaming ORC, because it was built for streaming workloads

4. Inside a Parquet file, what is the correct hierarchy of physical units from largest to smallest?

Page → Column Chunk → Row Group → File File → Row Group → Column Chunk → Page File → Page → Row Group → Column Chunk Row Group → File → Page → Column Chunk

5. Which encoding inside Parquet is best suited for a low-cardinality column like country with values that repeat thousands of times?

Bit-packing, because countries are stored as small integers Bloom filters, because they speed up point lookups Dictionary encoding, because it replaces repeated values with compact integer IDs Run-length encoding only, since RLE always beats dictionary encoding
Pre-Section Quiz: Open Table Formats

1. What core problem do open table formats solve that raw Parquet on object storage does not?

They compress files more aggressively than Parquet alone They add ACID transactions, schema evolution, and time travel on top of immutable data files They replace Parquet entirely with a proprietary columnar layout They eliminate the need for a query engine

2. From the catalog pointer down to physical data, what is the correct order of Iceberg's metadata layers?

Catalog → manifest list → metadata.json → manifest → data files Catalog → metadata.json → snapshot/manifest list → manifest → data files metadata.json → catalog → data files → manifest → snapshot Snapshot → catalog → manifest → metadata.json → data files

3. Why can Iceberg rename, drop, reorder, or add columns without rewriting any data files?

It stores a separate Parquet file per column version Each column has an immutable numeric ID, so name changes only update metadata Iceberg keeps every column as text and re-parses on read Iceberg disables schema enforcement entirely

4. A pipeline applies thousands of CDC upserts per minute and needs writes to land quickly even at the cost of a small read overhead. Which Hudi mode fits best?

Copy-on-Write, because it minimizes read latency Merge-on-Read, because new changes are written as small delta logs and merged at read time Iceberg snapshots, because they are atomic Delta Lake's _delta_log, because it is append-only

5. Your organization wants vendor-neutral lakehouse storage queryable from Spark, Flink, Trino, Athena, BigQuery, Snowflake, and DuckDB. Which open table format has the broadest native engine support?

Delta Lake, because it is open source Apache Hudi, because it is the oldest of the three Apache Iceberg, with native support across Spark, Flink, Trino, Athena, BigQuery, Snowflake, and DuckDB Plain Parquet, because every engine reads it

2. Open Table Formats: Iceberg, Delta, Hudi

A Parquet file is just a Parquet file. It has no notion of “the current state of the orders table.” If two writers append concurrently, or one writer fails halfway through, you can end up with partial data, duplicates, or queries that see inconsistent snapshots. Open table formats are thin metadata layers on top of Parquet/ORC that provide ACID transactions, schema evolution, and time travel on commodity object storage.

Key Points

2.1 Apache Iceberg: A Tree of Snapshots

Iceberg organizes a table as a tree of metadata files. At the root sits a metadata.json file pointing at the current snapshot. Each snapshot lists manifest files, and each manifest lists the data files (Parquet or ORC) that belong to the table at that moment. Writers append new data files and produce a new snapshot atomically; readers see whatever snapshot was current when their query started. The result is serializable isolation via optimistic concurrency control — two writers can prepare commits in parallel; the second one detects a conflict and retries.

flowchart TD Catalog["Catalog (Glue / Hive / REST)"] Catalog --> MetaJSON["metadata.json (current snapshot pointer)"] MetaJSON --> Snap1["Snapshot S1 (older)"] MetaJSON --> Snap2["Snapshot S2 (current)"] Snap1 --> ML1["Manifest List S1"] Snap2 --> ML2["Manifest List S2"] ML1 --> M1["Manifest File A"] ML2 --> M1 ML2 --> M2["Manifest File B (new)"] M1 --> D1["data-001.parquet"] M1 --> D2["data-002.parquet"] M2 --> D3["data-003.parquet (newly added)"]

Iceberg Metadata Layers Reveal

From the catalog pointer down to data files, each layer dereferences the next. A reader walks this tree once at query start.

Catalog Glue / Hive / REST metadata.json current-snapshot-id Manifest List (snapshot) avro file with manifest pointers Manifest A data file metadata + stats Manifest B data file metadata + stats data-001.parquet data-002.parquet

Because every snapshot is preserved (until expired), Iceberg supports time travel:

-- Query Iceberg as of a specific time
SELECT * FROM iceberg_table
FOR SYSTEM_TIME AS OF '2026-05-07 14:00:00';

-- Or by snapshot ID
SELECT * FROM iceberg_table
FOR SYSTEM_VERSION AS OF 4538291;

Iceberg's biggest superpower is schema evolution backed by stable column IDs. Each column has an immutable numeric ID, so you can rename, reorder, drop, or add columns without rewriting any data files. Partition evolution works the same way: change the partition spec on a go-forward basis — old partitions and new partitions coexist seamlessly.

2.2 Delta Lake: A Transaction Log

Delta Lake takes a different route. Instead of a tree of manifests, it maintains a _delta_log directory containing an ordered series of JSON commit files, each describing one transaction (add file X, remove file Y, update schema, etc.). To compute the current table state, the reader replays the log from the last checkpoint forward — much like a database write-ahead log applied to object storage.

-- Delta time travel by version
SELECT * FROM delta_table VERSION AS OF 5;

-- By timestamp
SELECT * FROM delta_table TIMESTAMP AS OF '2026-05-01';

Delta's default version retention is 30 days. Schema evolution is more limited than Iceberg's: you can add columns at the end and merge schemas during writes, but native rename/drop is constrained. Where Delta shines is the Databricks ecosystem: zero-copy table cloning, deep MERGE optimizations, and tight integration with MLflow and Unity Catalog.

2.3 Apache Hudi: Streaming-First Upserts

Hudi was built at Uber to solve a problem the others didn't initially tackle: efficient row-level upserts and deletes on a lake. Hudi maintains a timeline of instants and offers two table types:

Table TypeHow It WorksBest For
Copy-on-Write (CoW)Each write rewrites affected files in full. Reads are fast and uniform.Read-heavy analytics, batch ETL
Merge-on-Read (MoR)New changes are written as small delta logs and merged at read time (with periodic compaction). Writes are fast; reads pay a small merge cost.Streaming ingestion, CDC, low-latency upserts
flowchart LR Upsert["Incoming upsert batch"] Upsert --> CoW["Copy-on-Write path"] Upsert --> MoR["Merge-on-Read path"] CoW --> Rewrite["Rewrite affected base Parquet files"] Rewrite --> CoWRead["Reader: scan base files (fast, uniform)"] MoR --> Delta["Append small delta log files (Avro)"] Delta --> Compact["Periodic compaction job"] Compact --> Base["Merged base Parquet files"] Delta --> MoRRead["Reader: merge base + deltas at query time"] Base --> MoRRead

Hudi Copy-on-Write vs Merge-on-Read

Top: CoW rewrites affected base files on every upsert — uniform reads. Bottom: MoR appends delta logs and merges at read time — fast writes.

Upsert batch CDC events Copy-on-Write Rewrite base files part-0001.parquet (new) Reader: scan base fast, uniform Merge-on-Read Append delta log (avro) .log.1 .log.2 .log.3 Periodic compaction Reader: merge base + deltas at query

Hudi also brings record-level indexing (Bloom, Hash File, HBase-backed), so updating one row by primary key doesn't require scanning the whole partition. This makes Hudi the format of choice for CDC pipelines from transactional sources. Schema evolution is more limited than Iceberg's, and ecosystem support outside Spark and Flink is thinner.

2.4 Side by Side

FeatureIcebergDelta LakeHudi
MetadataJSON manifest tree_delta_log transaction logTimeline (DeltaLog)
IsolationSerializableSerializableSerializable
Schema evolutionMost advanced (rename, drop, reorder, promote)Moderate (add at end, merge)Basic
Partition evolutionHidden + dynamicExplicit, requires rewriteLargely fixed
Time travelSnapshot-basedVersion & timestampInstant-based
Best forMulti-engine warehousesDatabricks/Spark, MLStreaming, CDC, upserts
Native enginesSpark, Flink, Trino, Athena, BigQuery, SnowflakeSpark, Databricks-centricSpark, Flink

An analogy: Iceberg is like Git for your lake — every commit is a snapshot you can travel back to. Delta is like a single transaction log a la PostgreSQL's WAL, replayed to derive state. Hudi is like a journaling filesystem optimized for many small updates with periodic compaction.

Post-Section Quiz: Open Table Formats

1. What core problem do open table formats solve that raw Parquet on object storage does not?

They compress files more aggressively than Parquet alone They add ACID transactions, schema evolution, and time travel on top of immutable data files They replace Parquet entirely with a proprietary columnar layout They eliminate the need for a query engine

2. From the catalog pointer down to physical data, what is the correct order of Iceberg's metadata layers?

Catalog → manifest list → metadata.json → manifest → data files Catalog → metadata.json → snapshot/manifest list → manifest → data files metadata.json → catalog → data files → manifest → snapshot Snapshot → catalog → manifest → metadata.json → data files

3. Why can Iceberg rename, drop, reorder, or add columns without rewriting any data files?

It stores a separate Parquet file per column version Each column has an immutable numeric ID, so name changes only update metadata Iceberg keeps every column as text and re-parses on read Iceberg disables schema enforcement entirely

4. A pipeline applies thousands of CDC upserts per minute and needs writes to land quickly even at the cost of a small read overhead. Which Hudi mode fits best?

Copy-on-Write, because it minimizes read latency Merge-on-Read, because new changes are written as small delta logs and merged at read time Iceberg snapshots, because they are atomic Delta Lake's _delta_log, because it is append-only

5. Your organization wants vendor-neutral lakehouse storage queryable from Spark, Flink, Trino, Athena, BigQuery, Snowflake, and DuckDB. Which open table format has the broadest native engine support?

Delta Lake, because it is open source Apache Hudi, because it is the oldest of the three Apache Iceberg, with native support across Spark, Flink, Trino, Athena, BigQuery, Snowflake, and DuckDB Plain Parquet, because every engine reads it
Pre-Section Quiz: S3 Object Storage and Partitioning

1. You have raw landing data that is rarely queried after 90 days but must remain accessible for audits within minutes. Which S3 storage class is the most cost-effective fit at the 90-365 day stage?

S3 Standard, to keep retrieval instant S3 Glacier Deep Archive, since you do not query it frequently S3 Glacier Instant Retrieval, which is cheap to store and still returns in milliseconds S3 One Zone-IA, since data is stored in one AZ for durability

2. What is the main operational benefit of Amazon S3 Tables over a self-managed Iceberg deployment in standard S3 buckets?

It uses a different file format than Parquet AWS automates compaction, snapshot expiration, and unreferenced file removal as a managed service It eliminates the need for a query engine It removes the per-byte scanned charge from Athena

3. With Hive-style partitioning, why does forgetting to include the partition column in a WHERE clause silently kill performance?

Hive partitions only work for time columns Without the partition filter the engine cannot narrow down directories and falls back to a full scan Hive partitions corrupt unless every column is filtered Partition pruning requires a special PRUNE SQL keyword

4. Why is Iceberg's hidden partitioning more user-friendly than classical Hive partitioning?

It removes partitioning entirely Partition values are derived from source columns via transforms (e.g., days(order_ts)) so users write natural filters and pruning still works It encodes partition values into S3 object tags rather than directories It scans every directory to determine the partition

5. A 5 TB clickstream table is queried mostly by event_date, occasionally by user_id, and sometimes by country. Which layout best balances pruning across these patterns?

Hive-partition by user_id alone Bucket the entire table by country Partition by days(event_ts) and Z-order within partition by (user_id, country) Store as a single 5 TB Parquet file

3. S3 Object Storage and Partitioning

This partition covers the bottom and the connective layer: where the bytes physically live (S3 storage classes, S3 Tables, lifecycle policies) and how they are arranged for fast pruning (Hive partitioning, Iceberg hidden partitioning, bucketing, clustering, Z-ordering).

Key Points

3.1 S3 Storage Classes

Amazon S3 is the de facto storage backbone for AWS-based lakes. What many teams miss is that S3 is not one storage tier — it's a family of classes with very different cost and access profiles. Choosing the right class per object can cut the storage bill 50-95%.

Storage ClassDesigned ForAccess LatencyCost Profile
S3 StandardHot, frequently accessed dataMillisecondsHighest storage, lowest retrieval
S3 Intelligent-TieringUnknown/changing access patternsMillisecondsAuto-moves objects between tiers
S3 Standard-IALess frequent but rapid accessMilliseconds~45% cheaper storage, retrieval fee
S3 One Zone-IARecreatable infrequent dataMillisecondsSingle AZ, ~20% cheaper than Standard-IA
S3 Glacier Instant RetrievalArchive with millisecond accessMillisecondsLow storage, higher retrieval
S3 Glacier Flexible RetrievalArchive, minutes-to-hours retrievalMinutes-hoursVery low storage
S3 Glacier Deep ArchiveLong-term compliance archivesHours (12+)Lowest storage cost

3.2 Amazon S3 Tables: Managed Iceberg

Running Iceberg yourself is workable but operationally heavy: schedule compaction (small file problem), expire old snapshots, clean up unreferenced files, tune metadata layout. Most teams underinvest, and table performance silently degrades.

Amazon S3 Tables, announced at AWS re:Invent 2024, is a fully managed Iceberg service built on a new bucket type called table buckets. Tables are first-class AWS resources with their own ARNs, IAM policies, and dedicated endpoints.

  1. Performance — up to 3x query throughput and 10x TPS versus self-managed Iceberg.
  2. Automatic maintenance — background compaction, snapshot expiration, unreferenced file removal.
  3. Native integrations — auto-registered in AWS Glue Data Catalog; queryable from Athena, EMR, Spark, Redshift, Kinesis Data Firehose, and QuickSight.

3.3 Lifecycle Policies and Intelligent Tiering

For data not in S3 Tables, control cost through lifecycle policies — declarative rules that transition objects between storage classes by age. A common lake lifecycle:

# Conceptual S3 lifecycle policy
- Day 0-30:     S3 Standard (hot ETL output)
- Day 30-90:    S3 Standard-IA (occasional ad-hoc queries)
- Day 90-365:   S3 Glacier Instant Retrieval (audit access)
- Day 365+:     S3 Glacier Deep Archive (compliance only)
- Day 2555:     Delete (7 years)
stateDiagram-v2 [*] --> S3_Standard: Object created (Day 0) S3_Standard --> S3_Standard_IA: Day 30 (infrequent access) S3_Standard_IA --> Glacier_Instant: Day 90 (audit-only access) Glacier_Instant --> Glacier_Deep_Archive: Day 365 (compliance only) Glacier_Deep_Archive --> [*]: Day 2555 (delete after 7 years) S3_Standard --> Intelligent_Tiering: Unknown access pattern Intelligent_Tiering --> Intelligent_Tiering: Auto-move Frequent/Infrequent/Archive

3.4 Hive-Style Partitioning

The classical scheme, inherited from Apache Hive, encodes partition values directly into the directory path:

s3://lake/orders/
  region=us/year=2026/month=05/day=07/part-0001.parquet
  region=us/year=2026/month=05/day=07/part-0002.parquet
  region=eu/year=2026/month=05/day=07/part-0001.parquet

A query with WHERE region='us' AND year=2026 AND month=5 only lists those directories. The query engine never sees the EU files. Downsides: users must include partition columns explicitly or pruning silently fails; partition values are tied to physical layout, so changing the scheme requires rewriting all historical data; too-fine partitioning produces millions of tiny files.

3.5 Hidden Partitioning in Iceberg

Iceberg solves the awkward parts with hidden partitioning. The partition spec lives in metadata, not the directory layout, and Iceberg automatically derives partition values from source columns using transforms:

-- Define partitioning by day, derived from a timestamp column
CREATE TABLE orders (
  order_id BIGINT,
  customer_id BIGINT,
  order_ts TIMESTAMP,
  amount DECIMAL(12,2)
) PARTITIONED BY (days(order_ts));

-- Users write natural SQL, Iceberg prunes automatically
SELECT SUM(amount)
FROM orders
WHERE order_ts BETWEEN '2026-05-01' AND '2026-05-07';

Available transforms include years, months, days, hours, bucket(N, col), and truncate(N, col); the partition spec can evolve over time without rewriting old data.

3.6 Bucketing, Clustering, and Z-Ordering

Partitioning works best for low-cardinality columns. For high-cardinality columns like customer_id or order_id, partitioning is impossible — you'd get millions of tiny partitions. The answer is bucketing, clustering, and Z-ordering.

TechniqueBest ForCardinalityMechanism
Hive partitioningCoarse filters on a few columnsLow (10s-1000s)Directory paths
Hidden partitioning (Iceberg)Same as Hive but with evolutionLowMetadata + transforms
BucketingPoint lookups, equi-joinsHighHash to fixed buckets
Clustering / sortRange queries on one dominant columnHighSort within file
Z-orderingMulti-column range filtersHigh, multi-dimInterleaved bit clustering

A worked example: a 5 TB clickstream table queried mostly by event_date, occasionally by user_id, sometimes by country:

  1. Partition by days(event_ts) — coarse pruning on the dominant filter.
  2. Z-order within each partition by (user_id, country) — second-dimension pruning.
  3. Target file size 128-512 MB to keep file count manageable.
flowchart TD Q["Query: WHERE event_ts in last 24h AND user_id = 12345"] Q --> P["Partition prune: keep day=2026-05-07"] P --> Z["Z-order prune: skip files whose (user_id, country) range excludes 12345"] Z --> RG["Parquet row-group prune: skip groups via min/max stats"] RG --> Page["Page-level scan: read only matching pages"] Page --> Result["Return matching rows"] P -. "skips ~99% of partitions" .-> Skip1["Skipped TB"] Z -. "skips most surviving files" .-> Skip2["Skipped GB"] RG -. "skips ~30% on selective filters" .-> Skip3["Skipped MB"]

Layered Pruning: Partition → Z-Order → Row Group

Each layer narrows the search space. Red tiles are skipped without being read; blue tiles survive and are passed down to the next layer.

L1: Partition prune days(event_ts) today L2: Z-order prune (user_id, country) L3: Row group prune min/max stats Final: 2 row groups read of original ~tens of TBs

Queries filtering on date alone hit the partition prune. Queries also filtering on user_id additionally skip most files within the day. The metadata stays small because there are at most a few thousand date partitions, not millions of user partitions.

Post-Section Quiz: S3 Object Storage and Partitioning

1. You have raw landing data that is rarely queried after 90 days but must remain accessible for audits within minutes. Which S3 storage class is the most cost-effective fit at the 90-365 day stage?

S3 Standard, to keep retrieval instant S3 Glacier Deep Archive, since you do not query it frequently S3 Glacier Instant Retrieval, which is cheap to store and still returns in milliseconds S3 One Zone-IA, since data is stored in one AZ for durability

2. What is the main operational benefit of Amazon S3 Tables over a self-managed Iceberg deployment in standard S3 buckets?

It uses a different file format than Parquet AWS automates compaction, snapshot expiration, and unreferenced file removal as a managed service It eliminates the need for a query engine It removes the per-byte scanned charge from Athena

3. With Hive-style partitioning, why does forgetting to include the partition column in a WHERE clause silently kill performance?

Hive partitions only work for time columns Without the partition filter the engine cannot narrow down directories and falls back to a full scan Hive partitions corrupt unless every column is filtered Partition pruning requires a special PRUNE SQL keyword

4. Why is Iceberg's hidden partitioning more user-friendly than classical Hive partitioning?

It removes partitioning entirely Partition values are derived from source columns via transforms (e.g., days(order_ts)) so users write natural filters and pruning still works It encodes partition values into S3 object tags rather than directories It scans every directory to determine the partition

5. A 5 TB clickstream table is queried mostly by event_date, occasionally by user_id, and sometimes by country. Which layout best balances pruning across these patterns?

Hive-partition by user_id alone Bucket the entire table by country Partition by days(event_ts) and Z-order within partition by (user_id, country) Store as a single 5 TB Parquet file

Your Progress

Answer Explanations