Chapter 3: Storage Foundations and Open Table Formats
Learning Objectives
Compare row-oriented, columnar, and hybrid storage formats by use case
Explain how Apache Iceberg, Delta Lake, and Hudi implement ACID on object storage
Use partitioning, clustering, and Z-ordering to accelerate analytical queries
Configure S3 storage classes and S3 Tables for cost-effective lake storage
Pre-Section Quiz: File Formats and Columnar Storage
1. Why does Parquet typically read 5-40x less data than CSV for an analytical query that filters and aggregates a few columns?
Parquet uses tighter ASCII encoding than CSVParquet stores values column-by-column with rich min/max statistics, so engines can read only relevant columns and skip irrelevant row groupsParquet files are stored unencrypted, which removes overheadParquet files are always smaller because they drop NULL values entirely
2. What is "predicate pushdown" in the context of Parquet?
Pushing the WHERE clause from the application server down into the databaseA query optimization where filters are evaluated against file/row-group/page statistics so irrelevant data is never readForcing the query engine to scan every page to verify the predicateA Hive feature that rewrites JOINs as MapReduce jobs
3. You are designing a Kafka pipeline where producers and consumers evolve schemas independently. Which file format is the natural fit for the on-the-wire records?
Parquet, because it has the strongest compressionCSV, because it is human-readableAvro, because it is a compact binary row format with an embedded schema designed for streamingORC, because it was built for streaming workloads
4. Inside a Parquet file, what is the correct hierarchy of physical units from largest to smallest?
Page → Column Chunk → Row Group → FileFile → Row Group → Column Chunk → PageFile → Page → Row Group → Column ChunkRow Group → File → Page → Column Chunk
5. Which encoding inside Parquet is best suited for a low-cardinality column like country with values that repeat thousands of times?
Bit-packing, because countries are stored as small integersBloom filters, because they speed up point lookupsDictionary encoding, because it replaces repeated values with compact integer IDsRun-length encoding only, since RLE always beats dictionary encoding
1. File Formats and Columnar Storage
Storage decisions cascade through every layer of a data platform. The bytes you commit to disk determine how much money you spend on storage, how much money you spend on scans, and how fast queries return. This partition focuses on file formats — the lowest layer of the storage stack — with special attention to why columnar Parquet has become the default for analytics.
Key Points
Row-oriented formats (CSV, JSON, Avro) store all fields of a record together — ideal for ingestion and whole-record reads, terrible for column-selective analytics.
Parquet and ORC flip the layout to columnar so queries that touch 3 of 50 columns read only ~6% of the file.
Parquet organizes a file as File → Row Group (~128 MB) → Column Chunk → Page, with min/max stats and bloom filters at every level.
Predicate pushdown uses those statistics to skip data that cannot match a filter — the heart of Parquet's 7-100x speedup over CSV on analytical workloads.
Encodings (dictionary, RLE, bit-packing, bloom filters) stack on top of compression, often shrinking files by 95%+.
1.1 Row-Oriented vs Columnar: The Core Trade-off
Row-oriented formats — CSV, JSON, Avro — store records the way you'd write them on a notepad: all the fields for one record together, then all the fields for the next. CSV is the lowest common denominator (every spreadsheet and scripting language reads it), JSON layers in nested types (great for API payloads), and Avro is a compact binary row format with an embedded schema designed for high-throughput streaming on Apache Kafka.
Row formats are great when you need to read or write whole records: ingesting webhook events, exporting a customer profile, replaying a Kafka topic. They are terrible for analytics because answering “average order value last quarter?” forces the engine to read every byte of every order — including customer addresses, SKUs, and shipping notes you don't care about — just to grab one column.
Parquet and ORC flip the layout. Instead of grouping rows together, they group values from the same column together. A query that touches three of fifty columns reads roughly 6% of the file rather than 100%. Apache Parquet has become the de facto standard for analytical lakes; ORC is its close cousin, more common in legacy Hive deployments.
1.2 Parquet's Hierarchical Layout
Parquet organizes a file hierarchically: the file is split into row groups of around 128 MB, each row group is sliced into per-column column chunks, and each chunk is broken into pages that hold the actual encoded values. At every level, Parquet stores rich metadata: min/max values, null counts, distinct counts, and optional bloom filters.
That metadata enables predicate pushdown — the engine consults statistics and skips data that cannot match the filter without reading the values. If a query asks WHERE shipdate ≤ '1996-09-02' and a row group's max shipdate is '1996-08-15', the engine reads it. If the row group's min shipdate is '1997-01-01', the engine skips the entire row group. On TPC-H SF20 this kind of pruning skips roughly 30% of data on selective filters.
Columnar Predicate Pushdown
Query: SELECT SUM(amount) WHERE amount > 100. Only the amount column is fetched; row groups whose max(amount) is below 100 are skipped via pushdown.
1.3 Encodings That Compound the Win
Parquet stacks several encodings on top of compression:
Encoding
What It Does
Best For
Dictionary encoding
Replaces repeated values with integer IDs
Low-cardinality columns (status, country)
Run-length encoding (RLE)
Compresses runs of identical values into (value, count)
Sorted or naturally clustered columns
Bit-packing
Stores small integers using only the bits they need
Counts, IDs, encoded categoricals
Bloom filters
Probabilistic structure for “definitely not present” checks
Point lookups by ID
The numbers are striking. On a 194 GB Allstate dataset, Parquet compressed the data to 4.7 GB — ~97% reduction — and read 3.5x less data on full scans. On TPC-H SF20, Parquet was 5x smaller than CSV (3.2 GB vs 16 GB) and 7-10x faster on joins (2 s versus 20 s). On AWS Athena, the same query scanned 117 KB of Parquet versus 48 MB of CSV, which directly shows up on the bill.
1.4 Choosing a Format by Access Pattern
The choice is rarely “Parquet always wins.” It depends on access pattern.
Pattern
Recommended Format
Why
Streaming ingest with schema evolution
Avro
Compact binary, embedded schema, row-based writes
Webhook landing zone, debug-friendly
JSON or CSV
Human-readable, no tooling required
Analytical scans, BI dashboards
Parquet (or ORC on Hive)
Columnar, compressed, predicate pushdown
Small (<100 MB) reference data
CSV is fine
Tooling overhead exceeds benefit
OLAP at >100 MB scale
Parquet
5-40x smaller, 7-100x faster queries
A common pattern is to land raw data as JSON or Avro for fidelity, then immediately convert to Parquet for analytics. The raw layer is the audit trail; the Parquet layer is what dashboards actually query.
Post-Section Quiz: File Formats and Columnar Storage
1. Why does Parquet typically read 5-40x less data than CSV for an analytical query that filters and aggregates a few columns?
Parquet uses tighter ASCII encoding than CSVParquet stores values column-by-column with rich min/max statistics, so engines can read only relevant columns and skip irrelevant row groupsParquet files are stored unencrypted, which removes overheadParquet files are always smaller because they drop NULL values entirely
2. What is "predicate pushdown" in the context of Parquet?
Pushing the WHERE clause from the application server down into the databaseA query optimization where filters are evaluated against file/row-group/page statistics so irrelevant data is never readForcing the query engine to scan every page to verify the predicateA Hive feature that rewrites JOINs as MapReduce jobs
3. You are designing a Kafka pipeline where producers and consumers evolve schemas independently. Which file format is the natural fit for the on-the-wire records?
Parquet, because it has the strongest compressionCSV, because it is human-readableAvro, because it is a compact binary row format with an embedded schema designed for streamingORC, because it was built for streaming workloads
4. Inside a Parquet file, what is the correct hierarchy of physical units from largest to smallest?
Page → Column Chunk → Row Group → FileFile → Row Group → Column Chunk → PageFile → Page → Row Group → Column ChunkRow Group → File → Page → Column Chunk
5. Which encoding inside Parquet is best suited for a low-cardinality column like country with values that repeat thousands of times?
Bit-packing, because countries are stored as small integersBloom filters, because they speed up point lookupsDictionary encoding, because it replaces repeated values with compact integer IDsRun-length encoding only, since RLE always beats dictionary encoding
Pre-Section Quiz: Open Table Formats
1. What core problem do open table formats solve that raw Parquet on object storage does not?
They compress files more aggressively than Parquet aloneThey add ACID transactions, schema evolution, and time travel on top of immutable data filesThey replace Parquet entirely with a proprietary columnar layoutThey eliminate the need for a query engine
2. From the catalog pointer down to physical data, what is the correct order of Iceberg's metadata layers?
Catalog → manifest list → metadata.json → manifest → data filesCatalog → metadata.json → snapshot/manifest list → manifest → data filesmetadata.json → catalog → data files → manifest → snapshotSnapshot → catalog → manifest → metadata.json → data files
3. Why can Iceberg rename, drop, reorder, or add columns without rewriting any data files?
It stores a separate Parquet file per column versionEach column has an immutable numeric ID, so name changes only update metadataIceberg keeps every column as text and re-parses on readIceberg disables schema enforcement entirely
4. A pipeline applies thousands of CDC upserts per minute and needs writes to land quickly even at the cost of a small read overhead. Which Hudi mode fits best?
Copy-on-Write, because it minimizes read latencyMerge-on-Read, because new changes are written as small delta logs and merged at read timeIceberg snapshots, because they are atomicDelta Lake's _delta_log, because it is append-only
5. Your organization wants vendor-neutral lakehouse storage queryable from Spark, Flink, Trino, Athena, BigQuery, Snowflake, and DuckDB. Which open table format has the broadest native engine support?
Delta Lake, because it is open sourceApache Hudi, because it is the oldest of the threeApache Iceberg, with native support across Spark, Flink, Trino, Athena, BigQuery, Snowflake, and DuckDBPlain Parquet, because every engine reads it
2. Open Table Formats: Iceberg, Delta, Hudi
A Parquet file is just a Parquet file. It has no notion of “the current state of the orders table.” If two writers append concurrently, or one writer fails halfway through, you can end up with partial data, duplicates, or queries that see inconsistent snapshots. Open table formats are thin metadata layers on top of Parquet/ORC that provide ACID transactions, schema evolution, and time travel on commodity object storage.
Key Points
Apache Iceberg uses a tree of JSON manifests rooted at a catalog pointer; serializable isolation via optimistic concurrency; broadest engine support; aggressive schema and partition evolution backed by stable column IDs.
Delta Lake uses an ordered _delta_log of JSON commits replayed from checkpoints; tightest Databricks/Spark integration; default 30-day version retention.
Apache Hudi is streaming-first with two table types — Copy-on-Write (rewrite affected files on each write) and Merge-on-Read (append delta logs, compact later).
All three deliver serializable isolation, but they differ sharply on schema flexibility, partition evolution, and engine ecosystem.
Pick Iceberg for multi-engine flexibility, Delta for Databricks-centric workloads, Hudi for streaming upserts and CDC.
2.1 Apache Iceberg: A Tree of Snapshots
Iceberg organizes a table as a tree of metadata files. At the root sits a metadata.json file pointing at the current snapshot. Each snapshot lists manifest files, and each manifest lists the data files (Parquet or ORC) that belong to the table at that moment. Writers append new data files and produce a new snapshot atomically; readers see whatever snapshot was current when their query started. The result is serializable isolation via optimistic concurrency control — two writers can prepare commits in parallel; the second one detects a conflict and retries.
From the catalog pointer down to data files, each layer dereferences the next. A reader walks this tree once at query start.
Because every snapshot is preserved (until expired), Iceberg supports time travel:
-- Query Iceberg as of a specific time
SELECT * FROM iceberg_table
FOR SYSTEM_TIME AS OF '2026-05-07 14:00:00';
-- Or by snapshot ID
SELECT * FROM iceberg_table
FOR SYSTEM_VERSION AS OF 4538291;
Iceberg's biggest superpower is schema evolution backed by stable column IDs. Each column has an immutable numeric ID, so you can rename, reorder, drop, or add columns without rewriting any data files. Partition evolution works the same way: change the partition spec on a go-forward basis — old partitions and new partitions coexist seamlessly.
2.2 Delta Lake: A Transaction Log
Delta Lake takes a different route. Instead of a tree of manifests, it maintains a _delta_log directory containing an ordered series of JSON commit files, each describing one transaction (add file X, remove file Y, update schema, etc.). To compute the current table state, the reader replays the log from the last checkpoint forward — much like a database write-ahead log applied to object storage.
-- Delta time travel by version
SELECT * FROM delta_table VERSION AS OF 5;
-- By timestamp
SELECT * FROM delta_table TIMESTAMP AS OF '2026-05-01';
Delta's default version retention is 30 days. Schema evolution is more limited than Iceberg's: you can add columns at the end and merge schemas during writes, but native rename/drop is constrained. Where Delta shines is the Databricks ecosystem: zero-copy table cloning, deep MERGE optimizations, and tight integration with MLflow and Unity Catalog.
2.3 Apache Hudi: Streaming-First Upserts
Hudi was built at Uber to solve a problem the others didn't initially tackle: efficient row-level upserts and deletes on a lake. Hudi maintains a timeline of instants and offers two table types:
Table Type
How It Works
Best For
Copy-on-Write (CoW)
Each write rewrites affected files in full. Reads are fast and uniform.
Read-heavy analytics, batch ETL
Merge-on-Read (MoR)
New changes are written as small delta logs and merged at read time (with periodic compaction). Writes are fast; reads pay a small merge cost.
Streaming ingestion, CDC, low-latency upserts
flowchart LR
Upsert["Incoming upsert batch"]
Upsert --> CoW["Copy-on-Write path"]
Upsert --> MoR["Merge-on-Read path"]
CoW --> Rewrite["Rewrite affected base Parquet files"]
Rewrite --> CoWRead["Reader: scan base files (fast, uniform)"]
MoR --> Delta["Append small delta log files (Avro)"]
Delta --> Compact["Periodic compaction job"]
Compact --> Base["Merged base Parquet files"]
Delta --> MoRRead["Reader: merge base + deltas at query time"]
Base --> MoRRead
Hudi Copy-on-Write vs Merge-on-Read
Top: CoW rewrites affected base files on every upsert — uniform reads. Bottom: MoR appends delta logs and merges at read time — fast writes.
Hudi also brings record-level indexing (Bloom, Hash File, HBase-backed), so updating one row by primary key doesn't require scanning the whole partition. This makes Hudi the format of choice for CDC pipelines from transactional sources. Schema evolution is more limited than Iceberg's, and ecosystem support outside Spark and Flink is thinner.
2.4 Side by Side
Feature
Iceberg
Delta Lake
Hudi
Metadata
JSON manifest tree
_delta_log transaction log
Timeline (DeltaLog)
Isolation
Serializable
Serializable
Serializable
Schema evolution
Most advanced (rename, drop, reorder, promote)
Moderate (add at end, merge)
Basic
Partition evolution
Hidden + dynamic
Explicit, requires rewrite
Largely fixed
Time travel
Snapshot-based
Version & timestamp
Instant-based
Best for
Multi-engine warehouses
Databricks/Spark, ML
Streaming, CDC, upserts
Native engines
Spark, Flink, Trino, Athena, BigQuery, Snowflake
Spark, Databricks-centric
Spark, Flink
An analogy: Iceberg is like Git for your lake — every commit is a snapshot you can travel back to. Delta is like a single transaction log a la PostgreSQL's WAL, replayed to derive state. Hudi is like a journaling filesystem optimized for many small updates with periodic compaction.
Post-Section Quiz: Open Table Formats
1. What core problem do open table formats solve that raw Parquet on object storage does not?
They compress files more aggressively than Parquet aloneThey add ACID transactions, schema evolution, and time travel on top of immutable data filesThey replace Parquet entirely with a proprietary columnar layoutThey eliminate the need for a query engine
2. From the catalog pointer down to physical data, what is the correct order of Iceberg's metadata layers?
Catalog → manifest list → metadata.json → manifest → data filesCatalog → metadata.json → snapshot/manifest list → manifest → data filesmetadata.json → catalog → data files → manifest → snapshotSnapshot → catalog → manifest → metadata.json → data files
3. Why can Iceberg rename, drop, reorder, or add columns without rewriting any data files?
It stores a separate Parquet file per column versionEach column has an immutable numeric ID, so name changes only update metadataIceberg keeps every column as text and re-parses on readIceberg disables schema enforcement entirely
4. A pipeline applies thousands of CDC upserts per minute and needs writes to land quickly even at the cost of a small read overhead. Which Hudi mode fits best?
Copy-on-Write, because it minimizes read latencyMerge-on-Read, because new changes are written as small delta logs and merged at read timeIceberg snapshots, because they are atomicDelta Lake's _delta_log, because it is append-only
5. Your organization wants vendor-neutral lakehouse storage queryable from Spark, Flink, Trino, Athena, BigQuery, Snowflake, and DuckDB. Which open table format has the broadest native engine support?
Delta Lake, because it is open sourceApache Hudi, because it is the oldest of the threeApache Iceberg, with native support across Spark, Flink, Trino, Athena, BigQuery, Snowflake, and DuckDBPlain Parquet, because every engine reads it
Pre-Section Quiz: S3 Object Storage and Partitioning
1. You have raw landing data that is rarely queried after 90 days but must remain accessible for audits within minutes. Which S3 storage class is the most cost-effective fit at the 90-365 day stage?
S3 Standard, to keep retrieval instantS3 Glacier Deep Archive, since you do not query it frequentlyS3 Glacier Instant Retrieval, which is cheap to store and still returns in millisecondsS3 One Zone-IA, since data is stored in one AZ for durability
2. What is the main operational benefit of Amazon S3 Tables over a self-managed Iceberg deployment in standard S3 buckets?
It uses a different file format than ParquetAWS automates compaction, snapshot expiration, and unreferenced file removal as a managed serviceIt eliminates the need for a query engineIt removes the per-byte scanned charge from Athena
3. With Hive-style partitioning, why does forgetting to include the partition column in a WHERE clause silently kill performance?
Hive partitions only work for time columnsWithout the partition filter the engine cannot narrow down directories and falls back to a full scanHive partitions corrupt unless every column is filteredPartition pruning requires a special PRUNE SQL keyword
4. Why is Iceberg's hidden partitioning more user-friendly than classical Hive partitioning?
It removes partitioning entirelyPartition values are derived from source columns via transforms (e.g., days(order_ts)) so users write natural filters and pruning still worksIt encodes partition values into S3 object tags rather than directoriesIt scans every directory to determine the partition
5. A 5 TB clickstream table is queried mostly by event_date, occasionally by user_id, and sometimes by country. Which layout best balances pruning across these patterns?
Hive-partition by user_id aloneBucket the entire table by countryPartition by days(event_ts) and Z-order within partition by (user_id, country)Store as a single 5 TB Parquet file
3. S3 Object Storage and Partitioning
This partition covers the bottom and the connective layer: where the bytes physically live (S3 storage classes, S3 Tables, lifecycle policies) and how they are arranged for fast pruning (Hive partitioning, Iceberg hidden partitioning, bucketing, clustering, Z-ordering).
Key Points
S3 is a family of storage classes — Standard, Intelligent-Tiering, Standard-IA, One Zone-IA, Glacier Instant/Flexible/Deep — with cost differences of 50-95% across tiers.
Amazon S3 Tables is a fully managed Iceberg service launched at re:Invent 2024; up to 3x query throughput and 10x TPS versus self-managed Iceberg, with automated compaction and snapshot expiry.
Hive partitioning encodes partition values in directory paths; simple and universal, but requires users to filter on partition columns or pruning silently fails.
Iceberg hidden partitioning derives partition values via transforms (days, hours, bucket, truncate) from source columns — users write natural SQL.
Combine coarse partitioning on the dominant filter (often time) with Z-ordering on high-cardinality columns to make billion-row tables feel indexed.
3.1 S3 Storage Classes
Amazon S3 is the de facto storage backbone for AWS-based lakes. What many teams miss is that S3 is not one storage tier — it's a family of classes with very different cost and access profiles. Choosing the right class per object can cut the storage bill 50-95%.
Storage Class
Designed For
Access Latency
Cost Profile
S3 Standard
Hot, frequently accessed data
Milliseconds
Highest storage, lowest retrieval
S3 Intelligent-Tiering
Unknown/changing access patterns
Milliseconds
Auto-moves objects between tiers
S3 Standard-IA
Less frequent but rapid access
Milliseconds
~45% cheaper storage, retrieval fee
S3 One Zone-IA
Recreatable infrequent data
Milliseconds
Single AZ, ~20% cheaper than Standard-IA
S3 Glacier Instant Retrieval
Archive with millisecond access
Milliseconds
Low storage, higher retrieval
S3 Glacier Flexible Retrieval
Archive, minutes-to-hours retrieval
Minutes-hours
Very low storage
S3 Glacier Deep Archive
Long-term compliance archives
Hours (12+)
Lowest storage cost
3.2 Amazon S3 Tables: Managed Iceberg
Running Iceberg yourself is workable but operationally heavy: schedule compaction (small file problem), expire old snapshots, clean up unreferenced files, tune metadata layout. Most teams underinvest, and table performance silently degrades.
Amazon S3 Tables, announced at AWS re:Invent 2024, is a fully managed Iceberg service built on a new bucket type called table buckets. Tables are first-class AWS resources with their own ARNs, IAM policies, and dedicated endpoints.
Performance — up to 3x query throughput and 10x TPS versus self-managed Iceberg.
Native integrations — auto-registered in AWS Glue Data Catalog; queryable from Athena, EMR, Spark, Redshift, Kinesis Data Firehose, and QuickSight.
3.3 Lifecycle Policies and Intelligent Tiering
For data not in S3 Tables, control cost through lifecycle policies — declarative rules that transition objects between storage classes by age. A common lake lifecycle:
# Conceptual S3 lifecycle policy
- Day 0-30: S3 Standard (hot ETL output)
- Day 30-90: S3 Standard-IA (occasional ad-hoc queries)
- Day 90-365: S3 Glacier Instant Retrieval (audit access)
- Day 365+: S3 Glacier Deep Archive (compliance only)
- Day 2555: Delete (7 years)
stateDiagram-v2
[*] --> S3_Standard: Object created (Day 0)
S3_Standard --> S3_Standard_IA: Day 30 (infrequent access)
S3_Standard_IA --> Glacier_Instant: Day 90 (audit-only access)
Glacier_Instant --> Glacier_Deep_Archive: Day 365 (compliance only)
Glacier_Deep_Archive --> [*]: Day 2555 (delete after 7 years)
S3_Standard --> Intelligent_Tiering: Unknown access pattern
Intelligent_Tiering --> Intelligent_Tiering: Auto-move Frequent/Infrequent/Archive
3.4 Hive-Style Partitioning
The classical scheme, inherited from Apache Hive, encodes partition values directly into the directory path:
A query with WHERE region='us' AND year=2026 AND month=5 only lists those directories. The query engine never sees the EU files. Downsides: users must include partition columns explicitly or pruning silently fails; partition values are tied to physical layout, so changing the scheme requires rewriting all historical data; too-fine partitioning produces millions of tiny files.
3.5 Hidden Partitioning in Iceberg
Iceberg solves the awkward parts with hidden partitioning. The partition spec lives in metadata, not the directory layout, and Iceberg automatically derives partition values from source columns using transforms:
-- Define partitioning by day, derived from a timestamp column
CREATE TABLE orders (
order_id BIGINT,
customer_id BIGINT,
order_ts TIMESTAMP,
amount DECIMAL(12,2)
) PARTITIONED BY (days(order_ts));
-- Users write natural SQL, Iceberg prunes automatically
SELECT SUM(amount)
FROM orders
WHERE order_ts BETWEEN '2026-05-01' AND '2026-05-07';
Available transforms include years, months, days, hours, bucket(N, col), and truncate(N, col); the partition spec can evolve over time without rewriting old data.
3.6 Bucketing, Clustering, and Z-Ordering
Partitioning works best for low-cardinality columns. For high-cardinality columns like customer_id or order_id, partitioning is impossible — you'd get millions of tiny partitions. The answer is bucketing, clustering, and Z-ordering.
Technique
Best For
Cardinality
Mechanism
Hive partitioning
Coarse filters on a few columns
Low (10s-1000s)
Directory paths
Hidden partitioning (Iceberg)
Same as Hive but with evolution
Low
Metadata + transforms
Bucketing
Point lookups, equi-joins
High
Hash to fixed buckets
Clustering / sort
Range queries on one dominant column
High
Sort within file
Z-ordering
Multi-column range filters
High, multi-dim
Interleaved bit clustering
A worked example: a 5 TB clickstream table queried mostly by event_date, occasionally by user_id, sometimes by country:
Partition by days(event_ts) — coarse pruning on the dominant filter.
Z-order within each partition by (user_id, country) — second-dimension pruning.
Target file size 128-512 MB to keep file count manageable.
flowchart TD
Q["Query: WHERE event_ts in last 24h AND user_id = 12345"]
Q --> P["Partition prune: keep day=2026-05-07"]
P --> Z["Z-order prune: skip files whose (user_id, country) range excludes 12345"]
Z --> RG["Parquet row-group prune: skip groups via min/max stats"]
RG --> Page["Page-level scan: read only matching pages"]
Page --> Result["Return matching rows"]
P -. "skips ~99% of partitions" .-> Skip1["Skipped TB"]
Z -. "skips most surviving files" .-> Skip2["Skipped GB"]
RG -. "skips ~30% on selective filters" .-> Skip3["Skipped MB"]
Layered Pruning: Partition → Z-Order → Row Group
Each layer narrows the search space. Red tiles are skipped without being read; blue tiles survive and are passed down to the next layer.
Queries filtering on date alone hit the partition prune. Queries also filtering on user_id additionally skip most files within the day. The metadata stays small because there are at most a few thousand date partitions, not millions of user partitions.
Post-Section Quiz: S3 Object Storage and Partitioning
1. You have raw landing data that is rarely queried after 90 days but must remain accessible for audits within minutes. Which S3 storage class is the most cost-effective fit at the 90-365 day stage?
S3 Standard, to keep retrieval instantS3 Glacier Deep Archive, since you do not query it frequentlyS3 Glacier Instant Retrieval, which is cheap to store and still returns in millisecondsS3 One Zone-IA, since data is stored in one AZ for durability
2. What is the main operational benefit of Amazon S3 Tables over a self-managed Iceberg deployment in standard S3 buckets?
It uses a different file format than ParquetAWS automates compaction, snapshot expiration, and unreferenced file removal as a managed serviceIt eliminates the need for a query engineIt removes the per-byte scanned charge from Athena
3. With Hive-style partitioning, why does forgetting to include the partition column in a WHERE clause silently kill performance?
Hive partitions only work for time columnsWithout the partition filter the engine cannot narrow down directories and falls back to a full scanHive partitions corrupt unless every column is filteredPartition pruning requires a special PRUNE SQL keyword
4. Why is Iceberg's hidden partitioning more user-friendly than classical Hive partitioning?
It removes partitioning entirelyPartition values are derived from source columns via transforms (e.g., days(order_ts)) so users write natural filters and pruning still worksIt encodes partition values into S3 object tags rather than directoriesIt scans every directory to determine the partition
5. A 5 TB clickstream table is queried mostly by event_date, occasionally by user_id, and sometimes by country. Which layout best balances pruning across these patterns?
Hive-partition by user_id aloneBucket the entire table by countryPartition by days(event_ts) and Z-order within partition by (user_id, country)Store as a single 5 TB Parquet file