Study Guide: Storage Foundations and Open Table Formats

Pre-Section Quiz: File Formats and Columnar Storage

1. Why does Parquet typically read 5-40x less data than CSV for an analytical query that filters and aggregates a few columns?

Parquet uses tighter ASCII encoding than CSV Parquet stores values column-by-column with rich min/max statistics, so engines can read only relevant columns and skip irrelevant row groups Parquet files are stored unencrypted, which removes overhead Parquet files are always smaller because they drop NULL values entirely

2. What is "predicate pushdown" in the context of Parquet?

Pushing the WHERE clause from the application server down into the database A query optimization where filters are evaluated against file/row-group/page statistics so irrelevant data is never read Forcing the query engine to scan every page to verify the predicate A Hive feature that rewrites JOINs as MapReduce jobs

3. You are designing a Kafka pipeline where producers and consumers evolve schemas independently. Which file format is the natural fit for the on-the-wire records?

Parquet, because it has the strongest compression CSV, because it is human-readable Avro, because it is a compact binary row format with an embedded schema designed for streaming ORC, because it was built for streaming workloads

4. Inside a Parquet file, what is the correct hierarchy of physical units from largest to smallest?

Page → Column Chunk → Row Group → File File → Row Group → Column Chunk → Page File → Page → Row Group → Column Chunk Row Group → File → Page → Column Chunk

5. Which encoding inside Parquet is best suited for a low-cardinality column like country with values that repeat thousands of times?

Bit-packing, because countries are stored as small integers Bloom filters, because they speed up point lookups Dictionary encoding, because it replaces repeated values with compact integer IDs Run-length encoding only, since RLE always beats dictionary encoding

1. File Formats and Columnar Storage

Storage decisions cascade through every layer of a data platform. The bytes you commit to disk determine how much money you spend on storage, how much money you spend on scans, and how fast queries return. This partition focuses on file formats — the lowest layer of the storage stack — with special attention to why columnar Parquet has become the default for analytics.

Key Points

Row-oriented formats (CSV, JSON, Avro) store all fields of a record together — ideal for ingestion and whole-record reads, terrible for column-selective analytics.
Parquet and ORC flip the layout to columnar so queries that touch 3 of 50 columns read only ~6% of the file.
Parquet organizes a file as File → Row Group (~128 MB) → Column Chunk → Page, with min/max stats and bloom filters at every level.
Predicate pushdown uses those statistics to skip data that cannot match a filter — the heart of Parquet's 7-100x speedup over CSV on analytical workloads.
Encodings (dictionary, RLE, bit-packing, bloom filters) stack on top of compression, often shrinking files by 95%+.

1.1 Row-Oriented vs Columnar: The Core Trade-off

Row-oriented formats — CSV, JSON, Avro — store records the way you'd write them on a notepad: all the fields for one record together, then all the fields for the next. CSV is the lowest common denominator (every spreadsheet and scripting language reads it), JSON layers in nested types (great for API payloads), and Avro is a compact binary row format with an embedded schema designed for high-throughput streaming on Apache Kafka.

Row formats are great when you need to read or write whole records: ingesting webhook events, exporting a customer profile, replaying a Kafka topic. They are terrible for analytics because answering “average order value last quarter?” forces the engine to read every byte of every order — including customer addresses, SKUs, and shipping notes you don't care about — just to grab one column.

Parquet and ORC flip the layout. Instead of grouping rows together, they group values from the same column together. A query that touches three of fifty columns reads roughly 6% of the file rather than 100%. Apache Parquet has become the de facto standard for analytical lakes; ORC is its close cousin, more common in legacy Hive deployments.

1.2 Parquet's Hierarchical Layout

Parquet organizes a file hierarchically: the file is split into row groups of around 128 MB, each row group is sliced into per-column column chunks, and each chunk is broken into pages that hold the actual encoded values. At every level, Parquet stores rich metadata: min/max values, null counts, distinct counts, and optional bloom filters.

flowchart TD File["Parquet File"] File --> RG1["Row Group 1 (~128 MB)"] File --> RG2["Row Group 2 (~128 MB)"] File --> Footer["Footer (file metadata + schema)"] RG1 --> CC1["Column Chunk: order_id"] RG1 --> CC2["Column Chunk: customer_id"] RG1 --> CC3["Column Chunk: amount"] CC1 --> P1["Page 1 (encoded values + stats)"] CC1 --> P2["Page 2 (encoded values + stats)"] CC1 --> P3["Page N ..."] Footer --> Stats["Min/Max, null counts, bloom filters"] Stats -. "predicate pushdown skips chunks/pages" .-> RG2

That metadata enables predicate pushdown — the engine consults statistics and skips data that cannot match the filter without reading the values. If a query asks WHERE shipdate ≤ '1996-09-02' and a row group's max shipdate is '1996-08-15', the engine reads it. If the row group's min shipdate is '1997-01-01', the engine skips the entire row group. On TPC-H SF20 this kind of pruning skips roughly 30% of data on selective filters.

Columnar Predicate Pushdown

Query: SELECT SUM(amount) WHERE amount > 100. Only the amount column is fetched; row groups whose max(amount) is below 100 are skipped via pushdown.

1.3 Encodings That Compound the Win

Parquet stacks several encodings on top of compression:

Encoding	What It Does	Best For
Dictionary encoding	Replaces repeated values with integer IDs	Low-cardinality columns (status, country)
Run-length encoding (RLE)	Compresses runs of identical values into (value, count)	Sorted or naturally clustered columns
Bit-packing	Stores small integers using only the bits they need	Counts, IDs, encoded categoricals
Bloom filters	Probabilistic structure for “definitely not present” checks	Point lookups by ID

The numbers are striking. On a 194 GB Allstate dataset, Parquet compressed the data to 4.7 GB — ~97% reduction — and read 3.5x less data on full scans. On TPC-H SF20, Parquet was 5x smaller than CSV (3.2 GB vs 16 GB) and 7-10x faster on joins (2 s versus 20 s). On AWS Athena, the same query scanned 117 KB of Parquet versus 48 MB of CSV, which directly shows up on the bill.

1.4 Choosing a Format by Access Pattern

The choice is rarely “Parquet always wins.” It depends on access pattern.

Pattern	Recommended Format	Why
Streaming ingest with schema evolution	Avro	Compact binary, embedded schema, row-based writes
Webhook landing zone, debug-friendly	JSON or CSV	Human-readable, no tooling required
Analytical scans, BI dashboards	Parquet (or ORC on Hive)	Columnar, compressed, predicate pushdown
Small (<100 MB) reference data	CSV is fine	Tooling overhead exceeds benefit
OLAP at >100 MB scale	Parquet	5-40x smaller, 7-100x faster queries

A common pattern is to land raw data as JSON or Avro for fidelity, then immediately convert to Parquet for analytics. The raw layer is the audit trail; the Parquet layer is what dashboards actually query.

Post-Section Quiz: File Formats and Columnar Storage

1. Why does Parquet typically read 5-40x less data than CSV for an analytical query that filters and aggregates a few columns?

2. What is "predicate pushdown" in the context of Parquet?

3. You are designing a Kafka pipeline where producers and consumers evolve schemas independently. Which file format is the natural fit for the on-the-wire records?

4. Inside a Parquet file, what is the correct hierarchy of physical units from largest to smallest?

Page → Column Chunk → Row Group → File File → Row Group → Column Chunk → Page File → Page → Row Group → Column Chunk Row Group → File → Page → Column Chunk

5. Which encoding inside Parquet is best suited for a low-cardinality column like country with values that repeat thousands of times?

Pre-Section Quiz: Open Table Formats

1. What core problem do open table formats solve that raw Parquet on object storage does not?

They compress files more aggressively than Parquet alone They add ACID transactions, schema evolution, and time travel on top of immutable data files They replace Parquet entirely with a proprietary columnar layout They eliminate the need for a query engine

2. From the catalog pointer down to physical data, what is the correct order of Iceberg's metadata layers?

Catalog → manifest list → metadata.json → manifest → data files Catalog → metadata.json → snapshot/manifest list → manifest → data files metadata.json → catalog → data files → manifest → snapshot Snapshot → catalog → manifest → metadata.json → data files

3. Why can Iceberg rename, drop, reorder, or add columns without rewriting any data files?

It stores a separate Parquet file per column version Each column has an immutable numeric ID, so name changes only update metadata Iceberg keeps every column as text and re-parses on read Iceberg disables schema enforcement entirely

4. A pipeline applies thousands of CDC upserts per minute and needs writes to land quickly even at the cost of a small read overhead. Which Hudi mode fits best?

Copy-on-Write, because it minimizes read latency Merge-on-Read, because new changes are written as small delta logs and merged at read time Iceberg snapshots, because they are atomic Delta Lake's _delta_log, because it is append-only

5. Your organization wants vendor-neutral lakehouse storage queryable from Spark, Flink, Trino, Athena, BigQuery, Snowflake, and DuckDB. Which open table format has the broadest native engine support?

Delta Lake, because it is open source Apache Hudi, because it is the oldest of the three Apache Iceberg, with native support across Spark, Flink, Trino, Athena, BigQuery, Snowflake, and DuckDB Plain Parquet, because every engine reads it

2. Open Table Formats: Iceberg, Delta, Hudi

A Parquet file is just a Parquet file. It has no notion of “the current state of the orders table.” If two writers append concurrently, or one writer fails halfway through, you can end up with partial data, duplicates, or queries that see inconsistent snapshots. Open table formats are thin metadata layers on top of Parquet/ORC that provide ACID transactions, schema evolution, and time travel on commodity object storage.

Key Points

Apache Iceberg uses a tree of JSON manifests rooted at a catalog pointer; serializable isolation via optimistic concurrency; broadest engine support; aggressive schema and partition evolution backed by stable column IDs.
Delta Lake uses an ordered _delta_log of JSON commits replayed from checkpoints; tightest Databricks/Spark integration; default 30-day version retention.
Apache Hudi is streaming-first with two table types — Copy-on-Write (rewrite affected files on each write) and Merge-on-Read (append delta logs, compact later).
All three deliver serializable isolation, but they differ sharply on schema flexibility, partition evolution, and engine ecosystem.
Pick Iceberg for multi-engine flexibility, Delta for Databricks-centric workloads, Hudi for streaming upserts and CDC.

2.1 Apache Iceberg: A Tree of Snapshots

Iceberg organizes a table as a tree of metadata files. At the root sits a metadata.json file pointing at the current snapshot. Each snapshot lists manifest files, and each manifest lists the data files (Parquet or ORC) that belong to the table at that moment. Writers append new data files and produce a new snapshot atomically; readers see whatever snapshot was current when their query started. The result is serializable isolation via optimistic concurrency control — two writers can prepare commits in parallel; the second one detects a conflict and retries.

flowchart TD Catalog["Catalog (Glue / Hive / REST)"] Catalog --> MetaJSON["metadata.json (current snapshot pointer)"] MetaJSON --> Snap1["Snapshot S1 (older)"] MetaJSON --> Snap2["Snapshot S2 (current)"] Snap1 --> ML1["Manifest List S1"] Snap2 --> ML2["Manifest List S2"] ML1 --> M1["Manifest File A"] ML2 --> M1 ML2 --> M2["Manifest File B (new)"] M1 --> D1["data-001.parquet"] M1 --> D2["data-002.parquet"] M2 --> D3["data-003.parquet (newly added)"]

Iceberg Metadata Layers Reveal

From the catalog pointer down to data files, each layer dereferences the next. A reader walks this tree once at query start.

Because every snapshot is preserved (until expired), Iceberg supports time travel:

-- Query Iceberg as of a specific time
SELECT * FROM iceberg_table
FOR SYSTEM_TIME AS OF '2026-05-07 14:00:00';

-- Or by snapshot ID
SELECT * FROM iceberg_table
FOR SYSTEM_VERSION AS OF 4538291;

Iceberg's biggest superpower is schema evolution backed by stable column IDs. Each column has an immutable numeric ID, so you can rename, reorder, drop, or add columns without rewriting any data files. Partition evolution works the same way: change the partition spec on a go-forward basis — old partitions and new partitions coexist seamlessly.

2.2 Delta Lake: A Transaction Log

Delta Lake takes a different route. Instead of a tree of manifests, it maintains a _delta_log directory containing an ordered series of JSON commit files, each describing one transaction (add file X, remove file Y, update schema, etc.). To compute the current table state, the reader replays the log from the last checkpoint forward — much like a database write-ahead log applied to object storage.

-- Delta time travel by version
SELECT * FROM delta_table VERSION AS OF 5;

-- By timestamp
SELECT * FROM delta_table TIMESTAMP AS OF '2026-05-01';

Delta's default version retention is 30 days. Schema evolution is more limited than Iceberg's: you can add columns at the end and merge schemas during writes, but native rename/drop is constrained. Where Delta shines is the Databricks ecosystem: zero-copy table cloning, deep MERGE optimizations, and tight integration with MLflow and Unity Catalog.

2.3 Apache Hudi: Streaming-First Upserts

Hudi was built at Uber to solve a problem the others didn't initially tackle: efficient row-level upserts and deletes on a lake. Hudi maintains a timeline of instants and offers two table types:

Table Type	How It Works	Best For
Copy-on-Write (CoW)	Each write rewrites affected files in full. Reads are fast and uniform.	Read-heavy analytics, batch ETL
Merge-on-Read (MoR)	New changes are written as small delta logs and merged at read time (with periodic compaction). Writes are fast; reads pay a small merge cost.	Streaming ingestion, CDC, low-latency upserts

flowchart LR Upsert["Incoming upsert batch"] Upsert --> CoW["Copy-on-Write path"] Upsert --> MoR["Merge-on-Read path"] CoW --> Rewrite["Rewrite affected base Parquet files"] Rewrite --> CoWRead["Reader: scan base files (fast, uniform)"] MoR --> Delta["Append small delta log files (Avro)"] Delta --> Compact["Periodic compaction job"] Compact --> Base["Merged base Parquet files"] Delta --> MoRRead["Reader: merge base + deltas at query time"] Base --> MoRRead

Hudi Copy-on-Write vs Merge-on-Read

Top: CoW rewrites affected base files on every upsert — uniform reads. Bottom: MoR appends delta logs and merges at read time — fast writes.

Hudi also brings record-level indexing (Bloom, Hash File, HBase-backed), so updating one row by primary key doesn't require scanning the whole partition. This makes Hudi the format of choice for CDC pipelines from transactional sources. Schema evolution is more limited than Iceberg's, and ecosystem support outside Spark and Flink is thinner.

2.4 Side by Side

Feature	Iceberg	Delta Lake	Hudi
Metadata	JSON manifest tree	`_delta_log` transaction log	Timeline (DeltaLog)
Isolation	Serializable	Serializable	Serializable
Schema evolution	Most advanced (rename, drop, reorder, promote)	Moderate (add at end, merge)	Basic
Partition evolution	Hidden + dynamic	Explicit, requires rewrite	Largely fixed
Time travel	Snapshot-based	Version & timestamp	Instant-based
Best for	Multi-engine warehouses	Databricks/Spark, ML	Streaming, CDC, upserts
Native engines	Spark, Flink, Trino, Athena, BigQuery, Snowflake	Spark, Databricks-centric	Spark, Flink

An analogy: Iceberg is like Git for your lake — every commit is a snapshot you can travel back to. Delta is like a single transaction log a la PostgreSQL's WAL, replayed to derive state. Hudi is like a journaling filesystem optimized for many small updates with periodic compaction.

Post-Section Quiz: Open Table Formats

1. What core problem do open table formats solve that raw Parquet on object storage does not?

2. From the catalog pointer down to physical data, what is the correct order of Iceberg's metadata layers?

3. Why can Iceberg rename, drop, reorder, or add columns without rewriting any data files?

4. A pipeline applies thousands of CDC upserts per minute and needs writes to land quickly even at the cost of a small read overhead. Which Hudi mode fits best?

Pre-Section Quiz: S3 Object Storage and Partitioning

1. You have raw landing data that is rarely queried after 90 days but must remain accessible for audits within minutes. Which S3 storage class is the most cost-effective fit at the 90-365 day stage?

S3 Standard, to keep retrieval instant S3 Glacier Deep Archive, since you do not query it frequently S3 Glacier Instant Retrieval, which is cheap to store and still returns in milliseconds S3 One Zone-IA, since data is stored in one AZ for durability

2. What is the main operational benefit of Amazon S3 Tables over a self-managed Iceberg deployment in standard S3 buckets?

It uses a different file format than Parquet AWS automates compaction, snapshot expiration, and unreferenced file removal as a managed service It eliminates the need for a query engine It removes the per-byte scanned charge from Athena

3. With Hive-style partitioning, why does forgetting to include the partition column in a WHERE clause silently kill performance?

Hive partitions only work for time columns Without the partition filter the engine cannot narrow down directories and falls back to a full scan Hive partitions corrupt unless every column is filtered Partition pruning requires a special PRUNE SQL keyword

4. Why is Iceberg's hidden partitioning more user-friendly than classical Hive partitioning?

It removes partitioning entirely Partition values are derived from source columns via transforms (e.g., days(order_ts)) so users write natural filters and pruning still works It encodes partition values into S3 object tags rather than directories It scans every directory to determine the partition

5. A 5 TB clickstream table is queried mostly by event_date, occasionally by user_id, and sometimes by country. Which layout best balances pruning across these patterns?

Hive-partition by user_id alone Bucket the entire table by country Partition by days(event_ts) and Z-order within partition by (user_id, country) Store as a single 5 TB Parquet file

3. S3 Object Storage and Partitioning

This partition covers the bottom and the connective layer: where the bytes physically live (S3 storage classes, S3 Tables, lifecycle policies) and how they are arranged for fast pruning (Hive partitioning, Iceberg hidden partitioning, bucketing, clustering, Z-ordering).

Key Points

S3 is a family of storage classes — Standard, Intelligent-Tiering, Standard-IA, One Zone-IA, Glacier Instant/Flexible/Deep — with cost differences of 50-95% across tiers.
Amazon S3 Tables is a fully managed Iceberg service launched at re:Invent 2024; up to 3x query throughput and 10x TPS versus self-managed Iceberg, with automated compaction and snapshot expiry.
Hive partitioning encodes partition values in directory paths; simple and universal, but requires users to filter on partition columns or pruning silently fails.
Iceberg hidden partitioning derives partition values via transforms (days, hours, bucket, truncate) from source columns — users write natural SQL.
Combine coarse partitioning on the dominant filter (often time) with Z-ordering on high-cardinality columns to make billion-row tables feel indexed.

3.1 S3 Storage Classes

Amazon S3 is the de facto storage backbone for AWS-based lakes. What many teams miss is that S3 is not one storage tier — it's a family of classes with very different cost and access profiles. Choosing the right class per object can cut the storage bill 50-95%.

Storage Class	Designed For	Access Latency	Cost Profile
S3 Standard	Hot, frequently accessed data	Milliseconds	Highest storage, lowest retrieval
S3 Intelligent-Tiering	Unknown/changing access patterns	Milliseconds	Auto-moves objects between tiers
S3 Standard-IA	Less frequent but rapid access	Milliseconds	~45% cheaper storage, retrieval fee
S3 One Zone-IA	Recreatable infrequent data	Milliseconds	Single AZ, ~20% cheaper than Standard-IA
S3 Glacier Instant Retrieval	Archive with millisecond access	Milliseconds	Low storage, higher retrieval
S3 Glacier Flexible Retrieval	Archive, minutes-to-hours retrieval	Minutes-hours	Very low storage
S3 Glacier Deep Archive	Long-term compliance archives	Hours (12+)	Lowest storage cost

3.2 Amazon S3 Tables: Managed Iceberg

Running Iceberg yourself is workable but operationally heavy: schedule compaction (small file problem), expire old snapshots, clean up unreferenced files, tune metadata layout. Most teams underinvest, and table performance silently degrades.

Amazon S3 Tables, announced at AWS re:Invent 2024, is a fully managed Iceberg service built on a new bucket type called table buckets. Tables are first-class AWS resources with their own ARNs, IAM policies, and dedicated endpoints.

Performance — up to 3x query throughput and 10x TPS versus self-managed Iceberg.
Automatic maintenance — background compaction, snapshot expiration, unreferenced file removal.
Native integrations — auto-registered in AWS Glue Data Catalog; queryable from Athena, EMR, Spark, Redshift, Kinesis Data Firehose, and QuickSight.

3.3 Lifecycle Policies and Intelligent Tiering

For data not in S3 Tables, control cost through lifecycle policies — declarative rules that transition objects between storage classes by age. A common lake lifecycle:

# Conceptual S3 lifecycle policy
- Day 0-30:     S3 Standard (hot ETL output)
- Day 30-90:    S3 Standard-IA (occasional ad-hoc queries)
- Day 90-365:   S3 Glacier Instant Retrieval (audit access)
- Day 365+:     S3 Glacier Deep Archive (compliance only)
- Day 2555:     Delete (7 years)

stateDiagram-v2 [*] --> S3_Standard: Object created (Day 0) S3_Standard --> S3_Standard_IA: Day 30 (infrequent access) S3_Standard_IA --> Glacier_Instant: Day 90 (audit-only access) Glacier_Instant --> Glacier_Deep_Archive: Day 365 (compliance only) Glacier_Deep_Archive --> [*]: Day 2555 (delete after 7 years) S3_Standard --> Intelligent_Tiering: Unknown access pattern Intelligent_Tiering --> Intelligent_Tiering: Auto-move Frequent/Infrequent/Archive

3.4 Hive-Style Partitioning

The classical scheme, inherited from Apache Hive, encodes partition values directly into the directory path:

s3://lake/orders/
  region=us/year=2026/month=05/day=07/part-0001.parquet
  region=us/year=2026/month=05/day=07/part-0002.parquet
  region=eu/year=2026/month=05/day=07/part-0001.parquet

A query with WHERE region='us' AND year=2026 AND month=5 only lists those directories. The query engine never sees the EU files. Downsides: users must include partition columns explicitly or pruning silently fails; partition values are tied to physical layout, so changing the scheme requires rewriting all historical data; too-fine partitioning produces millions of tiny files.

3.5 Hidden Partitioning in Iceberg

Iceberg solves the awkward parts with hidden partitioning. The partition spec lives in metadata, not the directory layout, and Iceberg automatically derives partition values from source columns using transforms:

-- Define partitioning by day, derived from a timestamp column
CREATE TABLE orders (
  order_id BIGINT,
  customer_id BIGINT,
  order_ts TIMESTAMP,
  amount DECIMAL(12,2)
) PARTITIONED BY (days(order_ts));

-- Users write natural SQL, Iceberg prunes automatically
SELECT SUM(amount)
FROM orders
WHERE order_ts BETWEEN '2026-05-01' AND '2026-05-07';

Available transforms include years, months, days, hours, bucket(N, col), and truncate(N, col); the partition spec can evolve over time without rewriting old data.

3.6 Bucketing, Clustering, and Z-Ordering

Partitioning works best for low-cardinality columns. For high-cardinality columns like customer_id or order_id, partitioning is impossible — you'd get millions of tiny partitions. The answer is bucketing, clustering, and Z-ordering.

Technique	Best For	Cardinality	Mechanism
Hive partitioning	Coarse filters on a few columns	Low (10s-1000s)	Directory paths
Hidden partitioning (Iceberg)	Same as Hive but with evolution	Low	Metadata + transforms
Bucketing	Point lookups, equi-joins	High	Hash to fixed buckets
Clustering / sort	Range queries on one dominant column	High	Sort within file
Z-ordering	Multi-column range filters	High, multi-dim	Interleaved bit clustering

A worked example: a 5 TB clickstream table queried mostly by event_date, occasionally by user_id, sometimes by country:

Partition by days(event_ts) — coarse pruning on the dominant filter.
Z-order within each partition by (user_id, country) — second-dimension pruning.
Target file size 128-512 MB to keep file count manageable.

flowchart TD Q["Query: WHERE event_ts in last 24h AND user_id = 12345"] Q --> P["Partition prune: keep day=2026-05-07"] P --> Z["Z-order prune: skip files whose (user_id, country) range excludes 12345"] Z --> RG["Parquet row-group prune: skip groups via min/max stats"] RG --> Page["Page-level scan: read only matching pages"] Page --> Result["Return matching rows"] P -. "skips ~99% of partitions" .-> Skip1["Skipped TB"] Z -. "skips most surviving files" .-> Skip2["Skipped GB"] RG -. "skips ~30% on selective filters" .-> Skip3["Skipped MB"]

Layered Pruning: Partition → Z-Order → Row Group

Each layer narrows the search space. Red tiles are skipped without being read; blue tiles survive and are passed down to the next layer.

Queries filtering on date alone hit the partition prune. Queries also filtering on user_id additionally skip most files within the day. The metadata stays small because there are at most a few thousand date partitions, not millions of user partitions.

Post-Section Quiz: S3 Object Storage and Partitioning

1. You have raw landing data that is rarely queried after 90 days but must remain accessible for audits within minutes. Which S3 storage class is the most cost-effective fit at the 90-365 day stage?

2. What is the main operational benefit of Amazon S3 Tables over a self-managed Iceberg deployment in standard S3 buckets?

3. With Hive-style partitioning, why does forgetting to include the partition column in a WHERE clause silently kill performance?

4. Why is Iceberg's hidden partitioning more user-friendly than classical Hive partitioning?

5. A 5 TB clickstream table is queried mostly by event_date, occasionally by user_id, and sometimes by country. Which layout best balances pruning across these patterns?

Hive-partition by user_id alone Bucket the entire table by country Partition by days(event_ts) and Z-order within partition by (user_id, country) Store as a single 5 TB Parquet file

Chapter 3: Storage Foundations and Open Table Formats

Learning Objectives

1. File Formats and Columnar Storage

Key Points

1.1 Row-Oriented vs Columnar: The Core Trade-off

1.2 Parquet's Hierarchical Layout

Columnar Predicate Pushdown

1.3 Encodings That Compound the Win

1.4 Choosing a Format by Access Pattern

2. Open Table Formats: Iceberg, Delta, Hudi

Key Points

2.1 Apache Iceberg: A Tree of Snapshots

Iceberg Metadata Layers Reveal

2.2 Delta Lake: A Transaction Log

2.3 Apache Hudi: Streaming-First Upserts

Hudi Copy-on-Write vs Merge-on-Read

2.4 Side by Side

3. S3 Object Storage and Partitioning

Key Points

3.1 S3 Storage Classes

3.2 Amazon S3 Tables: Managed Iceberg

3.3 Lifecycle Policies and Intelligent Tiering

3.4 Hive-Style Partitioning

3.5 Hidden Partitioning in Iceberg

3.6 Bucketing, Clustering, and Z-Ordering

Layered Pruning: Partition → Z-Order → Row Group

Your Progress

Answer Explanations