Study Guide: Chapter 8 — Interactive Querying and Federated Analytics

Pre-Reading Quiz — Athena Fundamentals

1. A team converts a 1 TB CSV clickstream table to Snappy-compressed Parquet partitioned by event_date, and a single-day query then scans about 800 MB. Which design principle of Athena does this best illustrate?

A. Athena's optimization mindset rewards reducing wall-clock time, not bytes.

B. Athena bills per terabyte scanned, so data layout (not query complexity) is the dominant cost driver.

C. Parquet eliminates the need for partitioning because columnar files always scan less data.

D. Compression alone explains the cost reduction; partitioning is a secondary factor.

2. Why is Athena described as having "always-on availability" as a structural property rather than a marketing claim?

A. AWS replicates the cluster across three Availability Zones.

B. There is no dedicated cluster that can be paused, downsized, or fail; workers are allocated per query from a shared pool.

C. Athena keeps a hot standby cluster on every region.

D. Athena uses Redshift Serverless under the hood, which auto-pauses and resumes.

3. What is the relationship between Athena engine version 3 and Trino?

A. Engine v3 is built on Trino and adds Iceberg, EXPLAIN ANALYZE, and a cost-based optimizer.

B. Engine v3 is built on PrestoDB; Trino is only used in EMR.

C. Trino was renamed Athena in 2020; engine v3 simply rebrands the v2 engine.

D. Trino is a serialization format used by Athena to ship results to S3.

4. Why is Trino the right execution model for interactive Athena queries but a poor fit for multi-hour ETL?

A. Trino writes shuffle data to disk like Spark, making it slow on small queries.

B. Trino keeps intermediate results in memory across worker nodes, enabling sub-second responses but failing on jobs that exceed cluster memory.

C. Trino can only read CSV, so it cannot handle Parquet ETL output.

D. Trino is single-threaded per query, which limits both interactive and batch scenarios.

5. Where does Athena obtain table definitions, column types, and partition locations?

A. From inline DDL parameters submitted with each query.

B. From the AWS Glue Data Catalog, which is shared with Glue ETL and Lake Formation.

C. From a private metastore inside each Trino worker, separate from Glue.

D. From a Redshift system catalog accessed via Spectrum.

Serverless Query Model

Most data warehouses follow a "load before you query" model: stand up a cluster, ingest data, then run SQL. Amazon Athena inverts that contract. The data already lives in Amazon S3; you point Athena at it and run SQL immediately. There are no nodes to size, no maintenance windows, and no idle capacity to amortize.

The clearest analogy is a public library with a fleet of on-call researchers. You arrive with a question, hand it to the front desk (the Athena API), and behind the scenes a researcher pulls exactly the books your question requires. When the answer comes back, the researcher disappears. You pay only for the books they had to open.

Architecturally, Athena is a managed deployment of the Trino distributed SQL engine (renamed from PrestoSQL in 2020). When you submit a query, Athena allocates worker capacity from a shared, multi-tenant pool, plans the query, scans S3 objects in parallel, and streams results back. Metadata — table definitions, column types, partition locations — comes from the AWS Glue Data Catalog, so any table registered for Glue ETL or Lake Formation governance is immediately queryable from Athena.

Because the engine is serverless, "always-on availability" is a structural property: there is no cluster that can be paused, downsized, or fail. The same SQL that worked yesterday at 2 a.m. will work today at 2 p.m., even if no one queried in between.

Animation: Athena serverless query lifecycle

SQL query → Athena API → Glue Catalog metadata lookup → Trino worker pool → S3 columnar scans → results streamed back. Workers are released when the query ends.

Athena Engine Versions and the Trino Lineage

Athena exposes its underlying engine through versioned releases. Engine version 2 was based on PrestoDB; engine version 3 (the current default for new workgroups) is based on Trino and ships with Apache Iceberg native support, improved geospatial functions, EXPLAIN ANALYZE, and a cost-based optimizer that consumes Glue table statistics.

Lineage	Origin	Status in Athena
Presto (original)	Facebook, 2012	Renamed PrestoDB; ancestor of engine v2
PrestoSQL	2018 fork by original Presto creators	Renamed Trino in 2020
Trino	Active open-source project	Powers Athena engine v3 and EMR Trino

Trino's distinguishing features are its massively parallel execution, its connector architecture (which we exploit for federation later), and its memory-resident intermediate results. Unlike Spark, which writes shuffle data to disk, Trino keeps query state in memory across worker nodes — making interactive sub-second responses possible on terabyte-scale data, but causing Trino to fall over on multi-hour ETL jobs that exceed cluster memory. Trino is the tool for interactive analytics; Spark is the tool for batch transformation.

Pricing Per Terabyte Scanned

Athena bills $5.00 per terabyte of data scanned in S3 (US regions, on-demand pricing), with a 10 MB minimum charge per query. There is no charge for query planning, no cluster-hour charge, and no data-stored charge (S3 storage is billed separately).

This pricing model inverts the optimization mindset most engineers bring from warehouses. In Redshift or Snowflake, you optimize for wall-clock time — a faster query frees up cluster capacity. In Athena, you optimize for bytes read from S3. A query that scans 10 GB in 45 seconds costs the same as one that scans 10 GB in 5 seconds.

Concrete example: a daily report scans a 1 TB CSV clickstream table. At $5/TB, that costs $5/day or roughly $1,825/year. Convert to Snappy-compressed Parquet partitioned by event_date, and a single-day query scans about 800 MB. The annual cost drops below $1.50.

Figure 8.1: Athena Serverless Trino Execution Model

Key Takeaways

Athena is a serverless Trino deployment that queries S3 directly via Glue Data Catalog metadata.
Engine v3 is Trino-based, adding Iceberg, EXPLAIN ANALYZE, and a cost-based optimizer.
Trino is memory-resident and built for interactive queries; Spark remains the tool for batch ETL.
Pricing is $5/TB scanned — data layout, not query complexity, is the dominant cost driver.

Post-Reading Quiz — Athena Fundamentals