Chapter 8: Interactive Querying and Federated Analytics
Learning Objectives
Use Amazon Athena to run SQL queries against S3 data without provisioning servers.
Optimize Athena query cost and performance through partitioning, columnar formats, compression, bucketing, and CTAS rewrites.
Build federated queries that span S3, Amazon RDS, DynamoDB, and external systems such as Snowflake and BigQuery.
Compare Trino/Presto-based engines (Athena, EMR Trino) with persistent warehouses (Redshift) and decide when each is the right tool.
Partition 1: Athena Fundamentals
Pre-Reading Quiz — Athena Fundamentals
1. A team converts a 1 TB CSV clickstream table to Snappy-compressed Parquet partitioned by event_date, and a single-day query then scans about 800 MB. Which design principle of Athena does this best illustrate?
A. Athena's optimization mindset rewards reducing wall-clock time, not bytes.
B. Athena bills per terabyte scanned, so data layout (not query complexity) is the dominant cost driver.
C. Parquet eliminates the need for partitioning because columnar files always scan less data.
D. Compression alone explains the cost reduction; partitioning is a secondary factor.
2. Why is Athena described as having "always-on availability" as a structural property rather than a marketing claim?
A. AWS replicates the cluster across three Availability Zones.
B. There is no dedicated cluster that can be paused, downsized, or fail; workers are allocated per query from a shared pool.
C. Athena keeps a hot standby cluster on every region.
D. Athena uses Redshift Serverless under the hood, which auto-pauses and resumes.
3. What is the relationship between Athena engine version 3 and Trino?
A. Engine v3 is built on Trino and adds Iceberg, EXPLAIN ANALYZE, and a cost-based optimizer.
B. Engine v3 is built on PrestoDB; Trino is only used in EMR.
C. Trino was renamed Athena in 2020; engine v3 simply rebrands the v2 engine.
D. Trino is a serialization format used by Athena to ship results to S3.
4. Why is Trino the right execution model for interactive Athena queries but a poor fit for multi-hour ETL?
A. Trino writes shuffle data to disk like Spark, making it slow on small queries.
B. Trino keeps intermediate results in memory across worker nodes, enabling sub-second responses but failing on jobs that exceed cluster memory.
C. Trino can only read CSV, so it cannot handle Parquet ETL output.
D. Trino is single-threaded per query, which limits both interactive and batch scenarios.
5. Where does Athena obtain table definitions, column types, and partition locations?
A. From inline DDL parameters submitted with each query.
B. From the AWS Glue Data Catalog, which is shared with Glue ETL and Lake Formation.
C. From a private metastore inside each Trino worker, separate from Glue.
D. From a Redshift system catalog accessed via Spectrum.
Serverless Query Model
Most data warehouses follow a "load before you query" model: stand up a cluster, ingest data, then run SQL. Amazon Athena inverts that contract. The data already lives in Amazon S3; you point Athena at it and run SQL immediately. There are no nodes to size, no maintenance windows, and no idle capacity to amortize.
The clearest analogy is a public library with a fleet of on-call researchers. You arrive with a question, hand it to the front desk (the Athena API), and behind the scenes a researcher pulls exactly the books your question requires. When the answer comes back, the researcher disappears. You pay only for the books they had to open.
Architecturally, Athena is a managed deployment of the Trino distributed SQL engine (renamed from PrestoSQL in 2020). When you submit a query, Athena allocates worker capacity from a shared, multi-tenant pool, plans the query, scans S3 objects in parallel, and streams results back. Metadata — table definitions, column types, partition locations — comes from the AWS Glue Data Catalog, so any table registered for Glue ETL or Lake Formation governance is immediately queryable from Athena.
Because the engine is serverless, "always-on availability" is a structural property: there is no cluster that can be paused, downsized, or fail. The same SQL that worked yesterday at 2 a.m. will work today at 2 p.m., even if no one queried in between.
Animation: Athena serverless query lifecycle
SQL query → Athena API → Glue Catalog metadata lookup → Trino worker pool → S3 columnar scans → results streamed back. Workers are released when the query ends.
Athena Engine Versions and the Trino Lineage
Athena exposes its underlying engine through versioned releases. Engine version 2 was based on PrestoDB; engine version 3 (the current default for new workgroups) is based on Trino and ships with Apache Iceberg native support, improved geospatial functions, EXPLAIN ANALYZE, and a cost-based optimizer that consumes Glue table statistics.
Lineage
Origin
Status in Athena
Presto (original)
Facebook, 2012
Renamed PrestoDB; ancestor of engine v2
PrestoSQL
2018 fork by original Presto creators
Renamed Trino in 2020
Trino
Active open-source project
Powers Athena engine v3 and EMR Trino
Trino's distinguishing features are its massively parallel execution, its connector architecture (which we exploit for federation later), and its memory-resident intermediate results. Unlike Spark, which writes shuffle data to disk, Trino keeps query state in memory across worker nodes — making interactive sub-second responses possible on terabyte-scale data, but causing Trino to fall over on multi-hour ETL jobs that exceed cluster memory. Trino is the tool for interactive analytics; Spark is the tool for batch transformation.
Pricing Per Terabyte Scanned
Athena bills $5.00 per terabyte of data scanned in S3 (US regions, on-demand pricing), with a 10 MB minimum charge per query. There is no charge for query planning, no cluster-hour charge, and no data-stored charge (S3 storage is billed separately).
This pricing model inverts the optimization mindset most engineers bring from warehouses. In Redshift or Snowflake, you optimize for wall-clock time — a faster query frees up cluster capacity. In Athena, you optimize for bytes read from S3. A query that scans 10 GB in 45 seconds costs the same as one that scans 10 GB in 5 seconds.
Concrete example: a daily report scans a 1 TB CSV clickstream table. At $5/TB, that costs $5/day or roughly $1,825/year. Convert to Snappy-compressed Parquet partitioned by event_date, and a single-day query scans about 800 MB. The annual cost drops below $1.50.
Figure 8.1: Athena Serverless Trino Execution Model
flowchart LR
User[SQL Client / Console] -->|Submit query| API[Athena API]
API -->|Lookup table + partitions| Glue[(AWS Glue Data Catalog)]
API -->|Plan + dispatch| Pool[Shared Trino Worker Pool]
Pool -->|Parallel scans| S3[(Amazon S3 objects)]
S3 -->|Columnar reads| Pool
Pool -->|Stream results| API
API -->|Result set| User
Pool -.->|Workers released after query| Pool
Key Takeaways
Athena is a serverless Trino deployment that queries S3 directly via Glue Data Catalog metadata.
Engine v3 is Trino-based, adding Iceberg, EXPLAIN ANALYZE, and a cost-based optimizer.
Trino is memory-resident and built for interactive queries; Spark remains the tool for batch ETL.
Pricing is $5/TB scanned — data layout, not query complexity, is the dominant cost driver.
Post-Reading Quiz — Athena Fundamentals
1. A team converts a 1 TB CSV clickstream table to Snappy-compressed Parquet partitioned by event_date, and a single-day query then scans about 800 MB. Which design principle of Athena does this best illustrate?
A. Athena's optimization mindset rewards reducing wall-clock time, not bytes.
B. Athena bills per terabyte scanned, so data layout (not query complexity) is the dominant cost driver.
C. Parquet eliminates the need for partitioning because columnar files always scan less data.
D. Compression alone explains the cost reduction; partitioning is a secondary factor.
2. Why is Athena described as having "always-on availability" as a structural property rather than a marketing claim?
A. AWS replicates the cluster across three Availability Zones.
B. There is no dedicated cluster that can be paused, downsized, or fail; workers are allocated per query from a shared pool.
C. Athena keeps a hot standby cluster on every region.
D. Athena uses Redshift Serverless under the hood, which auto-pauses and resumes.
3. What is the relationship between Athena engine version 3 and Trino?
A. Engine v3 is built on Trino and adds Iceberg, EXPLAIN ANALYZE, and a cost-based optimizer.
B. Engine v3 is built on PrestoDB; Trino is only used in EMR.
C. Trino was renamed Athena in 2020; engine v3 simply rebrands the v2 engine.
D. Trino is a serialization format used by Athena to ship results to S3.
4. Why is Trino the right execution model for interactive Athena queries but a poor fit for multi-hour ETL?
A. Trino writes shuffle data to disk like Spark, making it slow on small queries.
B. Trino keeps intermediate results in memory across worker nodes, enabling sub-second responses but failing on jobs that exceed cluster memory.
C. Trino can only read CSV, so it cannot handle Parquet ETL output.
D. Trino is single-threaded per query, which limits both interactive and batch scenarios.
5. Where does Athena obtain table definitions, column types, and partition locations?
A. From inline DDL parameters submitted with each query.
B. From the AWS Glue Data Catalog, which is shared with Glue ETL and Lake Formation.
C. From a private metastore inside each Trino worker, separate from Glue.
D. From a Redshift system catalog accessed via Spectrum.
Partition 2: Athena Query Optimization
Pre-Reading Quiz — Athena Query Optimization
1. What does partition projection change about how Athena resolves which S3 prefixes to scan?
A. It forces Athena to crawl all S3 objects on every query.
B. It synthesizes the partition list on the fly from declared range patterns, avoiding a Glue Catalog list-partitions call.
C. It replaces partitioning with bucketing for time-series data.
D. It caches partitions inside Trino workers' local SSDs.
2. A query runs SELECT amount FROM sales WHERE amount > 1000 against a Parquet table. Which of the following best explains why Athena reads dramatically less than the full table even before partition pruning?
A. Parquet automatically pre-aggregates all numeric columns to skip filtering.
B. Parquet stores values column-by-column and embeds min/max statistics per row group, so Trino reads only the amount column and skips row groups whose max is ≤ 1000.
C. Parquet shards each column into a separate S3 object that is independently encrypted.
D. Parquet rewrites the query into a Glue Data Catalog stored procedure.
3. Why is the recommended Parquet object size 128–256 MB rather than dramatically smaller?
A. Athena rejects Parquet files smaller than 128 MB.
B. S3 charges higher fees for sub-128 MB GET requests.
C. Files smaller than 128 MB hurt Trino parallelism — the fixed S3 GET overhead per file dominates wall-clock time when there are thousands of tiny files.
D. Snappy compression cannot operate on files below 128 MB.
4. What is the primary benefit of bucketing a table by a high-cardinality join column like customer_id?
A. Bucketing replaces partitioning entirely, simplifying the table layout.
B. Bucketing enables co-located joins — bucket N of one table joins only against bucket N of the other, dramatically reducing shuffle traffic.
C. Bucketing reduces S3 storage cost by deduplicating identical customer records.
D. Bucketing forces Athena to evaluate every bucket sequentially, which lowers concurrency.
5. What is a structural constraint of a single CTAS statement in Athena that pushes teams toward INSERT INTO for ongoing loads?
A. CTAS cannot specify a target file format.
B. CTAS produces at most 100 partitions, and the destination S3 location must be empty.
C. CTAS must run during nightly maintenance windows.
D. CTAS automatically deletes the source table after writing.
Once you internalize "minimize bytes scanned," the optimization toolkit organizes itself naturally. There are three layers to attack: what you read (partitioning), how you read it (columnar formats), and what you compute on it once read (CTAS for materialization).
Partition Projection
Partitioning arranges files in S3 under a directory hierarchy that encodes a column value in the path. The Hive convention uses key=value segments:
When you submit SELECT * FROM sales WHERE year=2024 AND month=11 AND day=07, Athena consults the Glue Catalog for partition locations, identifies that only one prefix matches, and instructs Trino workers to read only that prefix. On a table with 365 day-partitions, a single-day query touches roughly 1/365th of the data — a 99.7% reduction in bytes scanned.
The traditional Hive approach requires MSCK REPAIR TABLE or a Glue crawler whenever new partitions appear. Partition projection takes a different path: instead of materializing every partition in the catalog, you declare the partition pattern in table properties, and Athena synthesizes the partition list on the fly from the WHERE clause.
An analogy: traditional Hive partitioning is a library where every book is individually catalogued by hand. Partition projection is a library where the librarian knows the shelving rule and computes where any book lives without consulting an index card.
Animation: Partition projection prunes the search to a single day
A predicate on year/month/day fades 83 of 84 partitions and lights up the single surviving prefix — partition projection synthesizes that path without a Glue listing call.
Columnar Formats and Compression
Once partitioning has narrowed the file set, the next question is how each file is laid out internally. Row-based formats (CSV, JSON, Avro) store all columns of one row together; column-based formats (Parquet, ORC) store all values of one column together.
Format
Layout
Athena cost on SELECT amount FROM sales
CSV
Row-based, no compression
Reads 100% of file
JSON
Row-based, often verbose
Reads 100% + parsing overhead
Avro
Row-based, schema-aware, compressed
Reads 100% but smaller bytes
Parquet
Columnar, Snappy/ZSTD compressed
Reads only the amount column
ORC
Columnar, Snappy/Zlib compressed
Reads only the amount column
Parquet also embeds min/max statistics per row group (~128 MB chunks). When your WHERE clause says amount > 1000, Trino can skip entire row groups whose maximum amount is below 1000, without reading any actual values. This is predicate pushdown at the file level, stacked on top of partition pruning.
The recommended file size is 128–256 MB per Parquet object, with internal row groups of 64–128 MB. Files smaller than 128 MB hurt parallelism — each worker pays a fixed S3 GET overhead per file, and with thousands of tiny files that overhead dominates wall-clock time (the "tail latency" problem). Files larger than 1 GB underutilize the worker fleet because individual workers cannot subdivide a file beyond row-group boundaries.
Compression: Snappy is the Parquet default (fast, modest ratio). ZSTD is 10–30% better at slightly higher CPU cost. GZIP is single-threaded per block — avoid for analytical Parquet.
CTAS and INSERT INTO Patterns
CTAS (CREATE TABLE AS SELECT) is the Swiss Army knife of Athena optimization. A CTAS reads from a source, applies any transformation (projection, filter, partitioning, formatting, bucketing), and writes a new table in a single operation:
CREATE TABLE optimized_sales
WITH (
format = 'PARQUET',
parquet_compression = 'SNAPPY',
partitioned_by = ARRAY['year', 'region'],
bucketed_by = ARRAY['customer_id'],
bucket_count = 32,
external_location = 's3://my-lake/curated/sales/'
) AS
SELECT
sale_id,
amount,
customer_id,
year(sale_date) AS year,
region
FROM raw_sales
WHERE sale_date >= DATE '2020-01-01';
Bucketing hashes a chosen column into a fixed number of files per partition. When two bucketed tables are joined on the bucket column, Trino performs a co-located join — bucket 7 of sales joins only against bucket 7 of customers — dramatically reducing shuffle traffic.
CTAS has two important constraints: the destination S3 location must be empty (Athena refuses to overwrite), and a single CTAS produces at most 100 partitions. For larger datasets you use INSERT INTO ... SELECT to append additional partitions:
INSERT INTO optimized_sales
SELECT sale_id, amount, customer_id, year(sale_date) AS year, region
FROM raw_sales
WHERE sale_date >= DATE '2024-01-01' AND sale_date < DATE '2024-02-01';
A common production pattern combines CTAS for backfills with scheduled INSERT INTO for daily increments. Many teams have replaced this with Apache Iceberg tables (Athena engine 3+), which provide atomic commits, hidden partitioning, and time travel.
Figure 8.2: Hive Partition Lookup vs Partition Projection
flowchart TD
Q[Query: WHERE year=2024 AND month=11 AND day=07]
Q --> Mode{Partition mode?}
Mode -->|Hive partitioning| H1[Glue Catalog list partitions API]
H1 --> H2[Materialize all registered partitions]
H2 --> H3[Filter against predicate]
H3 --> Scan[Issue S3 GET on matching prefix]
Mode -->|Partition projection| P1[Read TBLPROPERTIES projection.* ranges]
P1 --> P2[Compute partition path from storage.location.template]
P2 --> Scan
Scan --> Result[(Parquet row groups)]
Key Takeaways
Partitioning prunes the file set before reading; partition projection synthesizes paths without Glue list calls — ideal for high-cardinality time-series.
Aim for 128–256 MB Parquet files with Snappy or ZSTD compression to balance Trino parallelism and S3 GET overhead.
CTAS materializes optimized tables in a single statement; bucketing on join columns enables co-located joins; INSERT INTO (or Iceberg) handles incremental loads.
Post-Reading Quiz — Athena Query Optimization
1. What does partition projection change about how Athena resolves which S3 prefixes to scan?
A. It forces Athena to crawl all S3 objects on every query.
B. It synthesizes the partition list on the fly from declared range patterns, avoiding a Glue Catalog list-partitions call.
C. It replaces partitioning with bucketing for time-series data.
D. It caches partitions inside Trino workers' local SSDs.
2. A query runs SELECT amount FROM sales WHERE amount > 1000 against a Parquet table. Which of the following best explains why Athena reads dramatically less than the full table even before partition pruning?
A. Parquet automatically pre-aggregates all numeric columns to skip filtering.
B. Parquet stores values column-by-column and embeds min/max statistics per row group, so Trino reads only the amount column and skips row groups whose max is ≤ 1000.
C. Parquet shards each column into a separate S3 object that is independently encrypted.
D. Parquet rewrites the query into a Glue Data Catalog stored procedure.
3. Why is the recommended Parquet object size 128–256 MB rather than dramatically smaller?
A. Athena rejects Parquet files smaller than 128 MB.
B. S3 charges higher fees for sub-128 MB GET requests.
C. Files smaller than 128 MB hurt Trino parallelism — the fixed S3 GET overhead per file dominates wall-clock time when there are thousands of tiny files.
D. Snappy compression cannot operate on files below 128 MB.
4. What is the primary benefit of bucketing a table by a high-cardinality join column like customer_id?
A. Bucketing replaces partitioning entirely, simplifying the table layout.
B. Bucketing enables co-located joins — bucket N of one table joins only against bucket N of the other, dramatically reducing shuffle traffic.
C. Bucketing reduces S3 storage cost by deduplicating identical customer records.
D. Bucketing forces Athena to evaluate every bucket sequentially, which lowers concurrency.
5. What is a structural constraint of a single CTAS statement in Athena that pushes teams toward INSERT INTO for ongoing loads?
A. CTAS cannot specify a target file format.
B. CTAS produces at most 100 partitions, and the destination S3 location must be empty.
C. CTAS must run during nightly maintenance windows.
D. CTAS automatically deletes the source table after writing.
1. In Athena Federated Query, what is the role of an AWS Lambda function and why is it architecturally significant?
A. The Lambda is an authentication broker; it never touches data.
B. The Lambda is the data-source connector that translates Trino reads into native API calls and returns Apache Arrow blocks, letting Athena treat external data uniformly.
C. The Lambda mounts the external database as a local block device.
D. The Lambda is only used to decompress Parquet payloads from external sources.
2. A federated query runs SELECT user_id, email FROM lambda_dynamo.users WHERE last_login > '2024-01-01' where last_login is not a partition or sort key. What happens, and why is it expensive?
A. DynamoDB satisfies it as a GetItem; cost is single-digit millisecond latency.
B. The connector degenerates the query to a full DynamoDB Scan because the predicate is on a non-key attribute, so DynamoDB returns the whole table for Trino to post-filter.
C. The query fails with a syntax error because Athena requires partition keys in WHERE.
D. The query is silently rejected by Lake Formation row-level policies.
3. For an organization running ten ad-hoc Athena queries per day — each scanning ~5 GB of well-partitioned Parquet — versus a small idle Redshift provisioned cluster, which engine wins on cost and why?
A. Redshift wins because clusters are always cheaper than per-query pricing.
B. Athena wins by orders of magnitude (~$1/year vs ~$1,750/month) because Athena charges nothing when idle and per-TB-scanned is tiny on small workloads.
C. Both cost the same because AWS sets a flat data-platform price.
D. EMR is cheaper than either for ad-hoc SQL.
4. Which combination of properties best captures when Redshift is the right engine over Athena and EMR?
A. Sporadic SQL workloads where idle cost must be zero and query patterns are unpredictable.
B. Non-SQL big-data processing in Spark, Flink, or Hive frameworks.
C. Predictable, high-concurrency BI dashboards with sub-second latency expectations and persistent compute.
D. Multi-hour ETL transforming raw event logs into Parquet.
5. Why do mature data platforms run Athena, Redshift, and EMR side by side rather than picking one?
A. AWS contracts require multi-engine deployments above a usage tier.
B. Each engine is optimized for a different workload type, and they unify on a shared Glue Data Catalog (with Lake Formation governance), often using Iceberg on S3 to give every engine warehouse-quality semantics.
C. Athena, Redshift, and EMR are simply rebrandings of the same engine and load-balance behind the scenes.
D. Glue Data Catalog can only be queried when all three engines are concurrently provisioned.
Athena Federated Query Connectors
Up to this point, every example has assumed data lives in S3. In practice, organizations have customer profiles in DynamoDB, transactional records in RDS, financial metrics in Snowflake, and event logs in S3 — and the analytical question that matters often spans all of them. Athena Federated Query is the bridge.
Federated Query introduces a catalog that points not at the Glue Data Catalog but at an AWS Lambda function. That Lambda function is the data source connector — the runtime intermediary between Athena and the external system.
SQL Query
-> Athena Trino engine
-> Federated Query Handler
-> Lambda Data Source Connector (per source)
-> Native API call (DynamoDB GetItem, RDS SQL, Redshift, etc.)
-> Apache Arrow result blocks
-> Athena query plan continues (joins, aggregates)
-> Final result to user
Every connector implements four required handler interfaces:
Handler
Responsibility
Analogy
MetadataHandler
Lists schemas, tables, columns
A library catalog
GetTableHandler
Returns table schema for a specific table
A book's table of contents
GetSplitsHandler
Divides data into chunks for parallel reads
Assigning aisles to multiple researchers
ReadRecordsHandler
Streams actual rows back as Apache Arrow
The researcher reading the book aloud
The Apache Arrow detail is more than a technicality. Arrow is a columnar in-memory format that allows Trino, Lambda, and external systems to share data without serialization/deserialization overhead. When the connector returns Arrow blocks (up to 64 MB each), Trino consumes them directly, treating remote data with the same primitives it uses for native S3 Parquet.
Animation: Federated query flow — Athena to external source via Lambda
SQL flows Athena → Lambda → external source; Apache Arrow blocks stream back so Trino can join them with S3 data without serialization overhead.
Querying RDS, DynamoDB, and External Sources
Each connector maps Athena's query operations to the source system's native primitives. The mapping matters because it determines what gets pushed down (good — less data moves) versus pulled up to Trino for filtering (bad — more data moves).
DynamoDB is the trickiest because it is not a relational engine. WHERE user_id = 'u-123' becomes a DynamoDB GetItem — single-digit-millisecond latency. Partition-key equality becomes a Query. But WHERE last_login > '2024-01-01' degenerates into a full-table Scan — slow and expensive. Treat DynamoDB like an OLTP key-value store; treating it like a warehouse is a path to surprise bills.
-- Efficient: pushes down to GetItem
SELECT user_id, email, last_login
FROM lambda_dynamo.production.user_profiles
WHERE user_id = 'user_12345';
-- Expensive: degenerates to a full Scan
SELECT user_id, email
FROM lambda_dynamo.production.user_profiles
WHERE last_login > TIMESTAMP '2024-01-01 00:00:00';
RDS connectors are more conventional — the Lambda receives the WHERE clause and column list and constructs a real SQL query against the database. Pitfalls are operational: Lambda must share the VPC with RDS, credentials live in Secrets Manager, and concurrent queries can saturate connection limits. Route federated reads to a read replica when possible.
SELECT
c.customer_id,
c.name,
c.tier,
s.total_revenue_2024
FROM lambda_rds.prod.customers c
JOIN s3_lake.curated.revenue_summary s
ON c.customer_id = s.customer_id
WHERE c.created_date > DATE '2024-01-01'
AND c.tier IN ('gold', 'platinum');
Connectors to Snowflake and BigQuery
The connector model extends naturally to non-AWS warehouses. AWS publishes a Snowflake connector (JDBC) and a BigQuery connector (BigQuery client libraries). Both push down filters and projections; both bill on two axes (Athena scanned bytes plus the source warehouse's own pricing).
Latency: pure S3 queries return in 2–5 s; federated DynamoDB in 3–8 s; RDS or warehouse queries in 5–15 s; cross-source joins in 10–30 s. Lambda's 15-minute hard limit caps long-running reads, and the 64 MB Arrow block size can fragment very large result sets. Federation is excellent for joining moderate volumes; it is not a replacement for nightly terabyte ETL. The right pattern is exploration via federation, then materialization via CTAS for production.
When Athena vs Redshift vs EMR
The fundamental difference is whether your workload is interactive or persistent.
Athena — ad-hoc analysis. Data scientist exploring a new dataset, SRE searching CloudTrail logs, analyst answering a one-time CFO question. No idle cluster.
EMR — non-SQL big-data processing. Spark for ML feature engineering, Flink for streaming, Hive for legacy batch. Heavy ETL that produces curated tables Athena and Redshift query.
Three workshops in a factory: Athena is the bench for quick repairs as customers walk in; Redshift is the assembly line running the same product day after day; EMR is the heavy machine shop fabricating custom components.
Cost models compared
Service
Primary cost driver
Idle cost
Athena
$5/TB scanned (US)
Zero
Redshift Serverless
RPU-hours (~$0.36/RPU-hour)
Zero (after auto-pause)
Redshift Provisioned
Node-hours (24/7)
Full cluster cost
EMR Provisioned
EC2 + EMR fee per instance-hour
Full cluster cost
EMR Serverless
Worker resource-seconds
Zero
Ten ad-hoc Athena queries/day at 5 GB each ≈ $1/year. The same workload on a small Redshift provisioned cluster costs roughly $1,750/month even when idle. Conversely, 200 high-concurrency BI queries per minute on a 10 TB warehouse: Redshift wins; Athena would bill aggressively and likely throttle.
Figure 8.4: Decision Flow for Athena vs Redshift vs EMR
In practice, mature platforms use all three engines, each for the workload it is optimized for, unified by the Glue Data Catalog and Lake Formation governance.
Raw landing (S3, JSON/CSV)
|
| EMR Spark (heavy transformation)
v
Curated lake (S3, Parquet/Iceberg, Glue Catalog)
|
+--------+--------+
| |
v v
Athena (ad-hoc) Redshift Spectrum + RA3 (BI marts)
| |
v v
Data scientists BI dashboards
The Glue Data Catalog is the glue (literally): a table registered in Glue is queryable from Athena, accessible via Redshift Spectrum, and visible to EMR Spark — without redefinition. Lake Formation layers row- and column-level access control over the catalog so that a marketing analyst querying via Athena and a data engineer querying via Spark see the same governed view.
The data lakehouse pattern captures this in a single phrase: warehouse-quality structure (ACID, schema evolution, time travel) on lake-economics storage (S3, no duplication, multi-engine access). Athena engine v3's native Iceberg support makes this practical without third-party tooling.
Key Takeaways
Federation routes Athena reads through Lambda connectors that translate to native source APIs and return Apache Arrow — making Athena polyglot SQL.
Predicate pushdown is the dividing line between cheap and expensive federated queries; for DynamoDB use partition keys, for RDS use indexed columns.
Athena dominates sporadic/exploratory workloads (zero idle cost); Redshift dominates predictable, high-concurrency BI; EMR is chosen for non-SQL frameworks (Spark/Flink/Hive).
Mature platforms run all three side by side, unified by Glue Catalog + Lake Formation, with Iceberg on S3 as the data lakehouse lingua franca.
1. In Athena Federated Query, what is the role of an AWS Lambda function and why is it architecturally significant?
A. The Lambda is an authentication broker; it never touches data.
B. The Lambda is the data-source connector that translates Trino reads into native API calls and returns Apache Arrow blocks, letting Athena treat external data uniformly.
C. The Lambda mounts the external database as a local block device.
D. The Lambda is only used to decompress Parquet payloads from external sources.
2. A federated query runs SELECT user_id, email FROM lambda_dynamo.users WHERE last_login > '2024-01-01' where last_login is not a partition or sort key. What happens, and why is it expensive?
A. DynamoDB satisfies it as a GetItem; cost is single-digit millisecond latency.
B. The connector degenerates the query to a full DynamoDB Scan because the predicate is on a non-key attribute, so DynamoDB returns the whole table for Trino to post-filter.
C. The query fails with a syntax error because Athena requires partition keys in WHERE.
D. The query is silently rejected by Lake Formation row-level policies.
3. For an organization running ten ad-hoc Athena queries per day — each scanning ~5 GB of well-partitioned Parquet — versus a small idle Redshift provisioned cluster, which engine wins on cost and why?
A. Redshift wins because clusters are always cheaper than per-query pricing.
B. Athena wins by orders of magnitude (~$1/year vs ~$1,750/month) because Athena charges nothing when idle and per-TB-scanned is tiny on small workloads.
C. Both cost the same because AWS sets a flat data-platform price.
D. EMR is cheaper than either for ad-hoc SQL.
4. Which combination of properties best captures when Redshift is the right engine over Athena and EMR?
A. Sporadic SQL workloads where idle cost must be zero and query patterns are unpredictable.
B. Non-SQL big-data processing in Spark, Flink, or Hive frameworks.
C. Predictable, high-concurrency BI dashboards with sub-second latency expectations and persistent compute.
D. Multi-hour ETL transforming raw event logs into Parquet.
5. Why do mature data platforms run Athena, Redshift, and EMR side by side rather than picking one?
A. AWS contracts require multi-engine deployments above a usage tier.
B. Each engine is optimized for a different workload type, and they unify on a shared Glue Data Catalog (with Lake Formation governance), often using Iceberg on S3 to give every engine warehouse-quality semantics.
C. Athena, Redshift, and EMR are simply rebrandings of the same engine and load-balance behind the scenes.
D. Glue Data Catalog can only be queried when all three engines are concurrently provisioned.