Chapter 8: Interactive Querying and Federated Analytics

Learning Objectives

Partition 1: Athena Fundamentals

Pre-Reading Quiz — Athena Fundamentals

1. A team converts a 1 TB CSV clickstream table to Snappy-compressed Parquet partitioned by event_date, and a single-day query then scans about 800 MB. Which design principle of Athena does this best illustrate?

A. Athena's optimization mindset rewards reducing wall-clock time, not bytes.
B. Athena bills per terabyte scanned, so data layout (not query complexity) is the dominant cost driver.
C. Parquet eliminates the need for partitioning because columnar files always scan less data.
D. Compression alone explains the cost reduction; partitioning is a secondary factor.

2. Why is Athena described as having "always-on availability" as a structural property rather than a marketing claim?

A. AWS replicates the cluster across three Availability Zones.
B. There is no dedicated cluster that can be paused, downsized, or fail; workers are allocated per query from a shared pool.
C. Athena keeps a hot standby cluster on every region.
D. Athena uses Redshift Serverless under the hood, which auto-pauses and resumes.

3. What is the relationship between Athena engine version 3 and Trino?

A. Engine v3 is built on Trino and adds Iceberg, EXPLAIN ANALYZE, and a cost-based optimizer.
B. Engine v3 is built on PrestoDB; Trino is only used in EMR.
C. Trino was renamed Athena in 2020; engine v3 simply rebrands the v2 engine.
D. Trino is a serialization format used by Athena to ship results to S3.

4. Why is Trino the right execution model for interactive Athena queries but a poor fit for multi-hour ETL?

A. Trino writes shuffle data to disk like Spark, making it slow on small queries.
B. Trino keeps intermediate results in memory across worker nodes, enabling sub-second responses but failing on jobs that exceed cluster memory.
C. Trino can only read CSV, so it cannot handle Parquet ETL output.
D. Trino is single-threaded per query, which limits both interactive and batch scenarios.

5. Where does Athena obtain table definitions, column types, and partition locations?

A. From inline DDL parameters submitted with each query.
B. From the AWS Glue Data Catalog, which is shared with Glue ETL and Lake Formation.
C. From a private metastore inside each Trino worker, separate from Glue.
D. From a Redshift system catalog accessed via Spectrum.

Serverless Query Model

Most data warehouses follow a "load before you query" model: stand up a cluster, ingest data, then run SQL. Amazon Athena inverts that contract. The data already lives in Amazon S3; you point Athena at it and run SQL immediately. There are no nodes to size, no maintenance windows, and no idle capacity to amortize.

The clearest analogy is a public library with a fleet of on-call researchers. You arrive with a question, hand it to the front desk (the Athena API), and behind the scenes a researcher pulls exactly the books your question requires. When the answer comes back, the researcher disappears. You pay only for the books they had to open.

Architecturally, Athena is a managed deployment of the Trino distributed SQL engine (renamed from PrestoSQL in 2020). When you submit a query, Athena allocates worker capacity from a shared, multi-tenant pool, plans the query, scans S3 objects in parallel, and streams results back. Metadata — table definitions, column types, partition locations — comes from the AWS Glue Data Catalog, so any table registered for Glue ETL or Lake Formation governance is immediately queryable from Athena.

Because the engine is serverless, "always-on availability" is a structural property: there is no cluster that can be paused, downsized, or fail. The same SQL that worked yesterday at 2 a.m. will work today at 2 p.m., even if no one queried in between.

Animation: Athena serverless query lifecycle
SQL Client SELECT ... Athena API plan + dispatch Glue Catalog tables + partitions Trino Worker Pool parallel scans (released after query) Amazon S3 columnar reads Stream results to client
SQL query → Athena API → Glue Catalog metadata lookup → Trino worker pool → S3 columnar scans → results streamed back. Workers are released when the query ends.

Athena Engine Versions and the Trino Lineage

Athena exposes its underlying engine through versioned releases. Engine version 2 was based on PrestoDB; engine version 3 (the current default for new workgroups) is based on Trino and ships with Apache Iceberg native support, improved geospatial functions, EXPLAIN ANALYZE, and a cost-based optimizer that consumes Glue table statistics.

LineageOriginStatus in Athena
Presto (original)Facebook, 2012Renamed PrestoDB; ancestor of engine v2
PrestoSQL2018 fork by original Presto creatorsRenamed Trino in 2020
TrinoActive open-source projectPowers Athena engine v3 and EMR Trino

Trino's distinguishing features are its massively parallel execution, its connector architecture (which we exploit for federation later), and its memory-resident intermediate results. Unlike Spark, which writes shuffle data to disk, Trino keeps query state in memory across worker nodes — making interactive sub-second responses possible on terabyte-scale data, but causing Trino to fall over on multi-hour ETL jobs that exceed cluster memory. Trino is the tool for interactive analytics; Spark is the tool for batch transformation.

Pricing Per Terabyte Scanned

Athena bills $5.00 per terabyte of data scanned in S3 (US regions, on-demand pricing), with a 10 MB minimum charge per query. There is no charge for query planning, no cluster-hour charge, and no data-stored charge (S3 storage is billed separately).

This pricing model inverts the optimization mindset most engineers bring from warehouses. In Redshift or Snowflake, you optimize for wall-clock time — a faster query frees up cluster capacity. In Athena, you optimize for bytes read from S3. A query that scans 10 GB in 45 seconds costs the same as one that scans 10 GB in 5 seconds.

Concrete example: a daily report scans a 1 TB CSV clickstream table. At $5/TB, that costs $5/day or roughly $1,825/year. Convert to Snappy-compressed Parquet partitioned by event_date, and a single-day query scans about 800 MB. The annual cost drops below $1.50.

Figure 8.1: Athena Serverless Trino Execution Model

flowchart LR User[SQL Client / Console] -->|Submit query| API[Athena API] API -->|Lookup table + partitions| Glue[(AWS Glue Data Catalog)] API -->|Plan + dispatch| Pool[Shared Trino Worker Pool] Pool -->|Parallel scans| S3[(Amazon S3 objects)] S3 -->|Columnar reads| Pool Pool -->|Stream results| API API -->|Result set| User Pool -.->|Workers released after query| Pool

Key Takeaways

Post-Reading Quiz — Athena Fundamentals

1. A team converts a 1 TB CSV clickstream table to Snappy-compressed Parquet partitioned by event_date, and a single-day query then scans about 800 MB. Which design principle of Athena does this best illustrate?

A. Athena's optimization mindset rewards reducing wall-clock time, not bytes.
B. Athena bills per terabyte scanned, so data layout (not query complexity) is the dominant cost driver.
C. Parquet eliminates the need for partitioning because columnar files always scan less data.
D. Compression alone explains the cost reduction; partitioning is a secondary factor.

2. Why is Athena described as having "always-on availability" as a structural property rather than a marketing claim?

A. AWS replicates the cluster across three Availability Zones.
B. There is no dedicated cluster that can be paused, downsized, or fail; workers are allocated per query from a shared pool.
C. Athena keeps a hot standby cluster on every region.
D. Athena uses Redshift Serverless under the hood, which auto-pauses and resumes.

3. What is the relationship between Athena engine version 3 and Trino?

A. Engine v3 is built on Trino and adds Iceberg, EXPLAIN ANALYZE, and a cost-based optimizer.
B. Engine v3 is built on PrestoDB; Trino is only used in EMR.
C. Trino was renamed Athena in 2020; engine v3 simply rebrands the v2 engine.
D. Trino is a serialization format used by Athena to ship results to S3.

4. Why is Trino the right execution model for interactive Athena queries but a poor fit for multi-hour ETL?

A. Trino writes shuffle data to disk like Spark, making it slow on small queries.
B. Trino keeps intermediate results in memory across worker nodes, enabling sub-second responses but failing on jobs that exceed cluster memory.
C. Trino can only read CSV, so it cannot handle Parquet ETL output.
D. Trino is single-threaded per query, which limits both interactive and batch scenarios.

5. Where does Athena obtain table definitions, column types, and partition locations?

A. From inline DDL parameters submitted with each query.
B. From the AWS Glue Data Catalog, which is shared with Glue ETL and Lake Formation.
C. From a private metastore inside each Trino worker, separate from Glue.
D. From a Redshift system catalog accessed via Spectrum.

Partition 2: Athena Query Optimization

Pre-Reading Quiz — Athena Query Optimization

1. What does partition projection change about how Athena resolves which S3 prefixes to scan?

A. It forces Athena to crawl all S3 objects on every query.
B. It synthesizes the partition list on the fly from declared range patterns, avoiding a Glue Catalog list-partitions call.
C. It replaces partitioning with bucketing for time-series data.
D. It caches partitions inside Trino workers' local SSDs.

2. A query runs SELECT amount FROM sales WHERE amount > 1000 against a Parquet table. Which of the following best explains why Athena reads dramatically less than the full table even before partition pruning?

A. Parquet automatically pre-aggregates all numeric columns to skip filtering.
B. Parquet stores values column-by-column and embeds min/max statistics per row group, so Trino reads only the amount column and skips row groups whose max is ≤ 1000.
C. Parquet shards each column into a separate S3 object that is independently encrypted.
D. Parquet rewrites the query into a Glue Data Catalog stored procedure.

3. Why is the recommended Parquet object size 128–256 MB rather than dramatically smaller?

A. Athena rejects Parquet files smaller than 128 MB.
B. S3 charges higher fees for sub-128 MB GET requests.
C. Files smaller than 128 MB hurt Trino parallelism — the fixed S3 GET overhead per file dominates wall-clock time when there are thousands of tiny files.
D. Snappy compression cannot operate on files below 128 MB.

4. What is the primary benefit of bucketing a table by a high-cardinality join column like customer_id?

A. Bucketing replaces partitioning entirely, simplifying the table layout.
B. Bucketing enables co-located joins — bucket N of one table joins only against bucket N of the other, dramatically reducing shuffle traffic.
C. Bucketing reduces S3 storage cost by deduplicating identical customer records.
D. Bucketing forces Athena to evaluate every bucket sequentially, which lowers concurrency.

5. What is a structural constraint of a single CTAS statement in Athena that pushes teams toward INSERT INTO for ongoing loads?

A. CTAS cannot specify a target file format.
B. CTAS produces at most 100 partitions, and the destination S3 location must be empty.
C. CTAS must run during nightly maintenance windows.
D. CTAS automatically deletes the source table after writing.

Once you internalize "minimize bytes scanned," the optimization toolkit organizes itself naturally. There are three layers to attack: what you read (partitioning), how you read it (columnar formats), and what you compute on it once read (CTAS for materialization).

Partition Projection

Partitioning arranges files in S3 under a directory hierarchy that encodes a column value in the path. The Hive convention uses key=value segments:

s3://my-lake/sales/year=2024/month=11/day=07/sales-001.parquet
s3://my-lake/sales/year=2024/month=11/day=07/sales-002.parquet
s3://my-lake/sales/year=2024/month=11/day=08/sales-001.parquet

When you submit SELECT * FROM sales WHERE year=2024 AND month=11 AND day=07, Athena consults the Glue Catalog for partition locations, identifies that only one prefix matches, and instructs Trino workers to read only that prefix. On a table with 365 day-partitions, a single-day query touches roughly 1/365th of the data — a 99.7% reduction in bytes scanned.

The traditional Hive approach requires MSCK REPAIR TABLE or a Glue crawler whenever new partitions appear. Partition projection takes a different path: instead of materializing every partition in the catalog, you declare the partition pattern in table properties, and Athena synthesizes the partition list on the fly from the WHERE clause.

CREATE EXTERNAL TABLE sales (
  sale_id BIGINT,
  amount DECIMAL(10,2)
)
PARTITIONED BY (year INT, month INT, day INT)
STORED AS PARQUET
LOCATION 's3://my-lake/sales/'
TBLPROPERTIES (
  'projection.enabled' = 'true',
  'projection.year.type' = 'integer',
  'projection.year.range' = '2020,2030',
  'projection.month.type' = 'integer',
  'projection.month.range' = '1,12',
  'projection.day.type' = 'integer',
  'projection.day.range' = '1,31',
  'storage.location.template' = 's3://my-lake/sales/year=${year}/month=${month}/day=${day}/'
);

An analogy: traditional Hive partitioning is a library where every book is individually catalogued by hand. Partition projection is a library where the librarian knows the shelving rule and computes where any book lives without consulting an index card.

Animation: Partition projection prunes the search to a single day
WHERE year=2024 AND month=11 AND day=07 year month=1 2 3 4 5 6 7 8 9 10 11 12 2020 2021 2022 2023 2024 2025 2026 Single matching prefix scanned ~99.7% pruned
A predicate on year/month/day fades 83 of 84 partitions and lights up the single surviving prefix — partition projection synthesizes that path without a Glue listing call.

Columnar Formats and Compression

Once partitioning has narrowed the file set, the next question is how each file is laid out internally. Row-based formats (CSV, JSON, Avro) store all columns of one row together; column-based formats (Parquet, ORC) store all values of one column together.

FormatLayoutAthena cost on SELECT amount FROM sales
CSVRow-based, no compressionReads 100% of file
JSONRow-based, often verboseReads 100% + parsing overhead
AvroRow-based, schema-aware, compressedReads 100% but smaller bytes
ParquetColumnar, Snappy/ZSTD compressedReads only the amount column
ORCColumnar, Snappy/Zlib compressedReads only the amount column

Parquet also embeds min/max statistics per row group (~128 MB chunks). When your WHERE clause says amount > 1000, Trino can skip entire row groups whose maximum amount is below 1000, without reading any actual values. This is predicate pushdown at the file level, stacked on top of partition pruning.

The recommended file size is 128–256 MB per Parquet object, with internal row groups of 64–128 MB. Files smaller than 128 MB hurt parallelism — each worker pays a fixed S3 GET overhead per file, and with thousands of tiny files that overhead dominates wall-clock time (the "tail latency" problem). Files larger than 1 GB underutilize the worker fleet because individual workers cannot subdivide a file beyond row-group boundaries.

Compression: Snappy is the Parquet default (fast, modest ratio). ZSTD is 10–30% better at slightly higher CPU cost. GZIP is single-threaded per block — avoid for analytical Parquet.

CTAS and INSERT INTO Patterns

CTAS (CREATE TABLE AS SELECT) is the Swiss Army knife of Athena optimization. A CTAS reads from a source, applies any transformation (projection, filter, partitioning, formatting, bucketing), and writes a new table in a single operation:

CREATE TABLE optimized_sales
WITH (
  format = 'PARQUET',
  parquet_compression = 'SNAPPY',
  partitioned_by = ARRAY['year', 'region'],
  bucketed_by = ARRAY['customer_id'],
  bucket_count = 32,
  external_location = 's3://my-lake/curated/sales/'
) AS
SELECT
  sale_id,
  amount,
  customer_id,
  year(sale_date) AS year,
  region
FROM raw_sales
WHERE sale_date >= DATE '2020-01-01';

Bucketing hashes a chosen column into a fixed number of files per partition. When two bucketed tables are joined on the bucket column, Trino performs a co-located join — bucket 7 of sales joins only against bucket 7 of customers — dramatically reducing shuffle traffic.

CTAS has two important constraints: the destination S3 location must be empty (Athena refuses to overwrite), and a single CTAS produces at most 100 partitions. For larger datasets you use INSERT INTO ... SELECT to append additional partitions:

INSERT INTO optimized_sales
SELECT sale_id, amount, customer_id, year(sale_date) AS year, region
FROM raw_sales
WHERE sale_date >= DATE '2024-01-01' AND sale_date < DATE '2024-02-01';

A common production pattern combines CTAS for backfills with scheduled INSERT INTO for daily increments. Many teams have replaced this with Apache Iceberg tables (Athena engine 3+), which provide atomic commits, hidden partitioning, and time travel.

Figure 8.2: Hive Partition Lookup vs Partition Projection

flowchart TD Q[Query: WHERE year=2024 AND month=11 AND day=07] Q --> Mode{Partition mode?} Mode -->|Hive partitioning| H1[Glue Catalog
list partitions API] H1 --> H2[Materialize all
registered partitions] H2 --> H3[Filter against predicate] H3 --> Scan[Issue S3 GET on
matching prefix] Mode -->|Partition projection| P1[Read TBLPROPERTIES
projection.* ranges] P1 --> P2[Compute partition path
from storage.location.template] P2 --> Scan Scan --> Result[(Parquet row groups)]

Key Takeaways

Post-Reading Quiz — Athena Query Optimization

1. What does partition projection change about how Athena resolves which S3 prefixes to scan?

A. It forces Athena to crawl all S3 objects on every query.
B. It synthesizes the partition list on the fly from declared range patterns, avoiding a Glue Catalog list-partitions call.
C. It replaces partitioning with bucketing for time-series data.
D. It caches partitions inside Trino workers' local SSDs.

2. A query runs SELECT amount FROM sales WHERE amount > 1000 against a Parquet table. Which of the following best explains why Athena reads dramatically less than the full table even before partition pruning?

A. Parquet automatically pre-aggregates all numeric columns to skip filtering.
B. Parquet stores values column-by-column and embeds min/max statistics per row group, so Trino reads only the amount column and skips row groups whose max is ≤ 1000.
C. Parquet shards each column into a separate S3 object that is independently encrypted.
D. Parquet rewrites the query into a Glue Data Catalog stored procedure.

3. Why is the recommended Parquet object size 128–256 MB rather than dramatically smaller?

A. Athena rejects Parquet files smaller than 128 MB.
B. S3 charges higher fees for sub-128 MB GET requests.
C. Files smaller than 128 MB hurt Trino parallelism — the fixed S3 GET overhead per file dominates wall-clock time when there are thousands of tiny files.
D. Snappy compression cannot operate on files below 128 MB.

4. What is the primary benefit of bucketing a table by a high-cardinality join column like customer_id?

A. Bucketing replaces partitioning entirely, simplifying the table layout.
B. Bucketing enables co-located joins — bucket N of one table joins only against bucket N of the other, dramatically reducing shuffle traffic.
C. Bucketing reduces S3 storage cost by deduplicating identical customer records.
D. Bucketing forces Athena to evaluate every bucket sequentially, which lowers concurrency.

5. What is a structural constraint of a single CTAS statement in Athena that pushes teams toward INSERT INTO for ongoing loads?

A. CTAS cannot specify a target file format.
B. CTAS produces at most 100 partitions, and the destination S3 location must be empty.
C. CTAS must run during nightly maintenance windows.
D. CTAS automatically deletes the source table after writing.

Partition 3: Federated Querying & Engine Selection

Pre-Reading Quiz — Federated Querying & Engine Selection

1. In Athena Federated Query, what is the role of an AWS Lambda function and why is it architecturally significant?

A. The Lambda is an authentication broker; it never touches data.
B. The Lambda is the data-source connector that translates Trino reads into native API calls and returns Apache Arrow blocks, letting Athena treat external data uniformly.
C. The Lambda mounts the external database as a local block device.
D. The Lambda is only used to decompress Parquet payloads from external sources.

2. A federated query runs SELECT user_id, email FROM lambda_dynamo.users WHERE last_login > '2024-01-01' where last_login is not a partition or sort key. What happens, and why is it expensive?

A. DynamoDB satisfies it as a GetItem; cost is single-digit millisecond latency.
B. The connector degenerates the query to a full DynamoDB Scan because the predicate is on a non-key attribute, so DynamoDB returns the whole table for Trino to post-filter.
C. The query fails with a syntax error because Athena requires partition keys in WHERE.
D. The query is silently rejected by Lake Formation row-level policies.

3. For an organization running ten ad-hoc Athena queries per day — each scanning ~5 GB of well-partitioned Parquet — versus a small idle Redshift provisioned cluster, which engine wins on cost and why?

A. Redshift wins because clusters are always cheaper than per-query pricing.
B. Athena wins by orders of magnitude (~$1/year vs ~$1,750/month) because Athena charges nothing when idle and per-TB-scanned is tiny on small workloads.
C. Both cost the same because AWS sets a flat data-platform price.
D. EMR is cheaper than either for ad-hoc SQL.

4. Which combination of properties best captures when Redshift is the right engine over Athena and EMR?

A. Sporadic SQL workloads where idle cost must be zero and query patterns are unpredictable.
B. Non-SQL big-data processing in Spark, Flink, or Hive frameworks.
C. Predictable, high-concurrency BI dashboards with sub-second latency expectations and persistent compute.
D. Multi-hour ETL transforming raw event logs into Parquet.

5. Why do mature data platforms run Athena, Redshift, and EMR side by side rather than picking one?

A. AWS contracts require multi-engine deployments above a usage tier.
B. Each engine is optimized for a different workload type, and they unify on a shared Glue Data Catalog (with Lake Formation governance), often using Iceberg on S3 to give every engine warehouse-quality semantics.
C. Athena, Redshift, and EMR are simply rebrandings of the same engine and load-balance behind the scenes.
D. Glue Data Catalog can only be queried when all three engines are concurrently provisioned.

Athena Federated Query Connectors

Up to this point, every example has assumed data lives in S3. In practice, organizations have customer profiles in DynamoDB, transactional records in RDS, financial metrics in Snowflake, and event logs in S3 — and the analytical question that matters often spans all of them. Athena Federated Query is the bridge.

Federated Query introduces a catalog that points not at the Glue Data Catalog but at an AWS Lambda function. That Lambda function is the data source connector — the runtime intermediary between Athena and the external system.

SQL Query
  -> Athena Trino engine
  -> Federated Query Handler
  -> Lambda Data Source Connector (per source)
  -> Native API call (DynamoDB GetItem, RDS SQL, Redshift, etc.)
  -> Apache Arrow result blocks
  -> Athena query plan continues (joins, aggregates)
  -> Final result to user

Every connector implements four required handler interfaces:

HandlerResponsibilityAnalogy
MetadataHandlerLists schemas, tables, columnsA library catalog
GetTableHandlerReturns table schema for a specific tableA book's table of contents
GetSplitsHandlerDivides data into chunks for parallel readsAssigning aisles to multiple researchers
ReadRecordsHandlerStreams actual rows back as Apache ArrowThe researcher reading the book aloud

The Apache Arrow detail is more than a technicality. Arrow is a columnar in-memory format that allows Trino, Lambda, and external systems to share data without serialization/deserialization overhead. When the connector returns Arrow blocks (up to 64 MB each), Trino consumes them directly, treating remote data with the same primitives it uses for native S3 Parquet.

Animation: Federated query flow — Athena to external source via Lambda
Athena Trino SQL plan joins / aggregates Lambda Connector Metadata / Splits / Read pushdown predicates External Source DynamoDB / RDS Snowflake / BigQuery User result set Trino read native API Apache Arrow blocks (≤64 MB) flow back SQL Arrow rows SELECT * FROM lambda_catalog.t WHERE tier = 'platinum'
SQL flows Athena → Lambda → external source; Apache Arrow blocks stream back so Trino can join them with S3 data without serialization overhead.

Querying RDS, DynamoDB, and External Sources

Each connector maps Athena's query operations to the source system's native primitives. The mapping matters because it determines what gets pushed down (good — less data moves) versus pulled up to Trino for filtering (bad — more data moves).

DynamoDB is the trickiest because it is not a relational engine. WHERE user_id = 'u-123' becomes a DynamoDB GetItem — single-digit-millisecond latency. Partition-key equality becomes a Query. But WHERE last_login > '2024-01-01' degenerates into a full-table Scan — slow and expensive. Treat DynamoDB like an OLTP key-value store; treating it like a warehouse is a path to surprise bills.

-- Efficient: pushes down to GetItem
SELECT user_id, email, last_login
FROM lambda_dynamo.production.user_profiles
WHERE user_id = 'user_12345';

-- Expensive: degenerates to a full Scan
SELECT user_id, email
FROM lambda_dynamo.production.user_profiles
WHERE last_login > TIMESTAMP '2024-01-01 00:00:00';

RDS connectors are more conventional — the Lambda receives the WHERE clause and column list and constructs a real SQL query against the database. Pitfalls are operational: Lambda must share the VPC with RDS, credentials live in Secrets Manager, and concurrent queries can saturate connection limits. Route federated reads to a read replica when possible.

SELECT
    c.customer_id,
    c.name,
    c.tier,
    s.total_revenue_2024
FROM lambda_rds.prod.customers c
JOIN s3_lake.curated.revenue_summary s
    ON c.customer_id = s.customer_id
WHERE c.created_date > DATE '2024-01-01'
  AND c.tier IN ('gold', 'platinum');

Connectors to Snowflake and BigQuery

The connector model extends naturally to non-AWS warehouses. AWS publishes a Snowflake connector (JDBC) and a BigQuery connector (BigQuery client libraries). Both push down filters and projections; both bill on two axes (Athena scanned bytes plus the source warehouse's own pricing).

Latency: pure S3 queries return in 2–5 s; federated DynamoDB in 3–8 s; RDS or warehouse queries in 5–15 s; cross-source joins in 10–30 s. Lambda's 15-minute hard limit caps long-running reads, and the 64 MB Arrow block size can fragment very large result sets. Federation is excellent for joining moderate volumes; it is not a replacement for nightly terabyte ETL. The right pattern is exploration via federation, then materialization via CTAS for production.

When Athena vs Redshift vs EMR

The fundamental difference is whether your workload is interactive or persistent.

Three workshops in a factory: Athena is the bench for quick repairs as customers walk in; Redshift is the assembly line running the same product day after day; EMR is the heavy machine shop fabricating custom components.

Cost models compared

ServicePrimary cost driverIdle cost
Athena$5/TB scanned (US)Zero
Redshift ServerlessRPU-hours (~$0.36/RPU-hour)Zero (after auto-pause)
Redshift ProvisionedNode-hours (24/7)Full cluster cost
EMR ProvisionedEC2 + EMR fee per instance-hourFull cluster cost
EMR ServerlessWorker resource-secondsZero

Ten ad-hoc Athena queries/day at 5 GB each ≈ $1/year. The same workload on a small Redshift provisioned cluster costs roughly $1,750/month even when idle. Conversely, 200 high-concurrency BI queries per minute on a 10 TB warehouse: Redshift wins; Athena would bill aggressively and likely throttle.

Figure 8.4: Decision Flow for Athena vs Redshift vs EMR

flowchart TD Start[New analytics workload] --> Q1{Workload type?} Q1 -->|Non-SQL: Spark / Flink / ML| EMR[EMR
provisioned or serverless] Q1 -->|SQL-only| Q2{Access pattern?} Q2 -->|Ad-hoc, exploratory,
unpredictable| Q3{Daily scan volume?} Q3 -->|Sub-50 GB / day| Athena1[Athena
per-TB scanned] Q3 -->|Multi-TB sustained| Q4 Q2 -->|Predictable BI dashboards,
high concurrency| Q4{Latency SLA?} Q4 -->|Sub-second, 100s of users| Redshift[Redshift Provisioned
or Serverless] Q4 -->|Seconds OK,
cost-sensitive| Athena2[Athena +
partitioned Parquet] EMR --> Output[Curated Parquet / Iceberg in S3] Output --> Athena1 Output --> Redshift

Hybrid Usage Patterns

In practice, mature platforms use all three engines, each for the workload it is optimized for, unified by the Glue Data Catalog and Lake Formation governance.

Raw landing (S3, JSON/CSV)
        |
        | EMR Spark (heavy transformation)
        v
Curated lake (S3, Parquet/Iceberg, Glue Catalog)
        |
        +--------+--------+
        |                 |
        v                 v
   Athena (ad-hoc)   Redshift Spectrum + RA3 (BI marts)
        |                 |
        v                 v
    Data scientists    BI dashboards

The Glue Data Catalog is the glue (literally): a table registered in Glue is queryable from Athena, accessible via Redshift Spectrum, and visible to EMR Spark — without redefinition. Lake Formation layers row- and column-level access control over the catalog so that a marketing analyst querying via Athena and a data engineer querying via Spark see the same governed view.

The data lakehouse pattern captures this in a single phrase: warehouse-quality structure (ACID, schema evolution, time travel) on lake-economics storage (S3, no duplication, multi-engine access). Athena engine v3's native Iceberg support makes this practical without third-party tooling.

Key Takeaways

Post-Reading Quiz — Federated Querying & Engine Selection

1. In Athena Federated Query, what is the role of an AWS Lambda function and why is it architecturally significant?

A. The Lambda is an authentication broker; it never touches data.
B. The Lambda is the data-source connector that translates Trino reads into native API calls and returns Apache Arrow blocks, letting Athena treat external data uniformly.
C. The Lambda mounts the external database as a local block device.
D. The Lambda is only used to decompress Parquet payloads from external sources.

2. A federated query runs SELECT user_id, email FROM lambda_dynamo.users WHERE last_login > '2024-01-01' where last_login is not a partition or sort key. What happens, and why is it expensive?

A. DynamoDB satisfies it as a GetItem; cost is single-digit millisecond latency.
B. The connector degenerates the query to a full DynamoDB Scan because the predicate is on a non-key attribute, so DynamoDB returns the whole table for Trino to post-filter.
C. The query fails with a syntax error because Athena requires partition keys in WHERE.
D. The query is silently rejected by Lake Formation row-level policies.

3. For an organization running ten ad-hoc Athena queries per day — each scanning ~5 GB of well-partitioned Parquet — versus a small idle Redshift provisioned cluster, which engine wins on cost and why?

A. Redshift wins because clusters are always cheaper than per-query pricing.
B. Athena wins by orders of magnitude (~$1/year vs ~$1,750/month) because Athena charges nothing when idle and per-TB-scanned is tiny on small workloads.
C. Both cost the same because AWS sets a flat data-platform price.
D. EMR is cheaper than either for ad-hoc SQL.

4. Which combination of properties best captures when Redshift is the right engine over Athena and EMR?

A. Sporadic SQL workloads where idle cost must be zero and query patterns are unpredictable.
B. Non-SQL big-data processing in Spark, Flink, or Hive frameworks.
C. Predictable, high-concurrency BI dashboards with sub-second latency expectations and persistent compute.
D. Multi-hour ETL transforming raw event logs into Parquet.

5. Why do mature data platforms run Athena, Redshift, and EMR side by side rather than picking one?

A. AWS contracts require multi-engine deployments above a usage tier.
B. Each engine is optimized for a different workload type, and they unify on a shared Glue Data Catalog (with Lake Formation governance), often using Iceberg on S3 to give every engine warehouse-quality semantics.
C. Athena, Redshift, and EMR are simply rebrandings of the same engine and load-balance behind the scenes.
D. Glue Data Catalog can only be queried when all three engines are concurrently provisioned.

Your Progress

Answer Explanations