Study Guide: Chapter 13 — BI, ML, and Cost Optimization in Production

flowchart TD subgraph SRC["Sources"] DB[Operational DBs] EV[Events / IoT] SAAS[SaaS] end subgraph ING["Ingestion"] DMS[DMS CDC] KDS[Kinesis Streams] AF[AppFlow] GL[Glue Batch] end subgraph LAKE["Layered Lakehouse Storage"] BR["Bronze: S3 Raw
Parquet partitioned"] SI["Silver: Iceberg / Delta
Conformed Zone"] GO["Gold: Redshift / Snowflake
Star Schemas + MVs"] end subgraph BI["Analytics Branch"] QS[QuickSight + SPICE] ATH[Athena Ad-hoc] end subgraph ML["ML Branch"] FS["Feature Store
Online + Offline"] SM["SageMaker Training
+ Endpoints"] end subgraph CROSS["Cross-Cutting Planes"] GOV[Lake Formation Governance] OBS["Observability:
OpenLineage + Monte Carlo"] SEC["Security: IAM + KMS + VPC"] end DB --> DMS EV --> KDS SAAS --> AF DB --> GL DMS --> BR KDS --> BR AF --> BR GL --> BR BR --> SI SI --> GO GO --> QS GO --> ATH SI --> FS FS --> SM GOV -.-> LAKE GOV -.-> BI GOV -.-> ML OBS -.-> LAKE SEC -.-> ING SEC -.-> LAKE

Figure 13.5 — Multi-Tenant Isolation Models

Section 1: BI with QuickSight and SPICE

Pre-Reading Check — Section 1

1. A QuickSight Standard-edition account is hitting its dashboard refresh ceiling. The team needs hourly freshness on a 50 GB fact table. Which constraint is the binding one and what is the typical fix?

Standard caps refreshes at 4 per day; the binding constraint is refresh frequency, fixed by upgrading to Enterprise edition (up to 32 refreshes/day). Standard caps SPICE capacity at 10 GB total; the binding constraint is SPICE size, fixed by partitioning the dataset into smaller business-unit slices. Standard does not support direct query, so the fix is enabling direct query against Redshift instead of SPICE. Standard does not support VPC connectors, so the fix is migrating the warehouse out of a private VPC.

2. A business user complains that Q in QuickSight cannot answer "show me revenue by customer type" even though the data exists. The column in the dataset is named cust_acct_typ_cd_v2. Which fix most directly addresses the failure?

Increase the SPICE refresh frequency so Q has fresher data to translate. Add a business-friendly column name and description in the QuickSight dataset editor and configure synonyms. Migrate the dataset from SPICE to direct query mode so Q can rerun the question. Enable row-level security so Q is allowed to read the column.

3. Why is incremental refresh the preferred SPICE mode for a clickstream fact table that is append-only with monotonic event timestamps?

It is the only mode supported on Enterprise edition. It bypasses SPICE entirely and queries the source live, eliminating staleness. It loads only rows newer than a timestamp watermark, avoiding the cost of re-ingesting unchanged history on every refresh. It triggers a full refresh whenever the schema changes, ensuring the dataset is always current.

What QuickSight Is

Business intelligence is the layer where engineered data finally meets the people who make decisions with it. A pipeline that ingests terabytes of clickstream and conforms it to a star schema delivers zero business value until a regional sales lead can open a dashboard at 7 AM and see whether yesterday's promotion worked. Amazon QuickSight is AWS's native managed BI service, designed to plug directly into Redshift warehouses, S3 data lakes, and Athena query layers. It is serverless, per-session billed, and tightly integrated with AWS identity and networking.

QuickSight has three distinct capability layers, each with its own cost model and failure modes:

Connector and security layer — VPC connectors, IAM roles, and row-level security that govern how QuickSight reaches data.
SPICE in-memory acceleration layer — the columnar cache that decouples dashboard latency from source-system load.
End-user experience layer — dashboards, embedded analytics, and the natural-language assistant Q.

SPICE: Super-fast, Parallel, In-memory Calculation Engine

SPICE is QuickSight's proprietary columnar in-memory store. When you ingest a dataset into SPICE, QuickSight extracts rows from the source, compresses them columnarly (typical ratios are 4:1 to 6:1), and stores them in a distributed memory tier the dashboard renderer queries directly. Aggregation latency falls to the 10–100 millisecond range regardless of how loaded the source warehouse is.

The mental model is a coffee thermos. The espresso machine (your warehouse) is expensive, slow to warm up, and shared. Instead of walking back for every sip, you brew a thermos in the morning (SPICE ingestion) and pour from it instantly all day. The thermos is smaller than the machine's output, so you choose what goes in; contents go stale unless you refill on a schedule.

Capacity is a finite shared resource: Standard edition gives 10 GB per account; Enterprise gives 10 GB base plus 0.5 GB per provisioned user up to 512 GB. A careless team can fill it with overlapping snapshots and starve the organization.

Refresh Strategies

Refresh Mode	Behavior	When to Use
Full refresh	Re-ingests entire dataset from source	Small datasets, dimension tables, schema changes
Incremental refresh	Loads only rows newer than a timestamp watermark	Large append-only fact tables with monotonic event time
On-demand refresh	API or console-triggered manual run	Post-ETL completion via EventBridge, ad-hoc fixes

Standard caps refreshes at four per day; Enterprise allows up to 32. Schedule refreshes during low-query windows so SPICE ingestion does not contend with interactive analytics. The trade-off is always freshness versus cost: a dashboard accurate to the minute is dramatically more expensive than one accurate to the hour.

Q Natural Language Queries

Q in QuickSight is the natural-language layer that lets a business user type "what were our top five products by revenue last quarter in EMEA" and receive a generated chart instead of a SQL prompt. Under the hood, Q is a text-to-SQL pipeline: the question is parsed by an NLP model, mapped against the dataset schema and metadata, translated to SQL, executed against SPICE or the source, and rendered as a visualization.

Q is Enterprise-only and its accuracy is almost entirely determined by metadata quality:

Business-friendly column names and descriptions in the QuickSight dataset editor — not inherited from raw source DDL.
Hierarchies declared explicitly (Date → Quarter → Month → Day; Region → Country → City).
Custom synonyms and grouping definitions so "EMEA" maps to your organization's country list.
Data category tags (currency, percentage, date) so gross_margin is formatted as a percentage and aggregated as a weighted average.

Embedded Analytics

Embedded analytics surfaces dashboards or the full Q bar inside another application via a signed embed URL. SaaS companies use it to expose per-customer reporting inside their own product. Three dimensions matter:

Identity — row-level security and namespace isolation prevent tenant cross-contamination.
Session — short-lived embed URLs generated server-side via GenerateEmbedUrlForRegisteredUser or GenerateEmbedUrlForAnonymousUser; never long-lived credentials.
Theming — QuickSight themes and custom CSS so the experience feels native, not bolted-on.

For multi-tenant scenarios, the anonymous-user embed pattern with session tags is standard: each session is tagged with the tenant ID, RLS rules filter datasets on that tag, and the host application controls which dashboard a user can request.

Post-Reading Check — Section 1

3. Why is incremental refresh the preferred SPICE mode for a clickstream fact table that is append-only with monotonic event timestamps?

Section 2: ML Convergence and SageMaker Integration

Pre-Reading Check — Section 2

1. A churn model trained on the SageMaker offline feature store hits 99% accuracy in development but only 60% in production. What is the most likely root cause?

The online store has insufficient capacity for production traffic, causing GetRecord throttling. The training set was built using current feature values rather than ASOF joins on EventTime, leaking post-cancellation profile updates into training. SPICE compression introduced precision loss in numeric features. Lake Formation row-level security blocked some labels at training time but not inference.

2. Which SageMaker Feature Store property best reflects how it integrates into the lakehouse rather than being a sibling silo?

It uses a proprietary file format only readable by SageMaker training jobs. The offline store is S3 Parquet partitioned by event time and registered in the Glue Data Catalog, queryable from Athena, Redshift Spectrum, or EMR. The online store replaces Redshift as the warehouse for analytical workloads. It bypasses Lake Formation and uses its own access control surface separate from the rest of the data platform.

3. Generative-BI assistants tend to fail when pointed at a raw lake. What discipline makes them succeed instead?

Fine-tuning the LLM on every column in every raw bronze table. Disabling guardrails so the LLM has unrestricted query authority. Publishing a curated semantic layer of metrics with definitions, owners, and grain, then treating the LLM as a translator into that layer. Removing lineage and synonym dictionaries so the LLM is not biased by metadata.

Lakehouse-Native ML

For most of BI's history, analytics and ML lived in separate tools with separate data copies. The convergence story of the past few years is that those worlds are merging onto the same lakehouse storage, and the bridge between them is the SageMaker Feature Store and a new generation of generative-BI assistants that operate over warehouse data.

SageMaker is AWS's managed ML platform covering notebooks, training, model registry, deployment, and monitoring. Its critical integration point is that SageMaker training jobs, batch-transform jobs, and Studio notebooks read directly from the same S3 lakehouse and Glue Data Catalog as analytics workloads. Training a churn model no longer requires an export to a separate ML store; it reads the same Iceberg or Hudi table the dashboards read.

Concrete integration patterns:

Glue ETL into SageMaker training — a scheduled Glue job lands curated features in S3 as partitioned Parquet, and a training job uses the Glue Data Catalog table as input.
Athena Federated Query in notebooks — SageMaker Studio runs Athena queries against lake tables to build training samples, with full pushdown of partition filters.
EMR + Spark connector — large-scale feature engineering writes directly to feature groups via sagemaker-feature-store-pyspark for historical backfills.
Redshift ML — in-warehouse SQL training (XGBoost, linear learner) where the warehouse is the source of truth and exporting would add latency.

Feature Store: Online + Offline Pair

A feature store manages ML model inputs with two non-negotiable properties: low-latency online lookups for inference, and point-in-time-correct historical retrieval for training.

Store	Backend	Latency	Primary Use
Online	Managed in-memory KV	Single-digit ms	Real-time inference via `GetRecord` / `BatchGetRecord`
Offline	S3 Parquet (Glue cataloged)	Seconds to minutes	Training set creation, batch inference, analytics

The offline store is literally just S3 Parquet partitioned by event time, registered in the Glue Data Catalog, and queryable from Athena, Redshift Spectrum, or EMR. Feature data is part of the lakehouse, not a sibling. An analyst queries feature group history with the same Athena workgroup used for fact tables; a Lake Formation administrator grants access using the same policy primitives.

Ingestion Patterns

Batch ingestion — scheduled Glue or EMR jobs read from S3 or Redshift, transform, and call PutRecord (or use the Spark connector). The connector dual-writes to online and offline stores in one operation.
Stream ingestion — Kinesis or Managed Service for Apache Flink push records through Lambda handlers calling PutRecord continuously.
Backfill — historical Parquet files are written directly to the offline-store S3 path and registered, bypassing per-record API overhead.
CDC ingestion — DMS or Debezium captures change events through Kinesis or Kafka; a stream handler updates features as upstream rows change.

For upsert-heavy workloads (a customer profile updated daily), the Iceberg-format offline store enables ACID upserts, schema evolution, and time travel. For append-only event histories (every transaction), the Glue/Standard offline store is simpler and cheaper.

Point-in-Time Correctness: The Discipline That Separates Production from Prototype

Every offline-store record carries both EventTime (when the value was true in the real world) and WriteTime (when it was written). Training queries use ASOF joins to retrieve the feature value current at prediction time, not the latest value. Without this, a churn model is poisoned by post-cancellation profile updates — classic label leakage that produces 99% dev accuracy and 60% production accuracy.

Generative-BI Assistants

Generative-BI assistants are LLM-powered agents that take a user's natural-language question, plan a sequence of warehouse queries, and synthesize a narrative answer with charts and citations. Q in QuickSight sits at the simpler end (single-question text-to-SQL); Bedrock Agents and similar frameworks orchestrate multi-step reasoning across catalogs, semantic layers, and external knowledge bases.

The data engineering work is largely metadata work:

Semantic layer — published curated metrics ("monthly active users," "gross margin") with definitions, owners, and grain.
Glossary and synonyms — industry-specific terms mapped to columns and tables.
Lineage — tracked from raw source through transformations to the cited metric.
Guardrails — query cost quotas, schema-restriction policies, result-size caps.

The pattern that fails is pointing an LLM at a raw lake. The pattern that succeeds is publishing a curated semantic layer first and treating the LLM as a translator into it.

Post-Reading Check — Section 2

1. A churn model trained on the SageMaker offline feature store hits 99% accuracy in development but only 60% in production. What is the most likely root cause?

2. Which SageMaker Feature Store property best reflects how it integrates into the lakehouse rather than being a sibling silo?

3. Generative-BI assistants tend to fail when pointed at a raw lake. What discipline makes them succeed instead?

Section 3: Cost Optimization and Reference Architectures

Pre-Reading Check — Section 3

1. An Athena query SELECT * FROM events WHERE LOWER(country) = 'us' runs a full table scan on a clickstream table partitioned by country. What is the most direct fix?

Increase the Athena workgroup scan limit so the query is allowed to complete. Switch to BigQuery, which natively understands case-insensitive partition pruning. Normalize country codes at ingestion so the filter becomes country = 'US' and partition pruning works. Move the table from Parquet to CSV so the optimizer can re-evaluate the filter.

2. A platform team is choosing between S3 Standard, Intelligent-Tiering, and Glacier Flexible Retrieval for a 300 TB lakehouse with mixed and unpredictable access patterns. Which choice and rationale is correct?

Glacier Flexible Retrieval, because it is cheapest per GB regardless of access pattern. Intelligent-Tiering, because it auto-moves objects between Frequent and Infrequent tiers and typically saves 40-50% with no engineering work for unpredictable access. S3 Standard, because Intelligent-Tiering retrieval fees always exceed Standard pricing. A custom mix where every object is tagged manually each week.

3. A production team adopts a multi-tenant pool model with row-level security and tenant_id columns. A new enterprise customer demands dedicated infrastructure for compliance reasons. Which evolution best fits the canonical pattern?

Convert the entire platform to silo, giving every tenant a dedicated cluster. Refuse the customer because pool is the only valid multi-tenant model. Move to a hybrid: keep pool for the long tail of small tenants and add silo (dedicated cluster or account) for the enterprise customer. Switch to bridge for everyone by giving each tenant its own schema in the shared cluster.

Cost Optimization as Workload Shaping

Cost optimization in modern data platforms is about shaping workloads to the billing model of each service. Snowflake bills credits per second (60s minimum); BigQuery bills per terabyte scanned or by reserved slots; Redshift bills per provisioned hour or per serverless second; Athena bills per terabyte scanned; S3 bills per gigabyte-month with tier-dependent rates. Cost optimization means reducing cost (compute hours, bytes scanned, storage tier) without harming value (query latency, freshness, completeness).

Storage Class and Lifecycle Tuning

Tier	Approx. Cost vs Standard	Retrieval Latency	Best For
S3 Standard	100%	ms	Hot, frequently queried
S3 Intelligent-Tiering	~60%	ms (auto-managed)	Mixed/unknown access
S3 Standard-IA	~55%	ms (per-GB retrieval fee)	Known infrequent, > 30 days
Glacier Instant Retrieval	~25%	ms	Rare, instant needed
Glacier Flexible Retrieval	~15%	minutes-hours	Archive with occasional restore
Glacier Deep Archive	~5%	12+ hours	Compliance archive

S3 Intelligent-Tiering is the default-good choice for lakehouses with unpredictable access. Glacier tiers save 90+ percent for compliance archives and rarely-accessed raw event logs. A typical lakehouse lifecycle policy: keep raw bronze in Standard for 30 days, move to Intelligent-Tiering for 90 days, transition to Glacier Flexible Retrieval at 180 days, expire at 7 years.

Query Economics: Scan Reduction

In scan-billed services every byte read is a billed byte. The four levers, by typical impact:

Partitioning — physically separates data; filters on partition keys skip entire directories.
Clustering / sort keys — within a partition, sorts rows so range filters touch contiguous blocks (BigQuery clustering, Redshift sort keys, Z-ordering in Iceberg/Delta).
Materialized views — pre-compute expensive aggregations once, query them many times.
BI Engine / result caching — BigQuery BI Engine, Snowflake result cache, Redshift result cache, QuickSight SPICE all cut scans dramatically for hot dashboard data.

Anti-pattern: SELECT * FROM big_table WHERE LOWER(country) = 'us' defeats partition pruning because the optimizer cannot prove the function preserves filter on storage layout. Normalize country codes at ingestion. Production deployments enforce per-query scan quotas (BigQuery custom quotas, Athena workgroup limits) so a typo cannot become a five-figure bill.

Right-Sizing and Reserved Capacity

Snowflake — warehouse size per workload (XS for ad-hoc, M for ETL, L for heavy analytics); auto-suspend 60-300s; monitor WAREHOUSE_LOAD_HISTORY weekly.
Redshift — Serverless for variable workloads (per-second after 60s minimum); elastic resize for diurnal patterns; provisioned + RIs for steady 24/7 (up to 75% savings).
BigQuery — on-demand by default; slot reservations for steady high-volume; edition-based pricing (Standard/Enterprise/Enterprise Plus).
Athena — primary right-sizing lever is data layout (Iceberg + columnar formats), not compute.

Reserved capacity yields up to 75% savings for stable workloads. The trap is overcommitment: size reservations from 90 days of historical burn, cover only the steady baseline, and let on-demand absorb peaks. Per-query tagging (Snowflake QUERY_TAG, BigQuery labels, AWS resource tags) maps spend back to incurring teams — the precondition for any chargeback model.

Cost Optimization Decision Flow

flowchart TD Start(["Workload Cost Review"]) --> Q1{"Storage
dominates bill?"} Q1 -->|Yes| Tier["Apply S3 Lifecycle:
Standard 30d to IT 90d to
Glacier IR 1y to Deep Archive"] Q1 -->|No| Q2{"Per-byte scan
billing?"} Tier --> Q2 Q2 -->|Yes| Scan["Add Partitioning + Clustering
+ Materialized Views
+ Result Cache"] Q2 -->|No| Q3{"Compute
under-utilized?"} Scan --> Quota["Enforce Per-Query
Scan Quotas"] Quota --> Q3 Q3 -->|Yes| Right["Right-Size:
Auto-Suspend
Serverless Mode"] Q3 -->|No| Q4{"Steady
baseline load?"} Right --> Q4 Q4 -->|Yes| Reserve["Reserved Capacity:
RIs / Slot Reservations
up to 75% savings"] Q4 -->|No| Tag["Per-Query Tagging
+ FinOps Chargeback"] Reserve --> Tag Tag --> End(["50-75% Savings"])

Reference Architectures

A reference architecture is a documented, tested combination of components solving a class of problems with known trade-offs. The point is that new projects start from a known-good template instead of relitigating fundamental decisions.

The canonical AWS shape:

Streaming and batch share storage — both write to the same S3-backed Iceberg silver tables.
Schema-on-read at bronze, schema-enforced at silver — raw zones tolerate sloppy upstream schemas; conformed zones reject contract violations.
Governance through Lake Formation — one access-control surface for analytics, ML, and reverse-ETL.
Observability cross-cutting — OpenLineage events, Monte Carlo data quality, Airflow/Step Functions metadata.

Multi-Tenant Patterns

Model	Isolation Boundary	Cost Efficiency	Blast Radius	Use Case
Pool (shared)	Row-level (tenant_id column)	Highest	Largest	Many small tenants, similar workloads
Bridge (silo within shared cluster)	Schema or database	Medium	Medium	Mid-size tenants, regulatory variation
Silo (dedicated)	Account or cluster	Lowest	Smallest	Few large tenants, strict compliance

Most production multi-tenant platforms end up hybrid: pool for the long tail of small tenants, silo for the handful of enterprise customers whose contracts demand dedicated infrastructure, with a clear migration path between the two as customers grow.

Capstone Discipline

The capstone is not finding the theoretically optimal stack; it is making every component choice deliberately, documenting the trade-off, and designing for the team that will operate the system at 3 AM on Tuesday when something breaks. A worked e-commerce example combines DMS + Kinesis ingestion, S3 Intelligent-Tiering bronze, Iceberg silver, Redshift Serverless gold, QuickSight Enterprise + SPICE BI, SageMaker Feature Store ML, Hightouch reverse-ETL, Lake Formation + Atlan governance, and a tight cost-control posture (lifecycle policies, workgroup scan limits, per-team tags) within a $1.5M annual budget.

Section 3 Key Points

Storage tiering with Intelligent-Tiering and Glacier saves 40-90% with little engineering effort; lifecycle policies automate the transitions.
Scan reduction levers in order of impact: partitioning > clustering/sort keys > materialized views > result caching. Function calls on partition columns silently defeat pruning.
Right-size with auto-suspend and serverless modes; reserve only the steady baseline (sized from 90 days of burn) and let on-demand absorb peaks.
Per-query tagging is the precondition for chargeback — without it, optimization is a central crusade rather than a distributed practice.
Reference architectures are templates, not products: layered S3/Iceberg storage, parallel BI and ML branches, Lake Formation governance, orthogonal observability/security planes.
Multi-tenancy in practice is hybrid (pool small + silo enterprise); the engineering work is propagating tenant identity through every layer.

Post-Reading Check — Section 3

1. An Athena query SELECT * FROM events WHERE LOWER(country) = 'us' runs a full table scan on a clickstream table partitioned by country. What is the most direct fix?

Chapter 13: BI, ML, and Cost Optimization in Production

Learning Objectives

Chapter Overview Diagrams

Figure 13.1 — QuickSight Three-Layer Architecture

Figure 13.4 — End-to-End Reference Architecture