Chapter 13: BI, ML, and Cost Optimization in Production
Learning Objectives
Connect Amazon QuickSight to lakehouse and warehouse sources, configuring SPICE refreshes, VPC connectors, and IAM roles for production dashboards.
Integrate machine learning pipelines with analytics platforms via SageMaker Feature Store, building point-in-time-correct training datasets and serving real-time inference from warehouse-resident features.
Apply cost optimization patterns including right-sizing, storage tiering, and query economics to reduce platform spend by 50-75 percent while preserving SLA targets.
Synthesize a reference architecture combining ingestion, lakehouse storage, transformation, governance, BI, and ML services from prior chapters into an end-to-end production design.
Chapter Overview Diagrams
Figure 13.1 — QuickSight Three-Layer Architecture
flowchart TD
subgraph Sources["Data Sources"]
RS[Redshift]
S3[S3 / Athena]
RDS[RDS / Aurora]
SaaS[SaaS Connectors]
end
subgraph Connect["Connector and Security Layer"]
VPC[VPC Connector]
IAM[IAM Roles]
RLS[Row-Level Security]
end
subgraph SPICE["SPICE Acceleration Layer"]
Ingest[Scheduled Ingestion]
Mem[("Columnar In-Memory Store 4:1-6:1 compression")]
Refresh[Full / Incremental / On-Demand]
end
subgraph Experience["End-User Experience"]
Dash[Dashboards]
Embed[Embedded Analytics]
Q[Q Natural-Language]
end
Sources --> Connect
Connect --> SPICE
Ingest --> Mem
Refresh --> Ingest
SPICE --> Experience
Connect -. Direct Query .-> Experience
Figure 13.4 — End-to-End Reference Architecture
flowchart TD
subgraph SRC["Sources"]
DB[Operational DBs]
EV[Events / IoT]
SAAS[SaaS]
end
subgraph ING["Ingestion"]
DMS[DMS CDC]
KDS[Kinesis Streams]
AF[AppFlow]
GL[Glue Batch]
end
subgraph LAKE["Layered Lakehouse Storage"]
BR["Bronze: S3 Raw Parquet partitioned"]
SI["Silver: Iceberg / Delta Conformed Zone"]
GO["Gold: Redshift / Snowflake Star Schemas + MVs"]
end
subgraph BI["Analytics Branch"]
QS[QuickSight + SPICE]
ATH[Athena Ad-hoc]
end
subgraph ML["ML Branch"]
FS["Feature Store Online + Offline"]
SM["SageMaker Training + Endpoints"]
end
subgraph CROSS["Cross-Cutting Planes"]
GOV[Lake Formation Governance]
OBS["Observability: OpenLineage + Monte Carlo"]
SEC["Security: IAM + KMS + VPC"]
end
DB --> DMS
EV --> KDS
SAAS --> AF
DB --> GL
DMS --> BR
KDS --> BR
AF --> BR
GL --> BR
BR --> SI
SI --> GO
GO --> QS
GO --> ATH
SI --> FS
FS --> SM
GOV -.-> LAKE
GOV -.-> BI
GOV -.-> ML
OBS -.-> LAKE
SEC -.-> ING
SEC -.-> LAKE
Figure 13.5 — Multi-Tenant Isolation Models
graph TD
subgraph Pool["Pool Model: Shared Everything"]
P1[Shared Cluster] --> P2[Shared Schema] --> P3[Tables with tenant_id column] --> P4[Row-Level Security + Session Tags]
end
subgraph Bridge["Bridge Model: Schema-Per-Tenant"]
B1[Shared Cluster] --> B2[Tenant A Schema]
B1 --> B3[Tenant B Schema]
B1 --> B4[Tenant C Schema]
end
subgraph Silo["Silo Model: Dedicated Infrastructure"]
S1[Tenant A Account / Cluster]
S2[Tenant B Account / Cluster]
S3[Tenant C Account / Cluster]
end
Cost["Cost Efficiency: Pool > Bridge > Silo"] -.-> Pool
Iso["Isolation Strength: Silo > Bridge > Pool"] -.-> Silo
Section 1: BI with QuickSight and SPICE
Pre-Reading Check — Section 1
1. A QuickSight Standard-edition account is hitting its dashboard refresh ceiling. The team needs hourly freshness on a 50 GB fact table. Which constraint is the binding one and what is the typical fix?
2. A business user complains that Q in QuickSight cannot answer "show me revenue by customer type" even though the data exists. The column in the dataset is named cust_acct_typ_cd_v2. Which fix most directly addresses the failure?
3. Why is incremental refresh the preferred SPICE mode for a clickstream fact table that is append-only with monotonic event timestamps?
What QuickSight Is
Business intelligence is the layer where engineered data finally meets the people who make decisions with it. A pipeline that ingests terabytes of clickstream and conforms it to a star schema delivers zero business value until a regional sales lead can open a dashboard at 7 AM and see whether yesterday's promotion worked. Amazon QuickSight is AWS's native managed BI service, designed to plug directly into Redshift warehouses, S3 data lakes, and Athena query layers. It is serverless, per-session billed, and tightly integrated with AWS identity and networking.
QuickSight has three distinct capability layers, each with its own cost model and failure modes:
Connector and security layer — VPC connectors, IAM roles, and row-level security that govern how QuickSight reaches data.
SPICE in-memory acceleration layer — the columnar cache that decouples dashboard latency from source-system load.
End-user experience layer — dashboards, embedded analytics, and the natural-language assistant Q.
SPICE is QuickSight's proprietary columnar in-memory store. When you ingest a dataset into SPICE, QuickSight extracts rows from the source, compresses them columnarly (typical ratios are 4:1 to 6:1), and stores them in a distributed memory tier the dashboard renderer queries directly. Aggregation latency falls to the 10–100 millisecond range regardless of how loaded the source warehouse is.
The mental model is a coffee thermos. The espresso machine (your warehouse) is expensive, slow to warm up, and shared. Instead of walking back for every sip, you brew a thermos in the morning (SPICE ingestion) and pour from it instantly all day. The thermos is smaller than the machine's output, so you choose what goes in; contents go stale unless you refill on a schedule.
Capacity is a finite shared resource: Standard edition gives 10 GB per account; Enterprise gives 10 GB base plus 0.5 GB per provisioned user up to 512 GB. A careless team can fill it with overlapping snapshots and starve the organization.
Animation 1: SPICE In-Memory Caching Speedup
Source warehouse query takes seconds; SPICE in-memory aggregation returns in 10–100 ms. Watch the bars grow proportionally to query latency.
Refresh Strategies
Refresh Mode
Behavior
When to Use
Full refresh
Re-ingests entire dataset from source
Small datasets, dimension tables, schema changes
Incremental refresh
Loads only rows newer than a timestamp watermark
Large append-only fact tables with monotonic event time
On-demand refresh
API or console-triggered manual run
Post-ETL completion via EventBridge, ad-hoc fixes
Standard caps refreshes at four per day; Enterprise allows up to 32. Schedule refreshes during low-query windows so SPICE ingestion does not contend with interactive analytics. The trade-off is always freshness versus cost: a dashboard accurate to the minute is dramatically more expensive than one accurate to the hour.
Q Natural Language Queries
Q in QuickSight is the natural-language layer that lets a business user type "what were our top five products by revenue last quarter in EMEA" and receive a generated chart instead of a SQL prompt. Under the hood, Q is a text-to-SQL pipeline: the question is parsed by an NLP model, mapped against the dataset schema and metadata, translated to SQL, executed against SPICE or the source, and rendered as a visualization.
Q is Enterprise-only and its accuracy is almost entirely determined by metadata quality:
Business-friendly column names and descriptions in the QuickSight dataset editor — not inherited from raw source DDL.
Hierarchies declared explicitly (Date → Quarter → Month → Day; Region → Country → City).
Custom synonyms and grouping definitions so "EMEA" maps to your organization's country list.
Data category tags (currency, percentage, date) so gross_margin is formatted as a percentage and aggregated as a weighted average.
Embedded Analytics
Embedded analytics surfaces dashboards or the full Q bar inside another application via a signed embed URL. SaaS companies use it to expose per-customer reporting inside their own product. Three dimensions matter:
Identity — row-level security and namespace isolation prevent tenant cross-contamination.
Session — short-lived embed URLs generated server-side via GenerateEmbedUrlForRegisteredUser or GenerateEmbedUrlForAnonymousUser; never long-lived credentials.
Theming — QuickSight themes and custom CSS so the experience feels native, not bolted-on.
For multi-tenant scenarios, the anonymous-user embed pattern with session tags is standard: each session is tagged with the tenant ID, RLS rules filter datasets on that tag, and the host application controls which dashboard a user can request.
Section 1 Key Points
SPICE is QuickSight's columnar in-memory engine: 4:1–6:1 compression, 10–100 ms aggregation latency, 10 GB (Standard) to 512 GB (Enterprise) account-wide capacity.
Choose refresh mode by data shape: full for small/dimensional, incremental for large append-only with watermark, on-demand for post-ETL triggers.
Q accuracy is a metadata investment, not a model upgrade: friendly names, hierarchies, synonyms, and data-category tags drive answer quality.
Embedded analytics demands short-lived embed URLs, namespace isolation, and RLS with session tags — never long-lived credentials.
Post-Reading Check — Section 1
1. A QuickSight Standard-edition account is hitting its dashboard refresh ceiling. The team needs hourly freshness on a 50 GB fact table. Which constraint is the binding one and what is the typical fix?
2. A business user complains that Q in QuickSight cannot answer "show me revenue by customer type" even though the data exists. The column in the dataset is named cust_acct_typ_cd_v2. Which fix most directly addresses the failure?
3. Why is incremental refresh the preferred SPICE mode for a clickstream fact table that is append-only with monotonic event timestamps?
Section 2: ML Convergence and SageMaker Integration
Pre-Reading Check — Section 2
1. A churn model trained on the SageMaker offline feature store hits 99% accuracy in development but only 60% in production. What is the most likely root cause?
2. Which SageMaker Feature Store property best reflects how it integrates into the lakehouse rather than being a sibling silo?
3. Generative-BI assistants tend to fail when pointed at a raw lake. What discipline makes them succeed instead?
Lakehouse-Native ML
For most of BI's history, analytics and ML lived in separate tools with separate data copies. The convergence story of the past few years is that those worlds are merging onto the same lakehouse storage, and the bridge between them is the SageMaker Feature Store and a new generation of generative-BI assistants that operate over warehouse data.
SageMaker is AWS's managed ML platform covering notebooks, training, model registry, deployment, and monitoring. Its critical integration point is that SageMaker training jobs, batch-transform jobs, and Studio notebooks read directly from the same S3 lakehouse and Glue Data Catalog as analytics workloads. Training a churn model no longer requires an export to a separate ML store; it reads the same Iceberg or Hudi table the dashboards read.
Concrete integration patterns:
Glue ETL into SageMaker training — a scheduled Glue job lands curated features in S3 as partitioned Parquet, and a training job uses the Glue Data Catalog table as input.
Athena Federated Query in notebooks — SageMaker Studio runs Athena queries against lake tables to build training samples, with full pushdown of partition filters.
EMR + Spark connector — large-scale feature engineering writes directly to feature groups via sagemaker-feature-store-pyspark for historical backfills.
Redshift ML — in-warehouse SQL training (XGBoost, linear learner) where the warehouse is the source of truth and exporting would add latency.
Feature Store: Online + Offline Pair
A feature store manages ML model inputs with two non-negotiable properties: low-latency online lookups for inference, and point-in-time-correct historical retrieval for training.
Store
Backend
Latency
Primary Use
Online
Managed in-memory KV
Single-digit ms
Real-time inference via GetRecord / BatchGetRecord
Offline
S3 Parquet (Glue cataloged)
Seconds to minutes
Training set creation, batch inference, analytics
The offline store is literally just S3 Parquet partitioned by event time, registered in the Glue Data Catalog, and queryable from Athena, Redshift Spectrum, or EMR. Feature data is part of the lakehouse, not a sibling. An analyst queries feature group history with the same Athena workgroup used for fact tables; a Lake Formation administrator grants access using the same policy primitives.
Animation 2: Feature Store Online/Offline Split with Consumer Paths
Ingestion paths converge on a single feature group, which dual-writes to the online KV (real-time inference) and offline S3 Parquet (training and analytics).
Ingestion Patterns
Batch ingestion — scheduled Glue or EMR jobs read from S3 or Redshift, transform, and call PutRecord (or use the Spark connector). The connector dual-writes to online and offline stores in one operation.
Stream ingestion — Kinesis or Managed Service for Apache Flink push records through Lambda handlers calling PutRecord continuously.
Backfill — historical Parquet files are written directly to the offline-store S3 path and registered, bypassing per-record API overhead.
CDC ingestion — DMS or Debezium captures change events through Kinesis or Kafka; a stream handler updates features as upstream rows change.
For upsert-heavy workloads (a customer profile updated daily), the Iceberg-format offline store enables ACID upserts, schema evolution, and time travel. For append-only event histories (every transaction), the Glue/Standard offline store is simpler and cheaper.
Point-in-Time Correctness: The Discipline That Separates Production from Prototype
Every offline-store record carries both EventTime (when the value was true in the real world) and WriteTime (when it was written). Training queries use ASOF joins to retrieve the feature value current at prediction time, not the latest value. Without this, a churn model is poisoned by post-cancellation profile updates — classic label leakage that produces 99% dev accuracy and 60% production accuracy.
Generative-BI Assistants
Generative-BI assistants are LLM-powered agents that take a user's natural-language question, plan a sequence of warehouse queries, and synthesize a narrative answer with charts and citations. Q in QuickSight sits at the simpler end (single-question text-to-SQL); Bedrock Agents and similar frameworks orchestrate multi-step reasoning across catalogs, semantic layers, and external knowledge bases.
The data engineering work is largely metadata work:
Semantic layer — published curated metrics ("monthly active users," "gross margin") with definitions, owners, and grain.
Glossary and synonyms — industry-specific terms mapped to columns and tables.
Lineage — tracked from raw source through transformations to the cited metric.
The pattern that fails is pointing an LLM at a raw lake. The pattern that succeeds is publishing a curated semantic layer first and treating the LLM as a translator into it.
Section 2 Key Points
SageMaker reads the same S3/Glue lakehouse as analytics — ML stops being a downstream silo and becomes a peer workload under the same Lake Formation governance.
Feature Store pairs a single-digit-ms online KV with an S3 Parquet offline store registered in Glue, so analytics and ML share one feature definition.
Point-in-time correctness via ASOF joins on EventTime/WriteTime is the difference between 99% dev accuracy and 99% production accuracy — without it, label leakage is silent and lethal.
Generative-BI quality is metadata quality: curated semantic layers, synonyms, lineage, and guardrails make LLM-driven analysis reliable; raw-lake LLM access does not.
Post-Reading Check — Section 2
1. A churn model trained on the SageMaker offline feature store hits 99% accuracy in development but only 60% in production. What is the most likely root cause?
2. Which SageMaker Feature Store property best reflects how it integrates into the lakehouse rather than being a sibling silo?
3. Generative-BI assistants tend to fail when pointed at a raw lake. What discipline makes them succeed instead?
Section 3: Cost Optimization and Reference Architectures
Pre-Reading Check — Section 3
1. An Athena query SELECT * FROM events WHERE LOWER(country) = 'us' runs a full table scan on a clickstream table partitioned by country. What is the most direct fix?
2. A platform team is choosing between S3 Standard, Intelligent-Tiering, and Glacier Flexible Retrieval for a 300 TB lakehouse with mixed and unpredictable access patterns. Which choice and rationale is correct?
3. A production team adopts a multi-tenant pool model with row-level security and tenant_id columns. A new enterprise customer demands dedicated infrastructure for compliance reasons. Which evolution best fits the canonical pattern?
Cost Optimization as Workload Shaping
Cost optimization in modern data platforms is about shaping workloads to the billing model of each service. Snowflake bills credits per second (60s minimum); BigQuery bills per terabyte scanned or by reserved slots; Redshift bills per provisioned hour or per serverless second; Athena bills per terabyte scanned; S3 bills per gigabyte-month with tier-dependent rates. Cost optimization means reducing cost (compute hours, bytes scanned, storage tier) without harming value (query latency, freshness, completeness).
Storage Class and Lifecycle Tuning
Tier
Approx. Cost vs Standard
Retrieval Latency
Best For
S3 Standard
100%
ms
Hot, frequently queried
S3 Intelligent-Tiering
~60%
ms (auto-managed)
Mixed/unknown access
S3 Standard-IA
~55%
ms (per-GB retrieval fee)
Known infrequent, > 30 days
Glacier Instant Retrieval
~25%
ms
Rare, instant needed
Glacier Flexible Retrieval
~15%
minutes-hours
Archive with occasional restore
Glacier Deep Archive
~5%
12+ hours
Compliance archive
S3 Intelligent-Tiering is the default-good choice for lakehouses with unpredictable access. Glacier tiers save 90+ percent for compliance archives and rarely-accessed raw event logs. A typical lakehouse lifecycle policy: keep raw bronze in Standard for 30 days, move to Intelligent-Tiering for 90 days, transition to Glacier Flexible Retrieval at 180 days, expire at 7 years.
Query Economics: Scan Reduction
In scan-billed services every byte read is a billed byte. The four levers, by typical impact:
Clustering / sort keys — within a partition, sorts rows so range filters touch contiguous blocks (BigQuery clustering, Redshift sort keys, Z-ordering in Iceberg/Delta).
Materialized views — pre-compute expensive aggregations once, query them many times.
BI Engine / result caching — BigQuery BI Engine, Snowflake result cache, Redshift result cache, QuickSight SPICE all cut scans dramatically for hot dashboard data.
Anti-pattern:SELECT * FROM big_table WHERE LOWER(country) = 'us' defeats partition pruning because the optimizer cannot prove the function preserves filter on storage layout. Normalize country codes at ingestion. Production deployments enforce per-query scan quotas (BigQuery custom quotas, Athena workgroup limits) so a typo cannot become a five-figure bill.
Right-Sizing and Reserved Capacity
Snowflake — warehouse size per workload (XS for ad-hoc, M for ETL, L for heavy analytics); auto-suspend 60-300s; monitor WAREHOUSE_LOAD_HISTORY weekly.
Redshift — Serverless for variable workloads (per-second after 60s minimum); elastic resize for diurnal patterns; provisioned + RIs for steady 24/7 (up to 75% savings).
BigQuery — on-demand by default; slot reservations for steady high-volume; edition-based pricing (Standard/Enterprise/Enterprise Plus).
Athena — primary right-sizing lever is data layout (Iceberg + columnar formats), not compute.
Reserved capacity yields up to 75% savings for stable workloads. The trap is overcommitment: size reservations from 90 days of historical burn, cover only the steady baseline, and let on-demand absorb peaks. Per-query tagging (Snowflake QUERY_TAG, BigQuery labels, AWS resource tags) maps spend back to incurring teams — the precondition for any chargeback model.
Sources flow into ingestion, which feeds the bronze/silver/gold layers; gold branches to BI and ML, with governance/observability/security as cross-cutting planes.
Cost Optimization Decision Flow
flowchart TD
Start(["Workload Cost Review"]) --> Q1{"Storage dominates bill?"}
Q1 -->|Yes| Tier["Apply S3 Lifecycle: Standard 30d to IT 90d to Glacier IR 1y to Deep Archive"]
Q1 -->|No| Q2{"Per-byte scan billing?"}
Tier --> Q2
Q2 -->|Yes| Scan["Add Partitioning + Clustering + Materialized Views + Result Cache"]
Q2 -->|No| Q3{"Compute under-utilized?"}
Scan --> Quota["Enforce Per-Query Scan Quotas"]
Quota --> Q3
Q3 -->|Yes| Right["Right-Size: Auto-Suspend Serverless Mode"]
Q3 -->|No| Q4{"Steady baseline load?"}
Right --> Q4
Q4 -->|Yes| Reserve["Reserved Capacity: RIs / Slot Reservations up to 75% savings"]
Q4 -->|No| Tag["Per-Query Tagging + FinOps Chargeback"]
Reserve --> Tag
Tag --> End(["50-75% Savings"])
Reference Architectures
A reference architecture is a documented, tested combination of components solving a class of problems with known trade-offs. The point is that new projects start from a known-good template instead of relitigating fundamental decisions.
The canonical AWS shape:
Streaming and batch share storage — both write to the same S3-backed Iceberg silver tables.
Schema-on-read at bronze, schema-enforced at silver — raw zones tolerate sloppy upstream schemas; conformed zones reject contract violations.
Governance through Lake Formation — one access-control surface for analytics, ML, and reverse-ETL.
Observability cross-cutting — OpenLineage events, Monte Carlo data quality, Airflow/Step Functions metadata.
Multi-Tenant Patterns
Model
Isolation Boundary
Cost Efficiency
Blast Radius
Use Case
Pool (shared)
Row-level (tenant_id column)
Highest
Largest
Many small tenants, similar workloads
Bridge (silo within shared cluster)
Schema or database
Medium
Medium
Mid-size tenants, regulatory variation
Silo (dedicated)
Account or cluster
Lowest
Smallest
Few large tenants, strict compliance
Most production multi-tenant platforms end up hybrid: pool for the long tail of small tenants, silo for the handful of enterprise customers whose contracts demand dedicated infrastructure, with a clear migration path between the two as customers grow.
Capstone Discipline
The capstone is not finding the theoretically optimal stack; it is making every component choice deliberately, documenting the trade-off, and designing for the team that will operate the system at 3 AM on Tuesday when something breaks. A worked e-commerce example combines DMS + Kinesis ingestion, S3 Intelligent-Tiering bronze, Iceberg silver, Redshift Serverless gold, QuickSight Enterprise + SPICE BI, SageMaker Feature Store ML, Hightouch reverse-ETL, Lake Formation + Atlan governance, and a tight cost-control posture (lifecycle policies, workgroup scan limits, per-team tags) within a $1.5M annual budget.
Section 3 Key Points
Storage tiering with Intelligent-Tiering and Glacier saves 40-90% with little engineering effort; lifecycle policies automate the transitions.
Scan reduction levers in order of impact: partitioning > clustering/sort keys > materialized views > result caching. Function calls on partition columns silently defeat pruning.
Right-size with auto-suspend and serverless modes; reserve only the steady baseline (sized from 90 days of burn) and let on-demand absorb peaks.
Per-query tagging is the precondition for chargeback — without it, optimization is a central crusade rather than a distributed practice.
Reference architectures are templates, not products: layered S3/Iceberg storage, parallel BI and ML branches, Lake Formation governance, orthogonal observability/security planes.
Multi-tenancy in practice is hybrid (pool small + silo enterprise); the engineering work is propagating tenant identity through every layer.
Post-Reading Check — Section 3
1. An Athena query SELECT * FROM events WHERE LOWER(country) = 'us' runs a full table scan on a clickstream table partitioned by country. What is the most direct fix?
2. A platform team is choosing between S3 Standard, Intelligent-Tiering, and Glacier Flexible Retrieval for a 300 TB lakehouse with mixed and unpredictable access patterns. Which choice and rationale is correct?
3. A production team adopts a multi-tenant pool model with row-level security and tenant_id columns. A new enterprise customer demands dedicated infrastructure for compliance reasons. Which evolution best fits the canonical pattern?