Chapter 10: Zero-ETL and SaaS Integration

Learning Objectives

Part 1: Zero-ETL Concept and Mechanics

Pre-Reading Quiz — Test Your Intuition

1. What does "zero-ETL" actually eliminate?

All extract, transform, and load work disappears entirely. User-managed pipeline code and infrastructure; the cloud provider runs the ETL invisibly. The need for a destination warehouse like Redshift. Schema evolution and DDL changes on the source database.

2. Under the hood, how does Aurora-to-Redshift zero-ETL capture ongoing changes from MySQL?

It runs scheduled SELECT * queries against each table. It snapshots Aurora to S3 every 60 seconds and reloads Redshift. It tails the binary log (binlog) in ROW format for log-based CDC. It uses Aurora triggers that synchronously call a Redshift webhook.

3. Why is Aurora-to-Redshift replication described as "eventually consistent" rather than synchronous?

Because Redshift is technically slower than Aurora at writes. Because committed rows take a propagation window (often under 15 seconds) to appear in Redshift. Because zero-ETL only replicates rows once per day at midnight. Because Aurora and Redshift use incompatible character encodings.

What Zero-ETL Means

For most of data warehousing's history, getting data from an operational database into an analytics warehouse meant building a pipeline: extraction code, Airflow DAGs, Debezium connectors, AWS DMS instances, S3 staging, and COPY commands. Each layer cost engineering time, infrastructure dollars, and latency. Zero-ETL is the cloud industry's collective decision that for the most common case—operational data flowing into an analytical store—the cloud provider should own the pipeline so you don't have to.

The phrase is slightly misleading. The Extract, Transform, Load work still happens; it's just that you no longer write or operate it. The provider replaces three artifacts you used to maintain:

Traditional ArtifactZero-ETL Replacement
Extraction job (DMS task, Debezium connector, custom script)Managed CDC stream owned by the source service
Staging bucket / Kafka topicInternal AWS-managed transport (invisible)
Load script (COPY, MERGE, dbt incremental)Auto-generated apply into Redshift / OpenSearch / Lakehouse

Change Capture Under the Hood

Although AWS markets zero-ETL as "magic," the mechanism is the same log-based Change Data Capture pattern that has powered Debezium and DMS for years. For Aurora MySQL, the integration reads the binary log (binlog) in ROW format, capturing every insert, update, and delete as a structured event. For Aurora PostgreSQL, it reads the write-ahead log (WAL) using logical replication.

The flow has two distinct phases: an initial snapshot that seeds Redshift with whatever rows already live in Aurora, then ongoing CDC that tails the change log and applies events through a MERGE into the target. Because the source is the binary log rather than a SQL SELECT, the operational database's CPU and IO impact stays low.

Animation: Aurora → Redshift Zero-ETL Flow
Phase 1 (orange) seeds Redshift with a snapshot. Phase 2 (blue) tails the binlog continuously, streaming row deltas through the apply engine.
Aurora MySQL / Postgres binlog ROW / WAL Zero-ETL Apply Engine snapshot + CDC Redshift case-sensitive target tables Phase 1: Initial Snapshot (orange) - bulk export PITR rows Phase 2: Ongoing CDC (blue) - tail binlog, MERGE row deltas

Latency and Consistency Guarantees

The defining performance claim is near real-time: replicated rows typically appear in the analytics target within seconds of a source commit. That's an order of magnitude better than batch ETL (hours to days) and competitive with self-managed Debezium or DMS, without the operational overhead of either. Two consistency caveats matter:

  1. Eventual consistency, not synchronous replication. A transaction committed on Aurora is not immediately visible in Redshift. There is a propagation window—usually under 15 seconds, occasionally longer under load.
  2. DDL replication is supported but not always instant. Adding columns and renaming tables propagate, but aggressive schema changes (incompatible type alterations, dropping primary keys) can pause or break the integration.

Figure 10.1: Aurora-to-Redshift Sequence

sequenceDiagram participant App as Application participant Aurora as Aurora Cluster participant Binlog as Binlog / WAL participant ZE as Zero-ETL Service participant RS as Redshift Target Note over Aurora,RS: Phase 1 - Initial Snapshot ZE->>Aurora: Trigger snapshot export Aurora-->>ZE: PITR snapshot ZE->>RS: Bulk load existing rows RS-->>ZE: Snapshot complete Note over Aurora,RS: Phase 2 - Ongoing CDC (seconds) App->>Aurora: INSERT / UPDATE / DELETE Aurora->>Binlog: Write ROW-format change record ZE->>Binlog: Tail change events Binlog-->>ZE: Stream of row deltas ZE->>RS: MERGE into target tables RS-->>App: Available for analytics

Key Points — Part 1

Post-Reading Quiz — Apply What You Learned

1. What does "zero-ETL" actually eliminate?

All extract, transform, and load work disappears entirely. User-managed pipeline code and infrastructure; the cloud provider runs the ETL invisibly. The need for a destination warehouse like Redshift. Schema evolution and DDL changes on the source database.

2. Under the hood, how does Aurora-to-Redshift zero-ETL capture ongoing changes from MySQL?

It runs scheduled SELECT * queries against each table. It snapshots Aurora to S3 every 60 seconds and reloads Redshift. It tails the binary log (binlog) in ROW format for log-based CDC. It uses Aurora triggers that synchronously call a Redshift webhook.

3. Why is Aurora-to-Redshift replication described as "eventually consistent" rather than synchronous?

Because Redshift is technically slower than Aurora at writes. Because committed rows take a propagation window (often under 15 seconds) to appear in Redshift. Because zero-ETL only replicates rows once per day at midnight. Because Aurora and Redshift use incompatible character encodings.

Part 2: Database and SaaS Integrations

Pre-Reading Quiz — Test Your Intuition

1. Which Aurora parameter group settings are required before creating an Aurora-to-Redshift zero-ETL integration?

binlog_format = STATEMENT and aws_pitr_enabled = 0. binlog_format = ROW and aws_pitr_enabled = 1, set on a custom (non-default) cluster parameter group. No parameter changes are needed; the default group works. replica_mode = SYNCHRONOUS on the default parameter group.

2. The most common Salesforce zero-ETL failure surfaces as IngestionFailed in CloudWatch logs. What is almost always the root cause?

Salesforce session token expired and OAuth needs re-authorization. Missing Lake Formation permissions (DESCRIBE, CREATE_TABLE, ALTER) on the destination Glue database. Salesforce Bulk API rate-limited the integration. Redshift cluster ran out of free storage capacity.

3. For SAP OData entities that are not ODP-aware and lack a timestamp field, how is incremental sync performed?

SAP automatically emits delta tokens for any non-ODP entity. Glue uses full reload plus upserts — deletes are silently lost, a real correctness footgun. The integration refuses to start until a timestamp field is added. Glue falls back to scanning Salesforce Bulk API instead.

Aurora → Redshift Setup

Aurora-to-Redshift is the flagship zero-ETL integration. The setup involves two consoles—RDS for the source, Redshift for the target—and a small set of prerequisites that, if missed, cause cryptic failures.

Aurora source prerequisites: A custom DB cluster parameter group (the default cannot be modified) with:

binlog_format = ROW
aws_pitr_enabled = 1

binlog_format = ROW ensures every row change is logged with full before/after values—statement-based binlogs cannot be safely replicated. aws_pitr_enabled = 1 enables enhanced point-in-time recovery, required for the snapshot phase.

Redshift target prerequisites: The target cluster (or serverless workgroup) must have case sensitivity enabled at the cluster level—a one-time setting that cannot be retrofitted. The cluster also needs an authorization policy identifying the Aurora source ARN as an allowed integration writer.

Two-console workflow: In RDS, "Create zero-ETL integration" with the Aurora cluster as source, optionally apply data filters to scope replication to specific schemas/tables, and target the Redshift namespace. Then in Redshift Query Editor v2, run "Create database from integration" to produce a queryable database. From that point you can SELECT, build materialized views, and feed dbt models or Redshift ML.

Pricing: There is no separate fee for the integration. You pay only for Aurora storage/compute you already pay for, Redshift compute that ingests/stores the data, and any cross-region transfer charges. Compared to a DMS instance ($150–$1,500+ per month), this is often the cheapest path.

Animation: Traditional Pipeline vs Zero-ETL
Four legacy components (Debezium, Kafka, S3 staging, Airflow COPY) collapse into a single managed integration object.
TRADITIONAL PIPELINE [1] Debezium connector + EC2 Kafka cluster ($600/mo) [2] S3 sink connector + maintenance ($200/mo) [3] Airflow / MWAA COPY orchestration ($300/mo) [4] ~40 hrs/mo pipeline incident response ZERO-ETL Single Integration Object Aurora cluster ARN + Redshift workgroup + optional data filters No integration fee

DynamoDB → Redshift / OpenSearch and RDS → Redshift

DynamoDB cannot run group-by queries, complex joins, or full-text search efficiently. Zero-ETL gives it two integration targets: DynamoDB → Redshift for SQL analytics, BI dashboards, and ML training; DynamoDB → OpenSearch for full-text and log-style search. Under the hood it uses point-in-time recovery for the initial export plus DynamoDB Streams for ongoing CDC.

Beyond Aurora, AWS has progressively extended zero-ETL to standard RDS engines (RDS for MySQL today, more on the roadmap). The mechanics mirror Aurora: parameter group, binlog enable, target authorization, console-based integration creation. Always consult current AWS documentation for the supported engine matrix.

Figure 10.2: Database Zero-ETL Fan-In

flowchart BT AM[Aurora MySQL
binlog ROW] -->|zero-ETL| RS[(Redshift Target
case-sensitive cluster)] AP[Aurora PostgreSQL
logical WAL] -->|zero-ETL| RS DDB[DynamoDB
PITR + Streams] -->|zero-ETL| RS DDB -->|zero-ETL| OS[(OpenSearch
full-text index)] RDSm[RDS for MySQL
binlog ROW] -->|zero-ETL| RS RS --> BI[BI / dbt / Redshift ML] OS --> SRCH[Search and log analytics]

SaaS Zero-ETL via AWS Glue

AWS Glue Zero-ETL extends the managed-integration pattern to SaaS sources. Currently supported sources include Salesforce, Salesforce Marketing Cloud, SAP OData (ODP and non-ODP), ServiceNow, Zendesk, Zoho CRM, Facebook Ads, and Instagram Ads. Configuration is a two-step process:

  1. Create a Glue Data Catalog connection — reusable artifact storing the SaaS instance URL, IAM service role, and credentials (OAuth tokens for Salesforce, static creds for SAP). Always click "Test connection."
  2. Create a zero-ETL integration — pick the connector type, choose the connection, select entities/tables, and pick a destination (Redshift or SageMaker Lakehouse).

Critical permission requirement: The IAM service role must have AWS Lake Formation permissions (DESCRIBE, CREATE_TABLE, ALTER) on the target Glue database. Missing Lake Formation grants is the #1 cause of IngestionFailed errors.

Salesforce specifics: Glue uses the Salesforce Bulk API rather than the standard REST API—asynchronous, optimized for large datasets, and roughly 10x cheaper in API call quota for large extracts. SAP specifics: the Operational Data Provisioning (ODP) framework is the gold path because it produces native delta links. Non-ODP entities require a timestamp field for incremental sync, or fall back to full reload + upsert—which means deletes are silently lost.

Animation: SaaS Connector Schema Mapping with Watermark
Source rows flow through the schema mapper (orange → blue type-coerced columns) into Redshift; the watermark advances after each successful sync.
Salesforce Account Id, Name, Updated (Bulk API) Schema Mapper type coercion flatten composites Lake Formation grant Redshift account VARCHAR, TIMESTAMP (case-sensitive) Watermark / Delta Token t0 last sync now

Schema Mapping and Incremental Sync

When zero-ETL pulls Salesforce or SAP records into Redshift, schema mapping happens automatically: source field types map to Redshift-compatible types (VARCHAR, TIMESTAMP, DECIMAL); composite types flatten into joinable columns; new source columns propagate on the next sync. Incremental sync is what makes the model economical—a nightly full-table reload of 50 million Salesforce rows is impossibly expensive in API quota; pulling only the deltas since the last watermark is trivial.

Mechanism by source type:

Figure 10.3: SaaS Connector Schema Mapping

flowchart LR subgraph SaaS[SaaS Sources] SF[Salesforce
Bulk API] SAP[SAP OData
ODP delta links] SN[ServiceNow
change tracking] end subgraph Glue[AWS Glue Zero-ETL] CONN[Glue Data Catalog
Connection
OAuth / creds] INT[Zero-ETL Integration
entity selection] MAP[Schema Mapper
type coercion
flatten composites] WM[Watermark / Delta Token] end subgraph Target[Targets with Lake Formation grants] RS[(Redshift)] LH[(SageMaker Lakehouse)] end SF --> CONN SAP --> CONN SN --> CONN CONN --> INT INT --> MAP INT <--> WM MAP --> RS MAP --> LH

Key Points — Part 2

Post-Reading Quiz — Apply What You Learned

1. Which Aurora parameter group settings are required before creating an Aurora-to-Redshift zero-ETL integration?

binlog_format = STATEMENT and aws_pitr_enabled = 0. binlog_format = ROW and aws_pitr_enabled = 1, set on a custom (non-default) cluster parameter group. No parameter changes are needed; the default group works. replica_mode = SYNCHRONOUS on the default parameter group.

2. The most common Salesforce zero-ETL failure surfaces as IngestionFailed in CloudWatch logs. What is almost always the root cause?

Salesforce session token expired and OAuth needs re-authorization. Missing Lake Formation permissions (DESCRIBE, CREATE_TABLE, ALTER) on the destination Glue database. Salesforce Bulk API rate-limited the integration. Redshift cluster ran out of free storage capacity.

3. For SAP OData entities that are not ODP-aware and lack a timestamp field, how is incremental sync performed?

SAP automatically emits delta tokens for any non-ODP entity. Glue uses full reload plus upserts — deletes are silently lost, a real correctness footgun. The integration refuses to start until a timestamp field is added. Glue falls back to scanning Salesforce Bulk API instead.

Part 3: Trade-offs and When to Use Pipelines

Pre-Reading Quiz — Test Your Intuition

1. Your ingestion logic requires a 12-table join, complex aggregations, and ML feature engineering before landing in the warehouse. Which approach is the right fit?

Zero-ETL — it natively supports in-flight transformations. Batch ETL or hybrid (zero-ETL into bronze, then dbt/SQL into silver/gold) — zero-ETL alone is replication-only. AWS DMS — it has built-in Spark transformations. Debezium with stream processing replaces all transformation needs.

2. You need CDC from a Postgres database into Kafka feeding microservices in a multi-cloud architecture. Which tool fits best?

AWS zero-ETL — it works across any cloud. Debezium — multi-cloud flexibility, true Kafka streaming, fine-grained control. Batch ETL on a daily schedule. AWS DMS routed through a VPN tunnel to other clouds.

3. Although the zero-ETL integration itself is free, which hidden cost is most likely to surprise teams?

A monthly per-integration license fee from AWS. Redshift storage/compute, cross-region data transfer, and SaaS API quota consumption. Required pre-purchase of Aurora reserved instances. A consulting engagement to enable case-sensitivity on Redshift.

When You Still Need Pipelines

Despite its appeal, zero-ETL is not universal. The decision tree is sharpest when comparing it head-to-head:

FactorZero-ETLDebeziumAWS DMSBatch ETL
LatencySecondsSecondsMinutesHours/days
CostNo charge (AWS-managed)Infra + engineeringPer-instance feesCompute + orchestration
InfrastructureZero managementSelf-managed KafkaManaged instanceScheduled compute
FlexibilityAWS onlyAny DB + any sinkAWS-focusedAny source/target
TransformationReplication onlyStream processingLimitedFull SQL/Python in-flight

Choose Debezium when you need multi-cloud flexibility, non-AWS sources/sinks, true streaming into Kafka for microservices, or fine-grained application-level control over the change stream. Choose AWS DMS when you're performing a one-time migration with minimal ongoing CDC, or replicating between heterogeneous DBs not yet supported by zero-ETL. Choose batch ETL when latency genuinely doesn't matter (weekly executive reports), the source lacks log-based CDC (legacy DBs, REST APIs, FTP feeds), or your ingestion includes heavy in-flight joins/aggregations/ML feature engineering.

The brutal truth: zero-ETL is excellent at one job—replicating operational data into an analytics store—and bad at everything else. Try to use it for transformation, multi-cloud federation, or fine-grained event filtering, and you'll quickly hit walls.

Figure 10.4: Decision Tree

flowchart TD Start[Need to move data
from source to analytics] --> Q1{Latency budget?} Q1 -->|Hours / days OK| Batch[Batch ETL
scheduled compute] Q1 -->|Seconds to minutes| Q2{Heavy in-flight
transformation?} Q2 -->|Yes - joins, aggs, ML feats| Batch Q2 -->|No - replication only| Q3{Multi-cloud or
non-AWS sink?} Q3 -->|Yes| Deb[Debezium + Kafka
self-managed CDC] Q3 -->|No - all AWS| Q4{One-time
migration?} Q4 -->|Yes| DMS[AWS DMS
per-instance fee] Q4 -->|No - ongoing| Q5{Source supported
by zero-ETL?} Q5 -->|Yes| ZE[Zero-ETL
managed, no integration fee] Q5 -->|No| DMS

Cost and Quota Considerations

Although zero-ETL is free of integration charges, hidden costs are real:

  1. Redshift storage and compute. Replicated data takes Redshift space; query workload consumes RPU/credits.
  2. Cross-region transfer. Aurora in us-east-1 and Redshift in us-west-2 means every byte incurs cross-region fees. Co-locate aggressively.
  3. Source-side API quotas (SaaS). Salesforce Bulk API is cheaper than REST but not free; SAP gateway throughput, ServiceNow rate limits, Facebook Ads insights all impose ceilings.
  4. Filtering hygiene. Replicating 800 Aurora tables when you need 40 means 20x the cost. Use data filters at integration creation time—harder to remove tables later.

Hybrid Zero-ETL + Transformation

In real architectures, zero-ETL rarely stands alone. The dominant production pattern is hybrid: use zero-ETL to land raw operational data into a Redshift "raw" or "bronze" schema, then use dbt, materialized views, or scheduled batch SQL to transform that raw data into "silver" and "gold" layers consumable by BI and ML. This preserves the strengths of both: zero-ETL eliminates the brittle ingestion layer; downstream transformations handle joins, aggregations, and quality checks zero-ETL was never designed for.

Figure 10.5: Hybrid Bronze/Silver/Gold Architecture

flowchart LR subgraph Sources[Operational Sources] AU[Aurora] DD[DynamoDB] SAAS[SaaS via Glue] end subgraph Bronze[Raw / Bronze] RAW[(Redshift raw schema
zero-ETL landing)] end subgraph Silver[Silver - cleaned/joined] DBT[dbt models
Materialized views
Scheduled SQL] CLEAN[(Conformed marts)] end subgraph Gold[Gold - business-ready] MARTS[(Domain marts)] end subgraph Consumers BI[Tableau / QuickSight] ML[Redshift ML / SageMaker] REV[Reverse ETL to Salesforce] end AU -->|zero-ETL| RAW DD -->|zero-ETL| RAW SAAS -->|zero-ETL| RAW RAW --> DBT --> CLEAN --> MARTS MARTS --> BI MARTS --> ML MARTS --> REV

Phased Migration Strategy

The 9-week phased pattern for migrating an existing pipeline-based architecture to zero-ETL:

The structure preserves the legacy pipeline as a fallback for at least two weeks of dual-write validation, dramatically reducing migration risk.

Key Points — Part 3

Post-Reading Quiz — Apply What You Learned

1. Your ingestion logic requires a 12-table join, complex aggregations, and ML feature engineering before landing in the warehouse. Which approach is the right fit?

Zero-ETL — it natively supports in-flight transformations. Batch ETL or hybrid (zero-ETL into bronze, then dbt/SQL into silver/gold) — zero-ETL alone is replication-only. AWS DMS — it has built-in Spark transformations. Debezium with stream processing replaces all transformation needs.

2. You need CDC from a Postgres database into Kafka feeding microservices in a multi-cloud architecture. Which tool fits best?

AWS zero-ETL — it works across any cloud. Debezium — multi-cloud flexibility, true Kafka streaming, fine-grained control. Batch ETL on a daily schedule. AWS DMS routed through a VPN tunnel to other clouds.

3. Although the zero-ETL integration itself is free, which hidden cost is most likely to surprise teams?

A monthly per-integration license fee from AWS. Redshift storage/compute, cross-region data transfer, and SaaS API quota consumption. Required pre-purchase of Aurora reserved instances. A consulting engagement to enable case-sensitivity on Redshift.

Your Progress

Answer Explanations