Study Guide: Zero-ETL and SaaS Integration

What Zero-ETL Means

For most of data warehousing's history, getting data from an operational database into an analytics warehouse meant building a pipeline: extraction code, Airflow DAGs, Debezium connectors, AWS DMS instances, S3 staging, and COPY commands. Each layer cost engineering time, infrastructure dollars, and latency. Zero-ETL is the cloud industry's collective decision that for the most common case—operational data flowing into an analytical store—the cloud provider should own the pipeline so you don't have to.

The phrase is slightly misleading. The Extract, Transform, Load work still happens; it's just that you no longer write or operate it. The provider replaces three artifacts you used to maintain:

Traditional Artifact	Zero-ETL Replacement
Extraction job (DMS task, Debezium connector, custom script)	Managed CDC stream owned by the source service
Staging bucket / Kafka topic	Internal AWS-managed transport (invisible)
Load script (`COPY`, `MERGE`, dbt incremental)	Auto-generated apply into Redshift / OpenSearch / Lakehouse

Change Capture Under the Hood

Although AWS markets zero-ETL as "magic," the mechanism is the same log-based Change Data Capture pattern that has powered Debezium and DMS for years. For Aurora MySQL, the integration reads the binary log (binlog) in ROW format, capturing every insert, update, and delete as a structured event. For Aurora PostgreSQL, it reads the write-ahead log (WAL) using logical replication.

The flow has two distinct phases: an initial snapshot that seeds Redshift with whatever rows already live in Aurora, then ongoing CDC that tails the change log and applies events through a MERGE into the target. Because the source is the binary log rather than a SQL SELECT, the operational database's CPU and IO impact stays low.

Animation: Aurora → Redshift Zero-ETL Flow

Phase 1 (orange) seeds Redshift with a snapshot. Phase 2 (blue) tails the binlog continuously, streaming row deltas through the apply engine.

Latency and Consistency Guarantees

The defining performance claim is near real-time: replicated rows typically appear in the analytics target within seconds of a source commit. That's an order of magnitude better than batch ETL (hours to days) and competitive with self-managed Debezium or DMS, without the operational overhead of either. Two consistency caveats matter:

Eventual consistency, not synchronous replication. A transaction committed on Aurora is not immediately visible in Redshift. There is a propagation window—usually under 15 seconds, occasionally longer under load.
DDL replication is supported but not always instant. Adding columns and renaming tables propagate, but aggressive schema changes (incompatible type alterations, dropping primary keys) can pause or break the integration.

Figure 10.1: Aurora-to-Redshift Sequence

sequenceDiagram participant App as Application participant Aurora as Aurora Cluster participant Binlog as Binlog / WAL participant ZE as Zero-ETL Service participant RS as Redshift Target Note over Aurora,RS: Phase 1 - Initial Snapshot ZE->>Aurora: Trigger snapshot export Aurora-->>ZE: PITR snapshot ZE->>RS: Bulk load existing rows RS-->>ZE: Snapshot complete Note over Aurora,RS: Phase 2 - Ongoing CDC (seconds) App->>Aurora: INSERT / UPDATE / DELETE Aurora->>Binlog: Write ROW-format change record ZE->>Binlog: Tail change events Binlog-->>ZE: Stream of row deltas ZE->>RS: MERGE into target tables RS-->>App: Available for analytics

Key Points — Part 1

Zero-ETL relocates pipeline ownership to the cloud provider; the work still happens, you just don't operate it.
Under the hood it is still log-based CDC (binlog ROW format for MySQL, WAL logical replication for Postgres).
Two phases: Phase 1 PITR snapshot seeds the target; Phase 2 ongoing CDC streams every change as it happens.
Latency is second-scale and eventually consistent—Redshift is for analytics, not read-your-writes operational queries.
DDL like ADD COLUMN replicates; aggressive incompatible schema changes can pause the integration.

Part 2: Database and SaaS Integrations

Pre-Reading Quiz — Test Your Intuition

1. Which Aurora parameter group settings are required before creating an Aurora-to-Redshift zero-ETL integration?

binlog_format = STATEMENT and aws_pitr_enabled = 0. binlog_format = ROW and aws_pitr_enabled = 1, set on a custom (non-default) cluster parameter group. No parameter changes are needed; the default group works. replica_mode = SYNCHRONOUS on the default parameter group.

2. The most common Salesforce zero-ETL failure surfaces as IngestionFailed in CloudWatch logs. What is almost always the root cause?

Salesforce session token expired and OAuth needs re-authorization. Missing Lake Formation permissions (DESCRIBE, CREATE_TABLE, ALTER) on the destination Glue database. Salesforce Bulk API rate-limited the integration. Redshift cluster ran out of free storage capacity.

3. For SAP OData entities that are not ODP-aware and lack a timestamp field, how is incremental sync performed?

SAP automatically emits delta tokens for any non-ODP entity. Glue uses full reload plus upserts — deletes are silently lost, a real correctness footgun. The integration refuses to start until a timestamp field is added. Glue falls back to scanning Salesforce Bulk API instead.

Aurora → Redshift Setup

Aurora-to-Redshift is the flagship zero-ETL integration. The setup involves two consoles—RDS for the source, Redshift for the target—and a small set of prerequisites that, if missed, cause cryptic failures.

Aurora source prerequisites: A custom DB cluster parameter group (the default cannot be modified) with:

binlog_format = ROW
aws_pitr_enabled = 1

binlog_format = ROW ensures every row change is logged with full before/after values—statement-based binlogs cannot be safely replicated. aws_pitr_enabled = 1 enables enhanced point-in-time recovery, required for the snapshot phase.

Redshift target prerequisites: The target cluster (or serverless workgroup) must have case sensitivity enabled at the cluster level—a one-time setting that cannot be retrofitted. The cluster also needs an authorization policy identifying the Aurora source ARN as an allowed integration writer.

Two-console workflow: In RDS, "Create zero-ETL integration" with the Aurora cluster as source, optionally apply data filters to scope replication to specific schemas/tables, and target the Redshift namespace. Then in Redshift Query Editor v2, run "Create database from integration" to produce a queryable database. From that point you can SELECT, build materialized views, and feed dbt models or Redshift ML.

Pricing: There is no separate fee for the integration. You pay only for Aurora storage/compute you already pay for, Redshift compute that ingests/stores the data, and any cross-region transfer charges. Compared to a DMS instance ($150–$1,500+ per month), this is often the cheapest path.

Animation: Traditional Pipeline vs Zero-ETL

Four legacy components (Debezium, Kafka, S3 staging, Airflow COPY) collapse into a single managed integration object.

DynamoDB → Redshift / OpenSearch and RDS → Redshift

DynamoDB cannot run group-by queries, complex joins, or full-text search efficiently. Zero-ETL gives it two integration targets: DynamoDB → Redshift for SQL analytics, BI dashboards, and ML training; DynamoDB → OpenSearch for full-text and log-style search. Under the hood it uses point-in-time recovery for the initial export plus DynamoDB Streams for ongoing CDC.

Beyond Aurora, AWS has progressively extended zero-ETL to standard RDS engines (RDS for MySQL today, more on the roadmap). The mechanics mirror Aurora: parameter group, binlog enable, target authorization, console-based integration creation. Always consult current AWS documentation for the supported engine matrix.

Figure 10.2: Database Zero-ETL Fan-In

SaaS Zero-ETL via AWS Glue

AWS Glue Zero-ETL extends the managed-integration pattern to SaaS sources. Currently supported sources include Salesforce, Salesforce Marketing Cloud, SAP OData (ODP and non-ODP), ServiceNow, Zendesk, Zoho CRM, Facebook Ads, and Instagram Ads. Configuration is a two-step process:

Create a Glue Data Catalog connection — reusable artifact storing the SaaS instance URL, IAM service role, and credentials (OAuth tokens for Salesforce, static creds for SAP). Always click "Test connection."
Create a zero-ETL integration — pick the connector type, choose the connection, select entities/tables, and pick a destination (Redshift or SageMaker Lakehouse).

Critical permission requirement: The IAM service role must have AWS Lake Formation permissions (DESCRIBE, CREATE_TABLE, ALTER) on the target Glue database. Missing Lake Formation grants is the #1 cause of IngestionFailed errors.

Salesforce specifics: Glue uses the Salesforce Bulk API rather than the standard REST API—asynchronous, optimized for large datasets, and roughly 10x cheaper in API call quota for large extracts. SAP specifics: the Operational Data Provisioning (ODP) framework is the gold path because it produces native delta links. Non-ODP entities require a timestamp field for incremental sync, or fall back to full reload + upsert—which means deletes are silently lost.

Animation: SaaS Connector Schema Mapping with Watermark

Source rows flow through the schema mapper (orange → blue type-coerced columns) into Redshift; the watermark advances after each successful sync.

Schema Mapping and Incremental Sync

When zero-ETL pulls Salesforce or SAP records into Redshift, schema mapping happens automatically: source field types map to Redshift-compatible types (VARCHAR, TIMESTAMP, DECIMAL); composite types flatten into joinable columns; new source columns propagate on the next sync. Incremental sync is what makes the model economical—a nightly full-table reload of 50 million Salesforce rows is impossibly expensive in API quota; pulling only the deltas since the last watermark is trivial.

Mechanism by source type:

Salesforce / ServiceNow / Zendesk: automatic platform-native change tracking.
SAP ODP: delta links from the SAP server tell Glue exactly which rows changed.
SAP non-ODP with timestamp: Glue runs WHERE LastChangeDate > :watermark.
SAP non-ODP without timestamp: full reload + upsert — deletes silently lost.

Figure 10.3: SaaS Connector Schema Mapping

flowchart LR subgraph SaaS[SaaS Sources] SF[Salesforce
Bulk API] SAP[SAP OData
ODP delta links] SN[ServiceNow
change tracking] end subgraph Glue[AWS Glue Zero-ETL] CONN[Glue Data Catalog
Connection
OAuth / creds] INT[Zero-ETL Integration
entity selection] MAP[Schema Mapper
type coercion
flatten composites] WM[Watermark / Delta Token] end subgraph Target[Targets with Lake Formation grants] RS[(Redshift)] LH[(SageMaker Lakehouse)] end SF --> CONN SAP --> CONN SN --> CONN CONN --> INT INT --> MAP INT <--> WM MAP --> RS MAP --> LH

Key Points — Part 2

Aurora source needs a custom parameter group with binlog_format = ROW and aws_pitr_enabled = 1.
Redshift target needs case sensitivity enabled (one-time, cannot be retrofitted) and an authorization policy.
The integration itself is free; you pay only for Aurora compute, Redshift compute, and any cross-region transfer.
DynamoDB zero-ETL has two targets: Redshift for analytics, OpenSearch for full-text/log search.
SaaS Glue Zero-ETL is a two-step pattern: Glue Data Catalog connection → integration with entity selection.
The #1 SaaS failure is IngestionFailed due to missing Lake Formation grants on the destination Glue database.
Salesforce uses the Bulk API (10x cheaper in quota); SAP gold path is ODP delta links; non-ODP without a timestamp loses deletes.

Post-Reading Quiz — Apply What You Learned

1. Which Aurora parameter group settings are required before creating an Aurora-to-Redshift zero-ETL integration?

2. The most common Salesforce zero-ETL failure surfaces as IngestionFailed in CloudWatch logs. What is almost always the root cause?

3. For SAP OData entities that are not ODP-aware and lack a timestamp field, how is incremental sync performed?

Part 3: Trade-offs and When to Use Pipelines

When You Still Need Pipelines

Despite its appeal, zero-ETL is not universal. The decision tree is sharpest when comparing it head-to-head:

Factor	Zero-ETL	Debezium	AWS DMS	Batch ETL
Latency	Seconds	Seconds	Minutes	Hours/days
Cost	No charge (AWS-managed)	Infra + engineering	Per-instance fees	Compute + orchestration
Infrastructure	Zero management	Self-managed Kafka	Managed instance	Scheduled compute
Flexibility	AWS only	Any DB + any sink	AWS-focused	Any source/target
Transformation	Replication only	Stream processing	Limited	Full SQL/Python in-flight

Choose Debezium when you need multi-cloud flexibility, non-AWS sources/sinks, true streaming into Kafka for microservices, or fine-grained application-level control over the change stream. Choose AWS DMS when you're performing a one-time migration with minimal ongoing CDC, or replicating between heterogeneous DBs not yet supported by zero-ETL. Choose batch ETL when latency genuinely doesn't matter (weekly executive reports), the source lacks log-based CDC (legacy DBs, REST APIs, FTP feeds), or your ingestion includes heavy in-flight joins/aggregations/ML feature engineering.

The brutal truth: zero-ETL is excellent at one job—replicating operational data into an analytics store—and bad at everything else. Try to use it for transformation, multi-cloud federation, or fine-grained event filtering, and you'll quickly hit walls.

Figure 10.4: Decision Tree

flowchart TD Start[Need to move data
from source to analytics] --> Q1{Latency budget?} Q1 -->|Hours / days OK| Batch[Batch ETL
scheduled compute] Q1 -->|Seconds to minutes| Q2{Heavy in-flight
transformation?} Q2 -->|Yes - joins, aggs, ML feats| Batch Q2 -->|No - replication only| Q3{Multi-cloud or
non-AWS sink?} Q3 -->|Yes| Deb[Debezium + Kafka
self-managed CDC] Q3 -->|No - all AWS| Q4{One-time
migration?} Q4 -->|Yes| DMS[AWS DMS
per-instance fee] Q4 -->|No - ongoing| Q5{Source supported
by zero-ETL?} Q5 -->|Yes| ZE[Zero-ETL
managed, no integration fee] Q5 -->|No| DMS

Cost and Quota Considerations

Although zero-ETL is free of integration charges, hidden costs are real:

Redshift storage and compute. Replicated data takes Redshift space; query workload consumes RPU/credits.
Cross-region transfer. Aurora in us-east-1 and Redshift in us-west-2 means every byte incurs cross-region fees. Co-locate aggressively.
Source-side API quotas (SaaS). Salesforce Bulk API is cheaper than REST but not free; SAP gateway throughput, ServiceNow rate limits, Facebook Ads insights all impose ceilings.
Filtering hygiene. Replicating 800 Aurora tables when you need 40 means 20x the cost. Use data filters at integration creation time—harder to remove tables later.

Hybrid Zero-ETL + Transformation

In real architectures, zero-ETL rarely stands alone. The dominant production pattern is hybrid: use zero-ETL to land raw operational data into a Redshift "raw" or "bronze" schema, then use dbt, materialized views, or scheduled batch SQL to transform that raw data into "silver" and "gold" layers consumable by BI and ML. This preserves the strengths of both: zero-ETL eliminates the brittle ingestion layer; downstream transformations handle joins, aggregations, and quality checks zero-ETL was never designed for.

Figure 10.5: Hybrid Bronze/Silver/Gold Architecture

flowchart LR subgraph Sources[Operational Sources] AU[Aurora] DD[DynamoDB] SAAS[SaaS via Glue] end subgraph Bronze[Raw / Bronze] RAW[(Redshift raw schema
zero-ETL landing)] end subgraph Silver[Silver - cleaned/joined] DBT[dbt models
Materialized views
Scheduled SQL] CLEAN[(Conformed marts)] end subgraph Gold[Gold - business-ready] MARTS[(Domain marts)] end subgraph Consumers BI[Tableau / QuickSight] ML[Redshift ML / SageMaker] REV[Reverse ETL to Salesforce] end AU -->|zero-ETL| RAW DD -->|zero-ETL| RAW SAAS -->|zero-ETL| RAW RAW --> DBT --> CLEAN --> MARTS MARTS --> BI MARTS --> ML MARTS --> REV

Phased Migration Strategy

The 9-week phased pattern for migrating an existing pipeline-based architecture to zero-ETL:

Week 1–2: Deploy zero-ETL in shadow mode alongside the existing pipeline.
Week 3–4: Dual-write comparison; validate row counts, schemas, latencies.
Week 5: Cut over non-critical consumers (internal dashboards) first.
Week 6–8: Migrate primary warehouse feeds to zero-ETL backed tables.
Week 9+: Decommission legacy DMS/Debezium pipeline; archive code.

The structure preserves the legacy pipeline as a fallback for at least two weeks of dual-write validation, dramatically reducing migration risk.

Key Points — Part 3

Zero-ETL is excellent at replication only; transformation-heavy ingestion still belongs in batch or downstream dbt.
Pick Debezium for multi-cloud / non-AWS sinks / true Kafka streaming, DMS for one-time migrations, batch ETL for non-CDC sources or heavy in-flight joins.
Free integration does not mean free data — Redshift compute, cross-region transfer, SaaS API quotas, and filtering hygiene all matter.
The dominant production pattern is hybrid: zero-ETL into bronze, dbt/SQL into silver and gold marts.
Migrate from legacy pipelines via a phased 9-week shadow-mode strategy with dual-write validation.

Chapter 10: Zero-ETL and SaaS Integration

Learning Objectives

Part 1: Zero-ETL Concept and Mechanics