Define zero-ETL and explain how managed integrations change the economics and operations of traditional ingestion pipelines.
Configure an Aurora-to-Redshift zero-ETL integration end to end, including the prerequisite parameter groups, authorization, and database creation steps.
Use AWS Glue Zero-ETL connectors to ingest SaaS data from Salesforce, SAP OData, ServiceNow, and advertising platforms.
Identify when a managed integration is the right tool versus when traditional CDC, batch ETL, or open-source streaming (Debezium) remains the better choice.
Part 1: Zero-ETL Concept and Mechanics
Pre-Reading Quiz — Test Your Intuition
1. What does "zero-ETL" actually eliminate?
All extract, transform, and load work disappears entirely.User-managed pipeline code and infrastructure; the cloud provider runs the ETL invisibly.The need for a destination warehouse like Redshift.Schema evolution and DDL changes on the source database.
2. Under the hood, how does Aurora-to-Redshift zero-ETL capture ongoing changes from MySQL?
It runs scheduled SELECT * queries against each table.It snapshots Aurora to S3 every 60 seconds and reloads Redshift.It tails the binary log (binlog) in ROW format for log-based CDC.It uses Aurora triggers that synchronously call a Redshift webhook.
3. Why is Aurora-to-Redshift replication described as "eventually consistent" rather than synchronous?
Because Redshift is technically slower than Aurora at writes.Because committed rows take a propagation window (often under 15 seconds) to appear in Redshift.Because zero-ETL only replicates rows once per day at midnight.Because Aurora and Redshift use incompatible character encodings.
What Zero-ETL Means
For most of data warehousing's history, getting data from an operational database into an analytics warehouse meant building a pipeline: extraction code, Airflow DAGs, Debezium connectors, AWS DMS instances, S3 staging, and COPY commands. Each layer cost engineering time, infrastructure dollars, and latency. Zero-ETL is the cloud industry's collective decision that for the most common case—operational data flowing into an analytical store—the cloud provider should own the pipeline so you don't have to.
The phrase is slightly misleading. The Extract, Transform, Load work still happens; it's just that you no longer write or operate it. The provider replaces three artifacts you used to maintain:
Auto-generated apply into Redshift / OpenSearch / Lakehouse
Change Capture Under the Hood
Although AWS markets zero-ETL as "magic," the mechanism is the same log-based Change Data Capture pattern that has powered Debezium and DMS for years. For Aurora MySQL, the integration reads the binary log (binlog) in ROW format, capturing every insert, update, and delete as a structured event. For Aurora PostgreSQL, it reads the write-ahead log (WAL) using logical replication.
The flow has two distinct phases: an initial snapshot that seeds Redshift with whatever rows already live in Aurora, then ongoing CDC that tails the change log and applies events through a MERGE into the target. Because the source is the binary log rather than a SQL SELECT, the operational database's CPU and IO impact stays low.
Animation: Aurora → Redshift Zero-ETL Flow
Phase 1 (orange) seeds Redshift with a snapshot. Phase 2 (blue) tails the binlog continuously, streaming row deltas through the apply engine.
Latency and Consistency Guarantees
The defining performance claim is near real-time: replicated rows typically appear in the analytics target within seconds of a source commit. That's an order of magnitude better than batch ETL (hours to days) and competitive with self-managed Debezium or DMS, without the operational overhead of either. Two consistency caveats matter:
Eventual consistency, not synchronous replication. A transaction committed on Aurora is not immediately visible in Redshift. There is a propagation window—usually under 15 seconds, occasionally longer under load.
DDL replication is supported but not always instant. Adding columns and renaming tables propagate, but aggressive schema changes (incompatible type alterations, dropping primary keys) can pause or break the integration.
Figure 10.1: Aurora-to-Redshift Sequence
sequenceDiagram
participant App as Application
participant Aurora as Aurora Cluster
participant Binlog as Binlog / WAL
participant ZE as Zero-ETL Service
participant RS as Redshift Target
Note over Aurora,RS: Phase 1 - Initial Snapshot
ZE->>Aurora: Trigger snapshot export
Aurora-->>ZE: PITR snapshot
ZE->>RS: Bulk load existing rows
RS-->>ZE: Snapshot complete
Note over Aurora,RS: Phase 2 - Ongoing CDC (seconds)
App->>Aurora: INSERT / UPDATE / DELETE
Aurora->>Binlog: Write ROW-format change record
ZE->>Binlog: Tail change events
Binlog-->>ZE: Stream of row deltas
ZE->>RS: MERGE into target tables
RS-->>App: Available for analytics
Key Points — Part 1
Zero-ETL relocates pipeline ownership to the cloud provider; the work still happens, you just don't operate it.
Under the hood it is still log-based CDC (binlog ROW format for MySQL, WAL logical replication for Postgres).
Two phases: Phase 1 PITR snapshot seeds the target; Phase 2 ongoing CDC streams every change as it happens.
Latency is second-scale and eventually consistent—Redshift is for analytics, not read-your-writes operational queries.
DDL like ADD COLUMN replicates; aggressive incompatible schema changes can pause the integration.
Post-Reading Quiz — Apply What You Learned
1. What does "zero-ETL" actually eliminate?
All extract, transform, and load work disappears entirely.User-managed pipeline code and infrastructure; the cloud provider runs the ETL invisibly.The need for a destination warehouse like Redshift.Schema evolution and DDL changes on the source database.
2. Under the hood, how does Aurora-to-Redshift zero-ETL capture ongoing changes from MySQL?
It runs scheduled SELECT * queries against each table.It snapshots Aurora to S3 every 60 seconds and reloads Redshift.It tails the binary log (binlog) in ROW format for log-based CDC.It uses Aurora triggers that synchronously call a Redshift webhook.
3. Why is Aurora-to-Redshift replication described as "eventually consistent" rather than synchronous?
Because Redshift is technically slower than Aurora at writes.Because committed rows take a propagation window (often under 15 seconds) to appear in Redshift.Because zero-ETL only replicates rows once per day at midnight.Because Aurora and Redshift use incompatible character encodings.
Part 2: Database and SaaS Integrations
Pre-Reading Quiz — Test Your Intuition
1. Which Aurora parameter group settings are required before creating an Aurora-to-Redshift zero-ETL integration?
binlog_format = STATEMENT and aws_pitr_enabled = 0.binlog_format = ROW and aws_pitr_enabled = 1, set on a custom (non-default) cluster parameter group.No parameter changes are needed; the default group works.replica_mode = SYNCHRONOUS on the default parameter group.
2. The most common Salesforce zero-ETL failure surfaces as IngestionFailed in CloudWatch logs. What is almost always the root cause?
Salesforce session token expired and OAuth needs re-authorization.Missing Lake Formation permissions (DESCRIBE, CREATE_TABLE, ALTER) on the destination Glue database.Salesforce Bulk API rate-limited the integration.Redshift cluster ran out of free storage capacity.
3. For SAP OData entities that are not ODP-aware and lack a timestamp field, how is incremental sync performed?
SAP automatically emits delta tokens for any non-ODP entity.Glue uses full reload plus upserts — deletes are silently lost, a real correctness footgun.The integration refuses to start until a timestamp field is added.Glue falls back to scanning Salesforce Bulk API instead.
Aurora → Redshift Setup
Aurora-to-Redshift is the flagship zero-ETL integration. The setup involves two consoles—RDS for the source, Redshift for the target—and a small set of prerequisites that, if missed, cause cryptic failures.
Aurora source prerequisites: A custom DB cluster parameter group (the default cannot be modified) with:
binlog_format = ROW
aws_pitr_enabled = 1
binlog_format = ROW ensures every row change is logged with full before/after values—statement-based binlogs cannot be safely replicated. aws_pitr_enabled = 1 enables enhanced point-in-time recovery, required for the snapshot phase.
Redshift target prerequisites: The target cluster (or serverless workgroup) must have case sensitivity enabled at the cluster level—a one-time setting that cannot be retrofitted. The cluster also needs an authorization policy identifying the Aurora source ARN as an allowed integration writer.
Two-console workflow: In RDS, "Create zero-ETL integration" with the Aurora cluster as source, optionally apply data filters to scope replication to specific schemas/tables, and target the Redshift namespace. Then in Redshift Query Editor v2, run "Create database from integration" to produce a queryable database. From that point you can SELECT, build materialized views, and feed dbt models or Redshift ML.
Pricing: There is no separate fee for the integration. You pay only for Aurora storage/compute you already pay for, Redshift compute that ingests/stores the data, and any cross-region transfer charges. Compared to a DMS instance ($150–$1,500+ per month), this is often the cheapest path.
Animation: Traditional Pipeline vs Zero-ETL
Four legacy components (Debezium, Kafka, S3 staging, Airflow COPY) collapse into a single managed integration object.
DynamoDB → Redshift / OpenSearch and RDS → Redshift
DynamoDB cannot run group-by queries, complex joins, or full-text search efficiently. Zero-ETL gives it two integration targets: DynamoDB → Redshift for SQL analytics, BI dashboards, and ML training; DynamoDB → OpenSearch for full-text and log-style search. Under the hood it uses point-in-time recovery for the initial export plus DynamoDB Streams for ongoing CDC.
Beyond Aurora, AWS has progressively extended zero-ETL to standard RDS engines (RDS for MySQL today, more on the roadmap). The mechanics mirror Aurora: parameter group, binlog enable, target authorization, console-based integration creation. Always consult current AWS documentation for the supported engine matrix.
Figure 10.2: Database Zero-ETL Fan-In
flowchart BT
AM[Aurora MySQL binlog ROW] -->|zero-ETL| RS[(Redshift Target case-sensitive cluster)]
AP[Aurora PostgreSQL logical WAL] -->|zero-ETL| RS
DDB[DynamoDB PITR + Streams] -->|zero-ETL| RS
DDB -->|zero-ETL| OS[(OpenSearch full-text index)]
RDSm[RDS for MySQL binlog ROW] -->|zero-ETL| RS
RS --> BI[BI / dbt / Redshift ML]
OS --> SRCH[Search and log analytics]
SaaS Zero-ETL via AWS Glue
AWS Glue Zero-ETL extends the managed-integration pattern to SaaS sources. Currently supported sources include Salesforce, Salesforce Marketing Cloud, SAP OData (ODP and non-ODP), ServiceNow, Zendesk, Zoho CRM, Facebook Ads, and Instagram Ads. Configuration is a two-step process:
Create a Glue Data Catalog connection — reusable artifact storing the SaaS instance URL, IAM service role, and credentials (OAuth tokens for Salesforce, static creds for SAP). Always click "Test connection."
Create a zero-ETL integration — pick the connector type, choose the connection, select entities/tables, and pick a destination (Redshift or SageMaker Lakehouse).
Critical permission requirement: The IAM service role must have AWS Lake Formation permissions (DESCRIBE, CREATE_TABLE, ALTER) on the target Glue database. Missing Lake Formation grants is the #1 cause of IngestionFailed errors.
Salesforce specifics: Glue uses the Salesforce Bulk API rather than the standard REST API—asynchronous, optimized for large datasets, and roughly 10x cheaper in API call quota for large extracts. SAP specifics: the Operational Data Provisioning (ODP) framework is the gold path because it produces native delta links. Non-ODP entities require a timestamp field for incremental sync, or fall back to full reload + upsert—which means deletes are silently lost.
Animation: SaaS Connector Schema Mapping with Watermark
Source rows flow through the schema mapper (orange → blue type-coerced columns) into Redshift; the watermark advances after each successful sync.
Schema Mapping and Incremental Sync
When zero-ETL pulls Salesforce or SAP records into Redshift, schema mapping happens automatically: source field types map to Redshift-compatible types (VARCHAR, TIMESTAMP, DECIMAL); composite types flatten into joinable columns; new source columns propagate on the next sync. Incremental sync is what makes the model economical—a nightly full-table reload of 50 million Salesforce rows is impossibly expensive in API quota; pulling only the deltas since the last watermark is trivial.
SAP ODP: delta links from the SAP server tell Glue exactly which rows changed.
SAP non-ODP with timestamp: Glue runs WHERE LastChangeDate > :watermark.
SAP non-ODP without timestamp: full reload + upsert — deletes silently lost.
Figure 10.3: SaaS Connector Schema Mapping
flowchart LR
subgraph SaaS[SaaS Sources]
SF[Salesforce Bulk API]
SAP[SAP OData ODP delta links]
SN[ServiceNow change tracking]
end
subgraph Glue[AWS Glue Zero-ETL]
CONN[Glue Data Catalog Connection OAuth / creds]
INT[Zero-ETL Integration entity selection]
MAP[Schema Mapper type coercion flatten composites]
WM[Watermark / Delta Token]
end
subgraph Target[Targets with Lake Formation grants]
RS[(Redshift)]
LH[(SageMaker Lakehouse)]
end
SF --> CONN
SAP --> CONN
SN --> CONN
CONN --> INT
INT --> MAP
INT <--> WM
MAP --> RS
MAP --> LH
Key Points — Part 2
Aurora source needs a custom parameter group with binlog_format = ROW and aws_pitr_enabled = 1.
Redshift target needs case sensitivity enabled (one-time, cannot be retrofitted) and an authorization policy.
The integration itself is free; you pay only for Aurora compute, Redshift compute, and any cross-region transfer.
DynamoDB zero-ETL has two targets: Redshift for analytics, OpenSearch for full-text/log search.
SaaS Glue Zero-ETL is a two-step pattern: Glue Data Catalog connection → integration with entity selection.
The #1 SaaS failure is IngestionFailed due to missing Lake Formation grants on the destination Glue database.
Salesforce uses the Bulk API (10x cheaper in quota); SAP gold path is ODP delta links; non-ODP without a timestamp loses deletes.
Post-Reading Quiz — Apply What You Learned
1. Which Aurora parameter group settings are required before creating an Aurora-to-Redshift zero-ETL integration?
binlog_format = STATEMENT and aws_pitr_enabled = 0.binlog_format = ROW and aws_pitr_enabled = 1, set on a custom (non-default) cluster parameter group.No parameter changes are needed; the default group works.replica_mode = SYNCHRONOUS on the default parameter group.
2. The most common Salesforce zero-ETL failure surfaces as IngestionFailed in CloudWatch logs. What is almost always the root cause?
Salesforce session token expired and OAuth needs re-authorization.Missing Lake Formation permissions (DESCRIBE, CREATE_TABLE, ALTER) on the destination Glue database.Salesforce Bulk API rate-limited the integration.Redshift cluster ran out of free storage capacity.
3. For SAP OData entities that are not ODP-aware and lack a timestamp field, how is incremental sync performed?
SAP automatically emits delta tokens for any non-ODP entity.Glue uses full reload plus upserts — deletes are silently lost, a real correctness footgun.The integration refuses to start until a timestamp field is added.Glue falls back to scanning Salesforce Bulk API instead.
Part 3: Trade-offs and When to Use Pipelines
Pre-Reading Quiz — Test Your Intuition
1. Your ingestion logic requires a 12-table join, complex aggregations, and ML feature engineering before landing in the warehouse. Which approach is the right fit?
Zero-ETL — it natively supports in-flight transformations.Batch ETL or hybrid (zero-ETL into bronze, then dbt/SQL into silver/gold) — zero-ETL alone is replication-only.AWS DMS — it has built-in Spark transformations.Debezium with stream processing replaces all transformation needs.
2. You need CDC from a Postgres database into Kafka feeding microservices in a multi-cloud architecture. Which tool fits best?
AWS zero-ETL — it works across any cloud.Debezium — multi-cloud flexibility, true Kafka streaming, fine-grained control.Batch ETL on a daily schedule.AWS DMS routed through a VPN tunnel to other clouds.
3. Although the zero-ETL integration itself is free, which hidden cost is most likely to surprise teams?
A monthly per-integration license fee from AWS.Redshift storage/compute, cross-region data transfer, and SaaS API quota consumption.Required pre-purchase of Aurora reserved instances.A consulting engagement to enable case-sensitivity on Redshift.
When You Still Need Pipelines
Despite its appeal, zero-ETL is not universal. The decision tree is sharpest when comparing it head-to-head:
Factor
Zero-ETL
Debezium
AWS DMS
Batch ETL
Latency
Seconds
Seconds
Minutes
Hours/days
Cost
No charge (AWS-managed)
Infra + engineering
Per-instance fees
Compute + orchestration
Infrastructure
Zero management
Self-managed Kafka
Managed instance
Scheduled compute
Flexibility
AWS only
Any DB + any sink
AWS-focused
Any source/target
Transformation
Replication only
Stream processing
Limited
Full SQL/Python in-flight
Choose Debezium when you need multi-cloud flexibility, non-AWS sources/sinks, true streaming into Kafka for microservices, or fine-grained application-level control over the change stream. Choose AWS DMS when you're performing a one-time migration with minimal ongoing CDC, or replicating between heterogeneous DBs not yet supported by zero-ETL. Choose batch ETL when latency genuinely doesn't matter (weekly executive reports), the source lacks log-based CDC (legacy DBs, REST APIs, FTP feeds), or your ingestion includes heavy in-flight joins/aggregations/ML feature engineering.
The brutal truth: zero-ETL is excellent at one job—replicating operational data into an analytics store—and bad at everything else. Try to use it for transformation, multi-cloud federation, or fine-grained event filtering, and you'll quickly hit walls.
Figure 10.4: Decision Tree
flowchart TD
Start[Need to move data from source to analytics] --> Q1{Latency budget?}
Q1 -->|Hours / days OK| Batch[Batch ETL scheduled compute]
Q1 -->|Seconds to minutes| Q2{Heavy in-flight transformation?}
Q2 -->|Yes - joins, aggs, ML feats| Batch
Q2 -->|No - replication only| Q3{Multi-cloud or non-AWS sink?}
Q3 -->|Yes| Deb[Debezium + Kafka self-managed CDC]
Q3 -->|No - all AWS| Q4{One-time migration?}
Q4 -->|Yes| DMS[AWS DMS per-instance fee]
Q4 -->|No - ongoing| Q5{Source supported by zero-ETL?}
Q5 -->|Yes| ZE[Zero-ETL managed, no integration fee]
Q5 -->|No| DMS
Cost and Quota Considerations
Although zero-ETL is free of integration charges, hidden costs are real:
Redshift storage and compute. Replicated data takes Redshift space; query workload consumes RPU/credits.
Cross-region transfer. Aurora in us-east-1 and Redshift in us-west-2 means every byte incurs cross-region fees. Co-locate aggressively.
Source-side API quotas (SaaS). Salesforce Bulk API is cheaper than REST but not free; SAP gateway throughput, ServiceNow rate limits, Facebook Ads insights all impose ceilings.
Filtering hygiene. Replicating 800 Aurora tables when you need 40 means 20x the cost. Use data filters at integration creation time—harder to remove tables later.
Hybrid Zero-ETL + Transformation
In real architectures, zero-ETL rarely stands alone. The dominant production pattern is hybrid: use zero-ETL to land raw operational data into a Redshift "raw" or "bronze" schema, then use dbt, materialized views, or scheduled batch SQL to transform that raw data into "silver" and "gold" layers consumable by BI and ML. This preserves the strengths of both: zero-ETL eliminates the brittle ingestion layer; downstream transformations handle joins, aggregations, and quality checks zero-ETL was never designed for.
flowchart LR
subgraph Sources[Operational Sources]
AU[Aurora]
DD[DynamoDB]
SAAS[SaaS via Glue]
end
subgraph Bronze[Raw / Bronze]
RAW[(Redshift raw schema zero-ETL landing)]
end
subgraph Silver[Silver - cleaned/joined]
DBT[dbt models Materialized views Scheduled SQL]
CLEAN[(Conformed marts)]
end
subgraph Gold[Gold - business-ready]
MARTS[(Domain marts)]
end
subgraph Consumers
BI[Tableau / QuickSight]
ML[Redshift ML / SageMaker]
REV[Reverse ETL to Salesforce]
end
AU -->|zero-ETL| RAW
DD -->|zero-ETL| RAW
SAAS -->|zero-ETL| RAW
RAW --> DBT --> CLEAN --> MARTS
MARTS --> BI
MARTS --> ML
MARTS --> REV
Phased Migration Strategy
The 9-week phased pattern for migrating an existing pipeline-based architecture to zero-ETL:
Week 1–2: Deploy zero-ETL in shadow mode alongside the existing pipeline.
The structure preserves the legacy pipeline as a fallback for at least two weeks of dual-write validation, dramatically reducing migration risk.
Key Points — Part 3
Zero-ETL is excellent at replication only; transformation-heavy ingestion still belongs in batch or downstream dbt.
Pick Debezium for multi-cloud / non-AWS sinks / true Kafka streaming, DMS for one-time migrations, batch ETL for non-CDC sources or heavy in-flight joins.
Free integration does not mean free data — Redshift compute, cross-region transfer, SaaS API quotas, and filtering hygiene all matter.
The dominant production pattern is hybrid: zero-ETL into bronze, dbt/SQL into silver and gold marts.
Migrate from legacy pipelines via a phased 9-week shadow-mode strategy with dual-write validation.
Post-Reading Quiz — Apply What You Learned
1. Your ingestion logic requires a 12-table join, complex aggregations, and ML feature engineering before landing in the warehouse. Which approach is the right fit?
Zero-ETL — it natively supports in-flight transformations.Batch ETL or hybrid (zero-ETL into bronze, then dbt/SQL into silver/gold) — zero-ETL alone is replication-only.AWS DMS — it has built-in Spark transformations.Debezium with stream processing replaces all transformation needs.
2. You need CDC from a Postgres database into Kafka feeding microservices in a multi-cloud architecture. Which tool fits best?
AWS zero-ETL — it works across any cloud.Debezium — multi-cloud flexibility, true Kafka streaming, fine-grained control.Batch ETL on a daily schedule.AWS DMS routed through a VPN tunnel to other clouds.
3. Although the zero-ETL integration itself is free, which hidden cost is most likely to surprise teams?
A monthly per-integration license fee from AWS.Redshift storage/compute, cross-region data transfer, and SaaS API quota consumption.Required pre-purchase of Aurora reserved instances.A consulting engagement to enable case-sensitivity on Redshift.