Study Guide: Streaming Ingestion and Real-Time Pipelines

Pre-Reading Quiz — Streaming Foundations and AWS Services

1. A pub/sub log differs from a traditional queue primarily because:

It uses TCP instead of UDP for transport

Reads do not destroy entries, so multiple consumer groups can each track their own offset and replay history

It guarantees global ordering across all partitions

It only supports a single producer at a time

2. A streaming pipeline keyed by tenant_id sees 80% of traffic concentrated on one shard because Acme is the largest customer. What is the recommended fix?

Add more shards without changing the key — total throughput will balance out

Switch to a random partition key and lose all per-tenant ordering

Use a composite key like acme:user_5 that spreads load while preserving the per-user ordering you actually need

Migrate to a queue (SQS) so messages are evenly distributed

3. You need a managed AWS streaming service that supports cross-topic transactions and the full Apache Kafka API surface. Which fits best?

Amazon Kinesis Data Streams

Amazon Data Firehose

Amazon MSK

Amazon SQS FIFO

4. Which statement about Amazon Data Firehose is correct?

It is a replayable log with consumer offsets and shard iterators

It provides ~10 ms end-to-end latency, making it ideal for fraud detection

It buffers records and flushes them by size or time threshold to S3, Redshift, OpenSearch, or HTTP endpoints with 1–60 s latency

It guarantees exactly-once delivery natively without any destination-side deduplication

5. In a Kinesis Data Streams pipeline, what kind of delivery semantic do you get out of the box, and what does exactly-once require?

Exactly-once natively; no extra work needed

At-most-once natively; transactions are not supported on KDS

At-least-once natively; exactly-once requires application-level deduplication using sequence numbers stored alongside business writes

Best-effort only; Kinesis cannot recover from consumer crashes

Streaming Foundations and AWS Services

From Queues to Pub/Sub Logs

A stream is an unbounded sequence of events ordered (loosely) by time. A traditional queue like SQS is a one-shot mailbox — one consumer reads, the message disappears. A pub/sub log like Kafka or Kinesis is more like a bookshelf: producers append, multiple consumer groups read independently from any offset, and reading does not erase entries. Logs separate storage from position, enabling replay, multiple independent consumers, and time-travel debugging.

Property	Queue (SQS)	Pub/Sub Log (Kinesis, Kafka)
Read pattern	Consume-and-delete	Append-only, position-based read
Multiple consumers	Compete for messages	Independent groups, each tracks own offset
Replay	Generally no	Yes, within retention window
Ordering	Best-effort or per-group	Per-partition strict ordering

Producers, Partitions, and Offsets

A partition key decides which partition (Kafka/MSK) or shard (Kinesis) a record lands in. Records with the same key preserve order; different keys parallelize across partitions. Each record gets a monotonically increasing offset (Kafka) or sequence number (Kinesis) that consumers checkpoint to durable storage so they can resume after a crash.

The central trade-off: ordering is per-partition, never global, because global ordering would require a single bottleneck and destroy horizontal scalability.

Animation 1 — Pub/Sub Log Fan-Out

Producers append keyed records into ordered partitions. Multiple consumer groups read independently, each tracking its own offset. A new consumer (Z) replays from offset 0.

Producer A records Producer B records Producer C records Producer D records

Backpressure and Ordering Guarantees

When consumers fall behind producers, real systems apply backpressure: Kafka brokers raise ack latency, Kinesis throws ProvisionedThroughputExceededException, and Flink propagates fullness back through the operator DAG to the source. The "at-least-once" default means a consumer crash after processing but before checkpointing replays the record — double-counting events. Most of this chapter is about how to fix that.

Key Points: Streaming Foundations

Logs vs queues: Logs decouple storage from read position, enabling replay and multiple independent consumers.
Partition keys are the most consequential schema decision: they control both ordering (same key → same partition) and parallelism.
Ordering is per-partition, never global. Global ordering requires one partition and kills parallelism.
Default semantics are at-least-once. Exactly-once requires effort.

The Three AWS Streaming Services

AWS exposes three first-party streaming primitives. Picking incorrectly is the most common — and most expensive — mistake in AWS streaming architecture.

Amazon Kinesis Data Streams (KDS)

A shard-based replayable log. Each shard supports 1 MB/s ingest, 1,000 records/s, and 2 MB/s egress. Default 24-hour retention, extendable to 365 days. End-to-end latency ~200 ms. The dominant failure mode is the hot shard — skewed partition keys overwhelm one shard while the rest sit idle. Native delivery is at-least-once; exactly-once requires storing sequenceNumber alongside the business write.

Amazon MSK (managed Apache Kafka)

Real Apache Kafka brokers. Full Kafka API: idempotent producers, transactions, log compaction, Kafka Streams, Kafka Connect, read_committed isolation. Native exactly-once with enable.idempotence=true + transactional writes. Latency 10–100 ms. Two flavors: Provisioned (predictable cost-efficient) and Serverless (auto-scaling, simpler billing).

Amazon Data Firehose

Not a stream — a delivery service. No consumer API, no replay. Receives records via PUT or from a Kinesis/MSK source, optionally transforms with Lambda or converts JSON→Parquet, then pushes to S3, Redshift, OpenSearch, Splunk, or HTTP endpoints. Buffer-and-flush model; latency 1–60 s. Cheapest option (~$0.029/GB). At-least-once delivery; exactly-once happens at the destination.

Service	Latency	Replay	Exactly-once	Best for
KDS	~200 ms	Yes (24h–365d)	App-level dedup	AWS-native low-latency log
MSK	10–100 ms	Yes	Native (transactions)	Full Kafka ecosystem
Firehose	1–60 s	No	Destination-side	Cheap delivery to S3/Redshift

Diagram: AWS Streaming Fan-Out

flowchart LR Prod["Producers (SDK / KPL / Agent)"] --> KDS["Kinesis Data Streams
shards, ~200ms"] KDS --> Flink["Flink on MSF
stateful, exactly-once"] KDS --> FH["Data Firehose
buffer + flush, 1-60s"] Flink --> DDB["DynamoDB / SNS
fraud alerts"] FH --> S3["S3 (Parquet)"] S3 --> Ath["Athena / Redshift Spectrum"] MSK["Amazon MSK
(full Kafka API)"] -.alt source.-> Flink

Key Points: AWS Streaming Services

KDS = AWS-native replayable log, ~200 ms, hours-to-days retention; budget effort for hot-shard mitigation.
MSK = full Apache Kafka with native cross-topic transactions; pick when you need real Kafka, not just a stream.
Firehose = fire-and-forget delivery to S3/Redshift/OpenSearch with 1–60 s latency; cheapest for high-volume durable loading.
Common pattern: fan out from KDS into both Flink (real-time) and Firehose (durable Parquet on S3).

Post-Reading Quiz — Streaming Foundations and AWS Services

1. A pub/sub log differs from a traditional queue primarily because:

It uses TCP instead of UDP for transport

Reads do not destroy entries, so multiple consumer groups can each track their own offset and replay history

It guarantees global ordering across all partitions

It only supports a single producer at a time

2. A streaming pipeline keyed by tenant_id sees 80% of traffic concentrated on one shard because Acme is the largest customer. What is the recommended fix?

Add more shards without changing the key — total throughput will balance out

Switch to a random partition key and lose all per-tenant ordering

Use a composite key like acme:user_5 that spreads load while preserving the per-user ordering you actually need

Migrate to a queue (SQS) so messages are evenly distributed

3. You need a managed AWS streaming service that supports cross-topic transactions and the full Apache Kafka API surface. Which fits best?

Amazon Kinesis Data Streams

Amazon Data Firehose

Amazon MSK

Amazon SQS FIFO

4. Which statement about Amazon Data Firehose is correct?

It is a replayable log with consumer offsets and shard iterators

It provides ~10 ms end-to-end latency, making it ideal for fraud detection

It buffers records and flushes them by size or time threshold to S3, Redshift, OpenSearch, or HTTP endpoints with 1–60 s latency

It guarantees exactly-once delivery natively without any destination-side deduplication

5. In a Kinesis Data Streams pipeline, what kind of delivery semantic do you get out of the box, and what does exactly-once require?

Exactly-once natively; no extra work needed

At-most-once natively; transactions are not supported on KDS

At-least-once natively; exactly-once requires application-level deduplication using sequence numbers stored alongside business writes

Best-effort only; Kinesis cannot recover from consumer crashes

Pre-Reading Quiz — Stream Processing with Flink

1. What is a Kinesis Processing Unit (KPU) on Amazon Managed Service for Apache Flink?

A pricing tier for Kinesis Data Streams shards

A managed compute unit (~1 vCPU, 4 GB RAM, 50 GB local storage) on which Flink TaskManagers run

A type of Kinesis consumer that processes records exactly once

A throttle limit imposed on Firehose buffer flushes

2. When should you choose the DataStream API over Flink SQL / Table API?

For all pipelines — SQL is deprecated in modern Flink

Only when you need to write Python instead of Java

When you need fine-grained control: custom state, timers, side outputs, or complex event processing

Only for reading from Kinesis — SQL cannot connect to KDS

3. The default state backend on Amazon Managed Service for Apache Flink is:

HashMap (JVM heap)

Filesystem (full snapshots to S3)

RocksDB with incremental checkpoints

Apache Cassandra

4. In Flink's barrier-based checkpoint algorithm, what happens at an operator running in EXACTLY_ONCE mode when barrier N arrives on one input but not yet on others?

The operator immediately snapshots and discards records arriving on slow inputs

The operator aligns inputs — buffering records that have already passed the barrier on fast inputs until barrier N arrives on every input

The operator emits the barrier downstream immediately to keep latency low

The job fails and restarts from the previous checkpoint

5. What is a savepoint, and how does it differ from a regular checkpoint?

A savepoint is an automatically generated snapshot used during recovery; checkpoints are user-triggered

A savepoint is a user-triggered snapshot (e.g., for upgrades, parallelism changes, or version migrations); checkpoints are automatic and managed by Flink

Savepoints are stored in DynamoDB; checkpoints in S3

There is no difference — the terms are interchangeable

Stream Processing with Flink

Amazon Managed Service for Apache Flink (MSF)

Apache Flink is the de-facto standard for stateful, exactly-once stream processing. Amazon Managed Service for Apache Flink (MSF) — formerly Kinesis Data Analytics for Apache Flink — runs Flink TaskManagers on AWS-managed compute units called Kinesis Processing Units (KPUs): ~1 vCPU, 4 GB RAM, 50 GB local storage each. AWS handles patching, JobManager HA, and durable checkpoint storage to S3.

Native connectors exist for KDS, MSK, Firehose, DynamoDB Streams, S3, OpenSearch, and Lambda. The default state backend is RocksDB with incremental checkpoints; the default checkpoint interval is 60 seconds.

Key CloudWatch Metrics

Metric	What it tells you
`lastCheckpointDuration`	How long the most recent snapshot took (rising = backpressure)
`lastCheckpointSize`	State size (rising = key cardinality growing)
`numberOfFailedCheckpoints`	Recovery health (non-zero = investigate sinks/state)
`currentInputWatermark`	How far the pipeline has advanced in event time
`currentOutputWatermark`	What downstream sees (large gap = late firing)

DataStream API vs Table / SQL API

Flink offers two programming models that compile to the same execution graph.

The DataStream API is imperative (Java/Scala/Python): you write operators (map, filter, keyBy, window, process) explicitly. Use it for complex event processing, custom state machines, side outputs, and operator-level control.

DataStream<TempReading> readings = env
    .fromSource(kinesisSource, watermarkStrategy, "iot-temperature")
    .map(json -> mapper.readValue(json, TempReading.class));

DataStream<Alert> alerts = readings
    .keyBy(TempReading::getDeviceId)
    .window(TumblingEventTimeWindows.of(Time.minutes(1)))
    .aggregate(new AvgTemperature())
    .filter(avg -> avg.value > 80.0)
    .map(avg -> new Alert(avg.deviceId, "Overheating", avg.value));

alerts.sinkTo(kinesisAlertSink);

The Table API / Flink SQL is declarative: describe the computation in SQL, and Flink's planner produces an optimal execution graph. Shorter, more maintainable, accessible to analysts.

CREATE TABLE iot_temperature (
    device_id STRING,
    ts TIMESTAMP_LTZ(3),
    temperature_c DOUBLE,
    WATERMARK FOR ts AS ts - INTERVAL '5' SECONDS
) WITH ('connector' = 'kinesis', ...);

SELECT device_id, window_start, AVG(temperature_c) AS avg_c
FROM TABLE(TUMBLE(TABLE iot_temperature, DESCRIPTOR(ts), INTERVAL '1' MINUTES))
GROUP BY device_id, window_start
HAVING AVG(temperature_c) > 80.0;

The two APIs interoperate: convert DataStream to Table and back. A common pattern: do raw deserialization and enrichment in DataStream API, expose as a Table, let analysts build downstream analytics in SQL.

State Backends and Checkpointing

Stateful streaming means operators remember things between events: a windowed aggregator stores running totals, a join stores unmatched left-side records, a session detector stores last-seen timestamps.

Backend	Storage	When to use
HashMap	JVM heap	Small state, lowest latency, easy debugging
RocksDB (incremental)	Local disk + remote upload	Production default; gigabytes of state, low pause times
Filesystem (legacy)	Heap + full snapshot to FS	Mostly superseded by RocksDB

The Barrier-Based Checkpoint Algorithm

Flink's checkpoint is a barrier-based asynchronous distributed snapshot (a Chandy-Lamport variant):

The JobManager triggers a checkpoint at a configured interval (e.g., every 60 s).
Sources receive a numbered barrier and record their offsets/sequence numbers.
Barriers flow through the operator DAG alongside data, preserving order.
Each operator aligns its inputs (in EXACTLY_ONCE mode), then asynchronously snapshots state to the backend.
Operators acknowledge the checkpoint to the JobManager.
Once all tasks ack, the checkpoint is globally durable; sinks commit two-phase transactions.

Imagine a parade: the JobManager periodically inserts a flag-bearer (the barrier). Each viewing station (operator) waits until flag-bearers from every parallel route arrive, then photographs itself (snapshot). The photographs collectively form a consistent global snapshot.

Animation 2 — Flink Barrier Checkpoint Flowing Through Operators

JobManager triggers checkpoint N. The barrier flows from Source → KeyBy → Window → Sink, highlighting each operator as it aligns inputs and snapshots state to S3.

Checkpoint barrier Active operator Snapshot upload / ACK

Incremental checkpoints (RocksDB) persist only the SST files that changed since the previous snapshot, dramatically reducing checkpoint duration and S3 cost. Savepoints are user-triggered checkpoints used for upgrades, parallelism changes, and version migrations — you stop the job with a savepoint, deploy a new JAR, and restart from the savepoint with zero data loss.

Diagram: Flink Checkpoint Sequence

sequenceDiagram participant JM as JobManager participant Src as Source participant Op as Window Op participant Sink as 2PC Sink participant S3 as S3 backend JM->>Src: trigger CP N (every 60s) Src->>Src: record offsets Src->>Op: barrier N Op->>Op: align inputs Op->>S3: async snapshot (incremental) Op->>JM: ack Op->>Sink: barrier N Sink->>Sink: preCommit (flush, prepare 2PC) Sink->>JM: ack JM->>Sink: notifyCheckpointComplete(N) Sink->>Sink: commit (atomic publish)

Key Points: Stream Processing with Flink

MSF hides cluster ops behind KPUs and S3-backed checkpoints; defaults are RocksDB + 60-second incremental checkpoints.
Use Flink SQL / Table API for standard aggregations and joins; drop into the DataStream API only for custom state/timers/side outputs.
Checkpoints are barrier-based asynchronous snapshots: barriers flow through the DAG, operators align inputs and snapshot to S3.
Savepoints let you upgrade Flink versions or change parallelism with zero data loss.
EXACTLY_ONCE mode aligns barriers at multi-input operators (slight latency cost); AT_LEAST_ONCE skips alignment but allows duplicates on recovery.

Post-Reading Quiz — Stream Processing with Flink

1. What is a Kinesis Processing Unit (KPU) on Amazon Managed Service for Apache Flink?

A pricing tier for Kinesis Data Streams shards

A managed compute unit (~1 vCPU, 4 GB RAM, 50 GB local storage) on which Flink TaskManagers run

A type of Kinesis consumer that processes records exactly once

A throttle limit imposed on Firehose buffer flushes

2. When should you choose the DataStream API over Flink SQL / Table API?

For all pipelines — SQL is deprecated in modern Flink

Only when you need to write Python instead of Java

When you need fine-grained control: custom state, timers, side outputs, or complex event processing

Only for reading from Kinesis — SQL cannot connect to KDS

3. The default state backend on Amazon Managed Service for Apache Flink is:

HashMap (JVM heap)

Filesystem (full snapshots to S3)

RocksDB with incremental checkpoints

Apache Cassandra

4. In Flink's barrier-based checkpoint algorithm, what happens at an operator running in EXACTLY_ONCE mode when barrier N arrives on one input but not yet on others?

The operator immediately snapshots and discards records arriving on slow inputs

The operator aligns inputs — buffering records that have already passed the barrier on fast inputs until barrier N arrives on every input

The operator emits the barrier downstream immediately to keep latency low

The job fails and restarts from the previous checkpoint

5. What is a savepoint, and how does it differ from a regular checkpoint?

A savepoint is an automatically generated snapshot used during recovery; checkpoints are user-triggered

A savepoint is a user-triggered snapshot (e.g., for upgrades, parallelism changes, or version migrations); checkpoints are automatic and managed by Flink

Savepoints are stored in DynamoDB; checkpoints in S3

There is no difference — the terms are interchangeable

Pre-Reading Quiz — Time, Windows, and Correctness

1. Why should business analytics generally use event time rather than processing time?

Event time is faster because it skips watermark generation

Processing time is unsupported by Flink and AWS streaming services

Event time produces deterministic, reproducible aggregates that don't change under network delays, replays, or backfills

Event time eliminates the need for any state or checkpoints

2. You need a "5-minute rolling CPU average updated every minute." Which window type fits?

Tumbling window of size 5 minutes

Sliding (hopping) window with size 5 minutes and slide 1 minute

Session window with 5-minute gap

Global window with no trigger

3. In Flink, what does WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(5)) express?

Drop any record that arrives more than 5 seconds after the wall clock

A promise that the engine will fire windows every 5 seconds regardless of data

An assertion that records may arrive at most 5 seconds out of order, so watermark = max event-time seen − 5 s

A maximum allowed checkpoint interval of 5 seconds

4. A Flink job consumes from a Kinesis stream where one shard has gone idle (no producers writing to it). Downstream windows stop firing. What is the correct fix?

Kill the job and restart with parallelism=1

Add withIdleness(Duration) to the watermark strategy so idle partitions are excluded from the min-watermark calculation

Switch to processing time so watermarks are not needed

Increase the checkpoint interval to 10 minutes

5. End-to-end exactly-once in Flink requires the composition of which three layers?

VPC isolation, IAM roles, and KMS encryption

Replayable sources (offsets stored in checkpoint), barrier-based snapshots of operator state, and transactional or idempotent sinks

Tumbling windows, idempotent producers, and acks=1

Lambda triggers, DynamoDB streams, and SQS dead-letter queues

Time, Windows, and Correctness

Event Time vs Processing Time

Event time is the timestamp embedded in the record — when the sensor sampled, when the user clicked, when the trade was executed. Processing time is the wall-clock time when the engine sees the record. They diverge under network delay, mobile offline buffering, queue lag, or backfill.

Consider a phone buffering GPS pings while the user is in the subway. After 20 minutes underground, all pings flush at once. Processing-time view: "1,200 events arrived at 10:23 AM." Event-time view: "the user moved through these stations between 10:00 and 10:20 AM, in this order." The event-time view is correct for any meaningful analytics.

Aspect	Event time	Processing time
Source of timestamp	Record field	System clock
Reproducibility	Deterministic under reprocessing	Non-deterministic
Latency	Higher (wait for late data)	Lower
Use cases	Business analytics, billing, audit	Monitoring, alerting

Rule of thumb: use event time for any business logic that must produce the same result if you replay yesterday's data. Use processing time only for "is the pipeline currently alive" monitoring.

Window Types: Tumbling, Sliding, Session, Global

Window	Best for	Watch out for
Tumbling (fixed, non-overlapping)	Periodic reports, billing, hourly revenue	Boundary spikes (records hopping windows)
Sliding (fixed, overlapping)	Rolling averages, "5-min avg every minute"	State multiplication (size/slide copies per event) = memory pressure
Session (gap-based, dynamic)	User activity, page sessions, IoT bursts	Gap tuning is application-specific
Global (single never-closing window per key)	Lifetime totals, count-based triggers	State grows unbounded without TTL

-- Tumbling 1-hour by region
SELECT window_start, region, SUM(amount) AS revenue
FROM TABLE(TUMBLE(TABLE orders, DESCRIPTOR(order_ts), INTERVAL '1' HOUR))
GROUP BY window_start, region;

-- Sliding 5-min/1-min CPU average
SELECT window_start, AVG(cpu_pct)
FROM TABLE(HOP(TABLE host_metrics, DESCRIPTOR(ts), INTERVAL '1' MINUTE, INTERVAL '5' MINUTE))
GROUP BY window_start;

-- Session 30-min gap
SELECT user_id, window_start, window_end, COUNT(*) AS pageviews
FROM TABLE(SESSION(TABLE clickstream, DESCRIPTOR(ts), INTERVAL '30' MINUTES))
GROUP BY user_id, window_start, window_end;

Watermarks: The Engine's Promise

A watermark W(t) is a monotonically increasing assertion that "no more events with event-time timestamp ≤ t will arrive." It is how the engine knows it is safe to fire an event-time window: once W(t) advances past the window's end, no more in-time data should be coming.

WatermarkStrategy	Use when
`forMonotonousTimestamps()`	Records strictly in order (rare)
`forBoundedOutOfOrderness(Duration)`	Most common; allows N seconds of out-of-order tolerance
`noWatermarks()`	Processing-time pipelines
`withIdleness(Duration)`	Mark idle partitions inactive so downstream watermarks can advance

Watermark propagation through multi-input operators: the operator's output watermark equals the minimum of its input watermarks. This guarantees correctness but creates the watermark stall — one slow or idle partition holds back the entire pipeline. Fix with withIdleness(...) and instrument currentInputWatermark.

Late Events: Three Treatments

Dropped (default in many APIs).
Allowed lateness via allowedLateness(Duration): window state stays alive past the watermark; each late event re-fires the window. Downstream sinks must handle multiple emissions (use upsert sinks).
Side output via sideOutputLateData(OutputTag): late events route to a separate stream for inspection or repair.

Animation 3 — Event-Time Window Firing Driven by Watermark

Events arrive (out of order) into a tumbling window. The watermark advances behind them. When W(t) crosses the window end, the window fires. A late event arrives after the watermark and is routed to a side output.

In-time event Watermark / out-of-order Window fire Late event → side output

Diagram: Watermark-Driven Window Firing

flowchart TD In["Source records
(event-time ts embedded)"] --> WM["WatermarkStrategy
forBoundedOutOfOrderness(5s)
+ withIdleness(1m)"] WM --> KB["keyBy(deviceId)"] KB --> Win["TumblingEventTimeWindow (1m)"] Win --> Check{event ts vs W(t)?} Check -- "in window, W(t) < end" --> Buf[Buffer in RocksDB state] Check -- "W(t) ≥ window end" --> Fire[Fire window: emit aggregate] Check -- "ts < W(t) (late)" --> Late{allowedLateness still open?} Late -- yes --> ReFire[Re-fire window upsert] Late -- no --> Side[sideOutputLateData
audit / repair] Fire --> Sink[2PC Sink: preCommit on barrier, commit on CP complete] ReFire --> Sink

End-to-End Exactly-Once: Three Composed Layers

Replayable sources — Kafka offsets and Kinesis sequence numbers stored inside the checkpoint. On recovery, the source resumes from the last committed offset.
Internal state — checkpoint barriers ensure operator state is consistent across the snapshot boundary.
Transactional or idempotent sinks — either two-phase commit (TwoPhaseCommitSinkFunction) or idempotent operations (key-based upsert).

The two-phase commit pattern:

beginTransaction() ──> invoke(record) ──> preCommit() ──> commit()
   (start of CP)        (during CP)        (CP barrier)    (CP complete)

When the barrier arrives, preCommit() flushes buffers and prepares the transaction. After notifyCheckpointComplete(checkpointId), commit() atomically exposes writes (Kafka transaction commit, S3 staging-file rename to final). On crash between pre-commit and commit, the transaction is recovered from checkpoint state and either committed or aborted on restart.

Common Flink Failure Modes

Symptom	Likely cause	Fix
Watermark stalls; windows never fire	Idle partition pulls min watermark to −∞	`withIdleness(...)`
Checkpoint timeout	Sink backpressure or slow state persistence	Increase `checkpointTimeout`, reduce parallelism, switch to incremental
Duplicates after restart	Sink is non-transactional / non-idempotent	`FileSink` commit-on-checkpoint or transactional Kafka producer
Growing checkpoint size	Unbounded keyed state	Set TTL on state, or use session windows with a gap

Key Points: Time, Windows, and Correctness

Event time is reproducible under replay; processing time is not. Use event time for business logic.
Window types match question types: tumbling = "what happened in this period," sliding = "what is the trend now," session = "one user-sitting," global = "all-time per key."
Watermarks are monotonically increasing event-time assertions; they trigger window firing.
withIdleness prevents idle partitions from stalling the min-watermark; allowedLateness + side outputs handle late data.
End-to-end exactly-once = replayable sources + barrier-snapshot state + transactional / idempotent sinks (two-phase commit).

Post-Reading Quiz — Time, Windows, and Correctness

1. Why should business analytics generally use event time rather than processing time?

Event time is faster because it skips watermark generation

Processing time is unsupported by Flink and AWS streaming services

Event time produces deterministic, reproducible aggregates that don't change under network delays, replays, or backfills

Event time eliminates the need for any state or checkpoints

2. You need a "5-minute rolling CPU average updated every minute." Which window type fits?

Tumbling window of size 5 minutes

Sliding (hopping) window with size 5 minutes and slide 1 minute

Session window with 5-minute gap

Global window with no trigger

3. In Flink, what does WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(5)) express?

Drop any record that arrives more than 5 seconds after the wall clock

A promise that the engine will fire windows every 5 seconds regardless of data

An assertion that records may arrive at most 5 seconds out of order, so watermark = max event-time seen − 5 s

A maximum allowed checkpoint interval of 5 seconds

4. A Flink job consumes from a Kinesis stream where one shard has gone idle (no producers writing to it). Downstream windows stop firing. What is the correct fix?

Kill the job and restart with parallelism=1

Add withIdleness(Duration) to the watermark strategy so idle partitions are excluded from the min-watermark calculation

Switch to processing time so watermarks are not needed

Increase the checkpoint interval to 10 minutes

5. End-to-end exactly-once in Flink requires the composition of which three layers?

VPC isolation, IAM roles, and KMS encryption

Replayable sources (offsets stored in checkpoint), barrier-based snapshots of operator state, and transactional or idempotent sinks

Tumbling windows, idempotent producers, and acks=1

Lambda triggers, DynamoDB streams, and SQS dead-letter queues

Chapter 5: Streaming Ingestion and Real-Time Pipelines

Learning Objectives

Streaming Foundations and AWS Services

From Queues to Pub/Sub Logs

Producers, Partitions, and Offsets

Animation 1 — Pub/Sub Log Fan-Out

Backpressure and Ordering Guarantees

Key Points: Streaming Foundations

The Three AWS Streaming Services

Amazon Kinesis Data Streams (KDS)

Amazon MSK (managed Apache Kafka)

Amazon Data Firehose

Diagram: AWS Streaming Fan-Out

Key Points: AWS Streaming Services

Stream Processing with Flink

Amazon Managed Service for Apache Flink (MSF)

Key CloudWatch Metrics

DataStream API vs Table / SQL API

State Backends and Checkpointing

The Barrier-Based Checkpoint Algorithm

Animation 2 — Flink Barrier Checkpoint Flowing Through Operators

Diagram: Flink Checkpoint Sequence

Key Points: Stream Processing with Flink

Time, Windows, and Correctness

Event Time vs Processing Time

Window Types: Tumbling, Sliding, Session, Global

Watermarks: The Engine's Promise

Late Events: Three Treatments

Animation 3 — Event-Time Window Firing Driven by Watermark

Diagram: Watermark-Driven Window Firing

End-to-End Exactly-Once: Three Composed Layers

Common Flink Failure Modes

Key Points: Time, Windows, and Correctness

Your Progress

Answer Explanations