Chapter 5: Streaming Ingestion and Real-Time Pipelines

Learning Objectives

Pre-Reading Quiz — Streaming Foundations and AWS Services

1. A pub/sub log differs from a traditional queue primarily because:

It uses TCP instead of UDP for transport
Reads do not destroy entries, so multiple consumer groups can each track their own offset and replay history
It guarantees global ordering across all partitions
It only supports a single producer at a time

2. A streaming pipeline keyed by tenant_id sees 80% of traffic concentrated on one shard because Acme is the largest customer. What is the recommended fix?

Add more shards without changing the key — total throughput will balance out
Switch to a random partition key and lose all per-tenant ordering
Use a composite key like acme:user_5 that spreads load while preserving the per-user ordering you actually need
Migrate to a queue (SQS) so messages are evenly distributed

3. You need a managed AWS streaming service that supports cross-topic transactions and the full Apache Kafka API surface. Which fits best?

Amazon Kinesis Data Streams
Amazon Data Firehose
Amazon MSK
Amazon SQS FIFO

4. Which statement about Amazon Data Firehose is correct?

It is a replayable log with consumer offsets and shard iterators
It provides ~10 ms end-to-end latency, making it ideal for fraud detection
It buffers records and flushes them by size or time threshold to S3, Redshift, OpenSearch, or HTTP endpoints with 1–60 s latency
It guarantees exactly-once delivery natively without any destination-side deduplication

5. In a Kinesis Data Streams pipeline, what kind of delivery semantic do you get out of the box, and what does exactly-once require?

Exactly-once natively; no extra work needed
At-most-once natively; transactions are not supported on KDS
At-least-once natively; exactly-once requires application-level deduplication using sequence numbers stored alongside business writes
Best-effort only; Kinesis cannot recover from consumer crashes

Streaming Foundations and AWS Services

From Queues to Pub/Sub Logs

A stream is an unbounded sequence of events ordered (loosely) by time. A traditional queue like SQS is a one-shot mailbox — one consumer reads, the message disappears. A pub/sub log like Kafka or Kinesis is more like a bookshelf: producers append, multiple consumer groups read independently from any offset, and reading does not erase entries. Logs separate storage from position, enabling replay, multiple independent consumers, and time-travel debugging.

PropertyQueue (SQS)Pub/Sub Log (Kinesis, Kafka)
Read patternConsume-and-deleteAppend-only, position-based read
Multiple consumersCompete for messagesIndependent groups, each tracks own offset
ReplayGenerally noYes, within retention window
OrderingBest-effort or per-groupPer-partition strict ordering

Producers, Partitions, and Offsets

A partition key decides which partition (Kafka/MSK) or shard (Kinesis) a record lands in. Records with the same key preserve order; different keys parallelize across partitions. Each record gets a monotonically increasing offset (Kafka) or sequence number (Kinesis) that consumers checkpoint to durable storage so they can resume after a crash.

The central trade-off: ordering is per-partition, never global, because global ordering would require a single bottleneck and destroy horizontal scalability.

Animation 1 — Pub/Sub Log Fan-Out

Producers append keyed records into ordered partitions. Multiple consumer groups read independently, each tracking its own offset. A new consumer (Z) replays from offset 0.
Producer A Producer B Producer C Producer D Pub/Sub Log (per-partition strict ordering) P0 P1 P2 P3 Consumer Group X (analytics) offset = 19,431 reading P0, P1 Consumer Group Y (audit) offset = 12,005 reading P2, P3 Consumer Group Z (NEW) offset = 0 (replay) reading from start Logs separate "where data lives" from "who has read it"
Producer A records Producer B records Producer C records Producer D records

Backpressure and Ordering Guarantees

When consumers fall behind producers, real systems apply backpressure: Kafka brokers raise ack latency, Kinesis throws ProvisionedThroughputExceededException, and Flink propagates fullness back through the operator DAG to the source. The "at-least-once" default means a consumer crash after processing but before checkpointing replays the record — double-counting events. Most of this chapter is about how to fix that.

Key Points: Streaming Foundations

The Three AWS Streaming Services

AWS exposes three first-party streaming primitives. Picking incorrectly is the most common — and most expensive — mistake in AWS streaming architecture.

Amazon Kinesis Data Streams (KDS)

A shard-based replayable log. Each shard supports 1 MB/s ingest, 1,000 records/s, and 2 MB/s egress. Default 24-hour retention, extendable to 365 days. End-to-end latency ~200 ms. The dominant failure mode is the hot shard — skewed partition keys overwhelm one shard while the rest sit idle. Native delivery is at-least-once; exactly-once requires storing sequenceNumber alongside the business write.

Amazon MSK (managed Apache Kafka)

Real Apache Kafka brokers. Full Kafka API: idempotent producers, transactions, log compaction, Kafka Streams, Kafka Connect, read_committed isolation. Native exactly-once with enable.idempotence=true + transactional writes. Latency 10–100 ms. Two flavors: Provisioned (predictable cost-efficient) and Serverless (auto-scaling, simpler billing).

Amazon Data Firehose

Not a stream — a delivery service. No consumer API, no replay. Receives records via PUT or from a Kinesis/MSK source, optionally transforms with Lambda or converts JSON→Parquet, then pushes to S3, Redshift, OpenSearch, Splunk, or HTTP endpoints. Buffer-and-flush model; latency 1–60 s. Cheapest option (~$0.029/GB). At-least-once delivery; exactly-once happens at the destination.

ServiceLatencyReplayExactly-onceBest for
KDS~200 msYes (24h–365d)App-level dedupAWS-native low-latency log
MSK10–100 msYesNative (transactions)Full Kafka ecosystem
Firehose1–60 sNoDestination-sideCheap delivery to S3/Redshift

Diagram: AWS Streaming Fan-Out

flowchart LR Prod["Producers (SDK / KPL / Agent)"] --> KDS["Kinesis Data Streams
shards, ~200ms"] KDS --> Flink["Flink on MSF
stateful, exactly-once"] KDS --> FH["Data Firehose
buffer + flush, 1-60s"] Flink --> DDB["DynamoDB / SNS
fraud alerts"] FH --> S3["S3 (Parquet)"] S3 --> Ath["Athena / Redshift Spectrum"] MSK["Amazon MSK
(full Kafka API)"] -.alt source.-> Flink

Key Points: AWS Streaming Services

Post-Reading Quiz — Streaming Foundations and AWS Services

1. A pub/sub log differs from a traditional queue primarily because:

It uses TCP instead of UDP for transport
Reads do not destroy entries, so multiple consumer groups can each track their own offset and replay history
It guarantees global ordering across all partitions
It only supports a single producer at a time

2. A streaming pipeline keyed by tenant_id sees 80% of traffic concentrated on one shard because Acme is the largest customer. What is the recommended fix?

Add more shards without changing the key — total throughput will balance out
Switch to a random partition key and lose all per-tenant ordering
Use a composite key like acme:user_5 that spreads load while preserving the per-user ordering you actually need
Migrate to a queue (SQS) so messages are evenly distributed

3. You need a managed AWS streaming service that supports cross-topic transactions and the full Apache Kafka API surface. Which fits best?

Amazon Kinesis Data Streams
Amazon Data Firehose
Amazon MSK
Amazon SQS FIFO

4. Which statement about Amazon Data Firehose is correct?

It is a replayable log with consumer offsets and shard iterators
It provides ~10 ms end-to-end latency, making it ideal for fraud detection
It buffers records and flushes them by size or time threshold to S3, Redshift, OpenSearch, or HTTP endpoints with 1–60 s latency
It guarantees exactly-once delivery natively without any destination-side deduplication

5. In a Kinesis Data Streams pipeline, what kind of delivery semantic do you get out of the box, and what does exactly-once require?

Exactly-once natively; no extra work needed
At-most-once natively; transactions are not supported on KDS
At-least-once natively; exactly-once requires application-level deduplication using sequence numbers stored alongside business writes
Best-effort only; Kinesis cannot recover from consumer crashes
Pre-Reading Quiz — Stream Processing with Flink

1. What is a Kinesis Processing Unit (KPU) on Amazon Managed Service for Apache Flink?

A pricing tier for Kinesis Data Streams shards
A managed compute unit (~1 vCPU, 4 GB RAM, 50 GB local storage) on which Flink TaskManagers run
A type of Kinesis consumer that processes records exactly once
A throttle limit imposed on Firehose buffer flushes

2. When should you choose the DataStream API over Flink SQL / Table API?

For all pipelines — SQL is deprecated in modern Flink
Only when you need to write Python instead of Java
When you need fine-grained control: custom state, timers, side outputs, or complex event processing
Only for reading from Kinesis — SQL cannot connect to KDS

3. The default state backend on Amazon Managed Service for Apache Flink is:

HashMap (JVM heap)
Filesystem (full snapshots to S3)
RocksDB with incremental checkpoints
Apache Cassandra

4. In Flink's barrier-based checkpoint algorithm, what happens at an operator running in EXACTLY_ONCE mode when barrier N arrives on one input but not yet on others?

The operator immediately snapshots and discards records arriving on slow inputs
The operator aligns inputs — buffering records that have already passed the barrier on fast inputs until barrier N arrives on every input
The operator emits the barrier downstream immediately to keep latency low
The job fails and restarts from the previous checkpoint

5. What is a savepoint, and how does it differ from a regular checkpoint?

A savepoint is an automatically generated snapshot used during recovery; checkpoints are user-triggered
A savepoint is a user-triggered snapshot (e.g., for upgrades, parallelism changes, or version migrations); checkpoints are automatic and managed by Flink
Savepoints are stored in DynamoDB; checkpoints in S3
There is no difference — the terms are interchangeable

Stream Processing with Flink

Amazon Managed Service for Apache Flink (MSF)

Apache Flink is the de-facto standard for stateful, exactly-once stream processing. Amazon Managed Service for Apache Flink (MSF) — formerly Kinesis Data Analytics for Apache Flink — runs Flink TaskManagers on AWS-managed compute units called Kinesis Processing Units (KPUs): ~1 vCPU, 4 GB RAM, 50 GB local storage each. AWS handles patching, JobManager HA, and durable checkpoint storage to S3.

Native connectors exist for KDS, MSK, Firehose, DynamoDB Streams, S3, OpenSearch, and Lambda. The default state backend is RocksDB with incremental checkpoints; the default checkpoint interval is 60 seconds.

Key CloudWatch Metrics

MetricWhat it tells you
lastCheckpointDurationHow long the most recent snapshot took (rising = backpressure)
lastCheckpointSizeState size (rising = key cardinality growing)
numberOfFailedCheckpointsRecovery health (non-zero = investigate sinks/state)
currentInputWatermarkHow far the pipeline has advanced in event time
currentOutputWatermarkWhat downstream sees (large gap = late firing)

DataStream API vs Table / SQL API

Flink offers two programming models that compile to the same execution graph.

The DataStream API is imperative (Java/Scala/Python): you write operators (map, filter, keyBy, window, process) explicitly. Use it for complex event processing, custom state machines, side outputs, and operator-level control.

DataStream<TempReading> readings = env
    .fromSource(kinesisSource, watermarkStrategy, "iot-temperature")
    .map(json -> mapper.readValue(json, TempReading.class));

DataStream<Alert> alerts = readings
    .keyBy(TempReading::getDeviceId)
    .window(TumblingEventTimeWindows.of(Time.minutes(1)))
    .aggregate(new AvgTemperature())
    .filter(avg -> avg.value > 80.0)
    .map(avg -> new Alert(avg.deviceId, "Overheating", avg.value));

alerts.sinkTo(kinesisAlertSink);

The Table API / Flink SQL is declarative: describe the computation in SQL, and Flink's planner produces an optimal execution graph. Shorter, more maintainable, accessible to analysts.

CREATE TABLE iot_temperature (
    device_id STRING,
    ts TIMESTAMP_LTZ(3),
    temperature_c DOUBLE,
    WATERMARK FOR ts AS ts - INTERVAL '5' SECONDS
) WITH ('connector' = 'kinesis', ...);

SELECT device_id, window_start, AVG(temperature_c) AS avg_c
FROM TABLE(TUMBLE(TABLE iot_temperature, DESCRIPTOR(ts), INTERVAL '1' MINUTES))
GROUP BY device_id, window_start
HAVING AVG(temperature_c) > 80.0;

The two APIs interoperate: convert DataStream to Table and back. A common pattern: do raw deserialization and enrichment in DataStream API, expose as a Table, let analysts build downstream analytics in SQL.

State Backends and Checkpointing

Stateful streaming means operators remember things between events: a windowed aggregator stores running totals, a join stores unmatched left-side records, a session detector stores last-seen timestamps.

BackendStorageWhen to use
HashMapJVM heapSmall state, lowest latency, easy debugging
RocksDB (incremental)Local disk + remote uploadProduction default; gigabytes of state, low pause times
Filesystem (legacy)Heap + full snapshot to FSMostly superseded by RocksDB

The Barrier-Based Checkpoint Algorithm

Flink's checkpoint is a barrier-based asynchronous distributed snapshot (a Chandy-Lamport variant):

  1. The JobManager triggers a checkpoint at a configured interval (e.g., every 60 s).
  2. Sources receive a numbered barrier and record their offsets/sequence numbers.
  3. Barriers flow through the operator DAG alongside data, preserving order.
  4. Each operator aligns its inputs (in EXACTLY_ONCE mode), then asynchronously snapshots state to the backend.
  5. Operators acknowledge the checkpoint to the JobManager.
  6. Once all tasks ack, the checkpoint is globally durable; sinks commit two-phase transactions.

Imagine a parade: the JobManager periodically inserts a flag-bearer (the barrier). Each viewing station (operator) waits until flag-bearers from every parallel route arrive, then photographs itself (snapshot). The photographs collectively form a consistent global snapshot.

Animation 2 — Flink Barrier Checkpoint Flowing Through Operators

JobManager triggers checkpoint N. The barrier flows from Source → KeyBy → Window → Sink, highlighting each operator as it aligns inputs and snapshots state to S3.
JobManager Source KDS / MSK keyBy + map stateless Window agg RocksDB state Transactional Sink 2PC: preCommit / commit trigger CP N S3 (state backend) RocksDB incremental SST upload All tasks ACK → CP N globally durable → commit()
Checkpoint barrier Active operator Snapshot upload / ACK

Incremental checkpoints (RocksDB) persist only the SST files that changed since the previous snapshot, dramatically reducing checkpoint duration and S3 cost. Savepoints are user-triggered checkpoints used for upgrades, parallelism changes, and version migrations — you stop the job with a savepoint, deploy a new JAR, and restart from the savepoint with zero data loss.

Diagram: Flink Checkpoint Sequence

sequenceDiagram participant JM as JobManager participant Src as Source participant Op as Window Op participant Sink as 2PC Sink participant S3 as S3 backend JM->>Src: trigger CP N (every 60s) Src->>Src: record offsets Src->>Op: barrier N Op->>Op: align inputs Op->>S3: async snapshot (incremental) Op->>JM: ack Op->>Sink: barrier N Sink->>Sink: preCommit (flush, prepare 2PC) Sink->>JM: ack JM->>Sink: notifyCheckpointComplete(N) Sink->>Sink: commit (atomic publish)

Key Points: Stream Processing with Flink

Post-Reading Quiz — Stream Processing with Flink

1. What is a Kinesis Processing Unit (KPU) on Amazon Managed Service for Apache Flink?

A pricing tier for Kinesis Data Streams shards
A managed compute unit (~1 vCPU, 4 GB RAM, 50 GB local storage) on which Flink TaskManagers run
A type of Kinesis consumer that processes records exactly once
A throttle limit imposed on Firehose buffer flushes

2. When should you choose the DataStream API over Flink SQL / Table API?

For all pipelines — SQL is deprecated in modern Flink
Only when you need to write Python instead of Java
When you need fine-grained control: custom state, timers, side outputs, or complex event processing
Only for reading from Kinesis — SQL cannot connect to KDS

3. The default state backend on Amazon Managed Service for Apache Flink is:

HashMap (JVM heap)
Filesystem (full snapshots to S3)
RocksDB with incremental checkpoints
Apache Cassandra

4. In Flink's barrier-based checkpoint algorithm, what happens at an operator running in EXACTLY_ONCE mode when barrier N arrives on one input but not yet on others?

The operator immediately snapshots and discards records arriving on slow inputs
The operator aligns inputs — buffering records that have already passed the barrier on fast inputs until barrier N arrives on every input
The operator emits the barrier downstream immediately to keep latency low
The job fails and restarts from the previous checkpoint

5. What is a savepoint, and how does it differ from a regular checkpoint?

A savepoint is an automatically generated snapshot used during recovery; checkpoints are user-triggered
A savepoint is a user-triggered snapshot (e.g., for upgrades, parallelism changes, or version migrations); checkpoints are automatic and managed by Flink
Savepoints are stored in DynamoDB; checkpoints in S3
There is no difference — the terms are interchangeable
Pre-Reading Quiz — Time, Windows, and Correctness

1. Why should business analytics generally use event time rather than processing time?

Event time is faster because it skips watermark generation
Processing time is unsupported by Flink and AWS streaming services
Event time produces deterministic, reproducible aggregates that don't change under network delays, replays, or backfills
Event time eliminates the need for any state or checkpoints

2. You need a "5-minute rolling CPU average updated every minute." Which window type fits?

Tumbling window of size 5 minutes
Sliding (hopping) window with size 5 minutes and slide 1 minute
Session window with 5-minute gap
Global window with no trigger

3. In Flink, what does WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(5)) express?

Drop any record that arrives more than 5 seconds after the wall clock
A promise that the engine will fire windows every 5 seconds regardless of data
An assertion that records may arrive at most 5 seconds out of order, so watermark = max event-time seen − 5 s
A maximum allowed checkpoint interval of 5 seconds

4. A Flink job consumes from a Kinesis stream where one shard has gone idle (no producers writing to it). Downstream windows stop firing. What is the correct fix?

Kill the job and restart with parallelism=1
Add withIdleness(Duration) to the watermark strategy so idle partitions are excluded from the min-watermark calculation
Switch to processing time so watermarks are not needed
Increase the checkpoint interval to 10 minutes

5. End-to-end exactly-once in Flink requires the composition of which three layers?

VPC isolation, IAM roles, and KMS encryption
Replayable sources (offsets stored in checkpoint), barrier-based snapshots of operator state, and transactional or idempotent sinks
Tumbling windows, idempotent producers, and acks=1
Lambda triggers, DynamoDB streams, and SQS dead-letter queues

Time, Windows, and Correctness

Event Time vs Processing Time

Event time is the timestamp embedded in the record — when the sensor sampled, when the user clicked, when the trade was executed. Processing time is the wall-clock time when the engine sees the record. They diverge under network delay, mobile offline buffering, queue lag, or backfill.

Consider a phone buffering GPS pings while the user is in the subway. After 20 minutes underground, all pings flush at once. Processing-time view: "1,200 events arrived at 10:23 AM." Event-time view: "the user moved through these stations between 10:00 and 10:20 AM, in this order." The event-time view is correct for any meaningful analytics.

AspectEvent timeProcessing time
Source of timestampRecord fieldSystem clock
ReproducibilityDeterministic under reprocessingNon-deterministic
LatencyHigher (wait for late data)Lower
Use casesBusiness analytics, billing, auditMonitoring, alerting

Rule of thumb: use event time for any business logic that must produce the same result if you replay yesterday's data. Use processing time only for "is the pipeline currently alive" monitoring.

Window Types: Tumbling, Sliding, Session, Global

WindowBest forWatch out for
Tumbling (fixed, non-overlapping)Periodic reports, billing, hourly revenueBoundary spikes (records hopping windows)
Sliding (fixed, overlapping)Rolling averages, "5-min avg every minute"State multiplication (size/slide copies per event) = memory pressure
Session (gap-based, dynamic)User activity, page sessions, IoT burstsGap tuning is application-specific
Global (single never-closing window per key)Lifetime totals, count-based triggersState grows unbounded without TTL
-- Tumbling 1-hour by region
SELECT window_start, region, SUM(amount) AS revenue
FROM TABLE(TUMBLE(TABLE orders, DESCRIPTOR(order_ts), INTERVAL '1' HOUR))
GROUP BY window_start, region;

-- Sliding 5-min/1-min CPU average
SELECT window_start, AVG(cpu_pct)
FROM TABLE(HOP(TABLE host_metrics, DESCRIPTOR(ts), INTERVAL '1' MINUTE, INTERVAL '5' MINUTE))
GROUP BY window_start;

-- Session 30-min gap
SELECT user_id, window_start, window_end, COUNT(*) AS pageviews
FROM TABLE(SESSION(TABLE clickstream, DESCRIPTOR(ts), INTERVAL '30' MINUTES))
GROUP BY user_id, window_start, window_end;

Watermarks: The Engine's Promise

A watermark W(t) is a monotonically increasing assertion that "no more events with event-time timestamp ≤ t will arrive." It is how the engine knows it is safe to fire an event-time window: once W(t) advances past the window's end, no more in-time data should be coming.

WatermarkStrategyUse when
forMonotonousTimestamps()Records strictly in order (rare)
forBoundedOutOfOrderness(Duration)Most common; allows N seconds of out-of-order tolerance
noWatermarks()Processing-time pipelines
withIdleness(Duration)Mark idle partitions inactive so downstream watermarks can advance

Watermark propagation through multi-input operators: the operator's output watermark equals the minimum of its input watermarks. This guarantees correctness but creates the watermark stall — one slow or idle partition holds back the entire pipeline. Fix with withIdleness(...) and instrument currentInputWatermark.

Late Events: Three Treatments

  1. Dropped (default in many APIs).
  2. Allowed lateness via allowedLateness(Duration): window state stays alive past the watermark; each late event re-fires the window. Downstream sinks must handle multiple emissions (use upsert sinks).
  3. Side output via sideOutputLateData(OutputTag): late events route to a separate stream for inspection or repair.

Animation 3 — Event-Time Window Firing Driven by Watermark

Events arrive (out of order) into a tumbling window. The watermark advances behind them. When W(t) crosses the window end, the window fires. A late event arrives after the watermark and is routed to a side output.
10:00 10:01 10:02 10:03 Tumbling Window [10:00, 10:01) [10:01, 10:02) ts=10:00:10 ts=10:00:35 10:00:25 (oo) ts=10:00:50 ts=10:01:20 W(t) advances › FIRE W > end LATE: ts=10:00:18 Side Output sideOutputLateData() → audit / repair sink Event-Time Window + Watermark + Late Side-Output Watermark = max(event-time) − 5s. Window fires when W(t) ≥ end. Records older than W(t) are late.
In-time event Watermark / out-of-order Window fire Late event → side output

Diagram: Watermark-Driven Window Firing

flowchart TD In["Source records
(event-time ts embedded)"] --> WM["WatermarkStrategy
forBoundedOutOfOrderness(5s)
+ withIdleness(1m)"] WM --> KB["keyBy(deviceId)"] KB --> Win["TumblingEventTimeWindow (1m)"] Win --> Check{event ts vs W(t)?} Check -- "in window, W(t) < end" --> Buf[Buffer in RocksDB state] Check -- "W(t) ≥ window end" --> Fire[Fire window: emit aggregate] Check -- "ts < W(t) (late)" --> Late{allowedLateness still open?} Late -- yes --> ReFire[Re-fire window upsert] Late -- no --> Side[sideOutputLateData
audit / repair] Fire --> Sink[2PC Sink: preCommit on barrier, commit on CP complete] ReFire --> Sink

End-to-End Exactly-Once: Three Composed Layers

  1. Replayable sources — Kafka offsets and Kinesis sequence numbers stored inside the checkpoint. On recovery, the source resumes from the last committed offset.
  2. Internal state — checkpoint barriers ensure operator state is consistent across the snapshot boundary.
  3. Transactional or idempotent sinks — either two-phase commit (TwoPhaseCommitSinkFunction) or idempotent operations (key-based upsert).

The two-phase commit pattern:

beginTransaction() ──> invoke(record) ──> preCommit() ──> commit()
   (start of CP)        (during CP)        (CP barrier)    (CP complete)

When the barrier arrives, preCommit() flushes buffers and prepares the transaction. After notifyCheckpointComplete(checkpointId), commit() atomically exposes writes (Kafka transaction commit, S3 staging-file rename to final). On crash between pre-commit and commit, the transaction is recovered from checkpoint state and either committed or aborted on restart.

Common Flink Failure Modes

SymptomLikely causeFix
Watermark stalls; windows never fireIdle partition pulls min watermark to −∞withIdleness(...)
Checkpoint timeoutSink backpressure or slow state persistenceIncrease checkpointTimeout, reduce parallelism, switch to incremental
Duplicates after restartSink is non-transactional / non-idempotentFileSink commit-on-checkpoint or transactional Kafka producer
Growing checkpoint sizeUnbounded keyed stateSet TTL on state, or use session windows with a gap

Key Points: Time, Windows, and Correctness

Post-Reading Quiz — Time, Windows, and Correctness

1. Why should business analytics generally use event time rather than processing time?

Event time is faster because it skips watermark generation
Processing time is unsupported by Flink and AWS streaming services
Event time produces deterministic, reproducible aggregates that don't change under network delays, replays, or backfills
Event time eliminates the need for any state or checkpoints

2. You need a "5-minute rolling CPU average updated every minute." Which window type fits?

Tumbling window of size 5 minutes
Sliding (hopping) window with size 5 minutes and slide 1 minute
Session window with 5-minute gap
Global window with no trigger

3. In Flink, what does WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(5)) express?

Drop any record that arrives more than 5 seconds after the wall clock
A promise that the engine will fire windows every 5 seconds regardless of data
An assertion that records may arrive at most 5 seconds out of order, so watermark = max event-time seen − 5 s
A maximum allowed checkpoint interval of 5 seconds

4. A Flink job consumes from a Kinesis stream where one shard has gone idle (no producers writing to it). Downstream windows stop firing. What is the correct fix?

Kill the job and restart with parallelism=1
Add withIdleness(Duration) to the watermark strategy so idle partitions are excluded from the min-watermark calculation
Switch to processing time so watermarks are not needed
Increase the checkpoint interval to 10 minutes

5. End-to-end exactly-once in Flink requires the composition of which three layers?

VPC isolation, IAM roles, and KMS encryption
Replayable sources (offsets stored in checkpoint), barrier-based snapshots of operator state, and transactional or idempotent sinks
Tumbling windows, idempotent producers, and acks=1
Lambda triggers, DynamoDB streams, and SQS dead-letter queues

Your Progress

Answer Explanations