Chapter 9: Replication, Disaster Recovery, and SiteContinuity

Cohesity Certified Architect Expert (CCAE) Study Guide

Learning Objectives

Design active-passive and active-active replication topologies for one-to-one, one-to-many, many-to-one, and cross-cloud patterns.
Calculate replication bandwidth requirements from FETB, daily change rate, deduplication efficiency, and replication window.
Architect orchestrated DR with Cohesity SiteContinuity runbooks, including the full Failover Ready -> Failover Complete -> Prepare Failback -> Failback Ready -> Failback Complete state machine.
Differentiate CloudSpin, CloudReplicate, CloudArchive, and CloudTier and justify the right option for a given recovery scenario.
Identify the network ports, bonding modes, and MTU settings that must be present for replication to succeed at scale.

1. Replication Topologies

Pre-Section Check — Section 1

1. A regional bank operates a primary data center and 12 branch offices. Backup data from each branch must consolidate to the central DC for retention and reporting. Which Cohesity replication topology fits best?

One-to-one One-to-many Many-to-one (fan-in) Cross-cloud only

2. Where is replication configured in Cohesity?

On the Protection Group On the Protection Policy On the View Box only On the source object directly

Cohesity replication is a Protection-Group-aware operation: the source cluster sends only unique, deduplicated, compressed chunks to the target cluster, and the target keeps a fully addressable copy that can be searched, recovered, mounted, or orchestrated independently of the source. The replica is a working cluster, not a tape, which is what makes the four classic topologies viable.

The Four Canonical Topologies

Topology	Pattern	Use Case	Pros	Cons
One-to-one	A -> B	Two-site enterprise DR	Simple, symmetric failback	No tertiary copy
One-to-many	A -> B + C	In-region DR + cyber-vault	Geographic diversity	Higher WAN cost
Many-to-one	A,B,C,D -> Hub	ROBO consolidation	Centralized retention	Hub is a single failure domain
Cross-cloud	On-prem -> AWS/Azure CE	Cloud-as-DR	OPEX, no second DC	Egress costs on recall

Figure 9.1: Replication Topology Variants

Animation: Replication Topology Build-up (1:1 -> 1:Many -> Many:1 -> Cross-Cloud)

Scroll into view or press Replay to start.

Replication Policies and Retention

Replication is configured on the Protection Policy, not on the Protection Group. A policy can specify multiple "external" targets — local snapshot, replication target cluster, CloudArchive, CloudTier — each with its own retention. Retention on the replica is independent of the source: 14 daily on primary, 30 daily + 12 monthly on DR, 7 yearly on CloudArchive — all driven by one policy.

Key Takeaway: Topology decisions are driven by failure domain, not by bandwidth. Pick the topology that aligns with your blast-radius assumptions, then size WAN and retention to match.

Key Points

Four canonical topologies: 1:1, 1:Many, Many:1 (fan-in), and Cross-Cloud — choose by failure domain.
Replication is configured on the Protection Policy, not on the Protection Group.
The replica is a working cluster — searchable, mountable, and SiteContinuity-orchestratable.
Retention on the target is fully independent of the source.
A typical enterprise blends all four topologies (hub-and-spoke for ROBO, 1:1 between DCs, cross-cloud for cyber-vault).

Post-Section Check — Section 1

One-to-one One-to-many Many-to-one (fan-in) Cross-cloud only

2. Where is replication configured in Cohesity?

On the Protection Group On the Protection Policy On the View Box only On the source object directly

2. Replication Mechanics and Tuning

Pre-Section Check — Section 2

3. The SaaS Connector accepts a throttle value of "100" with a setting of MB/s. What wire-rate does this represent?

100 Mbps 100 MBps = 800 Mbps 12.5 Mbps 1 Gbps regardless

4. Which of the following is NOT a required TCP port between Cohesity replication source and target?

443 111 20000 3389

Encrypted, Compressed, Deduplicated Wire Format

Cohesity replication is always encrypted in flight (TLS) and is deduplicated and compressed at the chunk level before the wire. The source queries the target's chunk fingerprint database and only ships chunks the target does not already have. This global dedup on the wire is the dominant reason Cohesity hits aggressive RPOs over modest WAN links.

Three Throttling Surfaces

SaaS Connector throttling for DataProtect-as-a-Service — per-connector, day/time windows, split upload/download. Values are in bytes per second.
Source-agent throttling for individual physical hosts.
Cluster-to-cluster windows on the Protection Policy.

Common gotcha: SaaS Connector throttle is in bytes per second, opposite of Veritas AIR convention. "100 MB/s" is 800 Mbps on the wire.

Network Architecture and Required Ports

Cohesity recommends 2x 10 GbE LACP Bond Mode 4 with MLAG/VPC for replication-target clusters and ROBO nodes — 20 Gbps combined per node.

Port	Protocol	Purpose
443	TCP	HTTPS / API control plane
111	TCP	Portmap / RPC
20000	TCP	Replication data channel
24444	TCP	Replication control / metadata

Enable jumbo frames (MTU 9000) end-to-end. A single device that does not honor 9000-byte frames silently fragments or drops, tanking throughput.

Initial Seed Strategies

Wire seeding with extended window — quiet multi-day window, relaxed throttles.
Physical seed transport — replicate to a portable cluster on-site, ship it, re-home the target.

Replication Failure Handling

Cohesity replication is checkpointed: a partial run resumes from the last committed chunk rather than restarting. Persistent failures generate Helios alerts and pause the policy after configurable retry attempts.

Key Takeaway: Replication mechanics are built on dedup, compression, and TLS — but they only work if you've sized 2x10 GbE LACP, opened TCP 443/111/20000/24444, enabled MTU 9000 end-to-end, and remembered SaaS Connector throttles are in bytes per second.

Key Points

Replication is always TLS-encrypted, dedup'd, and compressed before the wire — "global dedup on the wire."
Required TCP ports source-to-target: 443, 111, 20000, 24444.
Baseline network: 2x10 GbE LACP Bond Mode 4 with MLAG/VPC; jumbo frames (MTU 9000) end-to-end.
SaaS Connector throttles are bytes per second — multiply by 8 to get bits per second.
Replication is checkpointed; resume from last committed chunk after failure.

Post-Section Check — Section 2

3. The SaaS Connector accepts a throttle value of "100" with a setting of MB/s. What wire-rate does this represent?

100 Mbps 100 MBps = 800 Mbps 12.5 Mbps 1 Gbps regardless

4. Which of the following is NOT a required TCP port between Cohesity replication source and target?

443 111 20000 3389

3. Replication Bandwidth Math

Pre-Section Check — Section 3

5. A workload has 100 TB FETB, 4% daily change, 50% on-wire reduction (dedup+compress), and an 8-hour replication window. What is the approximate required throughput in MB/s?

~25 MB/s ~70 MB/s ~140 MB/s ~560 MB/s

6. What is the recommended planning ceiling for sustainable replication throughput on a WAN link?

100% of nominal wire speed 75% of nominal wire speed 50% of nominal wire speed 25% of nominal wire speed

The single formula every CCAE must memorize:

Required throughput = (FETB × daily change rate × (1 - dedup/compression)) / replication window

Bytes-vs-Bits Reference

Wire Speed (bits)	Theoretical Bytes	50% Utilization Target
1 Gbps	125 MB/s	62.5 MB/s
10 Gbps	1,250 MB/s (1.25 GB/s)	625 MB/s
40 Gbps	5,000 MB/s	2,500 MB/s
100 Gbps	12,500 MB/s	6,250 MB/s

Worked Example: 50 TB FETB, 5% Daily Change, 4-Hour Window

Step 1 - Daily change in bytes:
  50 TB x 0.05 = 2.5 TB of change per day

Step 2 - Apply dedup/compression (60% reduction):
  2.5 TB x 0.40 = 1.0 TB on the wire

Step 3 - Divide by replication window (4 hr = 14,400 s):
  1,000,000 MB / 14,400 s = ~69.4 MB/s

Step 4 - Convert to bits per second:
  69.4 MB/s x 8 = ~556 Mbps

Step 5 - Apply 50% utilization headroom:
  556 Mbps / 0.50 = ~1.11 Gbps minimum WAN provisioning

The RPO / Bandwidth / Dedup Trade-offs

Knob	Meaning	Range
Increase WAN bandwidth	Buy more circuit	$$$, weeks lead time
Lengthen RPO (window)	Replicate every 8 hr instead of 4	Free, but business decision
Reduce change rate	Tighter PG scoping; exclude logs/scratch	Cheap, bounded
Improve dedup on wire	Better source filtering, larger target retention pool	Modest gains

Key Takeaway: Memorize the bandwidth equation, work in bytes first, multiply by 8 last, and budget for 50% of wire speed. The bytes-vs-bits trap has cost more than one architect a six-figure WAN over-provisioning bill.

Key Points

Formula: Throughput = (FETB × change rate × (1 - dedup)) / window.
Always compute in bytes, then multiply by 8 to get bits-per-second for the WAN team.
Plan for 50% of nominal wire speed as the sustainable ceiling.
Halving the window doubles throughput requirement; doubling change rate doubles it again.
Three knobs to close the math: bandwidth, window/RPO, change rate, and on-wire dedup.

Post-Section Check — Section 3

5. A workload has 100 TB FETB, 4% daily change, 50% on-wire reduction (dedup+compress), and an 8-hour replication window. What is the approximate required throughput in MB/s?

~25 MB/s ~70 MB/s ~140 MB/s ~560 MB/s

6. What is the recommended planning ceiling for sustainable replication throughput on a WAN link?

100% of nominal wire speed 75% of nominal wire speed 50% of nominal wire speed 25% of nominal wire speed

4. SiteContinuity Orchestration

Pre-Section Check — Section 4

7. After a successful Failover, a DR Plan is in "Failover Complete." The architect attempts to Failback directly. What happens?

Failback proceeds normally Failback fails — Prepare for Failback must run first to reverse-seed data Failback runs but with degraded RPO The plan automatically transitions to Failback Ready

8. Does running a Test Failover on a SiteContinuity DR Plan change the plan's underlying state?

Yes, it transitions to Failover In Progress Yes, but only briefly No — Test variants are non-disruptive and do not change state Only if the test passes

9. Which of the following bounds RPO in a SiteContinuity DR design?

Runbook execution time Replication frequency Number of VMs in the DR Application Boot-order serialization

Replication moves the data; SiteContinuity moves the application. SiteContinuity is Cohesity's DR orchestration engine that turns "we have replicated VMs" into "we have a runnable runbook with a measurable RTO."

The Runbook Analogy

A SiteContinuity runbook is an emergency-evacuation plan for an office building:

Who goes first? — Dependency order (DCs/DNS first, then app servers, then web).
Where do they go? — Resource Profile (target compute, port group, datastore).
What address? — Re-IP and VLAN mapping at the DR site.
How do you know everyone got out? — Validation steps.

Failback is going home after the storm: same plan in reverse, only after the building has been certified safe.

Runbook Building Blocks

DR Applications — logical groups of VMs with defined boot order.
Resource Profiles — reusable mappings of target vCenter/datastore/port group/IP rules.
Failback Resource Set — separate resource definition for failback (Edit -> Add Resource Set).
Snapshot Selection — latest by default, override for explicit RPO.

Figure 9.2: SiteContinuity DR Plan State Machine

stateDiagram-v2 [*] --> FailoverReady FailoverReady --> FailoverInProgress: Failover FailoverInProgress --> FailoverComplete FailoverComplete --> PrepareFailbackInProgress: Prepare for Failback PrepareFailbackInProgress --> FailbackReady: reverse seed complete FailbackReady --> FailbackInProgress: Failback FailbackInProgress --> FailbackComplete FailbackComplete --> FailoverReady: Prepare for Failover FailoverReady --> FailoverReady: Test Failover (no state change) FailbackReady --> FailbackReady: Test Failback (no state change)

Animation: SiteContinuity State Machine Walkthrough

Scroll into view or press Replay to start.

Failover Procedure (Production Cutover)

DR Plans -> Disaster Recovery Plans.
Select plan -> Actions -> Failover.
Choose Resource Profile (network mapping, IP customization, target compute).
Optionally enable Protect VMs at DR Site — closes the backup gap at moment of failover.
Confirm snapshot (latest or earlier RPO).
Type YES to confirm.
Validate via DR vCenter: boot order, networks, application functionality, PG state.

Figure 9.3: Failover Orchestration Sequence

sequenceDiagram participant Op as Operator participant Helios as Helios / SiteContinuity participant Src as Source Cluster participant Tgt as DR Target Cluster participant vC as DR vCenter participant VM as Recovered VMs Op->>Helios: Trigger Failover (DR Plan) Helios->>Helios: Validate Resource Profile + snapshot Helios->>Src: Quiesce replication (if reachable) Helios->>Tgt: Select latest replicated snapshot Tgt->>Tgt: Mount SnapTree view Helios->>vC: Register VMs from mounted view vC->>VM: Apply IP customization + VLAN mapping vC->>VM: Power on (boot order: DC, App, Web) VM-->>vC: Services up vC-->>Helios: Boot validation OK Helios->>Tgt: Optionally protect VMs at DR site Helios-->>Op: State = Failover Complete

Animation: Failover Orchestration Timeline (Source Down -> Recovery Complete)

Source Site

SITE DOWN

Helios Detect

Alert + Validate Plan

DR Cluster

Mount SnapTree Snapshot

DR vCenter

VM Power-On

Boot Order: DC -> App -> Web

Users

DNS/GSLB Redirect Complete

Scroll into view or press Replay to start.

Prepare for Failback

Confirm plan is in Failover Complete.
Actions -> Prepare for Failback — drives reverse replication from DR back to primary.
Plan moves: Prepare for Failback In Progress -> Failback Ready.

Failback Procedure

Actions -> Failback.
Pick failback Resource Profile; confirm snapshot (VADP or CDP).
Type YES.
Validate against primary vCenter.
After Failback Complete -> Actions -> Prepare for Failover to re-seed the DR site.

RTO and RPO Measurement

RPO is bounded by replication frequency (4-hr replication = 4-hr RPO worst case).
RTO is bounded by runbook execution time (boot order, IP customization, app warmup).
Regular Test Failovers are the single dominant predictor of successful real-world recoveries.

Key Takeaway: A SiteContinuity DR Plan is a state machine: Failover Ready -> Failover Complete -> Prepare Failback -> Failback Ready -> Failback Complete -> back to Failover Ready. Test variants do not change state. Skipping "Prepare for Failback" is the most common operational error.

Key Points

SiteContinuity transitions: Failover Ready -> Failover Complete -> Prepare Failback -> Failback Ready -> Failback Complete -> (Prepare for Failover) -> Failover Ready.
Test Failover and Test Failback do not change state — safe to run on a schedule.
"Prepare for Failback" runs reverse replication — mandatory before actual Failback.
RPO is bounded by replication frequency; RTO by runbook execution time.
Resource Profile = target vCenter + datastore + port group + IP customization rules; Failback Resource Set is added separately.

Post-Section Check — Section 4

7. After a successful Failover, a DR Plan is in "Failover Complete." The architect attempts to Failback directly. What happens?

Failback proceeds normally Failback fails — Prepare for Failback must run first to reverse-seed data Failback runs but with degraded RPO The plan automatically transitions to Failback Ready

8. Does running a Test Failover on a SiteContinuity DR Plan change the plan's underlying state?

Yes, it transitions to Failover In Progress Yes, but only briefly No — Test variants are non-disruptive and do not change state Only if the test passes

9. Which of the following bounds RPO in a SiteContinuity DR design?

Runbook execution time Replication frequency Number of VMs in the DR Application Boot-order serialization

5. Cloud-Based DR Options

Pre-Section Check — Section 5

10. A customer wants both a cloud-based DR target with full SiteContinuity orchestration AND long-term compliance retention in cloud object storage. Which combination is correct?

CloudTier + CloudArchive CloudReplicate + CloudArchive CloudSpin only CloudTier alone

11. What is the most important operational caveat about CloudTier?

It produces a redundant DR copy Once enabled on a View Box, it cannot be disabled It only works with Azure Blob It requires SiteContinuity

12. Which Cohesity feature converts an on-prem backup snapshot into a native AWS EC2 instance with EBS volumes on demand?

CloudArchive Direct CloudReplicate CloudSpin CloudTier

The Four Cloud Features

CloudReplicate — DR Replica as a Working Cluster

Replicates Protection Group snapshots to a Cohesity Cloud Edition cluster in AWS or Azure. Granular file recovery, IMR, View mounting, and SiteContinuity all work on the cloud-side replica. Use when the cloud is your DR site.

CloudSpin — On-Demand Native Cloud VMs

Converts a backup snapshot into a native cloud VM (EC2 + EBS, or Azure VM + Managed Disks). Active trigger by user. Use for dev/test clones or lightweight cloud DR without a Cloud Edition cluster.

CloudArchive — Long-Term Retention

Separate archival copy in cloud object storage (S3, Azure Blob, GCP), driven by Protection Policy schedules. Source cluster keeps a local index for search.

CloudArchive Direct: streams full data blocks directly to cloud, only the index stays on-prem.

CloudTier — Capacity Overflow (NOT a DR Tool)

Auto-tiers cold blocks (default >60 days) to cloud object storage when local capacity exceeds 80%. Two critical facts:

Once enabled, CloudTier cannot be disabled.
CloudTier is not a DR copy — the cluster is still the only authoritative copy.

Decision Matrix

Feature	Purpose	Result in Cloud	Reversible?	Trigger	Use Case
CloudSpin	Native cloud VM	EC2 / Azure VM	Yes (ephemeral)	On-demand	Dev/test, light DR
CloudReplicate	Cloud DR replica	Full Cohesity cluster	Yes	Policy	Cloud-as-DR-site
CloudArchive	LTR / compliance	Cohesity-format archive	Yes	Policy	7-yr retention
CloudArchive Direct	LTR, minimal local	Cohesity archive (data direct)	Yes	Policy	Local archive exhausted
CloudTier	Capacity relief	Object-storage extension	No	Auto threshold	Capacity overflow only

Figure 9.4: Cloud DR Decision Tree

graph TD Start[Cloud Use Case?] --> Q1{Primary Goal?} Q1 -->|Disaster Recovery| Q2{Need Cohesity features
at recovery time?} Q1 -->|Long-Term Retention| Q3{Local archive
capacity available?} Q1 -->|Capacity Relief| CT[CloudTier
WARNING: Irreversible
NOT a DR tool] Q2 -->|Yes - IMR, search,
SiteContinuity| CR[CloudReplicate
Full Cohesity cluster
in AWS/Azure] Q2 -->|No - just need
native cloud VM| CS[CloudSpin
EC2/Azure VM
on-demand conversion] Q3 -->|Yes| CA[CloudArchive
Cohesity-format
in object storage] Q3 -->|No - exhausted| CAD[CloudArchive Direct
Index local, data
streamed direct to cloud]

Recovery into VMware Cloud and Azure VMware Solution

VMC and AVS are first-class targets — native vCenter that SiteContinuity drives directly, no CloudSpin conversion, no Cloud Edition cluster required. Trade-off: SDDC cost vs. Cloud Edition cost.

Comparing Cost vs. Recovery Options

Recovery Option	OPEX	Recovery Speed	Operational Familiarity
Second physical DC + Cohesity	Highest CAPEX, lowest OPEX	Fastest	Highest
Cloud Edition + CloudReplicate + SiteContinuity	Medium	Fast	High
VMC / AVS + CloudReplicate	High	Fast	Highest (native vCenter)
CloudSpin only	Lowest	Slow (per-VM conversion)	Lower
CloudArchive + manual restore	Cheapest	Slowest	Lowest

Key Takeaway: CloudReplicate gives you a working cluster, CloudSpin gives you a native VM, CloudArchive gives you compliance retention, and CloudTier gives you capacity relief. Only the first three are DR options. CloudTier is irreversible once enabled.

Key Points

CloudReplicate: continuous policy-driven replication to a full Cohesity cluster in cloud (full DR semantics).
CloudSpin: on-demand conversion of a snapshot into a native EC2 / Azure VM — no Cohesity at recovery time.
CloudArchive: long-term retention as a separate archive copy in object storage.
CloudTier is irreversible and is NOT a DR tool — capacity overflow only.
VMC and AVS allow SiteContinuity to drive native vCenter directly, no conversion needed.

Post-Section Check — Section 5

10. A customer wants both a cloud-based DR target with full SiteContinuity orchestration AND long-term compliance retention in cloud object storage. Which combination is correct?

CloudTier + CloudArchive CloudReplicate + CloudArchive CloudSpin only CloudTier alone

11. What is the most important operational caveat about CloudTier?

It produces a redundant DR copy Once enabled on a View Box, it cannot be disabled It only works with Azure Blob It requires SiteContinuity

12. Which Cohesity feature converts an on-prem backup snapshot into a native AWS EC2 instance with EBS volumes on demand?

CloudArchive Direct CloudReplicate CloudSpin CloudTier

Chapter Summary

Topology choice (1:1, 1:Many, Many:1, Cross-Cloud) is driven by failure-domain assumptions, not bandwidth.
Replication is encrypted, deduplicated, and compressed on the wire; throughput scales linearly with node count.
Baseline: 2x10 GbE LACP, jumbo frames end-to-end, TCP 443/111/20000/24444.
Bandwidth formula: (FETB × change × (1 - dedup)) / window. Work in bytes, multiply by 8 last, plan for 50% of wire speed.
SaaS Connector throttles are bytes/second, not bits/second.
SiteContinuity state machine: Failover Ready -> Failover Complete -> Prepare Failback -> Failback Ready -> Failback Complete. Test variants do not change state.
CloudReplicate, CloudSpin, CloudArchive, CloudTier — four distinct tools for four distinct problems. CloudTier is irreversible and is NOT a DR tool.
Test Failovers are the dominant predictor of real-world recovery success.

Chapter 9: Replication, Disaster Recovery, and SiteContinuity

Learning Objectives

1. Replication Topologies

The Four Canonical Topologies

Figure 9.1: Replication Topology Variants

Replication Policies and Retention

Key Points

2. Replication Mechanics and Tuning

Encrypted, Compressed, Deduplicated Wire Format

Three Throttling Surfaces

Network Architecture and Required Ports

Initial Seed Strategies

Replication Failure Handling

Key Points

3. Replication Bandwidth Math

Bytes-vs-Bits Reference

Worked Example: 50 TB FETB, 5% Daily Change, 4-Hour Window

The RPO / Bandwidth / Dedup Trade-offs

Key Points

4. SiteContinuity Orchestration

The Runbook Analogy

Runbook Building Blocks

Figure 9.2: SiteContinuity DR Plan State Machine

Failover Procedure (Production Cutover)

Figure 9.3: Failover Orchestration Sequence

Prepare for Failback

Failback Procedure

RTO and RPO Measurement

Key Points

5. Cloud-Based DR Options

The Four Cloud Features

CloudReplicate — DR Replica as a Working Cluster

CloudSpin — On-Demand Native Cloud VMs

CloudArchive — Long-Term Retention

CloudTier — Capacity Overflow (NOT a DR Tool)

Decision Matrix

Figure 9.4: Cloud DR Decision Tree

Recovery into VMware Cloud and Azure VMware Solution

Comparing Cost vs. Recovery Options

Key Points

Chapter Summary

Your Progress

Answer Explanations