Chapter 9: Replication, Disaster Recovery, and SiteContinuity
Cohesity Certified Architect Expert (CCAE) Study Guide
Learning Objectives
Design active-passive and active-active replication topologies for one-to-one, one-to-many, many-to-one, and cross-cloud patterns.
Calculate replication bandwidth requirements from FETB, daily change rate, deduplication efficiency, and replication window.
Architect orchestrated DR with Cohesity SiteContinuity runbooks, including the full Failover Ready -> Failover Complete -> Prepare Failback -> Failback Ready -> Failback Complete state machine.
Differentiate CloudSpin, CloudReplicate, CloudArchive, and CloudTier and justify the right option for a given recovery scenario.
Identify the network ports, bonding modes, and MTU settings that must be present for replication to succeed at scale.
1. Replication Topologies
Pre-Section Check — Section 1
1. A regional bank operates a primary data center and 12 branch offices. Backup data from each branch must consolidate to the central DC for retention and reporting. Which Cohesity replication topology fits best?
One-to-oneOne-to-manyMany-to-one (fan-in)Cross-cloud only
2. Where is replication configured in Cohesity?
On the Protection GroupOn the Protection PolicyOn the View Box onlyOn the source object directly
Cohesity replication is a Protection-Group-aware operation: the source cluster sends only unique, deduplicated, compressed chunks to the target cluster, and the target keeps a fully addressable copy that can be searched, recovered, mounted, or orchestrated independently of the source. The replica is a working cluster, not a tape, which is what makes the four classic topologies viable.
The Four Canonical Topologies
Topology
Pattern
Use Case
Pros
Cons
One-to-one
A -> B
Two-site enterprise DR
Simple, symmetric failback
No tertiary copy
One-to-many
A -> B + C
In-region DR + cyber-vault
Geographic diversity
Higher WAN cost
Many-to-one
A,B,C,D -> Hub
ROBO consolidation
Centralized retention
Hub is a single failure domain
Cross-cloud
On-prem -> AWS/Azure CE
Cloud-as-DR
OPEX, no second DC
Egress costs on recall
Figure 9.1: Replication Topology Variants
flowchart LR
subgraph OneToOne["1:1 (Active-Passive DR)"]
A1[Cluster A Primary] -->|replicate| B1[Cluster B DR Site]
end
subgraph OneToMany["1:Many (Geographic Diversity)"]
A2[Cluster A Primary] -->|replicate| B2[Cluster B In-Region DR]
A2 -->|replicate| C2[Cluster C Cross-Region Vault]
end
subgraph ManyToOne["Many:1 (ROBO Fan-In)"]
S1[Spoke A ROBO] -->|replicate| H[Hub Cluster Central DC]
S2[Spoke B ROBO] -->|replicate| H
S3[Spoke C ROBO] -->|replicate| H
end
subgraph CrossCloud["Cross-Cloud (Cloud-as-DR)"]
OP[On-Prem Cluster] -->|replicate| CE[Cloud Edition AWS / Azure]
end
Replication is configured on the Protection Policy, not on the Protection Group. A policy can specify multiple "external" targets — local snapshot, replication target cluster, CloudArchive, CloudTier — each with its own retention. Retention on the replica is independent of the source: 14 daily on primary, 30 daily + 12 monthly on DR, 7 yearly on CloudArchive — all driven by one policy.
Key Takeaway: Topology decisions are driven by failure domain, not by bandwidth. Pick the topology that aligns with your blast-radius assumptions, then size WAN and retention to match.
Key Points
Four canonical topologies: 1:1, 1:Many, Many:1 (fan-in), and Cross-Cloud — choose by failure domain.
Replication is configured on the Protection Policy, not on the Protection Group.
The replica is a working cluster — searchable, mountable, and SiteContinuity-orchestratable.
Retention on the target is fully independent of the source.
A typical enterprise blends all four topologies (hub-and-spoke for ROBO, 1:1 between DCs, cross-cloud for cyber-vault).
Post-Section Check — Section 1
1. A regional bank operates a primary data center and 12 branch offices. Backup data from each branch must consolidate to the central DC for retention and reporting. Which Cohesity replication topology fits best?
One-to-oneOne-to-manyMany-to-one (fan-in)Cross-cloud only
2. Where is replication configured in Cohesity?
On the Protection GroupOn the Protection PolicyOn the View Box onlyOn the source object directly
2. Replication Mechanics and Tuning
Pre-Section Check — Section 2
3. The SaaS Connector accepts a throttle value of "100" with a setting of MB/s. What wire-rate does this represent?
4. Which of the following is NOT a required TCP port between Cohesity replication source and target?
443111200003389
Encrypted, Compressed, Deduplicated Wire Format
Cohesity replication is always encrypted in flight (TLS) and is deduplicated and compressed at the chunk level before the wire. The source queries the target's chunk fingerprint database and only ships chunks the target does not already have. This global dedup on the wire is the dominant reason Cohesity hits aggressive RPOs over modest WAN links.
Three Throttling Surfaces
SaaS Connector throttling for DataProtect-as-a-Service — per-connector, day/time windows, split upload/download. Values are in bytes per second.
Source-agent throttling for individual physical hosts.
Cluster-to-cluster windows on the Protection Policy.
Common gotcha: SaaS Connector throttle is in bytes per second, opposite of Veritas AIR convention. "100 MB/s" is 800 Mbps on the wire.
Network Architecture and Required Ports
Cohesity recommends 2x 10 GbE LACP Bond Mode 4 with MLAG/VPC for replication-target clusters and ROBO nodes — 20 Gbps combined per node.
Port
Protocol
Purpose
443
TCP
HTTPS / API control plane
111
TCP
Portmap / RPC
20000
TCP
Replication data channel
24444
TCP
Replication control / metadata
Enable jumbo frames (MTU 9000) end-to-end. A single device that does not honor 9000-byte frames silently fragments or drops, tanking throughput.
Physical seed transport — replicate to a portable cluster on-site, ship it, re-home the target.
Replication Failure Handling
Cohesity replication is checkpointed: a partial run resumes from the last committed chunk rather than restarting. Persistent failures generate Helios alerts and pause the policy after configurable retry attempts.
Key Takeaway: Replication mechanics are built on dedup, compression, and TLS — but they only work if you've sized 2x10 GbE LACP, opened TCP 443/111/20000/24444, enabled MTU 9000 end-to-end, and remembered SaaS Connector throttles are in bytes per second.
Key Points
Replication is always TLS-encrypted, dedup'd, and compressed before the wire — "global dedup on the wire."
4. Which of the following is NOT a required TCP port between Cohesity replication source and target?
443111200003389
3. Replication Bandwidth Math
Pre-Section Check — Section 3
5. A workload has 100 TB FETB, 4% daily change, 50% on-wire reduction (dedup+compress), and an 8-hour replication window. What is the approximate required throughput in MB/s?
~25 MB/s~70 MB/s~140 MB/s~560 MB/s
6. What is the recommended planning ceiling for sustainable replication throughput on a WAN link?
100% of nominal wire speed75% of nominal wire speed50% of nominal wire speed25% of nominal wire speed
Worked Example: 50 TB FETB, 5% Daily Change, 4-Hour Window
Step 1 - Daily change in bytes:
50 TB x 0.05 = 2.5 TB of change per day
Step 2 - Apply dedup/compression (60% reduction):
2.5 TB x 0.40 = 1.0 TB on the wire
Step 3 - Divide by replication window (4 hr = 14,400 s):
1,000,000 MB / 14,400 s = ~69.4 MB/s
Step 4 - Convert to bits per second:
69.4 MB/s x 8 = ~556 Mbps
Step 5 - Apply 50% utilization headroom:
556 Mbps / 0.50 = ~1.11 Gbps minimum WAN provisioning
The RPO / Bandwidth / Dedup Trade-offs
Knob
Meaning
Range
Increase WAN bandwidth
Buy more circuit
$$$, weeks lead time
Lengthen RPO (window)
Replicate every 8 hr instead of 4
Free, but business decision
Reduce change rate
Tighter PG scoping; exclude logs/scratch
Cheap, bounded
Improve dedup on wire
Better source filtering, larger target retention pool
Modest gains
Key Takeaway: Memorize the bandwidth equation, work in bytes first, multiply by 8 last, and budget for 50% of wire speed. The bytes-vs-bits trap has cost more than one architect a six-figure WAN over-provisioning bill.
Always compute in bytes, then multiply by 8 to get bits-per-second for the WAN team.
Plan for 50% of nominal wire speed as the sustainable ceiling.
Halving the window doubles throughput requirement; doubling change rate doubles it again.
Three knobs to close the math: bandwidth, window/RPO, change rate, and on-wire dedup.
Post-Section Check — Section 3
5. A workload has 100 TB FETB, 4% daily change, 50% on-wire reduction (dedup+compress), and an 8-hour replication window. What is the approximate required throughput in MB/s?
~25 MB/s~70 MB/s~140 MB/s~560 MB/s
6. What is the recommended planning ceiling for sustainable replication throughput on a WAN link?
100% of nominal wire speed75% of nominal wire speed50% of nominal wire speed25% of nominal wire speed
4. SiteContinuity Orchestration
Pre-Section Check — Section 4
7. After a successful Failover, a DR Plan is in "Failover Complete." The architect attempts to Failback directly. What happens?
Failback proceeds normallyFailback fails — Prepare for Failback must run first to reverse-seed dataFailback runs but with degraded RPOThe plan automatically transitions to Failback Ready
8. Does running a Test Failover on a SiteContinuity DR Plan change the plan's underlying state?
Yes, it transitions to Failover In ProgressYes, but only brieflyNo — Test variants are non-disruptive and do not change stateOnly if the test passes
9. Which of the following bounds RPO in a SiteContinuity DR design?
Runbook execution timeReplication frequencyNumber of VMs in the DR ApplicationBoot-order serialization
Replication moves the data; SiteContinuity moves the application. SiteContinuity is Cohesity's DR orchestration engine that turns "we have replicated VMs" into "we have a runnable runbook with a measurable RTO."
The Runbook Analogy
A SiteContinuity runbook is an emergency-evacuation plan for an office building:
Who goes first? — Dependency order (DCs/DNS first, then app servers, then web).
Where do they go? — Resource Profile (target compute, port group, datastore).
What address? — Re-IP and VLAN mapping at the DR site.
How do you know everyone got out? — Validation steps.
Failback is going home after the storm: same plan in reverse, only after the building has been certified safe.
Runbook Building Blocks
DR Applications — logical groups of VMs with defined boot order.
Resource Profiles — reusable mappings of target vCenter/datastore/port group/IP rules.
Failback Resource Set — separate resource definition for failback (Edit -> Add Resource Set).
Snapshot Selection — latest by default, override for explicit RPO.
Figure 9.2: SiteContinuity DR Plan State Machine
stateDiagram-v2
[*] --> FailoverReady
FailoverReady --> FailoverInProgress: Failover
FailoverInProgress --> FailoverComplete
FailoverComplete --> PrepareFailbackInProgress: Prepare for Failback
PrepareFailbackInProgress --> FailbackReady: reverse seed complete
FailbackReady --> FailbackInProgress: Failback
FailbackInProgress --> FailbackComplete
FailbackComplete --> FailoverReady: Prepare for Failover
FailoverReady --> FailoverReady: Test Failover (no state change)
FailbackReady --> FailbackReady: Test Failback (no state change)
Animation: SiteContinuity State Machine Walkthrough
Scroll into view or press Replay to start.
Failover Procedure (Production Cutover)
DR Plans -> Disaster Recovery Plans.
Select plan -> Actions -> Failover.
Choose Resource Profile (network mapping, IP customization, target compute).
Optionally enable Protect VMs at DR Site — closes the backup gap at moment of failover.
Confirm snapshot (latest or earlier RPO).
Type YES to confirm.
Validate via DR vCenter: boot order, networks, application functionality, PG state.
Figure 9.3: Failover Orchestration Sequence
sequenceDiagram
participant Op as Operator
participant Helios as Helios / SiteContinuity
participant Src as Source Cluster
participant Tgt as DR Target Cluster
participant vC as DR vCenter
participant VM as Recovered VMs
Op->>Helios: Trigger Failover (DR Plan)
Helios->>Helios: Validate Resource Profile + snapshot
Helios->>Src: Quiesce replication (if reachable)
Helios->>Tgt: Select latest replicated snapshot
Tgt->>Tgt: Mount SnapTree view
Helios->>vC: Register VMs from mounted view
vC->>VM: Apply IP customization + VLAN mapping
vC->>VM: Power on (boot order: DC, App, Web)
VM-->>vC: Services up
vC-->>Helios: Boot validation OK
Helios->>Tgt: Optionally protect VMs at DR site
Helios-->>Op: State = Failover Complete
Animation: Failover Orchestration Timeline (Source Down -> Recovery Complete)
Source Site
SITE DOWN
Helios Detect
Alert + Validate Plan
DR Cluster
Mount SnapTree Snapshot
DR vCenter
Register VMs + Apply Re-IP
VM Power-On
Boot Order: DC -> App -> Web
Users
DNS/GSLB Redirect Complete
Scroll into view or press Replay to start.
Prepare for Failback
Confirm plan is in Failover Complete.
Actions -> Prepare for Failback — drives reverse replication from DR back to primary.
Plan moves: Prepare for Failback In Progress -> Failback Ready.
Failback Procedure
Actions -> Failback.
Pick failback Resource Profile; confirm snapshot (VADP or CDP).
Type YES.
Validate against primary vCenter.
After Failback Complete -> Actions -> Prepare for Failover to re-seed the DR site.
RTO and RPO Measurement
RPO is bounded by replication frequency (4-hr replication = 4-hr RPO worst case).
RTO is bounded by runbook execution time (boot order, IP customization, app warmup).
Regular Test Failovers are the single dominant predictor of successful real-world recoveries.
Key Takeaway: A SiteContinuity DR Plan is a state machine: Failover Ready -> Failover Complete -> Prepare Failback -> Failback Ready -> Failback Complete -> back to Failover Ready. Test variants do not change state. Skipping "Prepare for Failback" is the most common operational error.
Test Failover and Test Failback do not change state — safe to run on a schedule.
"Prepare for Failback" runs reverse replication — mandatory before actual Failback.
RPO is bounded by replication frequency; RTO by runbook execution time.
Resource Profile = target vCenter + datastore + port group + IP customization rules; Failback Resource Set is added separately.
Post-Section Check — Section 4
7. After a successful Failover, a DR Plan is in "Failover Complete." The architect attempts to Failback directly. What happens?
Failback proceeds normallyFailback fails — Prepare for Failback must run first to reverse-seed dataFailback runs but with degraded RPOThe plan automatically transitions to Failback Ready
8. Does running a Test Failover on a SiteContinuity DR Plan change the plan's underlying state?
Yes, it transitions to Failover In ProgressYes, but only brieflyNo — Test variants are non-disruptive and do not change stateOnly if the test passes
9. Which of the following bounds RPO in a SiteContinuity DR design?
Runbook execution timeReplication frequencyNumber of VMs in the DR ApplicationBoot-order serialization
5. Cloud-Based DR Options
Pre-Section Check — Section 5
10. A customer wants both a cloud-based DR target with full SiteContinuity orchestration AND long-term compliance retention in cloud object storage. Which combination is correct?
Replicates Protection Group snapshots to a Cohesity Cloud Edition cluster in AWS or Azure. Granular file recovery, IMR, View mounting, and SiteContinuity all work on the cloud-side replica. Use when the cloud is your DR site.
CloudSpin — On-Demand Native Cloud VMs
Converts a backup snapshot into a native cloud VM (EC2 + EBS, or Azure VM + Managed Disks). Active trigger by user. Use for dev/test clones or lightweight cloud DR without a Cloud Edition cluster.
CloudArchive — Long-Term Retention
Separate archival copy in cloud object storage (S3, Azure Blob, GCP), driven by Protection Policy schedules. Source cluster keeps a local index for search.
CloudArchive Direct: streams full data blocks directly to cloud, only the index stays on-prem.
CloudTier — Capacity Overflow (NOT a DR Tool)
Auto-tiers cold blocks (default >60 days) to cloud object storage when local capacity exceeds 80%. Two critical facts:
Once enabled, CloudTier cannot be disabled.
CloudTier is not a DR copy — the cluster is still the only authoritative copy.
Decision Matrix
Feature
Purpose
Result in Cloud
Reversible?
Trigger
Use Case
CloudSpin
Native cloud VM
EC2 / Azure VM
Yes (ephemeral)
On-demand
Dev/test, light DR
CloudReplicate
Cloud DR replica
Full Cohesity cluster
Yes
Policy
Cloud-as-DR-site
CloudArchive
LTR / compliance
Cohesity-format archive
Yes
Policy
7-yr retention
CloudArchive Direct
LTR, minimal local
Cohesity archive (data direct)
Yes
Policy
Local archive exhausted
CloudTier
Capacity relief
Object-storage extension
No
Auto threshold
Capacity overflow only
Figure 9.4: Cloud DR Decision Tree
graph TD
Start[Cloud Use Case?] --> Q1{Primary Goal?}
Q1 -->|Disaster Recovery| Q2{Need Cohesity features at recovery time?}
Q1 -->|Long-Term Retention| Q3{Local archive capacity available?}
Q1 -->|Capacity Relief| CT[CloudTier WARNING: Irreversible NOT a DR tool]
Q2 -->|Yes - IMR, search, SiteContinuity| CR[CloudReplicate Full Cohesity cluster in AWS/Azure]
Q2 -->|No - just need native cloud VM| CS[CloudSpin EC2/Azure VM on-demand conversion]
Q3 -->|Yes| CA[CloudArchive Cohesity-format in object storage]
Q3 -->|No - exhausted| CAD[CloudArchive Direct Index local, data streamed direct to cloud]
Recovery into VMware Cloud and Azure VMware Solution
VMC and AVS are first-class targets — native vCenter that SiteContinuity drives directly, no CloudSpin conversion, no Cloud Edition cluster required. Trade-off: SDDC cost vs. Cloud Edition cost.
Comparing Cost vs. Recovery Options
Recovery Option
OPEX
Recovery Speed
Operational Familiarity
Second physical DC + Cohesity
Highest CAPEX, lowest OPEX
Fastest
Highest
Cloud Edition + CloudReplicate + SiteContinuity
Medium
Fast
High
VMC / AVS + CloudReplicate
High
Fast
Highest (native vCenter)
CloudSpin only
Lowest
Slow (per-VM conversion)
Lower
CloudArchive + manual restore
Cheapest
Slowest
Lowest
Key Takeaway: CloudReplicate gives you a working cluster, CloudSpin gives you a native VM, CloudArchive gives you compliance retention, and CloudTier gives you capacity relief. Only the first three are DR options. CloudTier is irreversible once enabled.
Key Points
CloudReplicate: continuous policy-driven replication to a full Cohesity cluster in cloud (full DR semantics).
CloudSpin: on-demand conversion of a snapshot into a native EC2 / Azure VM — no Cohesity at recovery time.
CloudArchive: long-term retention as a separate archive copy in object storage.
CloudTier is irreversible and is NOT a DR tool — capacity overflow only.
VMC and AVS allow SiteContinuity to drive native vCenter directly, no conversion needed.
Post-Section Check — Section 5
10. A customer wants both a cloud-based DR target with full SiteContinuity orchestration AND long-term compliance retention in cloud object storage. Which combination is correct?