Chapter 9: Replication, Disaster Recovery, and SiteContinuity

Cohesity Certified Architect Expert (CCAE) Study Guide

Learning Objectives

1. Replication Topologies

Pre-Section Check — Section 1

1. A regional bank operates a primary data center and 12 branch offices. Backup data from each branch must consolidate to the central DC for retention and reporting. Which Cohesity replication topology fits best?

One-to-one One-to-many Many-to-one (fan-in) Cross-cloud only

2. Where is replication configured in Cohesity?

On the Protection Group On the Protection Policy On the View Box only On the source object directly

Cohesity replication is a Protection-Group-aware operation: the source cluster sends only unique, deduplicated, compressed chunks to the target cluster, and the target keeps a fully addressable copy that can be searched, recovered, mounted, or orchestrated independently of the source. The replica is a working cluster, not a tape, which is what makes the four classic topologies viable.

The Four Canonical Topologies

TopologyPatternUse CaseProsCons
One-to-oneA -> BTwo-site enterprise DRSimple, symmetric failbackNo tertiary copy
One-to-manyA -> B + CIn-region DR + cyber-vaultGeographic diversityHigher WAN cost
Many-to-oneA,B,C,D -> HubROBO consolidationCentralized retentionHub is a single failure domain
Cross-cloudOn-prem -> AWS/Azure CECloud-as-DROPEX, no second DCEgress costs on recall

Figure 9.1: Replication Topology Variants

flowchart LR subgraph OneToOne["1:1 (Active-Passive DR)"] A1[Cluster A
Primary] -->|replicate| B1[Cluster B
DR Site] end subgraph OneToMany["1:Many (Geographic Diversity)"] A2[Cluster A
Primary] -->|replicate| B2[Cluster B
In-Region DR] A2 -->|replicate| C2[Cluster C
Cross-Region Vault] end subgraph ManyToOne["Many:1 (ROBO Fan-In)"] S1[Spoke A
ROBO] -->|replicate| H[Hub Cluster
Central DC] S2[Spoke B
ROBO] -->|replicate| H S3[Spoke C
ROBO] -->|replicate| H end subgraph CrossCloud["Cross-Cloud (Cloud-as-DR)"] OP[On-Prem Cluster] -->|replicate| CE[Cloud Edition
AWS / Azure] end
Animation: Replication Topology Build-up (1:1 -> 1:Many -> Many:1 -> Cross-Cloud)
Stage 1: 1:1 Stage 2: 1:Many Stage 3: Many:1 Stage 4: Cross-Cloud Primary DR Primary DR Vault Spoke A Spoke B Spoke C Hub On-Prem Cloud CE
Scroll into view or press Replay to start.

Replication Policies and Retention

Replication is configured on the Protection Policy, not on the Protection Group. A policy can specify multiple "external" targets — local snapshot, replication target cluster, CloudArchive, CloudTier — each with its own retention. Retention on the replica is independent of the source: 14 daily on primary, 30 daily + 12 monthly on DR, 7 yearly on CloudArchive — all driven by one policy.

Key Takeaway: Topology decisions are driven by failure domain, not by bandwidth. Pick the topology that aligns with your blast-radius assumptions, then size WAN and retention to match.

Key Points

Post-Section Check — Section 1

1. A regional bank operates a primary data center and 12 branch offices. Backup data from each branch must consolidate to the central DC for retention and reporting. Which Cohesity replication topology fits best?

One-to-one One-to-many Many-to-one (fan-in) Cross-cloud only

2. Where is replication configured in Cohesity?

On the Protection Group On the Protection Policy On the View Box only On the source object directly

2. Replication Mechanics and Tuning

Pre-Section Check — Section 2

3. The SaaS Connector accepts a throttle value of "100" with a setting of MB/s. What wire-rate does this represent?

100 Mbps 100 MBps = 800 Mbps 12.5 Mbps 1 Gbps regardless

4. Which of the following is NOT a required TCP port between Cohesity replication source and target?

443 111 20000 3389

Encrypted, Compressed, Deduplicated Wire Format

Cohesity replication is always encrypted in flight (TLS) and is deduplicated and compressed at the chunk level before the wire. The source queries the target's chunk fingerprint database and only ships chunks the target does not already have. This global dedup on the wire is the dominant reason Cohesity hits aggressive RPOs over modest WAN links.

Three Throttling Surfaces

  1. SaaS Connector throttling for DataProtect-as-a-Service — per-connector, day/time windows, split upload/download. Values are in bytes per second.
  2. Source-agent throttling for individual physical hosts.
  3. Cluster-to-cluster windows on the Protection Policy.

Common gotcha: SaaS Connector throttle is in bytes per second, opposite of Veritas AIR convention. "100 MB/s" is 800 Mbps on the wire.

Network Architecture and Required Ports

Cohesity recommends 2x 10 GbE LACP Bond Mode 4 with MLAG/VPC for replication-target clusters and ROBO nodes — 20 Gbps combined per node.

PortProtocolPurpose
443TCPHTTPS / API control plane
111TCPPortmap / RPC
20000TCPReplication data channel
24444TCPReplication control / metadata

Enable jumbo frames (MTU 9000) end-to-end. A single device that does not honor 9000-byte frames silently fragments or drops, tanking throughput.

Initial Seed Strategies

  1. Wire seeding with extended window — quiet multi-day window, relaxed throttles.
  2. Physical seed transport — replicate to a portable cluster on-site, ship it, re-home the target.

Replication Failure Handling

Cohesity replication is checkpointed: a partial run resumes from the last committed chunk rather than restarting. Persistent failures generate Helios alerts and pause the policy after configurable retry attempts.

Key Takeaway: Replication mechanics are built on dedup, compression, and TLS — but they only work if you've sized 2x10 GbE LACP, opened TCP 443/111/20000/24444, enabled MTU 9000 end-to-end, and remembered SaaS Connector throttles are in bytes per second.

Key Points

Post-Section Check — Section 2

3. The SaaS Connector accepts a throttle value of "100" with a setting of MB/s. What wire-rate does this represent?

100 Mbps 100 MBps = 800 Mbps 12.5 Mbps 1 Gbps regardless

4. Which of the following is NOT a required TCP port between Cohesity replication source and target?

443 111 20000 3389

3. Replication Bandwidth Math

Pre-Section Check — Section 3

5. A workload has 100 TB FETB, 4% daily change, 50% on-wire reduction (dedup+compress), and an 8-hour replication window. What is the approximate required throughput in MB/s?

~25 MB/s ~70 MB/s ~140 MB/s ~560 MB/s

6. What is the recommended planning ceiling for sustainable replication throughput on a WAN link?

100% of nominal wire speed 75% of nominal wire speed 50% of nominal wire speed 25% of nominal wire speed

The single formula every CCAE must memorize:

Required throughput = (FETB × daily change rate × (1 - dedup/compression)) / replication window

Bytes-vs-Bits Reference

Wire Speed (bits)Theoretical Bytes50% Utilization Target
1 Gbps125 MB/s62.5 MB/s
10 Gbps1,250 MB/s (1.25 GB/s)625 MB/s
40 Gbps5,000 MB/s2,500 MB/s
100 Gbps12,500 MB/s6,250 MB/s

Worked Example: 50 TB FETB, 5% Daily Change, 4-Hour Window

Step 1 - Daily change in bytes:
  50 TB x 0.05 = 2.5 TB of change per day

Step 2 - Apply dedup/compression (60% reduction):
  2.5 TB x 0.40 = 1.0 TB on the wire

Step 3 - Divide by replication window (4 hr = 14,400 s):
  1,000,000 MB / 14,400 s = ~69.4 MB/s

Step 4 - Convert to bits per second:
  69.4 MB/s x 8 = ~556 Mbps

Step 5 - Apply 50% utilization headroom:
  556 Mbps / 0.50 = ~1.11 Gbps minimum WAN provisioning

The RPO / Bandwidth / Dedup Trade-offs

KnobMeaningRange
Increase WAN bandwidthBuy more circuit$$$, weeks lead time
Lengthen RPO (window)Replicate every 8 hr instead of 4Free, but business decision
Reduce change rateTighter PG scoping; exclude logs/scratchCheap, bounded
Improve dedup on wireBetter source filtering, larger target retention poolModest gains
Key Takeaway: Memorize the bandwidth equation, work in bytes first, multiply by 8 last, and budget for 50% of wire speed. The bytes-vs-bits trap has cost more than one architect a six-figure WAN over-provisioning bill.

Key Points

Post-Section Check — Section 3

5. A workload has 100 TB FETB, 4% daily change, 50% on-wire reduction (dedup+compress), and an 8-hour replication window. What is the approximate required throughput in MB/s?

~25 MB/s ~70 MB/s ~140 MB/s ~560 MB/s

6. What is the recommended planning ceiling for sustainable replication throughput on a WAN link?

100% of nominal wire speed 75% of nominal wire speed 50% of nominal wire speed 25% of nominal wire speed

4. SiteContinuity Orchestration

Pre-Section Check — Section 4

7. After a successful Failover, a DR Plan is in "Failover Complete." The architect attempts to Failback directly. What happens?

Failback proceeds normally Failback fails — Prepare for Failback must run first to reverse-seed data Failback runs but with degraded RPO The plan automatically transitions to Failback Ready

8. Does running a Test Failover on a SiteContinuity DR Plan change the plan's underlying state?

Yes, it transitions to Failover In Progress Yes, but only briefly No — Test variants are non-disruptive and do not change state Only if the test passes

9. Which of the following bounds RPO in a SiteContinuity DR design?

Runbook execution time Replication frequency Number of VMs in the DR Application Boot-order serialization

Replication moves the data; SiteContinuity moves the application. SiteContinuity is Cohesity's DR orchestration engine that turns "we have replicated VMs" into "we have a runnable runbook with a measurable RTO."

The Runbook Analogy

A SiteContinuity runbook is an emergency-evacuation plan for an office building:

  1. Who goes first? — Dependency order (DCs/DNS first, then app servers, then web).
  2. Where do they go? — Resource Profile (target compute, port group, datastore).
  3. What address? — Re-IP and VLAN mapping at the DR site.
  4. How do you know everyone got out? — Validation steps.

Failback is going home after the storm: same plan in reverse, only after the building has been certified safe.

Runbook Building Blocks

Figure 9.2: SiteContinuity DR Plan State Machine

stateDiagram-v2 [*] --> FailoverReady FailoverReady --> FailoverInProgress: Failover FailoverInProgress --> FailoverComplete FailoverComplete --> PrepareFailbackInProgress: Prepare for Failback PrepareFailbackInProgress --> FailbackReady: reverse seed complete FailbackReady --> FailbackInProgress: Failback FailbackInProgress --> FailbackComplete FailbackComplete --> FailoverReady: Prepare for Failover FailoverReady --> FailoverReady: Test Failover (no state change) FailbackReady --> FailbackReady: Test Failback (no state change)
Animation: SiteContinuity State Machine Walkthrough
Failover Ready Failover Complete Prepare Failback (reverse seed) Failback Ready Failback Complete Prepare for Failover (next cycle) Failover Prepare seed done Failback Test Failover (no state change) Test Failback (no state change)
Scroll into view or press Replay to start.

Failover Procedure (Production Cutover)

  1. DR Plans -> Disaster Recovery Plans.
  2. Select plan -> Actions -> Failover.
  3. Choose Resource Profile (network mapping, IP customization, target compute).
  4. Optionally enable Protect VMs at DR Site — closes the backup gap at moment of failover.
  5. Confirm snapshot (latest or earlier RPO).
  6. Type YES to confirm.
  7. Validate via DR vCenter: boot order, networks, application functionality, PG state.

Figure 9.3: Failover Orchestration Sequence

sequenceDiagram participant Op as Operator participant Helios as Helios / SiteContinuity participant Src as Source Cluster participant Tgt as DR Target Cluster participant vC as DR vCenter participant VM as Recovered VMs Op->>Helios: Trigger Failover (DR Plan) Helios->>Helios: Validate Resource Profile + snapshot Helios->>Src: Quiesce replication (if reachable) Helios->>Tgt: Select latest replicated snapshot Tgt->>Tgt: Mount SnapTree view Helios->>vC: Register VMs from mounted view vC->>VM: Apply IP customization + VLAN mapping vC->>VM: Power on (boot order: DC, App, Web) VM-->>vC: Services up vC-->>Helios: Boot validation OK Helios->>Tgt: Optionally protect VMs at DR site Helios-->>Op: State = Failover Complete
Animation: Failover Orchestration Timeline (Source Down -> Recovery Complete)
Source Site
SITE DOWN
Helios Detect
Alert + Validate Plan
DR Cluster
Mount SnapTree Snapshot
DR vCenter
Register VMs + Apply Re-IP
VM Power-On
Boot Order: DC -> App -> Web
Users
DNS/GSLB Redirect Complete
Scroll into view or press Replay to start.

Prepare for Failback

  1. Confirm plan is in Failover Complete.
  2. Actions -> Prepare for Failback — drives reverse replication from DR back to primary.
  3. Plan moves: Prepare for Failback In Progress -> Failback Ready.

Failback Procedure

  1. Actions -> Failback.
  2. Pick failback Resource Profile; confirm snapshot (VADP or CDP).
  3. Type YES.
  4. Validate against primary vCenter.
  5. After Failback Complete -> Actions -> Prepare for Failover to re-seed the DR site.

RTO and RPO Measurement

Key Takeaway: A SiteContinuity DR Plan is a state machine: Failover Ready -> Failover Complete -> Prepare Failback -> Failback Ready -> Failback Complete -> back to Failover Ready. Test variants do not change state. Skipping "Prepare for Failback" is the most common operational error.

Key Points

Post-Section Check — Section 4

7. After a successful Failover, a DR Plan is in "Failover Complete." The architect attempts to Failback directly. What happens?

Failback proceeds normally Failback fails — Prepare for Failback must run first to reverse-seed data Failback runs but with degraded RPO The plan automatically transitions to Failback Ready

8. Does running a Test Failover on a SiteContinuity DR Plan change the plan's underlying state?

Yes, it transitions to Failover In Progress Yes, but only briefly No — Test variants are non-disruptive and do not change state Only if the test passes

9. Which of the following bounds RPO in a SiteContinuity DR design?

Runbook execution time Replication frequency Number of VMs in the DR Application Boot-order serialization

5. Cloud-Based DR Options

Pre-Section Check — Section 5

10. A customer wants both a cloud-based DR target with full SiteContinuity orchestration AND long-term compliance retention in cloud object storage. Which combination is correct?

CloudTier + CloudArchive CloudReplicate + CloudArchive CloudSpin only CloudTier alone

11. What is the most important operational caveat about CloudTier?

It produces a redundant DR copy Once enabled on a View Box, it cannot be disabled It only works with Azure Blob It requires SiteContinuity

12. Which Cohesity feature converts an on-prem backup snapshot into a native AWS EC2 instance with EBS volumes on demand?

CloudArchive Direct CloudReplicate CloudSpin CloudTier

The Four Cloud Features

CloudReplicate — DR Replica as a Working Cluster

Replicates Protection Group snapshots to a Cohesity Cloud Edition cluster in AWS or Azure. Granular file recovery, IMR, View mounting, and SiteContinuity all work on the cloud-side replica. Use when the cloud is your DR site.

CloudSpin — On-Demand Native Cloud VMs

Converts a backup snapshot into a native cloud VM (EC2 + EBS, or Azure VM + Managed Disks). Active trigger by user. Use for dev/test clones or lightweight cloud DR without a Cloud Edition cluster.

CloudArchive — Long-Term Retention

Separate archival copy in cloud object storage (S3, Azure Blob, GCP), driven by Protection Policy schedules. Source cluster keeps a local index for search.

CloudArchive Direct: streams full data blocks directly to cloud, only the index stays on-prem.

CloudTier — Capacity Overflow (NOT a DR Tool)

Auto-tiers cold blocks (default >60 days) to cloud object storage when local capacity exceeds 80%. Two critical facts:

Decision Matrix

FeaturePurposeResult in CloudReversible?TriggerUse Case
CloudSpinNative cloud VMEC2 / Azure VMYes (ephemeral)On-demandDev/test, light DR
CloudReplicateCloud DR replicaFull Cohesity clusterYesPolicyCloud-as-DR-site
CloudArchiveLTR / complianceCohesity-format archiveYesPolicy7-yr retention
CloudArchive DirectLTR, minimal localCohesity archive (data direct)YesPolicyLocal archive exhausted
CloudTierCapacity reliefObject-storage extensionNoAuto thresholdCapacity overflow only

Figure 9.4: Cloud DR Decision Tree

graph TD Start[Cloud Use Case?] --> Q1{Primary Goal?} Q1 -->|Disaster Recovery| Q2{Need Cohesity features
at recovery time?} Q1 -->|Long-Term Retention| Q3{Local archive
capacity available?} Q1 -->|Capacity Relief| CT[CloudTier
WARNING: Irreversible
NOT a DR tool] Q2 -->|Yes - IMR, search,
SiteContinuity| CR[CloudReplicate
Full Cohesity cluster
in AWS/Azure] Q2 -->|No - just need
native cloud VM| CS[CloudSpin
EC2/Azure VM
on-demand conversion] Q3 -->|Yes| CA[CloudArchive
Cohesity-format
in object storage] Q3 -->|No - exhausted| CAD[CloudArchive Direct
Index local, data
streamed direct to cloud]

Recovery into VMware Cloud and Azure VMware Solution

VMC and AVS are first-class targets — native vCenter that SiteContinuity drives directly, no CloudSpin conversion, no Cloud Edition cluster required. Trade-off: SDDC cost vs. Cloud Edition cost.

Comparing Cost vs. Recovery Options

Recovery OptionOPEXRecovery SpeedOperational Familiarity
Second physical DC + CohesityHighest CAPEX, lowest OPEXFastestHighest
Cloud Edition + CloudReplicate + SiteContinuityMediumFastHigh
VMC / AVS + CloudReplicateHighFastHighest (native vCenter)
CloudSpin onlyLowestSlow (per-VM conversion)Lower
CloudArchive + manual restoreCheapestSlowestLowest
Key Takeaway: CloudReplicate gives you a working cluster, CloudSpin gives you a native VM, CloudArchive gives you compliance retention, and CloudTier gives you capacity relief. Only the first three are DR options. CloudTier is irreversible once enabled.

Key Points

Post-Section Check — Section 5

10. A customer wants both a cloud-based DR target with full SiteContinuity orchestration AND long-term compliance retention in cloud object storage. Which combination is correct?

CloudTier + CloudArchive CloudReplicate + CloudArchive CloudSpin only CloudTier alone

11. What is the most important operational caveat about CloudTier?

It produces a redundant DR copy Once enabled on a View Box, it cannot be disabled It only works with Azure Blob It requires SiteContinuity

12. Which Cohesity feature converts an on-prem backup snapshot into a native AWS EC2 instance with EBS volumes on demand?

CloudArchive Direct CloudReplicate CloudSpin CloudTier

Chapter Summary

Your Progress

Answer Explanations