Cohesity Certified Architect Expert (CCAE) Certification Exam Preparation Guide

A comprehensive advanced study guide for architects designing, deploying, and operating Cohesity DataProtect, SmartFiles, and Helios environments at enterprise scale.

Chapter 1: CCAE Exam Overview and Cohesity Platform Architecture
Chapter 2: SpanFS Internals, Distributed Storage, and Cluster Mechanics
Chapter 3: Cluster Design, Sizing, and Capacity Planning
Chapter 4: Networking, DNS, and Cluster Connectivity
Chapter 5: Cluster Deployment, Bootstrap, and Day-2 Operations
Chapter 6: Identity, Access Management, and Multi-Tenancy
Chapter 7: Data Protection: Sources, Policies, and Protection Groups
Chapter 8: Application-Aware Backup and Recovery Patterns
Chapter 9: Replication, Disaster Recovery, and SiteContinuity
Chapter 10: Cloud Integration: Archive, Tier, Replicate, and Spin
Chapter 11: Security, Encryption, and Ransomware Resilience
Chapter 12: SmartFiles: Files, Objects, and Unstructured Data Services
Chapter 13: Helios SaaS, Marketplace Apps, and Automation
Chapter 14: Performance, Monitoring, and Troubleshooting
Chapter 15: End-to-End Architecture Scenarios and Exam Synthesis

Chapter 1: CCAE Exam Overview and Cohesity Platform Architecture

Learning Objectives

Describe the CCAE exam blueprint, domains, weightings, and prerequisites.
Explain the high-level architecture of the Cohesity DataPlatform and its core services.
Differentiate between Cohesity DataProtect, SmartFiles, SiteContinuity, and Helios.
Identify how the CCAE role fits within the Cohesity certification track (CCSE, CCPE, CCAE).
Map physical, virtual, and cloud form factors to specific architectural use cases.

CCAE Exam Blueprint and Study Strategy

Architects who pursue the Cohesity Certified Architect Expert (CCAE) credential are signaling that they can do more than operate a backup platform — they can size, design, and defend a Cohesity Data Cloud deployment in front of customers, security teams, and CIOs. This first section unpacks what the exam actually tests, who it is for, and how to study for it efficiently.

Domain Weightings and Number of Questions

The CCAE blueprint is divided into four weighted domains. The Solution Discovery and Design domain dominates at 35%, which tells you immediately that this is an architecture exam — not a memorize-the-CLI exam [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/datasheets/cohesity-certified-architect-expert-exam-preparation-guide-en.pdf].

Domain	Weight	Focus
1. Cohesity Data Cloud Data Management Platform Architecture	22%	Products, technology use cases and limitations, designing a DataProtect platform
2. Cohesity Architecture Solution Discovery and Design	35%	Sizing, workload-appropriate protection, hybrid/multi-cloud, Helios Self-Managed for dark sites, business alignment
3. Design Security-Focused Solutions	18%	Cyber-resiliency design, immutability, encryption, ransomware patterns
4. Integrate Third-party Solutions with Cohesity	13%	Integration patterns and Cohesity APIs

The remaining 12% spans cross-cutting topics such as licensing, support, and lifecycle. Treat the percentages as time-budget guidance: if you have 30 study days, allocate roughly 10-11 days to Domain 2, 6-7 days each to Domains 1 and 3, and the balance to Domain 4 [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/datasheets/cohesity-certified-architect-expert-exam-preparation-guide-en.pdf].

Worked example — building a study plan. An architect with strong DataProtect operational experience but limited security background should invert the time allocation slightly and over-invest in Domain 3. Why? Because security questions are often scenario-based (“the customer has FedRAMP requirements and a 3-2-1-1-0 mandate — which combination of features applies?”), and confidence in DataLock, FortKnox, and DataHawk pays disproportionate returns versus rote memorization of, say, a backup policy slider [Source: https://www.cohesity.com/academy/certification/].

Recommended Hands-on Prerequisites and Lab Environments

CCAE has no formal prerequisite certification, but the Cohesity Academy explicitly recommends prior hands-on experience with DataProtect, SmartFiles, SiteContinuity, and Helios, plus working knowledge of:

VMware vSphere and Microsoft Hyper-V virtualization
Database protection patterns (Oracle, SQL Server, SAP HANA)
Object storage and S3 semantics
Backup, DR, and hybrid-cloud architecture
Cyber-resiliency and ransomware concepts [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/academy/cohesity-academy-customer-course-catalog.pdf]

The most reliable lab is a Helios sandbox tenant connected to either a Cohesity Virtual Edition cluster running on a workstation-class hypervisor or a Cloud Edition trial in AWS or Azure. The Cohesity Academy course catalog maps specific instructor-led and on-demand modules to each exam section, which is the highest-signal study artifact you can use [Source: https://www.cohesity.com/academy/].

Exam Delivery, Scoring, and Recertification Cycle

Attribute	Value
Duration	90 minutes
Cost	$200 USD
Passing score	60%
Validity	2 years
Retake policy	14-day waiting period between attempts
Delivery	Online proctored

Source data is consistent across the official preparation guide and independent exam-tracking sites [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/datasheets/cohesity-certified-architect-expert-exam-preparation-guide-en.pdf] [Source: https://www.nwexam.com/cohesity]. The two-year validity window means you should plan recertification roughly six months before expiry, especially because Cohesity Data Cloud feature velocity is high and the blueprint occasionally adds topics such as DirectIO, FortKnox, and DataHawk [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/netbackup-directio-powering-the-next-era-of-cyber-resilience.pdf].

How CCAE Differs from CCSE and CCPE

The Cohesity certification track is layered. CCAE sits at the top of the architect/expert tier, and the foundational and associate credentials (CCA, CCSE) feed candidates into it [Source: https://www.cohesity.com/academy/certification/].

Certification	Audience	Focus	Depth
CCA (Cohesity Certified Associate)	Operators, junior admins	Cluster basics, daily operations	Foundational
CCSE (Cohesity Certified Sales Engineer)	Pre-sales SEs	Product positioning, demo, light design	Associate
CCPE (Cohesity Certified Professional Engineer)	Senior admins, consultants	Implementation, deployment, Day-2 ops	Professional
CCAE (Cohesity Certified Architect Expert)	Solutions architects, senior consultants	End-to-end design, sizing, security, multi-cloud	Expert

Analogy. Think of the certification path the way an aviation track works: a CCSE is a flight instructor who can demonstrate the cockpit, a CCPE is a line pilot who can fly the aircraft daily, and a CCAE is the aerospace architect who designs the airframe and mission profile. The CCAE exam consequently leans heavily on why you would choose a topology rather than how you click through a wizard [Source: https://www.cohesity.com/academy/].

Figure 1.1: Cohesity certification track progression from foundational to expert tier

flowchart TD
    CCA["CCA - Cohesity Certified Associate<br/>Operators and junior admins<br/>Cluster basics, daily operations"]
    CCSE["CCSE - Cohesity Certified Sales Engineer<br/>Pre-sales engineers<br/>Product positioning, demo, light design"]
    CCPE["CCPE - Cohesity Certified Professional Engineer<br/>Senior admins and consultants<br/>Implementation, deployment, Day-2 ops"]
    CCAE["CCAE - Cohesity Certified Architect Expert<br/>Solutions architects, senior consultants<br/>End-to-end design, sizing, security, multi-cloud"]

    CCA --> CCSE
    CCA --> CCPE
    CCSE --> CCAE
    CCPE --> CCAE

    style CCA fill:#1f6feb,stroke:#58a6ff,color:#ffffff
    style CCSE fill:#1f6feb,stroke:#58a6ff,color:#ffffff
    style CCPE fill:#1f6feb,stroke:#58a6ff,color:#ffffff
    style CCAE fill:#238636,stroke:#58a6ff,color:#ffffff

Key Takeaway: CCAE is a 90-minute, four-domain, design-oriented exam dominated by the 35%-weighted Solution Discovery and Design domain. Plan 30 days of preparation, weight your time against the blueprint, and treat hands-on Helios + Virtual Edition labs as non-negotiable. The certification is the architect-tier capstone above CCA, CCSE, and CCPE.

Cohesity DataPlatform Architecture Pillars

Every CCAE exam scenario eventually traces back to four architectural pillars: a single distributed file system, a hyperconverged scale-out node model, MapReduce-style background services, and strict consistency. If you internalize these pillars, you can derive most design answers from first principles.

SpanFS Distributed File System

SpanFS is Cohesity’s distributed, web-scale file system that consolidates backups, files, objects, dev/test copies, and analytics onto a single tier of storage [Source: https://www.cohesity.com/platform/spanfs/]. Architecturally, SpanFS layers four major subsystems:

Access Layer — exposes industry-standard NFS, SMB, and S3 protocols (with OST and DirectIO for NetBackup integration) on the same volumes via virtual IPs, with no master node and no protocol-specific choke point [Source: https://www.cohesity.com/blogs/cohesity-spanfs-snaptree/].
I/O Engine — chunks data, performs variable-length global deduplication (inline or post-process), compresses, encrypts, indexes, and tiers blocks across SSD, HDD, and cloud based on access pattern [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity-SpanFS-and-SnapTree-WP.pdf].
Metadata Management — uses a distributed key-value store built on a patented B+ tree, replicated and sharded consistently across every node. SnapTree (paired with SpanFS) implements Distributed Redirect-on-Write (D-ROW) for unlimited snapshots and clones with effectively zero performance penalty [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity-SpanFS-and-SnapTree-WP.pdf].
Storage and Distribution — fully distributed across hyperconverged x86 nodes, dynamically rebalanced, and protected by erasure coding or replication [Source: https://blogs.vmware.com/affiliates/cohesity-spanfs-the-difference-maker-in-the-enterprise-and-secondary-storage-architectures].

Note: A 2015 USENIX paper titled “SpanFS” describes an unrelated academic project. The authoritative source for the Cohesity implementation is Cohesity’s own SpanFS and SnapTree whitepaper [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity-SpanFS-and-SnapTree-WP.pdf] [Source: https://www.usenix.org/system/files/conference/atc15/atc15-paper-kang.pdf].

Analogy. SpanFS is to a Cohesity cluster what HDFS is to a Hadoop cluster — except SpanFS adds strict consistency, multi-protocol access, snapshots, and tenant isolation, all of which HDFS lacks natively.

Hyperconverged Scale-out Node Model

A Cohesity cluster is a shared-nothing collection of x86 nodes; each node contributes CPU, memory, NVMe/SSD (used for metadata and write caching), and capacity HDDs to a single SpanFS namespace. There is no separate metadata controller, and no node is more privileged than another. This means:

Throughput scales linearly with node count.
Metadata service capacity grows with each added node.
Failure of any single node degrades but does not stop the cluster (assuming RF or EC policy permits) [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/cohesity-platform-security-white-paper-en.pdf].

Property	Traditional Two-Tier Backup	Cohesity Hyperconverged
Compute / storage scaling	Independent, often imbalanced	Coupled, balanced per node
Metadata controller	Dedicated server, bottleneck risk	Distributed across all nodes
Add-capacity workflow	Re-rack, re-license, migrate	Add ReadyNode, auto-rebalance
Failure blast radius	Often whole-array	Single node, EC-bounded

MapReduce-style Indexing and Global Deduplication

Apollo, the cluster-wide background services engine, runs MapReduce-style jobs across every node to perform garbage collection, post-process deduplication, indexing, file analytics, and integrity scrubbing [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity-SpanFS-and-SnapTree-WP.pdf]. The deduplication itself is variable-length sliding-window, applied globally across the entire cluster — meaning two backup jobs that ingest the same OS image from different sources still collapse to a single physical chunk.

Worked example — dedupe ratio for a mixed VM estate. A customer protects 500 Windows VMs averaging 80 GB each, with significant OS overlap. Front-end TB (FETB) is roughly 40 TB. Variable-length global dedupe typically delivers 4-6x on this profile, and inline compression layers another 1.5-2x on top. Effective stored capacity lands near 5-7 TB before applying RF/EC, which is why Cohesity sizing tools can return surprisingly low usable-capacity requirements. We unpack the math in Chapter 3, but the architectural point is that Apollo, not Bridge, is what makes those ratios stick over time through post-process re-deduplication and garbage collection [Source: https://www.cohesity.com/blogs/cohesity-spanfs-snaptree/].

Strict Consistency and Quorum Semantics

SpanFS is strictly consistent, meaning any node can serve any I/O for any object and clients always see the latest committed state [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity-SpanFS-and-SnapTree-WP.pdf]. The cluster uses a Paxos-style quorum to maintain metadata agreement; a cluster requires a majority of nodes to remain healthy to continue accepting writes. This is the architectural reason that the minimum supported cluster size is three or four nodes depending on form factor: smaller clusters cannot maintain quorum during a single-node failure [Source: https://www.cohesity.com/blogs/cohesity-spanfs-snaptree/].

Key Takeaway: SpanFS combines a distributed-metadata architecture, hyperconverged x86 nodes, MapReduce-style background services in Apollo, and strict consistency to deliver linear scale without master/slave bottlenecks. Quorum requirements drive the three- or four-node minimums you will see in every sizing question.

Core Services and Software Stack

Beneath the CCAE exam’s scenario language are a handful of cooperating services. Memorizing what they do — and what they don’t do — is one of the highest-yield activities for Domain 1.

Bridge, Apollo, Iris, and Magneto Services

Every node runs the same software stack. Four services do most of the architecturally interesting work [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity-SpanFS-and-SnapTree-WP.pdf]:

Service	Role	Responsibilities
Bridge	SpanFS data path	Chunking, dedupe, compression, encryption, erasure coding, tiering, NFS/SMB/S3 protocol stacks
Apollo	Background analytics & MapReduce	Garbage collection, post-process dedupe, indexing, scrubbing, file analytics
Magneto	Data protection orchestration	Backups, snapshots, replication, archive, recovery workflows; talks to vCenter, DBs, NAS, cloud
Iris	Management UI/control plane	Web UI, REST API, CLI dispatch, RBAC enforcement

A second tier of services rounds out the platform. ScribeStore is the underlying distributed key-value metadata store that holds inodes, chunk locations, and snapshot trees [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity-SpanFS-and-SnapTree-WP.pdf]. Yoda is the global indexing and search service that powers cross-cluster file and object search.

Analogy. Picture a hospital. Bridge is the operating theatre — it is where actual patient I/O happens. Magneto is the intake and discharge coordinator — it decides when patients (workloads) arrive, how often they get checked, and where they go. Apollo is the night cleaning crew — it sweeps, sorts, and reorganizes the building so the next day’s operations stay efficient. Iris is the front desk — every external interaction goes through it. Yoda is the medical records librarian.

Figure 1.2: Layered Cohesity DataPlatform architecture from hardware to control plane

flowchart LR
    subgraph L1["Hardware Layer"]
        HW["x86 Nodes<br/>CPU + Memory<br/>NVMe/SSD + HDD"]
    end

    subgraph L2["Storage Foundation"]
        SPANFS["SpanFS<br/>Distributed File System<br/>NFS / SMB / S3 / OST"]
        SCRIBE["ScribeStore<br/>Distributed KV Metadata"]
    end

    subgraph L3["Core Services"]
        BRIDGE["Bridge<br/>Data Path"]
        APOLLO["Apollo<br/>MapReduce Background Jobs"]
        MAGNETO["Magneto<br/>Protection Orchestration"]
        IRIS["Iris<br/>UI / API / RBAC"]
        YODA["Yoda<br/>Global Search"]
    end

    subgraph L4["Workload Products"]
        DP["DataProtect<br/>Backup and Recovery"]
        SF["SmartFiles<br/>Files and Objects"]
        SC["SiteContinuity<br/>DR Orchestration"]
    end

    subgraph L5["Control Plane"]
        HELIOS["Helios<br/>SaaS Multicloud Management<br/>and AI Insights"]
    end

    L1 --> L2
    L2 --> L3
    L3 --> L4
    L4 --> L5

    style L1 fill:#0d1117,stroke:#58a6ff,color:#ffffff
    style L2 fill:#0d1117,stroke:#58a6ff,color:#ffffff
    style L3 fill:#0d1117,stroke:#58a6ff,color:#ffffff
    style L4 fill:#0d1117,stroke:#58a6ff,color:#ffffff
    style L5 fill:#0d1117,stroke:#58a6ff,color:#ffffff

Yoda Search Service and Global Indexing

Yoda makes global search possible across an entire fleet of clusters. When a backup completes, Magneto signals Apollo to index file paths and metadata; that index is then surfaced through Yoda for queries originating in either the local Iris UI or the Helios global console. This is what allows a CCAE-exam scenario such as “find every PDF named contract-2024.pdf across 14 clusters in 8 regions” to return in seconds [Source: https://www.cohesity.com/blogs/cohesity-spanfs-snaptree/].

Helios SaaS Control Plane

Helios is Cohesity’s multicloud SaaS management plane that consolidates DataProtect, SmartFiles, and SiteContinuity into a single operations and reporting surface [Source: https://futurumgroup.com/document/cohesity-helios-mcdm-product-brief/]. It does not host customer backup data; it hosts control and insight:

Global dashboards, SLA reporting, and capacity forecasting
Cross-cluster search via Yoda
AI-driven anomaly detection and ransomware threat indicators
Policy-based fleet management
Entry point for SaaS-only services such as DataProtect-as-a-Service, FortKnox cyber vault, and DataHawk

For air-gapped or sovereign environments, Cohesity offers Helios Self-Managed, a customer-hosted variant called out explicitly in the CCAE Solution Discovery and Design domain [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/datasheets/cohesity-certified-architect-expert-exam-preparation-guide-en.pdf].

Marketplace Apps and the Cohesity App Framework

The Cohesity App Framework allows third parties (and customers) to deploy containerized applications directly onto cluster nodes, sandboxed away from the data path. Typical app categories include compliance scanners, anti-virus engines (often via ICAP), eDiscovery tools, and analytics workloads. The architectural advantage is data gravity — the apps run where the data already lives rather than pulling petabytes across the network [Source: https://www.cohesity.com/blogs/analyst-reports-highlight-the-strength-of-cohesitys-multi-use-platform/].

Key Takeaway: Bridge, Apollo, Magneto, and Iris form the four-service backbone; ScribeStore and Yoda support metadata and global search; Helios delivers SaaS-tier visibility; and the Marketplace plus App Framework let analytics run on top of the data without copying it elsewhere. Most exam scenarios reduce to identifying which of these services owns a given behavior.

Cohesity Product Portfolio: DataProtect, SmartFiles, SiteContinuity, Helios

Domain 1 of the exam expects you to differentiate the four headline products and recognize when each is the right answer. They are best understood as personalities sharing the same DataPlatform body.

DataProtect — Backup and Recovery

DataProtect is Cohesity’s flagship enterprise backup product. It protects:

VMware vSphere, Microsoft Hyper-V, and Nutanix AHV virtual machines
Physical Linux and Windows servers
Network-attached storage via SMB, NFS, and NDMP
Databases including Oracle, SQL Server, SAP HANA, and MongoDB
Microsoft 365 (Exchange, OneDrive, SharePoint, Teams)
Kubernetes workloads
Public cloud workloads (AWS EC2/EBS/RDS, Azure, GCP) [Source: https://www.cohesity.com/cohesity-vs-commvault/]

Internally, DataProtect uses Magneto-driven incremental-forever pipelines that persist recovery points as deduplicated SnapTree snapshots inside SpanFS. This is what enables Instant Mass Restore of thousands of VMs from any historical point in time — the data is already in a queryable, mountable filesystem rather than locked in proprietary backup tape format [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/solution-guides/the-essential-guide-to-modern-data-management-en.pdf].

SmartFiles — Files, Objects, and Unstructured Data

SmartFiles is Cohesity’s unified file and object service for primary and secondary unstructured data. Because it shares SpanFS with DataProtect, secondary copies of files can be created without provisioning a separate storage silo. Distinguishing characteristics:

Multi-protocol access: SMB3, NFSv3/v4, S3 — including cross-protocol reads (a file written via SMB can be read via S3)
Immutability for ransomware-resistant archive tiers
Encryption and access-pattern analytics
Application use cases ranging from file-share consolidation to S3 object storage for cloud-native apps and AI/ML pipelines [Source: https://futurumgroup.com/document/cohesity-helios-mcdm-product-brief/]

SiteContinuity — Disaster Recovery Orchestration

SiteContinuity converges backup and DR onto one platform. Where competitors require separate products (Veeam plus Zerto plus VMware SRM, for instance), SiteContinuity orchestrates DR runbooks directly on top of DataProtect snapshots and replication targets [Source: https://docs.cohesity.com/disaster-recovery/pdf/site-continuity-user-guide.pdf]. Capabilities include:

Boot-order and dependency mapping
Network re-IP and VLAN mapping during failover
Non-disruptive DR test failovers
Failback orchestration
Near-zero RTO with Instant Mass Restore-backed runbooks

Helios — SaaS Management and Insights

Helios stitches everything together. It is the single control plane through which you observe and govern any number of DataProtect, SmartFiles, and SiteContinuity instances, on-premises or in cloud, with optional Helios Self-Managed for air-gapped sites [Source: https://futurumgroup.com/document/cohesity-helios-mcdm-product-brief/].

Side-by-Side Product Comparison

Dimension	DataProtect	SmartFiles	SiteContinuity	Helios
Primary purpose	Backup and recovery	Primary/secondary file & object storage	DR orchestration & failover	SaaS management & AI/insights
Core workloads	VMs, DBs, NAS, M365, cloud	Unstructured data, archives, S3 apps	Mission-critical apps, RTO/RPO tiers	All clusters & services globally
Underlying tech	Magneto + SpanFS	SpanFS multi-protocol Views	DataProtect snapshots + replication + runbooks	Cloud control plane over all clusters
Replaces	Veeam, Veritas NetBackup, Commvault	Isilon, NetApp, ECS	VMware SRM, Zerto	Per-cluster UIs, separate analytics tools
Delivery	Software on cluster / SaaS	Software on cluster	Software + Helios	SaaS (or Self-Managed)

[Source: https://www.cohesity.com/blogs/analyst-reports-highlight-the-strength-of-cohesitys-multi-use-platform/] [Source: https://docs.cohesity.com/disaster-recovery/pdf/site-continuity-user-guide.pdf] [Source: https://www.peerspot.com/products/comparisons/cohesity-data-cloud_vs_opentext-data-protector]

Worked example — picking the right product mix. A retailer wants to consolidate a Veeam-backed VMware estate, retire an aging Isilon cluster, and add a documented DR plan for its top-20 store-operations apps. The architect-grade answer is: deploy DataProtect for the VMware/M365 workloads, repurpose the same cluster’s SpanFS as SmartFiles to absorb the Isilon shares, and add SiteContinuity to author and test failover runbooks for the top-20 apps — all observed through one Helios tenant. Three products, one platform, one license envelope.

Figure 1.3: Cohesity Data Cloud product family taxonomy

graph TD
    DC["Cohesity Data Cloud"]

    DP_PLATFORM["DataPlatform<br/>Workload-facing products<br/>sharing SpanFS"]
    HELIOS["Helios<br/>SaaS Control Plane<br/>(or Helios Self-Managed)"]

    DC --> DP_PLATFORM
    DC --> HELIOS

    DP["DataProtect<br/>Backup and Recovery"]
    SF["SmartFiles<br/>Files and Objects"]
    SCONT["SiteContinuity<br/>DR Orchestration"]
    FK["FortKnox<br/>SaaS Cyber Vault"]
    DH["DataHawk<br/>Threat Detection<br/>and Classification"]

    DP_PLATFORM --> DP
    DP_PLATFORM --> SF
    DP_PLATFORM --> SCONT
    DP_PLATFORM --> FK
    DP_PLATFORM --> DH

    HELIOS -.observes.-> DP
    HELIOS -.observes.-> SF
    HELIOS -.observes.-> SCONT
    HELIOS -.observes.-> FK
    HELIOS -.observes.-> DH

    style DC fill:#238636,stroke:#58a6ff,color:#ffffff
    style DP_PLATFORM fill:#1f6feb,stroke:#58a6ff,color:#ffffff
    style HELIOS fill:#a371f7,stroke:#58a6ff,color:#ffffff
    style DP fill:#0d1117,stroke:#58a6ff,color:#ffffff
    style SF fill:#0d1117,stroke:#58a6ff,color:#ffffff
    style SCONT fill:#0d1117,stroke:#58a6ff,color:#ffffff
    style FK fill:#0d1117,stroke:#58a6ff,color:#ffffff
    style DH fill:#0d1117,stroke:#58a6ff,color:#ffffff

Key Takeaway: DataProtect, SmartFiles, and SiteContinuity are the three workload-facing products that share a single SpanFS tier and Magneto orchestration. Helios is the SaaS control plane that observes and governs them all. The exam frequently asks which product replaces a legacy point tool — memorize the replacement column above.

Hardware, Cloud, and Virtual Edition Form Factors

Domain 2 (Solution Discovery and Design) routinely asks you to choose a form factor. The wrong hardware decision can sink an otherwise-correct architecture, so understand the trade-offs.

Cohesity-Branded Appliances vs. ReadyNodes vs. Certified Partners

ReadyNodes are pre-validated 1U or 2U x86 reference appliances sold under a fixed bill of materials. Each ReadyNode ships with a balanced ratio of CPU, memory, NVMe/SSD (used for metadata and write caching), and high-density HDDs [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity-SpanFS-and-SnapTree-WP.pdf]. ReadyNodes are typically branded by partners — Cisco, HPE, Dell, Lenovo — and sold as turnkey hardware that runs Cohesity software.

Form factor	Typical use	Procurement	Support model
Cohesity-branded appliance	Customers wanting a single throat to choke	Direct from Cohesity	Cohesity TAC end-to-end
ReadyNode (Cisco UCS, HPE Apollo, etc.)	Customers with hardware-vendor preferences or existing partnerships	OEM partner	OEM hardware + Cohesity software TAC
Certified third-party server	Custom sizing or specialized hardware	Customer-procured under HCL	Customer-led integration, Cohesity software TAC

Virtual Edition (VE) and Cloud Edition Deployment Models

Cohesity software also runs as virtual machines:

Virtual Edition (VE) runs on VMware vSphere or Microsoft Hyper-V. It is ideal for ROBO sites, dark-site management clusters, or small environments where deploying a physical appliance is uneconomical [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity-SpanFS-and-SnapTree-WP.pdf].
Cloud Edition runs as native VMs in AWS, Azure, or GCP and is the foundation of CloudReplicate and CloudSpin recovery patterns [Source: https://blogs.vmware.com/affiliates/cohesity-spanfs-the-difference-maker-in-the-enterprise-and-secondary-storage-architectures].

VE and Cloud Edition share the same Bridge/Apollo/Magneto stack as physical clusters; the architectural difference is that storage is supplied by the underlying hypervisor or cloud-provider block service rather than directly attached HDDs.

Robo Edition for Remote and Branch Offices

Robo Edition is a small-footprint variant tuned for branch offices. It typically runs a one- or three-node cluster on lighter hardware, replicates back to a primary regional cluster, and is centrally managed via Helios so that local IT staff aren’t required.

Choosing the Right Form Factor

Scenario	Recommended form factor	Why
200 TB enterprise data center, mixed VM/DB workloads	ReadyNode (Cisco/HPE) or branded appliance	Predictable performance, dense capacity, partner support
50 TB single dark-site classified environment	Physical cluster + Helios Self-Managed	No SaaS dependency, on-prem control
30 retail branch offices, 2 TB each	Robo Edition replicating to regional hub	Small footprint, central Helios management
AWS-resident DR target for on-prem clusters	Cloud Edition	Native AWS, supports CloudReplicate and CloudSpin
Lab / proof-of-concept	Virtual Edition on existing vSphere	Zero hardware cost, fast spin-up
Production on customer-standard Cisco UCS shop	Cisco-branded ReadyNode	Fits procurement and operations model

Worked example — a hybrid topology. An insurance company runs two primary data centers (East and Central US) and 18 branch offices. The architect should pair a pair of large physical ReadyNode clusters at each primary site for production protection, deploy Robo Edition at each branch with replication back to the nearest primary cluster, and stand up a Cloud Edition cluster in AWS us-west-2 as a third-site DR target. Helios SaaS unifies fleet management; the local DC clusters can additionally archive to S3 Glacier for long-term retention. Notice how the form-factor decision is downstream of a deeper requirement (RPO, RTO, branch IT capability, sovereignty), which is exactly how the exam frames its scenarios.

Figure 1.4: Form factor decision tree for Cohesity deployment placement

flowchart TD
    START["New Cohesity Workload<br/>Identify Deployment Location"]

    Q1{"Where will the<br/>cluster physically run?"}

    Q_DC{"Data center<br/>requirements?"}
    Q_BRANCH{"Branch office<br/>footprint?"}
    Q_CLOUD{"Cloud-resident<br/>workload?"}
    Q_LAB{"Lab, POC, or<br/>dark site?"}

    APPLIANCE["Cohesity-Branded Appliance<br/>Single throat to choke<br/>Direct from Cohesity TAC"]
    READYNODE["Partner ReadyNode<br/>(Cisco, HPE, Dell, Lenovo)<br/>OEM hardware + Cohesity software"]
    CERTIFIED["Certified Third-Party Server<br/>Custom sizing under HCL"]
    ROBO["Robo Edition<br/>1- or 3-node small footprint<br/>Replicates to regional hub"]
    CLOUDED["Cloud Edition<br/>Native VMs in AWS / Azure / GCP<br/>CloudReplicate and CloudSpin"]
    VE["Virtual Edition<br/>VMware vSphere or Hyper-V<br/>Zero hardware cost"]

    START --> Q1
    Q1 -->|On-premises DC| Q_DC
    Q1 -->|Remote branch| Q_BRANCH
    Q1 -->|Public cloud| Q_CLOUD
    Q1 -->|Test or air-gapped| Q_LAB

    Q_DC -->|Vendor-agnostic, turnkey| APPLIANCE
    Q_DC -->|Existing OEM partnership| READYNODE
    Q_DC -->|Specialized hardware needed| CERTIFIED

    Q_BRANCH -->|Small site, 1-5 TB| ROBO

    Q_CLOUD -->|DR target or cloud workload| CLOUDED

    Q_LAB -->|Hypervisor available| VE
    Q_LAB -->|Sovereign or air-gapped| READYNODE

    style START fill:#238636,stroke:#58a6ff,color:#ffffff
    style Q1 fill:#1f6feb,stroke:#58a6ff,color:#ffffff
    style Q_DC fill:#1f6feb,stroke:#58a6ff,color:#ffffff
    style Q_BRANCH fill:#1f6feb,stroke:#58a6ff,color:#ffffff
    style Q_CLOUD fill:#1f6feb,stroke:#58a6ff,color:#ffffff
    style Q_LAB fill:#1f6feb,stroke:#58a6ff,color:#ffffff
    style APPLIANCE fill:#0d1117,stroke:#58a6ff,color:#ffffff
    style READYNODE fill:#0d1117,stroke:#58a6ff,color:#ffffff
    style CERTIFIED fill:#0d1117,stroke:#58a6ff,color:#ffffff
    style ROBO fill:#0d1117,stroke:#58a6ff,color:#ffffff
    style CLOUDED fill:#0d1117,stroke:#58a6ff,color:#ffffff
    style VE fill:#0d1117,stroke:#58a6ff,color:#ffffff

Key Takeaway: ReadyNodes are the workhorse physical form factor; Virtual Edition addresses ROBO, dark-site, and lab scenarios; Cloud Edition powers cloud-resident DR; and Robo Edition addresses branch deployments. Form-factor choice is downstream of business and security requirements — not the other way around.

Chapter Summary

The Cohesity Certified Architect Expert (CCAE) exam tests an architect’s ability to design, size, secure, and integrate Cohesity Data Cloud deployments. The blueprint is dominated by the 35%-weighted Solution Discovery and Design domain, with substantial weight given to Platform Architecture (22%), Security-focused Solutions (18%), and Third-party Integration (13%). The exam runs 90 minutes, costs $200, requires 60% to pass, and is valid for two years — and it sits at the top of a certification track that includes the CCA, CCSE, and CCPE credentials [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/datasheets/cohesity-certified-architect-expert-exam-preparation-guide-en.pdf].

Architecturally, every Cohesity scenario reduces to four pillars. First, SpanFS — the distributed file system that consolidates backup, file, object, and analytics data on one tier with strict consistency, multi-protocol access, and SnapTree-based unlimited snapshots [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity-SpanFS-and-SnapTree-WP.pdf]. Second, the hyperconverged scale-out node model where Bridge, Apollo, Magneto, Iris, ScribeStore, and Yoda services run on every x86 node. Third, MapReduce-style background services — Apollo’s role — that maintain global variable-length deduplication, garbage collection, and indexing without disrupting the data path. Fourth, quorum-driven consistency that imposes the familiar three- or four-node minimums [Source: https://www.cohesity.com/blogs/cohesity-spanfs-snaptree/].

The four headline products — DataProtect (backup and recovery), SmartFiles (files and objects), SiteContinuity (DR orchestration), and Helios (SaaS management) — are different personalities sharing the same SpanFS body, and the form factors (Cohesity-branded appliances, partner ReadyNodes, Virtual Edition, Cloud Edition, Robo Edition) let architects place that body anywhere from a sovereign air-gapped vault to a public-cloud DR target. Learn which service owns which behavior, which product replaces which legacy tool, and which form factor matches which business constraint, and you will recognize the right answer in roughly 70% of CCAE scenario questions [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/solution-guides/the-essential-guide-to-modern-data-management-en.pdf].

Key Terms

Term	Definition
SpanFS	Cohesity’s distributed, web-scale file system that consolidates backups, files, objects, dev/test, and analytics on a single tier; exposes NFS, SMB, S3, OST, and DirectIO via virtual IPs with no master node.
DataPlatform	The composite Cohesity software stack — SpanFS plus the Bridge, Apollo, Magneto, Iris, ScribeStore, and Yoda services — that runs on every cluster node.
DataProtect	Cohesity’s flagship enterprise backup and recovery product covering VMs, physical servers, databases, NAS, M365, Kubernetes, and public-cloud workloads via Magneto-driven pipelines.
Helios	Cohesity’s SaaS multicloud control plane (with a Helios Self-Managed option for dark sites) that unifies monitoring, policy, global search, and AI-driven anomaly detection across clusters.
SmartFiles	Cohesity’s unified file and object service offering NFS, SMB, and S3 access on the same data, with immutability and analytics, sharing SpanFS with DataProtect.
Bridge	The distributed scale-out file system data-path service that performs chunking, dedupe, compression, encryption, erasure coding, tiering, and protocol serving for SpanFS.
Apollo	The cluster-wide MapReduce-style background services engine that performs garbage collection, post-process dedupe, indexing, file analytics, and integrity scrubbing.
Magneto	The data protection orchestration service that drives backups, snapshots, replication, archive, and recovery workflows on top of SpanFS.
ReadyNode	A pre-validated 1U or 2U x86 reference appliance (Cisco, HPE, Dell, Lenovo, etc.) with balanced CPU, memory, NVMe/SSD, and HDD that is the standard physical form factor for Cohesity clusters.
Iris	The management UI and control-plane service that serves the Cohesity web interface, REST APIs, and CLI; enforces RBAC.
Yoda	The global indexing and search service that powers cross-cluster file and object search via Helios.
ScribeStore	The underlying distributed key-value metadata store that holds inodes, chunk locations, snapshot trees, and indexes; replicated and consistently sharded across nodes.
SnapTree	Cohesity’s snapshot and clone technology built on Distributed Redirect-on-Write (D-ROW), enabling unlimited snaps and clones with no performance impact.
SiteContinuity	Cohesity’s converged backup-plus-DR product providing automated failover, failback, and non-disruptive DR test runbooks on top of DataProtect snapshots.
Virtual Edition (VE)	Cohesity software delivered as VMware or Hyper-V virtual machines for ROBO, dark-site, or lab use.
Cloud Edition	Cohesity software delivered as native cloud VMs in AWS, Azure, or GCP, enabling CloudReplicate and CloudSpin DR patterns.
CCAE	Cohesity Certified Architect Expert — the architect-tier certification covering platform architecture, design, security, and third-party integration.

Chapter 2: SpanFS Internals, Distributed Storage, and Cluster Mechanics

If Chapter 1 introduced the cast of services that make up the Cohesity DataPlatform, this chapter pulls back the floor and exposes the plumbing. Beneath every backup job, every SmartFiles share, every Helios SLA report sits a single distributed file system: SpanFS. SpanFS is responsible for accepting writes from many protocols, deduplicating and compressing them, replicating or erasure-coding them across nodes, and presenting a strictly consistent global namespace — all without a single master node. For a CCAE candidate, the most common architectural mistakes (under-sized clusters, unrecoverable failure domains, dedup ratios that never materialize) trace back to misunderstandings of how SpanFS actually behaves on the wire and on disk. This chapter dissects that behavior.

Learning Objectives

Explain how SpanFS chunks, fingerprints, and stores data across nodes using chunk files, blob files, and the SnapTree metadata index.
Describe Replication Factor (RF) and Erasure Coding (EC) trade-offs, including stripe placement, minimum cluster sizes, and when each scheme is appropriate.
Trace a write I/O end-to-end from client through the Bridge service, NVRAM journal, IO Engine, and into chunk files on persistent media.
Predict how a cluster behaves under disk failure, node failure, chassis failure, and quorum loss, including rebuild times and management-plane availability.
Choose deduplication and compression policies (inline vs. post-process, variable-length scope) appropriate for VM, NAS, and database workloads.

SpanFS Data Path

The SpanFS data path is the sequence of services and structures that turn a client write into durable, deduplicated, replicated bytes on persistent media. Every Cohesity feature — backup, SmartFiles share, replication target — is ultimately a consumer of this data path.

Chunk files, blob files, and chunk groups

SpanFS stores user data as chunks, not as files in the traditional Linux sense. When data enters the cluster, the IO Engine runs a Rabin rolling hash across the byte stream and slices it at content-defined boundaries. Each resulting chunk is fingerprinted with SHA-1 and looked up in the cluster-wide deduplication hash table [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity-SpanFS-and-SnapTree-WP.pdf].

Three on-disk constructs hold these chunks together:

Chunk file: the smallest unit of user data persisted to disk. After dedup and compression, each unique chunk lives in a chunk file, typically tens of kilobytes in size.
Blob file: a container that aggregates many chunk files belonging to a logical object (a VM disk, a SmartFiles file, a database datafile). A blob file exposes a virtual byte range and points into the chunks that back it.
Chunk group: the resiliency unit. Multiple chunk files are bundled into a chunk group that is then replicated (RF) or erasure-coded (EC) as a whole stripe across fault domains. Chunk groups let the cluster amortize the overhead of EC encoding across many small chunks.

The relationship between these three objects is captured in SnapTree, a distributed B+ tree built atop the cluster-wide key-value store. SnapTree’s root nodes represent Views or files; intermediate and leaf nodes ultimately resolve a logical offset within an object to the specific chunk file holding that data [Source: https://www.cohesity.com/blogs/cohesity-spanfs-snaptree/]. Because SnapTree supports copy-on-write semantics, a snapshot or clone is just a new root pointer — adding effectively zero capacity until divergence occurs.

Figure 2.1: Chunk file, Blob file, and Chunk group hierarchy

flowchart LR
    Client[Client object<br/>VM disk / file] --> Blob[Blob file<br/>logical container]
    Blob --> ST[SnapTree root<br/>B+ tree index]
    ST --> CF1[Chunk file<br/>deduped + compressed]
    ST --> CF2[Chunk file<br/>deduped + compressed]
    ST --> CF3[Chunk file<br/>deduped + compressed]
    CF1 --> CG[Chunk group<br/>resiliency unit]
    CF2 --> CG
    CF3 --> CG
    CG --> RF[RF2/RF3 copies]
    CG --> EC[EC stripe<br/>across fault domains]

Client object (VM disk)
        |
        v
   Blob file  ---->  SnapTree root
        |                 |
        +--- chunk ref ---+
        +--- chunk ref ---+
        +--- chunk ref ---+
                         |
                         v
                   Chunk file (deduped, compressed)
                         |
                         v
                   Chunk group  ---->  RF2/RF3 copies OR EC stripe

NVRAM journaling and write coalescing

Hyperconverged backup workloads are punishing on disk: thousands of streams arrive concurrently, each with random small writes. To absorb this without driving spinning HDDs into seek-thrashing, every Cohesity node includes an NVRAM region — implemented as a battery- or flash-backed slice of SSD that survives power loss [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity-SpanFS-and-SnapTree-WP.pdf].

The mechanism follows a classic journal-then-checkpoint pattern that database engineers will recognize:

The IO Engine appends incoming writes to the NVRAM log sequentially.
The log entry is mirrored to one or two peer nodes (matching the configured Replication Factor) before the client receives the ACK.
Once durable in NVRAM on the required number of nodes, the client gets a low-latency acknowledgement.
A background destage task coalesces many small NVRAM entries into large, sequential disk writes, producing chunk files that play nicely with HDDs.

Analogy: the bank teller’s deposit slip. Imagine a busy bank where customers arrive every few seconds with small deposits. If the teller had to walk to the vault for every transaction, throughput would collapse. Instead the teller scribbles each deposit on a slip (the NVRAM journal), drops the slip into a duplicate carbon-copy file (the mirrored journal on a peer node), and hands the customer a receipt (ACK to the client). At the end of the hour, the back-office staff batches all of the day’s slips and updates the master ledger in one sweep (destage to chunk files on HDD). If the bank loses power, the carbon copies and slip stack survive — every transaction can be replayed.

Figure 2.2: SpanFS write I/O path from client to durable disk

sequenceDiagram
    participant C as Client
    participant V as Protection VIP
    participant B as Bridge Service
    participant IO as IO Engine
    participant N1 as NVRAM (Local)
    participant N2 as NVRAM (Peer)
    participant KV as Distributed KV
    participant D as Disk (HDD/SSD)

    C->>V: NFS/SMB/S3 WRITE
    V->>B: Route to node
    B->>IO: Hand off byte stream
    IO->>IO: Rabin chunk + SHA-1 fingerprint
    IO->>KV: Dedup hash lookup
    KV-->>IO: Hit/Miss per chunk
    IO->>N1: Append journal entry
    IO->>N2: Mirror journal (RF2)
    N2-->>IO: Mirror ack
    N1-->>IO: Local ack
    IO-->>B: Durable in NVRAM
    B-->>C: ACK to client
    Note over IO,D: Background destage (asynchronous)
    IO->>D: Coalesce + write chunk files
    IO->>KV: Commit SnapTree update (quorum)

Read path and locality optimizations

Reads in SpanFS are deliberately asymmetric to writes. A read request enters via any node’s protocol head (NFS, SMB, S3, or the Bridge service for backup data). That node:

Resolves the logical offset against the SnapTree to obtain the chunk fingerprint.
Looks up the chunk’s location in the distributed key-value store.
Fetches the chunk from local disk if present, or from the peer node holding the replica/EC fragment.
Decompresses and returns the data to the client.

SpanFS exploits locality in two ways. First, the cluster prefers to ingest, dedup, and persist data on the same node that received it, so that subsequent reads of recent data are local. Second, the read cache (in DRAM and SSD) is per-node and warmed by access patterns; hot chunks stay close to the protocol head serving them. For sequential restore workloads (e.g., Instant Mass Restore of a 1 TB VM), SpanFS issues parallel reads to multiple nodes simultaneously, exploiting the fact that the chunk group’s fragments span the cluster.

Garbage collection and chunk reclamation

Because chunks are deduplicated, deleting a backup or expiring a snapshot does not immediately free space. A chunk is reclaimable only when its reference count in the distributed KV store drops to zero. SpanFS runs a background garbage collection (GC) process that:

Walks the SnapTree to identify orphaned chunk references.
Decrements reference counts for chunks released by deleted snapshots, expired backups, or overwritten regions.
Compacts blob files by rewriting still-live chunks contiguously and freeing the original blob extents.
Re-runs erasure coding on cold blob files that initially landed in RF2 (described in the next section).

GC is throttled to avoid contending with foreground I/O, which is why architects often see capacity reclaim lagging deletion events by hours or days. For sizing, always model worst-case GC lag (commonly 24–72 hours) when planning capacity headroom.

Key Takeaway: SpanFS implements a journal-then-checkpoint write path: NVRAM absorbs random writes for low-latency ACKs, the IO Engine deduplicates and compresses chunks, chunk groups apply RF or EC, and SnapTree indexes everything. Reads exploit locality and parallel fetch; garbage collection runs asynchronously, so deleted capacity returns over hours, not seconds.

Resiliency: RF and Erasure Coding

A SpanFS cluster keeps user data safe through two complementary mechanisms: Replication Factor (RF) and Erasure Coding (EC). Both can coexist within a single cluster — even within a single View Box — and the architect’s job is to choose the right policy for each workload.

Replication Factor 2 vs. RF3 trade-offs

Replication is the simplest scheme: store N identical copies of every chunk group on N different fault domains. Cohesity supports RF2 (two copies) and RF3 (three copies) [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/cohesity-fault-tolerance-data-integrity-for-modern-web-scale-environments-white-paper-en.pdf].

RF2 survives one simultaneous failure (node, disk, or chassis depending on the configured fault domain). Recovery is a fast block-level copy from the surviving replica. Capacity overhead is 100% — usable capacity is roughly 50% of raw.
RF3 survives two simultaneous failures and is required for the strictest SLAs or small clusters where a second concurrent failure during rebuild is plausible. Overhead is 200%, leaving roughly 33% of raw as usable capacity.

RF is preferred for hot data: NVRAM journaling, freshly destaged chunks, and any workload with low-latency write SLAs. The reason is simple: writing a mirror is cheaper than encoding a stripe, and reading is just a local fetch from any surviving copy.

Erasure coding schemes (2:1, 4:2, 6:2)

Erasure coding stripes data and parity fragments across many fault domains using Reed-Solomon-style mathematics. Cohesity supports several schemes; the most commonly cited are EC 2:1, EC 4:2, and EC 5:2 [Source: http://vastitservices.com/wp-content/uploads/2020/03/VAST-Flyer-Cohesity-DataPlatform.pdf].

Scheme	Data Frags	Parity Frags	Min. Fault Domains	Failures Tolerated	Usable Capacity	Overhead
RF2	1	1 (mirror)	3 nodes	1	~50%	100%
RF3	1	2 (mirrors)	4 nodes	2	~33%	200%
EC 2:1	2	1	4 nodes	1	~67%	50%
EC 4:2	4	2	6 nodes	2	~67%	50%
EC 6:2	6	2	8 nodes	2	~75%	33%

The headline observation: EC 4:2 and RF3 tolerate the same number of failures, but EC 4:2 uses half the overhead [Source: https://www.cohesity.com/blogs/erasure-coding-increase-fault-resilience-capacity/]. In a 100 TB raw cluster, RF3 yields ~33 TB usable while EC 4:2 yields ~67 TB usable — and both survive two simultaneous failures. The cost is encoding CPU and a more complex rebuild path.

Inline vs. post-process erasure coding

Cohesity rarely writes EC-protected chunks directly on the hot ingest path. Instead, the platform uses a two-stage resiliency pipeline:

Stage 1 — Inline RF2: Hot, freshly ingested chunks land in NVRAM and SSD with RF2 protection. This minimizes write latency because mirroring is cheap.
Stage 2 — Post-process EC: A background task identifies cold chunk groups, re-encodes them under the View Box’s EC policy (e.g., 4:2), persists the new EC stripe, and frees the RF2 copies.

This staging is the reason Cohesity can advertise both low write latency and high storage efficiency — it pays the latency cost for hot data with mirroring and pays the encoding cost for cold data when no client is waiting. Architects must factor in the transient RF2 overhead when sizing: you cannot assume EC 4:2 efficiency on day-one capacity, because freshly ingested data lives at RF2 for some period.

Figure 2.3: Erasure coding stripe placement decision flow

flowchart TD
    Ingest[New write arrives at node] --> RF2Land[Stage 1: Land at RF2<br/>NVRAM mirror + SSD]
    RF2Land --> Ack[Client ACK<br/>low latency]
    Ack --> Cold{Chunk group<br/>cold?}
    Cold -- No --> Stay[Remain at RF2<br/>hot tier]
    Stay --> Cold
    Cold -- Yes --> ECCheck{Enough fault<br/>domains for EC?}
    ECCheck -- No --> StayRF[Keep at RF2/RF3<br/>per View Box policy]
    ECCheck -- Yes --> Encode[Stage 2: Reed-Solomon encode<br/>EC 2:1 / 4:2 / 6:2]
    Encode --> Place[Distribute fragments<br/>across fault domains]
    Place --> Free[Release original<br/>RF2 copies]
    Free --> Reclaim[GC reclaims capacity]

Choosing resiliency policies per View Box

Resiliency policy is configured at the View Box (storage domain) level, which means different workloads can have different policies on the same cluster. Practical guidelines:

Workload	Recommended Policy	Rationale
Production backups, long retention	EC 4:2 (post-process)	Cold data dominates; capacity efficiency matters
Hot SmartFiles primary share	RF2 (or RF3)	Latency-sensitive; minimal encoding overhead
Small cluster (<6 nodes)	RF2 or RF3	EC 4:2 requires 6 fault domains
Compliance / WORM	RF3 or EC 4:2	Two-failure tolerance
CloudArchive tier	Cloud provider’s own	Native S3/Azure durability replaces local RF/EC

Key Takeaway: RF2 is fast and cheap to rebuild but doubles capacity; RF3 doubles fault tolerance again at 3x overhead. EC 4:2 matches RF3’s fault tolerance at half the overhead but requires at least 6 fault domains and more CPU. Cohesity blends both — RF2 inline, EC in the background — so plan capacity assuming RF2 for hot data and the View Box’s EC policy for cold.

Deduplication and Compression

Storage efficiency is the difference between selling a customer 100 TB of usable capacity and 500 TB of effective capacity. Cohesity’s headline efficiency multipliers come from global, variable-length deduplication combined with inline compression.

Variable-length and fixed-length sliding-window dedupe

Most storage systems either store data in fixed 4 KiB or 8 KiB blocks (fixed-length dedup) or in chunks aligned to filesystem boundaries. Cohesity uses variable-length chunking driven by Rabin fingerprinting [Source: https://www.cohesity.com/blogs/global-deduplication-matters/]:

The IO Engine slides a rolling hash window across the byte stream.
When the rolling hash matches a content-defined breakpoint pattern, a chunk boundary is declared.
Each resulting chunk gets a SHA-1 fingerprint.
The fingerprint is queried against the global hash table; if found, only a metadata reference is stored.

The advantage is insertion-resilience. If a 1 KB header is prepended to a previously-backed-up file, fixed-block dedup will produce all new fingerprints because every block has shifted. Variable-length dedup will produce one new chunk (containing the header) and reuse every chunk after the next content-defined boundary. For incremental and synthetic-full backup workloads, this difference can be 5–10x in dedup ratio.

Global vs. local dedupe domains

The dedup hash table lives in the distributed key-value store and is replicated across nodes. The scope, however, is the View Box. Two View Boxes on the same cluster maintain separate hash tables, which means a chunk written into View Box A is not deduplicated against an identical chunk written into View Box B [Source: https://www.cohesity.com/glossary/data-deduplication/].

This has architectural consequences:

Multi-tenancy: Per-tenant View Boxes provide cryptographic and capacity isolation, at the cost of dedup blindness across tenants.
Workload grouping: Place similar workloads (e.g., all VMware backups) in the same View Box to maximize dedup. Mixing fundamentally different data (encrypted databases plus office documents) in one View Box dilutes ratios.
Encryption: Per-View-Box keys are fine; per-tenant keys break dedup across tenants regardless of View Box settings, because identical plaintext encrypts to different ciphertext.

Inline vs. post-process compression

Compression in SpanFS is applied after dedup and before persistence to chunk files. Compression mode is configurable per View Box:

Inline compression: Each unique chunk is compressed (typically LZ4 or zstd-class algorithm) before it is written to disk. Lower disk usage on first write; small CPU cost on the ingest path.
Post-process compression: Chunks are written uncompressed first (faster ACK), then compressed by a background task. Better latency at the cost of transient capacity.

In practice, inline compression is almost universally enabled for backup workloads — the data is already at RF2 in NVRAM before the client sees the ACK, so the marginal CPU cost is negligible.

Estimating dedupe ratios for different workloads

CCAE candidates are routinely asked to size effective capacity. Use these planning ratios as starting points (always validate with the Cohesity sizing tool):

Workload	Typical Dedup Ratio	Compression Ratio	Combined
VMware VM backups (mixed Windows)	4–6x	1.5–2x	6–12x
Oracle/SQL database fulls	3–5x	1.5–2x	4–10x
Microsoft 365 mailbox / SharePoint	3–5x	1.3–1.5x	4–8x
File shares / SmartFiles	1.5–3x	1.3–2x	2–6x
Already-compressed media (video, ZIP)	~1x	~1x	~1x
Encrypted source data	~1x	~1x	~1x

Key Takeaway: SpanFS dedup is variable-length, content-defined, SHA-1 fingerprinted, and global within a View Box. The dedup scope boundary is the View Box, so multi-tenant designs trade dedup efficiency for isolation. Always plan effective capacity using validated workload ratios — never assume the same multiplier across mixed workloads.

Cluster Mechanics and Quorum

A SpanFS cluster is a peer-to-peer system: every node runs the same services and any node can serve any client request. Coordinating peers without a master requires explicit consensus, fault domain awareness, and disciplined upgrade procedures.

Strict consistency and Paxos-based metadata

Unlike eventually-consistent object stores, SpanFS guarantees strict consistency for both data and metadata operations [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/cohesity-fault-tolerance-data-integrity-for-modern-web-scale-environments-white-paper-en.pdf]. Reads always observe the most recent durably committed write. Two clients writing to the same offset never see split-brain results.

Strict consistency is implemented via a Paxos-style consensus protocol layered over the distributed key-value store. Metadata writes — directory updates, SnapTree modifications, dedup hash registrations — are committed only when a quorum of replicas acknowledge the change. With three KV replicas per record, a write requires at least two acks.

Node, disk, and chassis fault domains

Cohesity lets architects configure the fault domain at which RF copies and EC fragments must be distributed:

Fault Domain	Description	Required Cluster Shape
Disk	Two copies/fragments may share a node	Default; smallest clusters
Node	Each copy/fragment lands on a different node	Recommended baseline
Chassis	Each copy/fragment lands on a different chassis (block)	Multi-node-per-chassis configurations
Rack	Each copy/fragment lands on a different rack	Stretched / large clusters

For example, a 4-node single-chassis ReadyNode platform cannot meaningfully use chassis-level fault domain — there is only one chassis. Promoting fault-domain to chassis on such a cluster effectively forces the cluster into a state where it cannot place EC stripes. Architects must size for the fault domain count, not just node count.

Quorum loss scenarios and recovery

A SpanFS cluster requires a majority quorum of nodes for the management plane (Apollo / Iris) and for metadata writes. The classic scenarios:

3-node cluster, 1 node down: 2 of 3 nodes online → quorum maintained. Cluster operates with reduced redundancy; rebuilds proceed if RF2.
3-node cluster, 2 nodes down: 1 of 3 online → quorum lost. Cluster goes read-only or offline; manual intervention required.
6-node cluster, 2 nodes down: 4 of 6 online → quorum maintained; EC 4:2 still tolerates the second failure if both losses occurred within the window before rebuild completes.
Network partition splitting cluster 3/3: neither side has majority → both sides go offline to prevent split-brain. Operator must manually bring the larger or designated side back.

This is why even-numbered cluster sizes are gently discouraged: a 4-node cluster splitting 2/2 has no automatic majority. Most production deployments use 4-node minimum but architect for failure scenarios assuming odd-numbered effective fault domains.

Figure 2.4: Cluster consistency states and quorum transitions

stateDiagram-v2
    [*] --> Healthy
    Healthy: Healthy<br/>All nodes online<br/>full quorum + RF/EC
    Degraded: Degraded<br/>Node/disk down<br/>quorum maintained
    Rebuilding: Rebuilding<br/>Background reconstruction<br/>fragments restored
    QuorumLoss: Quorum Loss<br/>Majority unreachable<br/>read-only / offline
    Recovery: Recovery<br/>Manual intervention<br/>partition resolved

    Healthy --> Degraded: failure within tolerance
    Degraded --> Rebuilding: replacement / spare available
    Rebuilding --> Healthy: rebuild completes
    Degraded --> QuorumLoss: additional failure exceeds tolerance
    Healthy --> QuorumLoss: network partition splits cluster
    QuorumLoss --> Recovery: operator action
    Recovery --> Degraded: quorum restored
    Recovery --> Healthy: all nodes rejoined

Rolling upgrades and maintenance mode

SpanFS upgrades are rolling: one node at a time is placed in maintenance mode, drained of leadership and active VIPs, upgraded, rebooted, and brought back into the cluster before the next node is touched. During the window when a node is in maintenance:

Quorum tightens (must be maintained by the remaining nodes).
Effective resiliency drops by one (a 2-failure scheme tolerates one further failure during the upgrade, not two).
Inflight client connections are migrated to peer nodes via VIP failover.

For RF2 clusters, this means an upgrade temporarily reduces fault tolerance to zero. For mission-critical systems, RF3 or EC 4:2 is preferred so that maintenance plus an unplanned failure does not equal data loss.

Key Takeaway: SpanFS uses Paxos-based strict consistency, requires majority quorum for the management plane, and lets you configure fault domains (disk/node/chassis/rack). Plan for fault-domain count, not just node count; rolling upgrades temporarily reduce fault tolerance by one, so RF2 clusters become single-failure-vulnerable during upgrade windows.

Worked Example: Tracing a 1 MB Write Through SpanFS

Let’s tie everything together by following a single 1 MB write from a backup proxy into a 6-node Cohesity cluster configured with View Box policy EC 4:2 (post-process), inline dedup, inline compression, fault domain = node.

Step 1 — Client connection. The backup proxy opens an NFS mount to the cluster’s protection VIP. SmartDNS resolves the VIP to Node 3. The proxy issues an NFSv3 WRITE for offset 0, length 1,048,576 of /backup/vm-disk-001.vmdk.

Step 2 — Bridge service receives. Node 3’s Bridge service authenticates the request, identifies the target View Box, and hands the byte stream to the IO Engine.

Step 3 — IO Engine chunks the data. The IO Engine runs Rabin fingerprinting across the 1 MB. Suppose this produces 16 variable-length chunks averaging ~64 KiB each.

Step 4 — Dedup lookup. For each chunk, the IO Engine computes a SHA-1 fingerprint and queries the global hash table in the distributed KV store. Suppose 12 of the 16 chunks already exist (this VM has been backed up before); only 4 are unique.

Step 5 — Compress unique chunks. The 4 unique chunks are compressed with LZ4. Assume an average 1.6x compression ratio, yielding ~160 KiB of compressed unique data.

Step 6 — NVRAM journal (RF2). Node 3 appends the 4 compressed chunks plus their metadata pointers to its NVRAM log. The log entry is mirrored synchronously to Node 5’s NVRAM. Once both NVRAMs ack, Node 3 sends the NFS WRITE reply back to the proxy. End of synchronous path — total latency on the order of single-digit milliseconds.

Step 7 — SnapTree update. The IO Engine commits a SnapTree transaction adding 16 chunk references to the blob file representing vm-disk-001.vmdk. The transaction is quorum-committed across three KV replicas (Nodes 1, 3, and 5).

Step 8 — Background destage. Within seconds, the destage task flushes the NVRAM journal entries into chunk files on the SSD/HDD tier of Nodes 3 and 5. The chunks are now persistent on rotational media at RF2.

Step 9 — Reference count update. For the 12 deduplicated chunks, the dedup hash table increments their reference counts, recording that this new blob also references them.

Step 10 — Post-process EC. Hours later, garbage collection identifies the chunk group containing these chunks as “cold.” It re-encodes them under EC 4:2: 4 data fragments and 2 parity fragments are computed and distributed across Nodes 1, 2, 3, 4, 5, and 6 — one fragment per node, satisfying the node-level fault domain. The original RF2 copies are released, and capacity is reclaimed.

Step 11 — Steady state. The 1 MB write that originally consumed 2 MB at RF2 (one mirror copy) now consumes roughly 240 KiB (160 KiB unique compressed data plus 50% EC parity overhead, ignoring the deduped 12 chunks that consumed nothing additional). That is the compounding effect of dedup, compression, and EC across a backup retention window.

Chapter Summary

SpanFS is the foundation that makes every other Cohesity feature possible. Its write path is a textbook journal-then-checkpoint design: NVRAM absorbs random writes with mirrored, low-latency durability; the IO Engine performs variable-length Rabin chunking, SHA-1 dedup, and inline compression; SnapTree indexes everything in a copy-on-write B+ tree atop a strictly consistent distributed key-value store. Because every node is a peer, the cluster has no single point of failure — but it depends on Paxos-style consensus and configured fault domains to maintain that property under stress.

Resiliency is layered. RF2 provides single-failure tolerance with low latency at high capacity overhead; RF3 doubles fault tolerance at triple overhead; EC schemes (2:1, 4:2, 6:2) trade encoding CPU for dramatically lower overhead. Cohesity’s two-stage pipeline — RF2 inline for hot data, EC in the background for cold data — captures both benefits, but architects must size capacity assuming RF2 for recently ingested workloads and account for fault-domain count rather than raw node count. View Box-scoped global deduplication, inline compression, and per-workload policy choice shape the effective capacity multiplier the customer ultimately sees.

For the CCAE exam, the recurring traps are: assuming dedup spans tenants when it does not; sizing EC efficiency without accounting for transient RF2 overhead; ignoring quorum math on small or even-numbered clusters; promoting fault domain to a level the cluster shape cannot satisfy; and forgetting that rolling upgrades temporarily reduce effective fault tolerance by one. Internalize the write-path trace and the RF/EC trade-off table, and most cluster-mechanics questions become straightforward applications of those fundamentals.

Key Terms

Term	Definition
Chunk file	Smallest persisted unit of user data in SpanFS, holding a single deduplicated, compressed chunk.
Blob file	Logical container that aggregates chunk references for a single object (VM disk, file, datafile).
NVRAM	Battery- or flash-backed SSD region used as a mirrored journal for low-latency, durable write ACKs.
View Box	Logical storage domain that defines deduplication scope, resiliency policy, encryption, QoS, and tiering.
Replication Factor (RF)	Number of full copies of each chunk group SpanFS stores; RF2 tolerates 1 failure, RF3 tolerates 2.
Erasure Coding (EC)	Reed-Solomon-style scheme storing data + parity fragments across fault domains for higher efficiency.
Quorum	Majority of cluster nodes/replicas required for metadata writes and management-plane operations.
Fault domain	Configured failure boundary (disk, node, chassis, rack) across which RF copies and EC fragments must spread.
Strict consistency	Guarantee that reads always observe the most recent durably committed write; enforced via Paxos-style consensus.

Chapter 3: Cluster Design, Sizing, and Capacity Planning

If Chapters 1 and 2 explained what Cohesity is and how SpanFS holds data together, this chapter answers the question every architect actually has to answer in a customer meeting: “How big should the cluster be, and what nodes do I buy?” Sizing a Cohesity cluster is part arithmetic, part workload psychology, and part risk management. Get it right and the cluster runs comfortably for three to five years with linear expansion. Get it wrong and you either over-spend by 40% on day one, or you starve the cluster of headroom and watch backup windows blow out within eighteen months.

This chapter walks through the full sizing pipeline: profiling workloads, feeding inputs to the Cohesity Sizer, choosing among the C4000, C5000, C5200, and C6000 ReadyNode families, and planning capacity over a multi-year horizon. We’ll work through concrete examples (including the 500 TB FETB scenario architects encounter constantly on the CCAE exam) and surface the gotchas that separate a passing answer from a great one.

Learning Objectives

By the end of this chapter you will be able to:

Design Cohesity clusters that meet specific RPO, RTO, and retention SLAs derived from a customer’s workload profile.
Apply Cohesity sizing tools and capacity formulas to translate Front-End TB (FETB), change rate, and retention into raw and usable cluster capacity.
Plan node count, fault tolerance margin (RF vs. EC), and growth headroom (N+1, year-3 horizon) for production deployments.
Justify hardware family selection (C4000 hybrid, C5000/C5200 performance, C6000 dense, all-flash variants) based on the workload mix and SLA priority.
Build a capacity plan that survives technology refresh cycles, tiering decisions, and growth shocks.

Workload Profiling Inputs

Sizing always starts with profiling — building a quantitative picture of the data you have to protect. A Cohesity cluster doesn’t care whether the source is “important”; it cares about how many bytes arrive, how often they change, how long they live, and how fast they have to come back. Skip this step and every later number is a guess.

Front-End TB (FETB), Change Rate, and Retention

Three numbers dominate any backup sizing conversation. Memorize them in this order, because every other metric flows downstream:

Front-End TB (FETB) — the total source-side data footprint of the workloads you intend to protect, measured before Cohesity touches it. FETB is the licensing unit and the headline number on every quote. Think of FETB like the headline number on a car spec sheet — the “0 to 60” of backup sizing. It tells you the order of magnitude of the deal but it doesn’t tell you how the cluster will actually behave under load. [Source: https://www.cohesity.com/blogs/fetb-vs—betb—thinking-beyond-the-invoice/]
Change Rate — the percentage of FETB that mutates between successive backups. Typical values: 1–3% daily for VM and NAS workloads; 5–15% daily for transactional databases; up to 30% daily for log-heavy systems if you protect transaction logs as part of the daily ingest. Change rate drives the incremental ingest size and therefore the back-end TB consumed per retention day.
Retention — how long each recovery point must remain on the cluster. A typical Grandfather-Father-Son (GFS) policy might be 30 daily / 12 weekly / 12 monthly / 7 yearly. Retention transforms a single-day ingest number into a multi-month cumulative consumption number, and retention is where small assumptions blow up into large capacities.

The fundamental back-end math:

BETB ≈ FETB × full_compression_factor + (FETB × change_rate × retention_days × incremental_factor)
Effective_BETB = BETB / (dedupe_ratio × compression_ratio)

In practice the Sizer computes this per-workload and aggregates, because dedupe and compression ratios differ by source type. A 1% change rate sounds small until you multiply by 365 days of retention and realize you’ve stored 4.65 full copies’ worth of incremental change. [Source: https://www.cohesity.com/blogs/fetb-vs—betb—thinking-beyond-the-invoice/]

Figure 3.1: Sizing input-to-output flow (FETB, change rate, retention, dedup → usable capacity → node count)

flowchart TD
    A[FETB by Workload] --> E[Initial Full Backend]
    B[Daily Change Rate] --> F[Incremental Volume]
    C[Retention Policy<br/>D/W/M/Y] --> G[Cumulative Retention Multiplier]
    D[Dedupe + Compression Ratios] --> H[Reduction Factor]
    E --> I[Effective BETB Required]
    F --> I
    G --> I
    H --> I
    I --> J[Apply Growth + N+1 Headroom]
    J --> K[Target Usable Capacity]
    K --> L{Choose EC Scheme}
    L --> M[Compute Raw Capacity]
    M --> N[Select ReadyNode Family]
    N --> O[Final Node Count]

Key Takeaway: FETB is the licensing input, change rate is the daily multiplier, and retention is the time multiplier. All three must be known with reasonable accuracy before any capacity number can be trusted. Treat customer-provided change rates with healthy skepticism — measure when possible, or pad by 20%.

Workload Categorization: VM, NAS, DB, Physical

Not all FETB is equal. The Sizer expects you to break the total down by workload type because each category has different reduction ratios, different ingest patterns, and different performance demands.

Workload Type	Typical Daily Change	Typical Reduction Ratio (dedupe + compression)	Sizing Notes
VMware / Hyper-V / AHV VMs	1–3%	6:1 to 10:1	CBT/RCT incremental forever; OS dedupe is excellent across similar VMs
Physical servers (Linux/Windows agent)	1–5%	3:1 to 5:1	Lower dedupe than VMs because OS images are not as templated
NAS (SMB/NFS/NDMP)	1–3%	2:1 to 3:1	Mixed media (images, video, archives) compress poorly
Databases (Oracle/SQL/HANA)	5–15% (data) + log	3:1 to 5:1	Frequent log backups for low RPO; engine-native compression reduces additional dedupe gains
Microsoft 365	1–2%	2:1 to 3:1	API-rate-limited rather than throughput-limited
Object/S3 sources	varies	1.5:1 to 2:1	Often pre-compressed; treat as low-reduction

[Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity_Data_Protection_White_Paper.pdf] [Source: https://www.cohesity.com/glossary/data-deduplication/]

A 100 TB FETB cluster of homogeneous Windows VMs might land on disk as 12 TB after dedupe and compression. The same 100 TB from a video-production NAS might consume 45 TB. Mixing workloads dilutes the average reduction ratio, so when a customer adds a new workload class to an existing cluster, re-run the Sizer rather than extrapolating.

Figure 3.2: Workload categorization and typical dedupe + compression ratios

graph TD
    W[Workload Categorization] --> VM[VMware / Hyper-V / AHV VMs]
    W --> PHY[Physical Servers]
    W --> NAS[NAS SMB/NFS/NDMP]
    W --> DB[Databases<br/>Oracle/SQL/HANA]
    W --> M365[Microsoft 365]
    W --> OBJ[Object / S3 Sources]
    VM --> VMR[6:1 to 10:1<br/>High OS template dedupe]
    PHY --> PHYR[3:1 to 5:1<br/>Lower OS reuse]
    NAS --> NASR[2:1 to 3:1<br/>Mixed media]
    DB --> DBR[3:1 to 5:1<br/>Engine-compressed]
    M365 --> M365R[2:1 to 3:1<br/>API rate-limited]
    OBJ --> OBJR[1.5:1 to 2:1<br/>Pre-compressed]

Daily Ingest, Replication, and Cloud Archive Throughput

Capacity is only half the sizing problem. The other half is flow: how many bytes per hour must traverse network, NVRAM, and SpanFS during the backup window?

The headline formula:

Required_Ingest_Throughput (TB/hr) = (FETB × change_rate) / backup_window_hours

Worked example. A customer has 500 TB of FETB, 2% daily change rate, an 8-hour overnight window. Daily incremental volume = 10 TB. Required ingest = 10 / 8 = 1.25 TB/hr, which is comfortably within a 4-node C5000 cluster’s headroom. Now suppose the customer wants the first full backup completed in the same 8-hour window: 500 / 8 = 62.5 TB/hr, which would require the seed to be staged differently (parallelized first-fulls, longer initial seed window, or a multi-night soft-rollout). Architects routinely miss the “first-full vs. steady-state” distinction in sizing.

Replication throughput must be sized separately. If the customer replicates daily incrementals (10 TB) over a 24-hour window to a DR cluster, required egress = ~0.42 TB/hr ≈ 935 Mbps after compression. WAN sizing in Chapter 9 builds on this same arithmetic. CloudArchive throughput is bounded by the object-storage endpoint and the cluster’s egress NIC — typical sizings assume 200–800 MB/s sustained per node depending on object size and parallelism.

Performance vs. Capacity-Driven Sizing

Cohesity sizings collapse into one of two regimes:

Capacity-driven sizing — when retention × FETB dwarfs ingest throughput, the cluster’s physical disk count is the constraint. Choose dense nodes (C6000), wide EC stripes (6:2), and grow the cluster by adding capacity. Typical fingerprint: long retention (≥1 year), stable workload, modest RPO, modest RTO.
Performance-driven sizing — when ingest throughput, restore SLA, or instant-mount concurrency dominates, the cluster’s CPU/NVRAM/flash count is the constraint. Choose performance or all-flash nodes (C5000/C5200/C6200), narrower EC or RF=2, and grow by adding nodes. Typical fingerprint: short RTO (instant VM mount), database log-shipping at high cadence, large numbers of concurrent restores.

The Sizer detects which regime applies and outputs node count to satisfy both. When in doubt, build for the more demanding axis and let the less-demanding axis benefit. [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/cohesity-data-cloud-a-unified-platform-en.pdf]

Key Takeaway: Profile workloads by category, change rate, retention, and SLA. Decide whether the design is capacity- or performance-driven before choosing nodes. The same FETB can require very different cluster shapes depending on which axis dominates.

Sizing Tools and Calculators

Cohesity provides a Sizer (the partner/SE-facing planning tool) and a published set of formulas that let you sanity-check or hand-build a sizing in a whiteboard session. Both are valid; the CCAE expects you to understand the formulas and to know when the Sizer’s defaults need to be overridden.

Cohesity Sizing Tool Inputs and Outputs

The Sizer is a structured calculator that takes a workload profile and returns a recommended ReadyNode model, node count, and capacity projection over a multi-year horizon. [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity_Data_Protection_White_Paper.pdf]

Sizer Input	Purpose	Typical Default
Total FETB by workload type	Capacity baseline	Customer-provided
Annual growth rate	Multi-year extrapolation	10–25%
Daily change rate by workload	Incremental ingest math	2% (VM), 5% (DB), 1.5% (NAS)
Retention policy (D/W/M/Y)	Cumulative storage demand	30 / 12 / 12 / 7 GFS
RPO (backup frequency)	Backup window concurrency	24h
RTO (recovery target)	Performance node selection	Hours (hybrid) or minutes (all-flash)
Dedupe ratio assumption	Effective capacity	2:1–4:1 by workload
Compression ratio assumption	Effective capacity	1.5:1–2:1
Replication target	Secondary cluster sizing	Optional
Archive target	CloudArchive sizing	Optional
EC scheme or RF	Usable capacity efficiency	EC 4:2 or 6:2

The Sizer outputs a recommended ReadyNode (e.g., C5066 hybrid), a minimum node count satisfying both EC and N+1 constraints, year-1 / year-3 / year-5 capacity projections, and the SKU-level licensing recommendation (DataProtect, SmartFiles, FETB Plus tier). [Source: https://www.cohesity.com/blogs/fetb-vs—betb—thinking-beyond-the-invoice/]

Effective Capacity, Dedupe Assumptions, and Overheads

The chain of capacity transformations from “raw disk on a pallet” to “FETB I can sell” looks like this:

Raw Capacity              (sum of all HDD/NVMe across all nodes)
   ↓ × (D / (D+P)) for EC, or × (1/RF) for replication
Usable Capacity           (after resiliency overhead)
   ↓ − ~5–10% reserved
Available Capacity        (after metadata, snapshots, rebuild reserve)
   ↓ × dedupe × compression
Effective Capacity        (the number on the data sheet)
   ↓ ÷ retention multiplier
Protectable FETB          (the number on the sales quote)

A worked example. A 10-node cluster of C5066 nodes at 192 TB raw per node = 1,920 TB raw. EC 6:2 yields 1,920 × (6/8) = 1,440 TB usable. Subtract 7% for metadata/rebuild reserve = ~1,340 TB available. Apply a typical 4.5x combined reduction for VM workloads = ~6,030 TB effective. Divide by an effective retention multiplier of, say, ~12 (which captures 30 daily incrementals + 12 weekly + 12 monthly + 7 yearly with realistic incremental sizes) = ~500 TB protectable FETB. [Source: https://www.cohesity.com/blogs/erasure-coding-increase-fault-resilience-capacity/] [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/cohesity-fault-tolerance-data-integrity-for-modern-web-scale-environments-white-paper-en.pdf]

Figure 3.3: Capacity transformation stack (Raw → RF/EC overhead → Effective → Reserve → Usable)

flowchart LR
    R[Raw Capacity<br/>Sum of all HDD/NVMe] -->|"× D/(D+P) for EC<br/>or × 1/RF"| U[Usable Capacity<br/>After resiliency]
    U -->|"− 5–10%<br/>metadata + rebuild"| A[Available Capacity]
    A -->|"× dedupe<br/>× compression"| E[Effective Capacity]
    E -->|"− N+1 reserve<br/>− 20% headroom"| RES[Practical Ceiling]
    RES -->|"÷ retention multiplier"| F[Protectable FETB]

The compounding here is unforgiving. A 20% optimistic dedupe assumption multiplied by a 20% optimistic retention multiplier produces a 44% capacity miss. Validate dedupe with a pilot whenever the deal is large enough to justify the calendar time.

RF vs. EC capacity comparison table — useful for back-of-envelope checks:

Scheme	Min Nodes	Failures Tolerated	Storage Overhead	Usable / Raw
RF 2	3	1	100%	50.0%
RF 3	4	2	200%	33.3%
EC 2:1	3	1	50%	66.7%
EC 4:1	5	1	25%	80.0%
EC 4:2	6	2	50%	66.7%
EC 5:2	7	2	40%	71.4%
EC 6:2	8	2	33%	75.0%

Note that EC 6:2 delivers the same fault tolerance as RF 3 (2 failures) at less than half the storage cost — but you need at least 8 nodes to use it. Cluster size unlocks EC efficiency, which is one of the most important architectural levers in Cohesity sizing. [Source: https://www.cohesity.com/blogs/erasure-coding-increase-fault-resilience-capacity/]

Sizing for SmartFiles Primary Workloads

SmartFiles changes the sizing calculation in two ways. First, the workload is primary rather than backup, so reduction ratios are typically lower (mixed media, no dedupe-friendly OS templates). Second, performance characteristics matter more — clients hit views directly via SMB/NFS/S3, so latency and IOPS feed into node selection.

For SmartFiles you size based on:

Hot vs. cold data ratio (drives tiering)
Concurrent user/share count (drives node count)
Object vs. file vs. mixed access (drives protocol choice)
QoS requirements per view (drives all-flash vs. hybrid)

A typical SmartFiles sizing assumes 2:1 to 3:1 reduction and a usable-capacity target of ~70% of available (leaving 30% headroom because primary workloads are less tolerant of full-cluster behavior than backup workloads). [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/solution-brief/cohesity-smartfiles-beyond-scale-out-nas-solution-brief-en.pdf]

Cloud Edition and CloudArchive Sizing

Cloud Edition (CE) deploys SpanFS on cloud VMs (AWS EC2, Azure VMs, GCP), and sizing must account for the constraints of the underlying cloud:

Instance types determine NVMe and bandwidth caps; sizings cite specific instance SKUs (e.g., AWS i3en.6xlarge for performance tier)
Object-storage backing for capacity tier (S3, Azure Blob, GCS) adds latency and per-request cost
Egress fees are nonlinear; sizing must include estimated monthly egress for replication or recovery
Minimum cluster size (typically 3 nodes) imposes a fixed monthly floor cost

CloudArchive sizing is comparatively simple — it’s an external object-storage target receiving compressed/encrypted snapshots. Size for: cumulative archived data × storage class price + estimated monthly recall + ingestion bandwidth. Glacier and Azure Archive offer 80–90% cost reduction over hot tiers but introduce 4–12 hour rehydration latency that must align with the customer’s restore SLA. [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/cohesity-data-cloud-a-unified-platform-en.pdf]

Key Takeaway: The Sizer is a structured wrapper around the same arithmetic you can do by hand. Validate dedupe assumptions, always include retention multipliers, and remember that EC efficiency improves dramatically with cluster size — small clusters pay an inevitable resiliency tax.

Node Selection and Cluster Topology

With a target effective capacity and ingest throughput in hand, the next decision is which physical nodes to deploy. Cohesity ships three main ReadyNode families plus all-flash variants, and each is optimized for a different point on the capacity-vs-performance curve.

All-Flash vs. Hybrid vs. Dense Storage Nodes

Family	Form Factor	CPU	Storage	Power (avg/peak)	Optimized For
C4000	2U single	1× Xeon 8-core 2.1 GHz	8× HDD slots + NVMe metadata tier	Modest	Entry/edge, branch
C5066 (hybrid)	2U	1× Xeon 16-core 2.4 GHz, 128 GB RAM	54 TB HDD + 3.2 TB NVMe per node	605–895W avg / 1690W peak	Performance backup
C5200 (perf/all-flash)	2U, 4-node block	5th-Gen Xeon 16-core 2.0 GHz, PCIe Gen5	216 TB HDD + 12.8 TB flash per block (or all-NVMe)	605–895W avg / 1690W peak	Density + performance
C6000 (dense)	Dense 2U	Sized for streaming I/O	Up to 168–192 TB raw HDD/node + flash	~605W avg / 650W peak	Long-term retention, archive
C6200 (all-flash dense)	2U	Newer all-flash variant	All NVMe	~440W avg	High-perf retention with low power

[Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/datasheets/Cohesity-C4000-Datasheet.pdf] [Source: https://www.cohesity.com/resources/datasheet/c5000-data-sheet/] [Source: https://www.cohesity.com/resources/datasheet/c5200-data-sheet/] [Source: https://www.cohesity.com/resources/datasheet/cohesity-c6000-series-high-density-converged-nodes/] [Source: https://www.networld.co.jp/product_file/file/Cohesity_C6000_.pdf]

Selection rules of thumb:

C4000 when: edge/ROBO site, ≤4 nodes, mostly capacity, low concurrency. Lowest TCO entry point but limited CPU headroom. [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/datasheets/Cohesity-C4000-Datasheet.pdf]
C5000/C5066 when: mainstream datacenter backup, 6+ nodes, mixed workloads, balanced capacity and ingest. The default choice for ~70% of CCAE-style scenarios.
C5200 when: performance density per RU matters, customer wants 4-node-per-2U efficiency, or the design requires PCIe Gen5 NVMe throughput. Common in modernization/refresh deals.
C6000 when: retention dominates (≥1 year archive on cluster), TB-per-watt and TB-per-RU are deciding factors, restore SLA is hours rather than minutes. The classic “deep retention” node.
All-flash (C5200 / C6200) when: instant mass restore at scale, CloneRefresh pipelines, sub-minute RTO, analytics on backup data, MongoDB/Cassandra restore SLAs. Highest $/TB but the only option that satisfies aggressive RTO across many simultaneous mounts. [Source: https://www.cohesity.com/platform/c6000/]

Figure 3.4: ReadyNode selection decision tree

flowchart TD
    S[Start: Workload + SLA Profile] --> Q1{Edge / ROBO site<br/>≤ 4 nodes?}
    Q1 -->|Yes| C4[C4000<br/>Entry / Edge Hybrid]
    Q1 -->|No| Q2{Sub-minute RTO<br/>or instant mass restore?}
    Q2 -->|Yes| Q3{Density per RU<br/>also matters?}
    Q3 -->|Yes| C6200[C6200<br/>All-Flash Dense]
    Q3 -->|No| C5200AF[C5200 All-Flash<br/>PCIe Gen5 NVMe]
    Q2 -->|No| Q4{Retention dominates<br/>≥ 1 year on cluster?}
    Q4 -->|Yes| C6000[C6000<br/>Dense Hybrid]
    Q4 -->|No| Q5{4-node-per-2U<br/>density required?}
    Q5 -->|Yes| C5200H[C5200 Hybrid<br/>Performance Density]
    Q5 -->|No| C5066[C5066<br/>Mainstream Hybrid<br/>Default for ~70% of designs]

Minimum Cluster Sizes and Scaling Increments

The minimum supported cluster size is 3 nodes for production deployments (Robo Edition can run with fewer at smaller scale, with reduced resiliency). The minimum is set by quorum (Chapter 2) and by RF=2 placement rules.

EC schemes constrain minimums further: D + P ≤ node count. The fault-tolerance white paper plots usable capacity by cluster size and EC scheme; the practical implication is:

Cluster Size	Smallest Usable EC	Recommended EC
3 nodes	RF 2 only	RF 2
4–5 nodes	EC 2:1	EC 2:1 or RF 2
6–7 nodes	EC 4:2	EC 4:2 or EC 5:2
8+ nodes	EC 6:2	EC 6:2 (best efficiency at 2-failure tolerance)
12+ nodes	EC 6:2	EC 6:2 (more parallel rebuild domains)

[Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/cohesity-fault-tolerance-data-integrity-for-modern-web-scale-environments-white-paper-en.pdf]

Scaling beyond the initial cluster is node-by-node, with SpanFS rebalancing chunk files as new nodes come online. Best practice is to add nodes in increments that preserve the EC scheme (e.g., add nodes in pairs if EC 4:2 placement benefits from even balance). Brick-blocks (C5200’s 4-nodes-per-2U) scale in 4-node increments naturally.

Mixed-Node Clusters and Constraints

Cohesity supports heterogeneous clusters — different ReadyNode models (different CPU, capacity, or even families) coexisting in a single cluster, with SpanFS auto-rebalancing data across them. This is the architectural escape hatch for “we need more capacity at year 3 without forklift”: add C6000 dense nodes to an existing C5066 cluster, and SpanFS migrates cold data to the denser nodes while keeping hot data on the performance nodes. [Source: https://www.cohesity.com/platform/c6000/]

Constraints to watch:

All nodes in a cluster must be on a compatible Cohesity software version (rolling upgrade, see Chapter 5)
Mixing all-flash and hybrid nodes produces an effectively tiered cluster — SpanFS routes hot data to flash and cold data to HDD
Per-node storage imbalance (e.g., 192 TB nodes mixed with 36 TB nodes) can lead to capacity-skewed placement; the larger nodes will hit failure-domain constraints first when the smaller nodes fill
Performance is gated by the slowest node class for any given chunk’s I/O — a 2 GHz C4000 in a cluster of 3.0 GHz C5200s will throttle the chunks placed on it

Brick Mode vs. Node Mode Considerations

In standard configuration, each physical node is a single fault domain — a node failure removes one EC stripe member or one RF replica. Brick mode subdivides a single dense node into multiple independent fault domains (“bricks”), each with its own subset of disks and metadata. This is useful when:

A C6000 dense node holds enough TB that losing the entire node would exceed the cluster’s rebuild capacity
The architect wants finer-grained EC stripe placement than per-node placement allows
The cluster has only a few large nodes and would otherwise be unable to satisfy EC node-count requirements

Brick mode trades simplicity for placement flexibility. In production it is uncommon outside of dense-node configurations; most architects accept the per-node fault domain default unless the math forces them off it. [Source: https://www.cohesity.com/resources/datasheet/cohesity-c6000-series-high-density-converged-nodes/]

Key Takeaway: Choose nodes by SLA, not by price. Match C4000 to entry/edge, C5000/C5200 to mainstream datacenter, C6000 to retention-heavy, and all-flash to RTO-driven scenarios. Heterogeneous clusters let you grow capacity-only at year 3 without replacing performance nodes.

Capacity Planning Over Time

A cluster sized for day-one is a cluster that runs out of room in eighteen months. Capacity planning must explicitly model growth, technology refresh, tiering, and reserve capacity, then track those assumptions against reality through Helios reporting.

Modeling Growth and Tech-Refresh Cycles

Standard practice: size for year-3 protected FETB at minimum, ideally year-5. Apply an annual data growth rate (Cohesity sizings typically use 10–25% depending on the customer; financial services and healthcare trend higher, manufacturing lower).

A 5-year compounding model:

Year	FETB (15% YoY growth)	Effective BETB (4.5x reduction, 12x retention factor)
0	500 TB	1,333 TB
1	575 TB	1,533 TB
2	661 TB	1,763 TB
3	760 TB	2,028 TB
4	874 TB	2,332 TB
5	1,005 TB	2,681 TB

A cluster sized for 1,333 TB BETB on day one will exceed 80% utilization in late year-1 unless the architect plans the expansion path explicitly. The two viable strategies:

Buy ahead — deploy enough nodes on day one to satisfy year-3 demand. Higher capex but simpler operations.
Add as needed — deploy a smaller initial cluster and expand annually with additional nodes. Lower capex day one but requires accurate forecasting and predictable supply chain.

Most enterprises hybrid these: deploy for year-3 capacity but configure EC and node count such that adding a single node (or a single 4-node block in C5200) at any point shifts headroom by a known amount.

Tech-refresh cycle. Cohesity hardware typically operates on a 5-year refresh, often aligned with the customer’s broader datacenter lifecycle. Refresh strategies include:

Forklift — replace the entire cluster with new nodes, migrate data via replication. Disruptive but clean.
Rolling refresh — add new-generation nodes alongside old, drain old nodes, remove. SpanFS supports this without downtime. [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity-SpanFS-and-SnapTree-WP.pdf]

Tiering Strategy Across Local, Cloud, and Tape

A well-designed Cohesity cluster doesn’t keep every byte on local NVMe forever. Tiering policies move cold data off expensive media into cheaper storage classes, freeing local capacity for hot recovery points and reducing TCO over multi-year retention windows. Tiering options:

Tier	Media	Latency	Cost	Use Case
Hot	Local NVMe / flash	<1 ms	Highest	Last 7–30 days, instant recovery
Warm	Local HDD	5–10 ms	Medium	30–180 days, routine recovery
Cold (CloudTier)	S3 Standard / Azure Cool	50–200 ms	Low	6–24 months
Archive (CloudArchive)	S3 Glacier / Azure Archive	4–12 hr rehydrate	Lowest	1+ year compliance retention
Tape (rare)	LTO via gateway	Hours	Very low	Air-gap, regulatory archive

The CCAE expects you to know that CloudTier is a transparent capacity extension (cluster manages it as additional storage) while CloudArchive is a logical destination for retention/policy-driven movement. Chapter 10 covers both in depth.

Reserve Capacity for Failures and Rebuilds

Two reserves must always be sized into a cluster:

N+1 capacity reserve — at least one full node’s worth of free space, so SpanFS can redistribute chunks after a node failure without exceeding 100% utilization. For a 10-node cluster of 192 TB nodes, that’s ~192 TB held in reserve. [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/cohesity-fault-tolerance-data-integrity-for-modern-web-scale-environments-white-paper-en.pdf]
80% utilization ceiling — best practice is to keep cluster utilization below 80% steady-state. Above that threshold, performance degrades, garbage collection contention rises, and rebuild margins shrink. Helios alerts trigger by default at 80% capacity used.

Combined, this means a “1,920 TB raw” cluster realistically targets ~80% × usable capacity − N+1 reserve as its planning ceiling. For a 10-node, 192 TB-per-node, EC 6:2 cluster: usable = 1,440 TB; 80% = 1,152 TB; minus 192 TB N+1 = 960 TB practical effective ceiling. Architects who size to the full 1,440 TB usable number are setting up the customer for failure.

Reporting and Forecasting via Helios

Helios provides multi-cluster capacity reporting with trend lines, forecasting, and SLA dashboards. The CCAE-relevant capabilities:

Capacity trend reports — historical growth per cluster, projected exhaustion date based on linear or seasonal trend
Reduction ratio trending — actual dedupe and compression realized per workload, contrasted against sizing assumptions (a sharp drop in dedupe ratio is a leading indicator of workload composition change)
Per-View / Per-Protection-Group consumption — find the workload responsible for unexpected capacity growth
SLA reporting — RPO/RTO compliance per protection group, useful for audit and contract review
Multi-cluster fleet view — fleet-wide capacity for service providers and large enterprises

Helios forecasting converts theoretical capacity planning into operational practice. The architect’s job at deployment is to pick the right starting point; Helios’s job over time is to flag deviations early enough that another node can be ordered before the cluster hits the ceiling. [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/cohesity-data-cloud-a-unified-platform-en.pdf]

Key Takeaway: Plan capacity for year-3 growth at minimum, reserve N+1 plus 20% headroom, tier cold data to cloud, and use Helios trend reports to catch deviations from your sizing assumptions before they become outages.

Worked Example: 500 TB FETB Sizing Walkthrough

Putting it all together with a CCAE-style scenario: a customer has 500 TB FETB (mixed: 350 TB VMware, 100 TB NAS, 50 TB SQL Server), 3% blended daily change rate, and a 30-day retention policy. Design a cluster.

Step 1: Aggregate ingest math.

Daily incremental = 500 TB × 3% = 15 TB/day
30-day cumulative incremental = ~450 TB (before reduction)
One full FETB sweep at protection-group registration time = 500 TB initial seed

Step 2: Apply blended reduction.

VMware (350 TB) at 7:1 → 50 TB
NAS (100 TB) at 2.5:1 → 40 TB
SQL (50 TB) at 4:1 → 12.5 TB
Initial-full back-end ≈ 102.5 TB
30 days of incrementals at blended ~5:1 → 450 / 5 = 90 TB
Steady-state back-end ≈ ~193 TB

Step 3: Add growth and N+1 headroom.

Year-3 with 15% YoY growth: 193 × 1.52 ≈ 293 TB BETB
Plus 20% headroom: 293 × 1.25 ≈ 366 TB target available capacity
Plus N+1: must equal at least one node’s raw capacity above target

Step 4: Choose EC and compute raw.

EC 6:2 (≥8 nodes) → 75% efficiency → raw = 366 / 0.75 = 488 TB raw minimum
Add N+1 reserve of one node, so we want raw ≈ 488 TB + one node

Step 5: Pick nodes.

C5066 hybrid at 54 TB HDD per node: 8 nodes × 54 TB = 432 TB raw — slightly under target
9 nodes × 54 TB = 486 TB — at target, but only 8 needed for EC 6:2 + 1 for N+1
Pragmatic answer: 8× C5066 nodes for EC 6:2 plus 1× C5066 for N+1 = 9-node cluster, 486 TB raw, leaving ~70 TB year-5 headroom before adding nodes
Alternatively for retention-heavy customer: 6× C6000 dense nodes at 96 TB each = 576 TB raw, EC 4:2 (since 6 nodes), with one-node N+1 → adequate but locks the cluster into a less efficient EC scheme

The “9-node C5066” answer is the typical exam-correct response: it satisfies EC 6:2 minimum (8) + N+1 (1), uses the mainstream performance node, and leaves growth headroom. The C6000 alternative is acceptable when retention dominates and the customer prefers density. [Source: https://www.cohesity.com/blogs/erasure-coding-increase-fault-resilience-capacity/]

Chapter Summary

Cluster design and sizing is the most arithmetic-heavy domain on the CCAE exam, but the arithmetic is simple once the conceptual model is clear. Start by profiling the workload (FETB, change rate, retention, RPO, RTO, workload mix), feed those inputs into the Cohesity Sizer or do the math by hand using the formulas in this chapter, and translate the answer into a node count and ReadyNode family. EC scheme selection is the largest lever for capacity efficiency on clusters of 8+ nodes; node-family selection is the largest lever for performance.

Always reserve N+1 capacity, target 80% utilization as a ceiling, and size for year-3 protected FETB at minimum. Helios reporting will tell you if your assumptions held up — and when they don’t, heterogeneous clusters let you add capacity-only nodes without forklift.

The architect who memorizes the RF-vs-EC table, the four ReadyNode families, the FETB-to-BETB transformation chain, and the N+1 / 80% rules will pass every sizing question on the CCAE. The architect who understands why each rule exists will design clusters that customers thank them for five years later.

Key Terms

Term	Definition
FETB	Front-End TB — the source-side data footprint of protected workloads, measured pre-deduplication and pre-compression. The licensing unit and primary Sizer input.
Change rate	The percentage of FETB that mutates between successive backups, driving incremental ingest size and back-end capacity per retention day. Typically 1–3% for VMs/NAS, 5–15% for transactional databases.
Effective capacity	Cluster capacity after applying the full transformation chain: raw → usable (resiliency) → available (overhead) → effective (dedupe × compression). The number used to compute protectable FETB.
All-flash node	ReadyNode variant (C5200, C6200) using NVMe exclusively, optimized for low-latency / high-IOPS workloads such as instant mass restore and analytics on backup data.
Hybrid node	ReadyNode (C4000, C5066, C5200 hybrid) combining HDD bulk storage with NVMe flash tier for metadata, dedupe index, and write coalescing. The mainstream choice for backup.
Brick mode	Per-node configuration that subdivides a single dense node into multiple independent fault domains (“bricks”), giving SpanFS finer-grained EC stripe placement on dense C6000 hardware.
ReadyNode	Cohesity-validated, partner-supplied hardware SKU (Cisco, HPE, Dell, Cohesity-branded) certified for the SpanFS stack. Distinct from Cohesity-branded appliances and from generic third-party hardware.
Sizing tool	Cohesity Sizer — partner/SE-facing calculator that converts workload profile (FETB, change rate, retention, growth, EC) into a recommended ReadyNode model, node count, and multi-year capacity projection.

Chapter 4: Networking, DNS, and Cluster Connectivity

Networking is the discipline that makes or breaks a Cohesity deployment. A perfectly sized cluster with a beautifully tuned protection policy will still miss its backup window if a switch is misconfigured, an MTU mismatch is silently dropping jumbo frames, or DNS is handing clients the IP address of a node that has been down for three hours. Architects sitting the CCAE exam are expected to design the physical, logical, and service-layer network of a Cohesity cluster end-to-end — bonded NICs, VLANs, VIPs, partitions, BGP, SmartDNS, NTP, certificates, and the firewall matrix that ties it all together.

This chapter walks through that stack from the wire up. We start with physical bonding and VLAN tagging on the node, climb into VIPs and the SmartDNS load-balancing service, work outward to external dependencies (DNS, NTP, AD, CAs, SMTP, Syslog), and close with the firewall port matrix every architect must memorize.

Learning Objectives

By the end of this chapter you will be able to:

Design Cohesity network topologies including bonded interfaces, VLAN tagging, and 10/25/40/100 GbE selection.
Configure VIPs, cluster partitions, and BGP/static routing for partition-aware client access.
Plan DNS, NTP, certificate, and identity prerequisites that must be in place before the cluster is bootstrapped.
Build a firewall port matrix that covers inter-node, source-to-cluster, replication, Helios, and IPMI traffic.
Troubleshoot common failure modes: LACP half-bonded, MTU mismatch, SmartDNS delegation broken, partition split-brain.

Physical and Logical Network Layout

Figure 4.1: End-to-end network topology from source clients through the bonded cluster interface to a partitioned VIP pool.

flowchart LR
    Source[Source Clients<br/>Backup Agents / NAS / VMware]
    ToR[ToR Switch Pair<br/>MLAG / vPC]
    Bond[bond0<br/>LACP mode 4]
    Node1[Node 01]
    Node2[Node 02]
    Node3[Node 03]
    Node4[Node 04]
    VIP[VIP Pool<br/>1 VIP per node]
    Part[Cluster Partition<br/>mgmt / smb / s3]

    Source --> ToR
    ToR --> Bond
    Bond --> Node1
    Bond --> Node2
    Bond --> Node3
    Bond --> Node4
    Node1 --> VIP
    Node2 --> VIP
    Node3 --> VIP
    Node4 --> VIP
    VIP --> Part

Every Cohesity node ships with at least two 10GbE+ NICs that are intended to be bonded into a single logical interface — bond0 — that becomes the cluster’s primary network. The primary network is the surface that carries node-to-node traffic, VIPs (and therefore client backup/restore traffic), management UI/API, and replication. A separate IPMI interface handles out-of-band hardware management. Some designs add a secondary network — usually a tagged VLAN on the same bond, occasionally a physically separate bond — to isolate replication or NAS protocol traffic from backup ingest.

Bond Modes: Active-Backup vs. LACP

Cohesity supports exactly two Linux bonding modes on the primary interface — there is no balance-tlb, no balance-alb, and no Cisco PAgP. The choice is binary, and architects should be able to defend it on exam day.

Figure 4.2: Bond mode selection decision tree — Active-Backup (mode 1) versus LACP (mode 4).

flowchart TD
    Start[Choose Bond Mode for bond0]
    Q1{Switches support<br/>LACP / 802.3ad?}
    Q2{Dual-switch<br/>resiliency required?}
    Q3{MLAG / vPC<br/>configured?}
    Mode1[Mode 1: Active-Backup<br/>One NIC active<br/>Sub-second failover<br/>Branch / ROBO / Lab]
    Mode4Single[Mode 4: LACP<br/>Single switch port-channel<br/>Aggregate throughput]
    Mode4MLAG[Mode 4: LACP<br/>MLAG / vPC pair<br/>Production default]
    Fallback[Fall back to Mode 1<br/>or fix switch config]

    Start --> Q1
    Q1 -->|No| Mode1
    Q1 -->|Yes| Q2
    Q2 -->|No| Mode4Single
    Q2 -->|Yes| Q3
    Q3 -->|Yes| Mode4MLAG
    Q3 -->|No| Fallback

Bond mode	Linux name	Switch requirement	Throughput	Failover	Typical use
Mode 1	Active-Backup	None — any switch	One NIC at a time	Sub-second on link loss	Branch / ROBO, lab clusters, switches without LACP/MLAG
Mode 4	802.3ad LACP	Port-channel/LAG with matching LACP timers; MLAG/vPC for dual-switch	Aggregate (L3+L4 hash)	Sub-second on LACPDU loss	Production default [Source: https://www.cohesity.com/blogs/optimizing-cohesity-and-vsphere-networking/]

Mode 4 is the production-recommended default because it gives both link redundancy and active load distribution across NICs. The trade-off is that the upstream switches must support LACP and be configured as a port-channel (Cisco), LAG (Arista/Juniper), or — for cross-switch resiliency — an MLAG/vPC pair so the two NICs in the bond can land on two physically separate switches and still appear to LACP as a single peer [Source: https://iworknthecloud.wordpress.com/2018/04/22/how-to-configure-lacp-vlans-and-jumbo-frames-on-cohesity-and-cisco-nexus/].

Cohesity’s terminology can confuse people: when a Cohesity engineer says “active-active,” they specifically mean LACP mode 4. There is no separate “active-active without LACP” option for the primary bond [Source: https://cypresscollege.a2hosted.com/files/ISER-Evidence/IIC-Student-Support/IIC8/IIC8-15_ISDataProtectionCohesityPlatform.pdf].

The bond mode is set in two places, and forgetting either is a classic deployment bug:

Per-node, in /etc/sysconfig/network-scripts/ifcfg-bond0, set BONDING_OPTS="mode=4 miimon=100".
Cluster-wide, via the iris CLI: iris_cli cluster -username=<user> -password=<pass> edit-bm bonding-mode=4 [Source: https://mirror.vcu.edu/pub/cohesity/docs/Cohesity%20CLI%20Reference%20Guide%207.3.2.pdf].

After applying the change, restart networking and the Nexus services (sudo systemctl stop nexus nexus_proxy && sudo systemctl start nexus nexus_proxy). Verify with cat /sys/class/net/bond0/bonding/mode — you should see 802.3ad 4.

Primary, Secondary, and IPMI Interfaces

Three logical interfaces matter to the architect:

Primary network — the bonded interface (bond0) carrying node management IPs, all VIPs, and (via tagged VLANs) backup/NFS/SMB/replication traffic. 10GbE is the floor; 25GbE and 100GbE are increasingly common on dense or all-flash nodes [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/reference-architecture/Optimal-Network-Designs-with-Cohesity-RA.pdf].
Secondary network — optional. Often a tagged VLAN on bond0 used to segregate replication traffic to a DR cluster, or a physically separate bond on dense nodes that need WAN replication kept off the ingest interface.
IPMI — out-of-band, 1GbE, on a dedicated management VLAN. Uses TCP/UDP 623 (IPMI) plus 80/443 for the Redfish/web UI per node. Keep IPMI off the primary network — it has its own lifecycle, its own credentials, and should not share a broadcast domain with backup traffic.

IP allocation rule of thumb per node:

Primary network: 1 node management IP + 1 VIP (recommended one VIP per node) = 2 IPs.
IPMI: 1 IP per node on the OOB network.

A 4-node cluster therefore needs 8 primary-network IPs (4 node + 4 VIP) plus 4 IPMI IPs, all carved out of pools your network team should reserve before bring-up.

VLAN Tagging and Untagged Native VLANs

The recommended segmentation pattern is untagged management on the native VLAN of bond0, with tagged VLANs on top of bond0 for protocol traffic. So:

bond0 native (untagged) → node management IPs, default gateway, primary VIPs.
bond0.100 (tagged VLAN 100) → SMB/NFS client traffic for SmartFiles.
bond0.200 (tagged VLAN 200) → replication to DR cluster.
bond0.300 (tagged VLAN 300) → tenant-A NAS access.

Each tagged VLAN can carry its own VIP pool, its own SmartDNS subdomain, and its own gateway, allowing per-protocol or per-tenant isolation without buying additional NICs. The switch ports are configured as trunk ports with the appropriate native VLAN and allowed VLAN list.

Speed Selection: 10/25/40/100 GbE

NIC speed should follow the workload, not fashion:

Speed	Use case
1 GbE	IPMI only. Never the primary bond in production.
10 GbE	Hybrid nodes, mid-size clusters, replication-heavy ROBO. Production minimum.
25 GbE	Dense storage and modern hybrid nodes; common production choice in 2024+.
40/100 GbE	All-flash nodes, large SmartFiles/primary workloads, 10+ node clusters.

Whatever speed you pick, all NICs in a single bond must run at the same speed — mixing 10 and 25 GbE in bond0 is unsupported. And whatever speed you pick, enable jumbo frames (MTU 9000) end-to-end: Cohesity nodes, switches, uplinks, and any L3 hops. An MTU mismatch with an intermediate L3 hop set at 1500 will silently drop large backup writes and produce bizarre, intermittent slowness that’s miserable to debug [Source: https://iworknthecloud.wordpress.com/2018/04/22/how-to-configure-lacp-vlans-and-jumbo-frames-on-cohesity-and-cisco-nexus/].

Key Takeaway: Bond two same-speed 10/25 GbE NICs into bond0 using LACP mode 4 with MLAG/vPC to dual switches, set the cluster-wide bonding mode through iris_cli, run jumbo frames end-to-end, and reserve 2 primary-network IPs plus 1 IPMI IP per node before you ever rack the gear.

Cluster Partitions and VIPs

Once bond0 is up and IPs are reserved, the next layer is the cluster’s VIP pool and how clients reach it. This is where Cohesity’s most distinctive networking feature — SmartDNS — and its concept of cluster partitions come into play.

The Cluster Partition Concept

A cluster partition in Cohesity is a logical grouping of nodes within a single physical cluster. Partitions are commonly used in three scenarios:

Stretch / multi-rack designs — nodes in rack A in one partition, nodes in rack B in another, so SmartDNS can return only locally-reachable VIPs to clients that are network-close.
Multi-tenant traffic separation — different tenants are pinned to different partitions, with each partition advertising its own VIPs and FQDN.
Mixed-protocol routing — SmartFiles SMB clients steered to one partition, S3 clients to another, so a hot SMB workload can’t starve S3 throughput.

A partition is not a separate cluster — quorum, metadata, and SpanFS still span the whole cluster. The partition only governs which VIPs a given client sees and which nodes can serve a given DNS-delegated FQDN.

Virtual IPs (VIPs)

A VIP is an IP address that clients connect to but which is not pinned to a specific node’s hardware. In Cohesity, the recommended pattern is one VIP per node, all VIPs in the same subnet/VLAN as the node management IPs and the gateway, statically assigned (no DHCP) [Source: https://kb.expedient.com/docs/cohesity-client-premises-networking-requirements].

VIPs serve every client-facing protocol the cluster speaks:

SMB (TCP 445) — Windows backup clients, SmartFiles SMB shares.
NFS (TCP/UDP 2049 + 111) — Linux clients, ESXi NFS datastores for instant recovery.
S3 (TCP 443) — object access to SmartFiles views.
Management UI/API (TCP 80/443) — operators, automation, Helios proxying.
Backup/agent traffic (TCP 50051, 11111, etc.) — physical agents and inter-cluster replication land on VIPs.

Because all VIPs live in the same L2 subnet, when a node fails its VIP can be re-homed to another node within the partition without changing routing — clients just see a brief reset and reconnect.

DNS Round-Robin vs. SmartDNS

There are two patterns for steering clients across the VIP pool. The exam will test both.

Aspect	Classic DNS Round-Robin	SmartDNS (preferred)
Where DNS records live	Corporate DNS server	Cohesity cluster (delegated subdomain)
A-record management	Static, manually maintained	Dynamic, cluster-managed
Health awareness	None — failed nodes still get returned	Health-checked — failed VIPs auto-removed
TTL	Whatever corporate DNS sets (often 1 hour)	Short (seconds), driving fast client failover
Partition awareness	None	Returns only VIPs from healthy nodes in the relevant partition
Failover	Manual (admin removes A record)	Automatic (next query gets healthy set)
Use case	Lab, ROBO, environments where DNS team won’t delegate	Production, multi-VIP, partitioned, multi-VLAN

In classic DNS round-robin, you pre-create A records for each VIP under a single cluster hostname (for example cohesity.acme.com) on the corporate authoritative DNS, and the DNS server rotates A-record order on each query. It works, but it’s static — when a node dies the corporate DNS keeps cheerfully handing out its VIP until a human intervenes [Source: https://kb.expedient.com/docs/cohesity-client-premises-networking-requirements].

Figure 4.3: SmartDNS resolution flow — client query traverses corporate DNS, the delegated Cohesity authoritative DNS, and returns a healthy VIP.

sequenceDiagram
    participant Client as Backup Client
    participant SiteDNS as Corporate DNS
    participant ClusterDNS as Cohesity SmartDNS<br/>(Authoritative for cohesity.acme.com)
    participant Health as Cluster Health Check
    participant VIP as Healthy VIP

    Client->>SiteDNS: Query cohesity.acme.com
    SiteDNS->>SiteDNS: Lookup NS records
    SiteDNS->>ClusterDNS: Forward via NS delegation
    ClusterDNS->>Health: Get currently healthy nodes
    Health-->>ClusterDNS: Healthy VIP set (excludes failed)
    ClusterDNS-->>SiteDNS: A records (rotated, short TTL)
    SiteDNS-->>Client: Healthy VIP address
    Client->>VIP: Connect (SMB / NFS / S3 / API)
    VIP-->>Client: Service response

In SmartDNS, a subdomain is delegated from corporate DNS to the Cohesity cluster, and the cluster runs an authoritative DNS service for that zone. On each query the cluster returns the currently healthy VIP set, rotated round-robin. Dead nodes are pulled from rotation by internal health checks; clients get a short TTL so retries quickly land on healthy nodes during failover [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/reference-architecture/Optimal-Network-Designs-with-Cohesity-RA.pdf].

Analogy — SmartDNS as Maitre d’. Picture a busy restaurant with eight tables. Classic DNS round-robin is a printed seating chart taped to the door: it cycles through tables 1 through 8 regardless of whether table 5 has a broken leg or table 7 is on fire. SmartDNS is a maitre d’ who knows the live state of every table — when you walk in they glance across the room, see which tables are clean, lit, and ready for service, and they seat you there. When table 5’s leg breaks, the maitre d’ simply stops sending guests to it until the busser fixes it; the printed chart-holder, by contrast, would happily seat you at the broken table until someone updates the laminated sheet.

Configuring SmartDNS — Architect Steps

Pick a delegated subdomain — cohesity.<company>.com is conventional.
On the corporate authoritative DNS, create NS records delegating that subdomain to two or more Cohesity VIPs.
Create matching A/glue records for the NS hostnames so corporate DNS can reach those VIPs on UDP/TCP 53.
In the Cohesity UI under Cluster → Network → VIPs / Hostname, set the Cluster Hostname to the delegated FQDN and assign one VIP per node on the primary bond.
(Optional) Configure additional VLAN-tagged SmartDNS subdomains for SmartFiles, per-tenant traffic, or replication. Each VLAN gets its own VIP pool and FQDN.
Validate with dig cohesity.acme.com @<cluster-VIP> — you should see the A-record set rotating across queries. Take a node offline (maintenance mode) and re-run; the dead VIP should disappear within seconds.

A couple of network rules accompany SmartDNS:

All VIPs must live in the same subnet/VLAN as the node management IPs and the gateway.
VIPs must be statically assigned — DHCP is unsupported.
Multicast must be enabled on the primary network for cluster auto-discovery during initial bring-up. From release 6.6+, unicast modes are available for environments that prohibit multicast [Source: https://www.ibm.com/docs/en/storage-defender/base?topic=suc-configuring-primary-secondary-network-in-cluster-multicast-disabled].

BGP for Multi-Subnet VIP Failover

Layer-2 round-robin VIPs are fine when every client can ARP for them in the same broadcast domain. But for multi-subnet primary networks (supported from 6.6+) and stretch-cluster designs that cross a router, BGP route advertisement lets the cluster announce VIP /32 routes from individual nodes upstream, so that when a VIP moves between partitions or subnets the routing table converges to the new owner.

BGP design points the architect needs:

The Cohesity cluster speaks BGP to one or more upstream ToR switches or route reflectors, advertising each VIP as a /32.
When a node fails, its /32 is withdrawn; when SmartDNS re-homes the VIP to a healthy node, that node advertises the /32.
BGP communities can be used to steer traffic across partitions — for example, a partition tagged “rack-A” advertises with a community that the upstream prefers when the source IP is in rack-A’s subnet.
BGP and SmartDNS are complementary — SmartDNS picks the VIP, BGP makes the VIP routable across the L3 fabric.

Worked Example — VIP Plan for a 4-Node Cluster with 3 Partitions

Suppose you’re designing a Cohesity cluster for a customer with the following requirements:

4 physical nodes, each with bond0 = 2 × 25 GbE LACP to a Cisco Nexus vPC pair.
Three logical partitions: mgmt (UI, API, replication), smartfiles-smb (SMB views), and smartfiles-s3 (S3 views).
Production VLANs: VLAN 10 for mgmt (untagged on bond0), VLAN 100 for SMB, VLAN 200 for S3.
Subnets: 10.10.0.0/24 for mgmt, 10.100.0.0/24 for SMB, 10.200.0.0/24 for S3.
Corporate DNS administered by the network team, willing to delegate cohesity.acme.com.

Step 1 — Node management IPs (mgmt VLAN, untagged on bond0).

Node	Management IP	IPMI IP
node01	10.10.0.11	10.99.0.11
node02	10.10.0.12	10.99.0.12
node03	10.10.0.13	10.99.0.13
node04	10.10.0.14	10.99.0.14

Step 2 — VIP pools, one per node per partition.

Partition	VLAN	Subnet	node01 VIP	node02 VIP	node03 VIP	node04 VIP
mgmt	10 (untagged)	10.10.0.0/24	10.10.0.21	10.10.0.22	10.10.0.23	10.10.0.24
smartfiles-smb	100	10.100.0.0/24	10.100.0.21	10.100.0.22	10.100.0.23	10.100.0.24
smartfiles-s3	200	10.200.0.0/24	10.200.0.21	10.200.0.22	10.200.0.23	10.200.0.24

That’s 4 VIPs per partition × 3 partitions = 12 VIPs total, plus 4 node IPs and 4 IPMI IPs = 20 IPs the network team must reserve before bring-up.

Step 3 — SmartDNS subdomains.

cohesity.acme.com → delegated to VIPs 10.10.0.21 and 10.10.0.22 (mgmt partition). UI and API live here.
smb.cohesity.acme.com → delegated to VIPs 10.100.0.21 and 10.100.0.22 (smartfiles-smb partition). SMB clients use \\smb.cohesity.acme.com\share.
s3.cohesity.acme.com → delegated to VIPs 10.200.0.21 and 10.200.0.22 (smartfiles-s3 partition). S3 SDKs use https://s3.cohesity.acme.com.

Step 4 — NS delegation in corporate DNS.

cohesity.acme.com.        IN NS  ns1.cohesity.acme.com.
cohesity.acme.com.        IN NS  ns2.cohesity.acme.com.
ns1.cohesity.acme.com.    IN A   10.10.0.21
ns2.cohesity.acme.com.    IN A   10.10.0.22
smb.cohesity.acme.com.    IN NS  ns1-smb.cohesity.acme.com.
smb.cohesity.acme.com.    IN NS  ns2-smb.cohesity.acme.com.
ns1-smb.cohesity.acme.com. IN A  10.100.0.21
ns2-smb.cohesity.acme.com. IN A  10.100.0.22
... (same pattern for s3)

Step 5 — Validation.

dig cohesity.acme.com           # expect 4 A records, rotating
dig smb.cohesity.acme.com       # expect 4 A records on VLAN 100
dig s3.cohesity.acme.com        # expect 4 A records on VLAN 200

Take node03 into maintenance mode and re-run — its VIP should disappear from all three responses within seconds.

Key Takeaway: Treat VIPs, partitions, and SmartDNS as one design problem. Reserve one VIP per node per partition in the same subnet/VLAN as the gateway, delegate a per-partition subdomain from corporate DNS, and let SmartDNS health-check VIPs in/out of rotation so client failover happens in seconds without operator intervention.

External Service Dependencies

A Cohesity cluster does not live in isolation. Before the bootstrap wizard ever runs, the architect must align with the network, identity, security, and operations teams on a shopping list of external services. Get any of these wrong and bring-up will stall — or worse, succeed in a way that leaves a security gap you’ll only discover during the first audit.

DNS, NTP, and Reverse DNS

DNS (TCP 53 / UDP 53) — at least two upstream resolvers. The cluster needs forward and reverse DNS for itself, every node, every VIP, every external target (object store, replication peer, AD domain), and every Helios endpoint. Reverse DNS (PTR records) is mandatory for AD/Kerberos to work; many bootstrap failures trace back to missing PTRs.
NTP (UDP 123) — at least two NTP sources, ideally three for quorum. Cohesity’s strict-consistency metadata layer is intolerant of clock skew; nodes more than ~30 seconds out can be ejected from quorum. Use the same NTP source as your AD domain controllers — Kerberos has a 5-minute clock skew tolerance and AD itself drifts if NTP is wrong.

AD/LDAP, Kerberos, and SSO Endpoints

LDAP (TCP 389) / LDAPS (TCP 636) — Active Directory bind for user lookup. LDAPS is preferred; if you use LDAP+StartTLS or plain LDAP, expect security review pushback.
Kerberos (TCP 88, UDP 88) — required for AD-joined SMB shares and for the cluster’s own machine account.
SAML / OIDC endpoints (TCP 443 outbound) — for SSO with Okta, Azure AD/Entra ID, Ping, etc. The cluster needs outbound HTTPS to the IdP’s metadata URL and SAML assertion endpoint.
DNS prerequisite — the AD domain’s _ldap._tcp.<domain> SRV records must resolve. Architects often forget that SRV-record lookups go through DNS, not LDAP.

The cluster joins AD as a machine account; the service account used to perform the join needs Domain Join rights and the ability to create computer objects in the target OU.

Certificate Authority and TLS Chains

Out of the box the cluster ships with a self-signed cert. For production you must replace it with a CA-signed cert covering:

The cluster hostname (cohesity.acme.com).
Every SmartDNS-delegated subdomain in active use (smb.cohesity.acme.com, s3.cohesity.acme.com).
Optionally, individual node FQDNs.

Use a SAN certificate (multiple subjectAltNames) so a single cert covers all FQDNs. Internal CAs are fine; the cluster trusts certs whose chain it can verify, so the issuing CA’s root and intermediates must be uploaded to the cluster.

For replication and Helios connectivity, the cluster needs to trust the peer cluster’s CA (for cluster-to-cluster replication) and trust public CAs (for Helios over TCP 443). Most enterprises route outbound HTTPS through a forward proxy that does TLS interception — the proxy’s root CA must be uploaded to the cluster or Helios calls will fail with cert validation errors.

SMTP, SNMP, and Syslog Targets

SMTP (TCP 25 / 465 / 587 outbound) — for email alerts. Most enterprises use an authenticated relay on 587 with STARTTLS.
SNMP (UDP 161 inbound for polling, UDP 162 outbound for traps) — for monitoring system integration.
Syslog (UDP 514 / TCP 6514 outbound) — for SIEM integration. TCP 6514 with TLS is preferred for compliance environments.

These don’t block bring-up, but they are usually in scope for the security review and absolutely should be configured before the cluster takes production traffic.

Key Takeaway: Treat DNS forward+reverse records, redundant NTP sources, AD/SSO endpoints, a SAN TLS cert from a known CA, and SMTP/SNMP/Syslog targets as Day-0 prerequisites — not Day-2 polish. A cluster that boots without them is a cluster that will fail its first audit.

Firewalls and Port Requirements

The CCAE exam loves port-matrix questions. Memorize the table below; an architect who can rattle off “TCP 50051 is the physical-agent channel, TCP 11114 is replication-inbound, UDP 123 is NTP” will save themselves several scenario questions.

Figure 4.4: Cohesity port matrix as a hierarchy — source-to-cluster, inter-node, inter-cluster, external services, and management.

graph TD
    Root[Cohesity Port Matrix]
    Source[Source to Cluster]
    Inter[Inter-Node / Inter-Cluster]
    External[External Services]
    Mgmt[Management / OOB]

    Root --> Source
    Root --> Inter
    Root --> External
    Root --> Mgmt

    Source --> Agent[TCP 50051<br/>Physical Agent<br/>Win/Linux/SQL/Oracle]
    Source --> VMware[TCP 443<br/>vCenter / ESXi]
    Source --> SMB[TCP 445<br/>NAS SMB]
    Source --> NFS[TCP/UDP 2049 + 111<br/>NAS NFS + RPC]
    Source --> WinRM[TCP 5986<br/>Hyper-V WinRM-HTTPS]

    Inter --> IO[TCP 11111<br/>I/O Operations]
    Inter --> Repl[TCP 11114<br/>Replication]
    Inter --> Mgmt24[TCP 24444<br/>Cluster Mgmt]
    Inter --> API[TCP 443<br/>Mgmt API]

    External --> DNS[TCP/UDP 53<br/>DNS / SmartDNS]
    External --> NTP[UDP 123<br/>NTP]
    External --> LDAP[TCP 389/636<br/>LDAP / LDAPS]
    External --> Krb[TCP/UDP 88<br/>Kerberos]
    External --> Helios[TCP 443<br/>Helios outbound]
    External --> SMTP[TCP 25/465/587<br/>SMTP]
    External --> Syslog[UDP 514 / TCP 6514<br/>Syslog]

    Mgmt --> SSH[TCP 22<br/>SSH / iris_cli]
    Mgmt --> UI[TCP 80/443<br/>UI / API]
    Mgmt --> IPMI[TCP/UDP 623 + 80/443<br/>IPMI / Redfish]

The Cohesity Port Matrix

Direction	Port	Protocol	Purpose	Source
Inter-node (within cluster, on bond0)	11111	TCP	I/O Operations Service	[Source: https://docs.cohesity.com/baas/data-protect/dataprotect-firewall-ports.htm]
Inter-cluster (replication)	11111, 11114, 24444, 443	TCP	Replication, IO ops, mgmt API	[Source: https://docs.cohesity.com/disaster-recovery/site-continuity/vmware-vms/initial-setup/firewall-ports.htm]
Source → cluster (physical Linux/Windows)	50051	TCP	Cohesity agent channel (also SQL VSS, Hyper-V, Oracle RMAN coord)	[Source: https://docs.cohesity.com/baas/data-protect/firewall-ports.htm]
Source → cluster (Hyper-V/SCVMM granular)	5986	TCP	WinRM over HTTPS to guest VM	[Source: https://kb.expedient.com/docs/firewall-and-port-requirements-1]
Source → cluster (NAS SMB)	445	TCP/UDP	SMB to NAS source	[Source: https://docs.cohesity.com/baas/data-protect/protect-nas-sources.htm]
Source → cluster (NAS NFS)	2049, 111	TCP/UDP	NFS + portmapper/RPC	[Source: https://docs.cohesity.com/baas/data-protect/protect-nas-sources.htm]
Source → cluster (NAS aux)	8080	TCP/UDP	Aux NAS workflows	[Source: https://docs.cohesity.com/baas/data-protect/protect-nas-sources.htm]
Cluster → vCenter / ESXi	443	TCP	VMware API, snapshot mgmt, VMware Tools file recovery	[Source: https://docs.cohesity.com/disaster-recovery/site-continuity/vmware-vms/initial-setup/firewall-ports.htm]
Cluster → Hybrid Extender (SMB recovery)	445	TCP	SMB recovery operations	[Source: https://docs.cohesity.com/baas/data-protect/saas-connection-requirements-cohesity-deployed.htm]
Cluster ↔ Oracle Hybrid Extender	29991	TCP	NFS mount of Cohesity views for instant recover/clone	[Source: https://docs.cohesity.com/baas/data-protect/firewall-ports.htm]
Cluster ↔ DNS	53	TCP/UDP	DNS lookups; SmartDNS responses if cluster is authoritative	[Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/reference-architecture/Optimal-Network-Designs-with-Cohesity-RA.pdf]
Cluster → NTP	123	UDP	Time sync	[Source: https://kb.expedient.com/docs/cohesity-client-premises-networking-requirements]
Cluster → AD/LDAP	389, 636	TCP	LDAP / LDAPS	[Source: https://kb.expedient.com/docs/cohesity-client-premises-networking-requirements]
Cluster → AD Kerberos	88	TCP/UDP	Kerberos tickets	[Source: https://kb.expedient.com/docs/cohesity-client-premises-networking-requirements]
Cluster → Helios (outbound)	443	TCP	Helios SaaS control plane	[Source: https://docs.cohesity.com/baas/data-protect/firewall-ports.htm]
Admin → cluster	22	TCP	SSH (support / iris_cli)	[Source: https://kb.expedient.com/docs/firewall-and-port-requirements-1]
Admin → cluster	80, 443	TCP	UI and API	[Source: https://kb.expedient.com/docs/firewall-and-port-requirements-1]
Admin → IPMI (OOB)	623, 80, 443	TCP/UDP	IPMI + Redfish web UI	[Source: https://docs.cohesity.com/baas/data-protect/firewall-ports.htm]

Inter-Node and Intra-Cluster Ports

Within a single cluster, all node-to-node traffic flows on bond0 and is generally not firewalled — the assumption is that the cluster’s primary network is a flat, trusted L2 segment. If your security policy puts an internal firewall between nodes (rare but seen in some service-provider topologies), the same inter-cluster ports — 11111, 11114, 24444, 443 — must be opened, plus 53 and 123 for DNS/NTP if those services are partition-local.

Crucially, multicast must be enabled on the primary network for the bring-up auto-discovery phase. If the network team can’t allow multicast on the primary VLAN, plan to use the unicast bring-up path on releases 6.6+ [Source: https://www.ibm.com/docs/en/storage-defender/base?topic=suc-configuring-primary-secondary-network-in-cluster-multicast-disabled].

Source-to-Cluster Ports

The single most-tested fact in this section: physical agent traffic (Linux, Windows, Hyper-V, SQL VSS, Oracle RMAN coordination) all share TCP 50051. If you remember nothing else, remember that one [Source: https://docs.cohesity.com/baas/data-protect/firewall-ports.htm].

After that:

VMware flows over TCP 443 from the cluster to vCenter and ESXi hosts.
NAS uses 445 (SMB), 2049 (NFS), 111 (RPC), occasionally 8080.
Linux NFS-based restores also need TCP/UDP 111 (rpcbind) for mount.
Hyper-V / SCVMM granular file recovery uses TCP 5986 (WinRM-HTTPS) into the guest VM.
Oracle Hybrid Extender to cluster uses TCP 29991 for the NFS mount of Cohesity views during instant recover and clone workflows.

Helios Outbound Connectivity

Helios is a SaaS control plane. The cluster initiates an outbound HTTPS (TCP 443) tunnel to Helios — Helios never reaches in. This means:

The cluster needs outbound TCP 443 to helios.cohesity.com and the regional Helios endpoints.
If a forward proxy intercepts TLS, its CA must be trusted by the cluster.
If proxy authentication is required, configure proxy credentials in the cluster.

In air-gapped environments (FedRAMP, classified networks), Helios is not available and operations rely on the cluster’s local UI/API only.

Common Firewall Misconfigurations

TCP 50051 not opened from cluster VIPs to physical agents — backups fail with “agent unreachable.” Open from each VIP, not just node IPs, because the connection can come from any VIP in the partition.
Asymmetric replication ports — opening 11111 outbound on the source side but not inbound on the DR side. Replication needs bidirectional flows; map the table direction carefully.
MTU 1500 on a transit switch between Cohesity and a NAS source while both endpoints are MTU 9000 — silent packet drops, intermittent slowness.
Missing reverse DNS — Kerberos breaks, AD-joined SMB protections fail with cryptic auth errors.
Helios proxy CA not trusted — cluster shows “disconnected from Helios” while local UI looks fine. Always test outbound TLS to the Helios endpoint with the cluster’s trust store, not a generic curl from a node.
IPMI on the primary VLAN — security finding waiting to happen; IPMI must be on a separate OOB management VLAN.
SmartDNS NS delegation pointed at node IPs instead of VIPs — when the named node fails, the entire delegated zone goes dark. Always delegate to two or more VIPs, never node IPs.

Key Takeaway: Memorize the port matrix — TCP 50051 for agents, 443 for VMware and Helios, 11111/11114/24444 for replication, 445/2049/111 for NAS — and treat reverse DNS, MTU 9000 end-to-end, and IPMI VLAN isolation as non-negotiable prerequisites.

Chapter Summary

Cohesity networking is a layered design problem. At the bottom is the physical bond — two same-speed 10/25 GbE NICs in LACP mode 4 to an MLAG/vPC switch pair, jumbo frames everywhere, configured both in ifcfg-bond0 per node and cluster-wide via iris_cli edit-bm bonding-mode=4. On top of bond0 ride untagged management traffic on the native VLAN and tagged VLANs for protocols, replication, and tenant separation. Each node carries one node IP and one VIP per partition on the primary network, plus one IPMI IP on a separate OOB VLAN.

Above the IPs is the SmartDNS service. By delegating a subdomain from corporate DNS to the cluster, the cluster becomes an authoritative DNS server that returns currently healthy VIPs in round-robin order, with partition awareness and second-scale TTLs that give clients near-instant failover when a node dies — the maitre d’ replacing the laminated seating chart. For multi-subnet primaries and stretch designs, BGP /32 advertisement makes VIP failover work across L3 boundaries.

Cluster bring-up depends on a roster of external services: redundant DNS with forward and reverse records, redundant NTP within seconds of AD time, AD/LDAPS+Kerberos endpoints, a SAN TLS cert from a trusted CA, SMTP/SNMP/Syslog targets, and outbound HTTPS to Helios. And the whole thing rides on a port matrix the architect must memorize: TCP 50051 for physical agents, 443 for VMware and Helios outbound, 11111/11114/24444 for inter-cluster replication, 445/2049/111 for NAS, 88/389/636 for AD, 53/123 for DNS/NTP, and 623 + 80/443 for IPMI on the OOB network.

Get the bond mode wrong and you cap throughput at one NIC. Get the MTU wrong and large writes silently drop. Get reverse DNS wrong and Kerberos fails. Get NS delegation pointed at node IPs and SmartDNS dies with the first failure. None of these are subtle — they are the architect’s checklist on Day 0, every time.

Key Terms

VIP (Virtual IP) — A statically assigned IP on the cluster’s primary network, used by clients to reach SMB, NFS, S3, or management endpoints. Recommended one VIP per node, all in the same subnet/VLAN as the gateway, no DHCP.
Cluster partition — A logical grouping of nodes within a single cluster, used for stretch/multi-rack designs, multi-tenant isolation, or per-protocol traffic separation. SmartDNS returns VIPs only from healthy nodes within the relevant partition pool.
LACP (Link Aggregation Control Protocol, IEEE 802.3ad / Linux bonding mode 4) — Active-active bonding mode where all NICs in the bond carry traffic; LACPDUs negotiate the aggregate with the upstream switch. Requires switch port-channel/LAG support and (for dual-switch resiliency) MLAG/vPC. The production-recommended bond mode for Cohesity.
SmartDNS — Cohesity’s authoritative DNS service that runs on the cluster itself for a delegated subdomain. Returns currently healthy VIPs in round-robin order with partition awareness, automatically removing failed VIPs from rotation. The preferred load-distribution mechanism for production clusters.
BGP (Border Gateway Protocol) — Used by Cohesity (6.6+) to advertise VIP /32 routes to upstream switches/route reflectors so VIP failover works across L3 boundaries. Complements SmartDNS for multi-subnet and stretch designs.
Bond (bond0) — The Linux logical interface formed by aggregating two or more physical NICs. Cohesity supports mode 1 (Active-Backup) and mode 4 (LACP). The bond is the cluster’s primary network and carries node IPs, VIPs, and tagged VLANs.
Primary network — The bonded interface (bond0) carrying node management IPs, VIPs, and (via tagged VLANs) backup, NFS, SMB, S3, and replication traffic. 10 GbE is the production minimum.
IPMI (Intelligent Platform Management Interface) — The out-of-band hardware management interface on each Cohesity node. 1 GbE, on a dedicated OOB management VLAN, using TCP/UDP 623 plus 80/443 for Redfish/web UI. Must be kept off the primary network.

Chapter 5: Cluster Deployment, Bootstrap, and Day-2 Operations

A Cohesity cluster spends roughly 30 minutes being born and the rest of its life in maintenance mode. The CCAE exam reflects that ratio: it expects you to bootstrap a cluster cleanly the first time, then operate it for years without unplanned downtime. This chapter walks through the full lifecycle: imaging and bootstrapping a physical cluster with iris_cli, deploying Virtual Edition (VE) on VMware, deploying Cloud Edition (CE) on AWS and Azure, and then performing the Day-2 work — rolling upgrades, node additions, disk replacements, and automation via REST, PowerShell, Ansible, and Terraform.

Learning Objectives

By the end of this chapter, you will be able to:

Bootstrap a new physical, virtual, or cloud Cohesity cluster end-to-end using both the web wizard and iris_cli.
Apply best-practice configuration (DNS, NTP, AD/SSO, licensing, VIPs) to a brownfield enterprise environment.
Perform Day-2 operations including rolling upgrades, node additions/removals, and disk/node replacement procedures.
Use the Cohesity CLI (iris_cli), REST API v1/v2, Helios API, the PowerShell module, the Ansible collection, and the Terraform provider to automate cluster lifecycle.
Choose appropriately between physical, Virtual Edition, and Cloud Edition deployment models for a given architecture requirement.

Bootstrapping a New Cluster

A Cohesity cluster begins life as a stack of imaged but unconfigured nodes. Bootstrap is the act of giving the first node an IP, telling it about its peers, and forming a quorum. Until quorum forms, there is no SpanFS, no Bridge service accepting backups, and no Helios footprint — just a handful of Linux boxes waiting for instructions.

Out-of-the-Box Experience and First-Time Wizard

When a Cohesity-branded appliance, ReadyNode, or certified partner platform arrives from the factory, each node ships pre-imaged with the Cohesity OS but without IP addresses. The bootstrap entry point is IPMI (Intelligent Platform Management Interface), the out-of-band management controller every server provides. The architect or field engineer connects to each node’s IPMI IP — typically supplied on a sticker by the hardware vendor — powers the node on, and verifies the Cohesity imaging splash screen [Source: https://www.youtube.com/watch?v=vAsrKn14jgY].

Each node carries a unique node-ID (e.g., 181140266786854) printed on the IPMI console or visible at the boot screen. Architects must record these IDs because they are required arguments to iris_cli cluster create when nodes do not yet have IPs assigned [Source: https://www.ibm.com/docs/en/storage-defender/base?topic=suc-configuring-primary-secondary-network-in-cluster-multicast-disabled].

Analogy: Think of the node-ID like a baby’s hospital wristband. Before the cluster gives the node a name and an IP “address” to live at, the node-ID is the only handle that uniquely identifies which physical box you mean.

IP Allocation and Partition Creation

The first node is the bootstrap target. Console into it (via IPMI or directly) and run the network configuration script:

cd /home/cohesity/bin/network
./configure_network.sh

The script prompts for:

Bond selection: bond0 is the default node-to-node bond (used by bridge0); bond1 is reserved for redundancy or a separate management/storage plane [Source: https://www.youtube.com/watch?v=oXFVOG6AO0o].
IP, prefix, gateway: a routable address on the management network (e.g., 10.1.4.16/24, gateway 10.1.0.1).
MTU: 1500 by default; raise to 9000 for jumbo frames if the switch fabric supports them end-to-end.
DNS servers: required for cluster name resolution and AD integration.

Activating the changes resets interfaces and incurs roughly 30–60 seconds of downtime on the bootstrap node — fine, since the cluster is not yet serving traffic. As an alternative to the script, architects can use iris_cli directly after logging in (default credentials: admin/admin):

iris_cli interface list
iris_cli node status

These commands enumerate physical interfaces and bond memberships so the architect can confirm that bond0 is up before pushing the configuration [Source: https://mirror.vcu.edu/pub/cohesity/docs/Cohesity%20CLI%20Reference%20Guide%207.3.2.pdf].

A key exam-relevant fact: partitions are created automatically during cluster formation. There is no manual partitioning step before iris_cli cluster create. Cluster partitions logically group nodes for VIP segregation and tenancy, but the initial partition is allocated as part of the create call [Source: http://pdamien58.blogspot.com/2016/01/cohesity-initial-cluster-setup.html].

Joining Nodes and Forming Quorum

Once the first node has an IP, it can see its peers on the same L2 domain via Cohesity’s discovery protocol. Run:

iris_cli discover free-nodes

This lists all imaged-but-uncommitted nodes the bootstrap node can reach. Architects review the list and select which nodes to enroll — important when, for example, a 4-node chassis is being split into two 2-node clusters [Source: https://www.youtube.com/watch?v=vAsrKn14jgY].

With the free-node list in hand, the cluster is created in a single iris_cli invocation:

iris_cli cluster create \
  domain-names=eng.cohesity.com \
  ntp-servers=pool.ntp.org \
  name="haswell2" \
  hostname=haswell2.eng.cohesity.com \
  subnet-gateway=10.1.0.1 \
  subnet-mask=255.255.240.0 \
  dns-server-ips=10.2.0.1 \
  node-ips=10.1.4.16,10.1.4.17,10.1.4.18 \
  node-ids=181140266786854,181140264583348,181140264822986 \
  vips=10.1.4.20,10.1.4.21,10.1.4.22 \
  enable-encryption=true

The command bundles every prerequisite into one atomic operation: domain, NTP, DNS, gateway, mask, node IPs, node IDs, VIPs, and encryption posture [Source: https://mirror.vcu.edu/pub/cohesity/docs/Cohesity%20CLI%20Reference%20Guide%207.3.2.pdf]. If nodes have already been assigned IPs (a common brownfield pattern when the network team pre-stages addresses), the node-ids= parameter can be omitted; Cohesity discovers the nodes by IP instead [Source: https://www.youtube.com/watch?v=vAsrKn14jgY].

Quorum forms once a majority of nodes acknowledge the create. For a 3-node cluster, all three must be present; the cluster cannot tolerate a missing node during initial formation, only after quorum is established and SpanFS metadata replicas are placed.

Exam tip: Multicast is frequently disabled in enterprise networks. Cohesity supports a unicast-discovery path; the architect must ensure the same node count, VIP count, and IPMI count are supplied and that ARP-based peer discovery works on the bootstrap subnet [Source: https://www.ibm.com/docs/en/storage-defender/base?topic=suc-configuring-primary-secondary-network-in-cluster-multicast-disabled].

Initial AD/SSO and Licensing Setup

With quorum formed, the cluster boots its services and presents a web UI on each node IP and on each VIP. The architect’s first post-create tasks are:

Verify health: iris_cli cluster status should show all nodes Online, services Up, and SpanFS mounted.
DNS registration: add A records for the cluster FQDN that round-robin across all VIPs (e.g., haswell2.eng.cohesity.com -> 10.1.4.20, 10.1.4.21, 10.1.4.22) [Source: https://cypresscollege.a2hosted.com/files/ISER-Evidence/IIC-Student-Support/IIC8/IIC8-15_ISDataProtectionCohesityPlatform.pdf].
License application: upload the entitlement file (capacity-based or subscription) under Settings > Licensing.
Active Directory join: bind the cluster to the AD forest for user authentication; map AD groups to built-in roles (Admin, Operator, Viewer).
SSO/SAML: optionally federate with Okta, Azure AD, or Ping for browser-based login.
Helios registration: link the cluster to Helios to enable global dashboards, multi-cluster reports, and SaaS-only features (DataHawk, FortKnox, SiteContinuity orchestration).

A classic brownfield pitfall is forgetting reverse DNS. Cohesity uses PTR records during AD join and Kerberos ticket validation; missing PTRs cause AD join failures that look mysterious until the architect runs dig -x <cluster IP>.

Figure 5.1: Bootstrap workflow from IPMI power-on to AD/SSO integration

flowchart TD
    A[Power on imaged nodes via IPMI] --> B[Console to first node]
    B --> C[Run configure_network.sh<br/>set bond0 IP, mask, gateway, DNS, MTU]
    C --> D[iris_cli discover free-nodes]
    D --> E{Select nodes<br/>to enroll}
    E --> F[iris_cli cluster create<br/>domain, NTP, DNS, node-ips, node-ids, VIPs]
    F --> G[Quorum forms<br/>initial partition auto-created]
    G --> H[iris_cli cluster status<br/>verify health]
    H --> I[DNS A/PTR records for VIPs]
    I --> J[Apply license]
    J --> K[Active Directory join<br/>map AD groups to roles]
    K --> L[Optional SAML/SSO federation]
    L --> M[Helios registration]

Key Takeaway: Bootstrap is a one-shot, atomic operation: imaged nodes get IPs, the first node discovers peers via iris_cli discover free-nodes, and a single iris_cli cluster create call passes domain, NTP, DNS, gateway, mask, node IPs/IDs, and VIPs to form quorum and auto-create the initial partition.

Virtual and Cloud Edition Deployment

Not every Cohesity cluster lives on Cohesity-branded steel. Virtual Edition (VE) packages the same SpanFS stack as a virtual appliance for VMware ESXi, Hyper-V, Nutanix AHV, KVM, and even Raspberry-Pi-class edge devices. Cloud Edition (CE) packages it for AWS, Azure, and GCP. The deployment mechanics differ, but the post-deploy operational model — Helios registration, policies, protection groups — is identical.

VE Prerequisites on VMware and Hyper-V

Virtual Edition ships as an OVA (VMware) or VHDX (Hyper-V) and runs as a single-node or multi-node cluster. A VE node consumes considerably less than its physical counterpart but still has hard floors: typically 8+ vCPU, 64+ GB RAM, dedicated performance-tier disk on flash, and a separate capacity-tier disk on HDD or capacity SSD [Source: https://chriscolotti.us/vmware/how-to-deploy-the-cohesity-azure-cloud-edition/].

Critical sizing constraints unique to VE:

Single-node VE expands by disk resize, not node addition. To add capacity, the architect shuts down the VM, extends the capacity-tier disk in the hypervisor (never the boot disk), powers on, and runs:
```
iris_cli cluster stop
iris_cli node list disk extend
iris_cli disk list
iris_cli cluster start
```
The sequence completes in minutes and adds the new capacity to SpanFS [Source: https://www.youtube.com/watch?v=BYM4u4NfvaI] [Source: http://demitasse.co.nz/2018/12/expanding-storage-on-a-cohesity-virtual-edition-appliance/].
Multi-node VE clusters (3+ nodes) behave like physical clusters: iris_cli discover free-nodes and iris_cli cluster create are used identically.
Anti-affinity rules must be configured in vSphere DRS so VE nodes never co-reside on the same ESXi host; otherwise a single host failure takes the cluster down.
VMDK provisioning must be Eager Zeroed Thick on the performance tier to guarantee NVRAM-equivalent commit latency. Thin-provisioned performance disks introduce write-stall behavior under load.

Cohesity Cloud Edition on AWS, Azure, GCP

Cloud Edition is the same software re-packaged as cloud VMs. It runs as a minimum 3-node production cluster (single-node CE is supported only for lab use) and registers to Helios for management.

Azure deployment uses the Cohesity Marketplace image with a documented VM size of Standard_DS5_v2 (16 vCPU per node). A 3-node minimum cluster therefore consumes 48 vCPU cores; new Azure subscriptions often default to a 10-core regional quota, so the architect’s first call is to request a quota increase before deploying [Source: https://chriscolotti.us/vmware/how-to-deploy-the-cohesity-azure-cloud-edition/]. Azure resource-group caps allow up to 64 VMs per resource group, which sets an effective ceiling on a single CE cluster’s node count [Source: https://docs.cohesity.com/baas/data-protect/azure-vm/azure-prereq.htm].

Azure Managed Disks must be sized in multiples of 1 MB. The capacity tier is resizable in place (e.g., 594 GB → 700 GB, yielding ~690 GiB usable after formatting), while the performance tier is provisioned on premium SSD and is not normally resized [Source: https://chriscolotti.us/vmware/how-to-deploy-the-cohesity-azure-cloud-edition/].

AWS deployment uses the Cohesity DataPlatform Cloud Edition AMI from the AWS Marketplace [Source: https://aws.amazon.com/marketplace/pp/prodview-m3tozzczpsmqe]. Sizing approximates the m5/m6i large-instance family; specific recommendations appear in the Cohesity & AWS Solution Brief [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/solution-brief/Cohesity-and-AWS-Solution-Brief.pdf]. AWS CE supports S3 tiering for overflow capacity, allowing the cluster to spill cold blocks to lower-cost object storage [Source: https://www.cohesity.com/blogs/deploying-cohesity-dataplatform-cloud-edition-aws/].

Networking prerequisites (both clouds): a dedicated subnet, security-group rules permitting inter-node traffic, a VPN or Direct Connect/ExpressRoute back to on-premises for replication, and IAM roles granting the cluster permission to read/write S3 (AWS) or Blob Storage (Azure) for CloudArchive and CloudTier targets.

Robo Edition and Edge Deployments

Robo Edition is a thin VE variant tuned for remote/branch offices (ROBO). A typical Robo cluster is a single-node VE on existing branch-office hypervisor capacity, configured to replicate inbound to a central CE or physical hub. Robo Edition trades clustered resiliency (no SpanFS quorum across nodes) for a tiny footprint and manages risk by replicating critical backups upstream within minutes.

Comparing Deployment Models

Attribute	Physical (Branded/ReadyNode)	Virtual Edition (VE)	Cloud Edition (CE)
Form factor	1U/2U appliance	OVA/VHDX virtual appliance	Cloud Marketplace image (AMI/Azure)
Min nodes (production)	3	3 (single-node = lab/Robo only)	3
Bootstrap entry point	IPMI + console	Hypervisor console	Cloud console + SSH
Node-ID source	Hardware sticker / IPMI	OVF deployment	VM metadata
Capacity expansion	Add nodes / disks	Resize VMDK, then `disk extend`	Add VMs or resize managed disks
Networking	Bonded NICs, VLAN trunks	vSwitch port group, MTU	VPC/VNet subnet, security groups
Storage backing	NVMe (perf) + HDD/SSD (cap)	VMDK on flash + capacity datastore	Premium SSD + standard SSD/HDD
External overflow	Local only	Local only	S3 / Blob tier supported
Helios	Optional	Optional	Strongly recommended
Typical use case	Primary on-prem fabric	Lab, ROBO, dev/test	Cloud DR, cloud-native protection

Figure 5.2: Deployment form factor matrix across Physical, VE, Cloud Edition, and Robo Edition

flowchart LR
    subgraph Physical[Physical / ReadyNode]
        P1[1U or 2U appliance]
        P2[IPMI bootstrap]
        P3[NVMe perf + HDD/SSD cap]
    end
    subgraph VE[Virtual Edition]
        V1[OVA / VHDX]
        V2[VMware, Hyper-V, AHV, KVM]
        V3[Eager Zeroed Thick VMDKs<br/>DRS anti-affinity]
    end
    subgraph CE[Cloud Edition]
        C1[AWS AMI / Azure Marketplace]
        C2[3-node minimum production]
        C3[S3 / Blob tiering]
    end
    subgraph Robo[Robo Edition]
        R1[Single-node VE variant]
        R2[Branch office hypervisor]
        R3[Replicates to hub cluster]
    end
    Physical -->|Primary on-prem fabric| Hub[Helios Fleet Management]
    VE -->|Lab, dev/test| Hub
    CE -->|Cloud DR + cloud-native| Hub
    Robo -->|Edge protection| Hub

Key Takeaway: Physical, VE, and CE share the same iris_cli bootstrap mechanics but differ in their resiliency model: physical and CE use clustered SpanFS; single-node VE expands via in-place disk resize and depends on hypervisor-level HA for availability.

Day-2 Operations

Once the cluster is alive and protecting data, the architect’s job pivots from creation to stewardship. Day-2 operations cover the recurring activities: software upgrades, capacity expansions, hardware replacement, and proactive health management.

Cluster Upgrades and Rolling Reboots

Cohesity ships a one-click rolling upgrade. The cluster pulls candidate releases automatically from Cohesity’s public release service — there is no manual download step — and the architect chooses a target version from a UI-filtered list of compatible packages [Source: https://www.cohesity.com/blogs/cohesity-cluster-upgrades/].

The mechanics are elegant. A distributed lock manager hands a single token from node to node:

The token-holder pauses local services and migrates its VIPs to peers.
Active client connections continue against the surviving VIPs (UI sessions, SMB mounts, NFS exports, ongoing backups).
The node atomically swaps its active and passive root partitions and reboots into the new image.
Once healthy, it releases the token to the next node.

Backups, replication, and indexing keep running throughout, and RPO/RTO/security posture is preserved [Source: https://www.cohesity.com/blogs/cohesity-cluster-upgrades/]. The atomic root-partition swap also enables fast rollback at the boot level: if a node fails to come up on the new image, it boots back into the previous root.

Figure 5.3: Rolling upgrade token-passing sequence across cluster nodes

sequenceDiagram
    participant Helios
    participant Cluster as Cluster Coordinator
    participant N1 as Node 1
    participant N2 as Node 2
    participant N3 as Node 3
    Helios->>Cluster: Initiate upgrade to target version
    Cluster->>Cluster: Run pre-upgrade checks
    Cluster->>N1: Hand upgrade token
    N1->>N2: Migrate VIPs to peers
    N1->>N1: Pause services, swap root partition, reboot
    N1->>Cluster: Healthy on new image, release token
    Cluster->>N2: Hand upgrade token
    N2->>N3: Migrate VIPs to peers
    N2->>N2: Pause, swap root partition, reboot
    N2->>Cluster: Healthy, release token
    Cluster->>N3: Hand upgrade token
    N3->>N1: Migrate VIPs to peers
    N3->>N3: Pause, swap root partition, reboot
    N3->>Cluster: Healthy, release token
    Cluster->>Helios: Upgrade complete, all nodes on new version

Analogy: A Cohesity rolling upgrade is like a Roomba in a house full of guests. Only one node ever steps out of the rotation at a time, and the rest keep cleaning (serving I/O) while it’s away. The Roomba doesn’t ask everyone to leave the house — it just navigates around them.

The 7.x UI introduces explicit pre-upgrade checks under the Upgrade tab and supports uploading a CRL (Certificate Revocation List) file when needed [Source: https://www.youtube.com/watch?v=-HjnGFgU_uA]. Architects should always:

Open Platform > Cluster, confirm green health.
Run pre-upgrade checks; fix any flagged issues (clock drift, certificate expiry, disk warnings).
Initiate the upgrade from Platform > Admin > Upgrade Cluster.
After completion, verify with iris_cli cluster get-version and spot-check a few backup jobs and a restore.

A noteworthy historical event: Cohesity 6.8.2_u1 migrated the underlying OS from CentOS 7.9 to RHEL 7.9 because of CentOS’s June 30, 2024 EOL. The upgrade was self-driven and low-risk because RHEL 7.9 is binary-compatible with CentOS 7.9 [Source: https://www.xiologix.com/20240701-cohesity-rhel-update/].

Adding and Removing Nodes

Cluster expansion is symmetric to bootstrap. New nodes are imaged, racked, cabled, and presented to the cluster via iris_cli discover free-nodes. The expansion command (UI-driven or CLI) appends the discovered nodes to the existing cluster, after which SpanFS rebalances chunk placement to spread load.

Removing a node is a multi-stage drain:

Mark the node for removal in the UI or via iris_cli.
SpanFS migrates chunk replicas off the node to maintain Replication Factor (RF) or Erasure Coding (EC) parity.
Once drained, the node leaves the quorum and can be physically removed.

Architects must size for rebuild headroom: when a node fails or is removed, the cluster needs free capacity equal to the failed node’s data footprint to re-protect blocks. Without headroom, the cluster operates with degraded resiliency until capacity is added.

Disk and Node Replacement Procedures

Disk failures are common; node failures are rare. Both are handled via Cohesity’s hardware replacement workflow:

Disk replacement: identified via Helios alert or iris_cli cluster status showing a failed disk LED. Field engineer hot-swaps the disk; the cluster auto-formats it, integrates it into SpanFS, and rebuilds chunk replicas onto it.
Node replacement: more involved. The replacement node arrives pre-imaged at the same OS version. The architect drains the failed node, physically swaps it, and uses iris_cli (or the UI) to re-add the new node. SpanFS rebuilds replicas; rebuild time scales with node capacity and cluster network throughput.

For appliance customers, the Cohesity Hardware Refresh Service handles tech-refresh cycles (5-7 year lifespan typical) by overlapping old and new clusters during data migration [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/datasheets/cohesity-hardware-refresh-service-data-sheet-en.pdf].

Health Checks and Proactive Support

Cohesity emits a daily Heartbeat log bundle that uploads to Cohesity Support over HTTPS. Heartbeat includes cluster configuration, service health, recent alerts, and capacity metrics — enough for support engineers to triage proactively without requesting fresh log bundles. Helios then surfaces alerts globally; the operator’s role is to triage alerts in priority order:

Severity	Examples	Action
Critical	Quorum loss, full capacity, multiple disk failures	Page on-call; engage support
Major	Single disk failure, replication lag, license expiry	Same-day investigation
Warning	Job failures, certificate expiry < 30 days	Plan resolution
Info	Successful upgrade, scheduled maintenance	Acknowledge

Key Takeaway: Day-2 is dominated by rolling upgrades (one-click, zero-downtime via VIP failover and atomic root-partition swaps), node and disk replacement (drain-rebuild cycles requiring rebuild-capacity headroom), and proactive monitoring through the Heartbeat log bundle and Helios alerts.

Automation and APIs

Click-ops works for one cluster. At fleet scale — dozens of clusters across regions and clouds — every operation must be expressible as code. Cohesity exposes four programmable interfaces, each with its own audience and idioms.

iris_cli Command Groups

iris_cli is the on-cluster shell binary. Architects invoke it from any node’s CLI (or remotely via SSH) for tasks where the UI is too slow or scripting is required. Major command groups include:

Group	Purpose	Example
`cluster`	Bootstrap, status, version	`iris_cli cluster get-version`
`node`	Per-node operations	`iris_cli node list disk extend`
`disk`	Disk inventory, extend	`iris_cli disk list`
`interface`	Network config	`iris_cli interface list`
`vlan`	VLAN/VIP management	`iris_cli vlan list`
`partition`	Cluster partition admin	`iris_cli partition list`
`protection-job`	Backup job control	`iris_cli protection-job list`
`recovery`	Restore operations	`iris_cli recovery list`

The CLI Reference Guide remains the authoritative source, with versioned PDFs published per release (e.g., 7.3.2) [Source: https://mirror.vcu.edu/pub/cohesity/docs/Cohesity%20CLI%20Reference%20Guide%207.3.2.pdf].

REST API v1 vs. v2 and Helios APIs

Cohesity exposes three REST surface areas:

REST API v1: cluster-local, organized around legacy resource names (jobs, runs, sources). Still in heavy use; some endpoints have no v2 equivalent.
REST API v2: cluster-local, redesigned around object-oriented resources (protection groups, recoveries, sources). Preferred for new automation.
Helios API: SaaS-side, scoped to a customer tenant and aggregated across all registered clusters. Enables fleet-level operations (create a policy across 50 clusters) and is the only path to Helios-only features (DataHawk, FortKnox, SiteContinuity orchestration).

Authentication patterns differ: cluster APIs accept username/password or API key against the cluster directly; Helios uses a per-tenant API key issued from the Helios console and includes a clusterId filter for fan-out calls.

Architects choose v2 for new code, v1 only when an endpoint is missing in v2, and Helios for any cross-cluster orchestration.

PowerShell Module and Ansible Collection

The Cohesity PowerShell module wraps both v1 and v2 endpoints with idiomatic PowerShell verbs (Get-CohesityProtectionJob, New-CohesityRecoveryRequest). Windows-centric teams often standardize on it for Windows Server, SQL, and M365 backup automation.

The Cohesity Ansible collection delivers the same automation as Ansible modules consumed in playbooks. It fits naturally into Linux-heavy and infrastructure-as-code shops where Ansible already orchestrates VMware, AD, and network changes.

A representative Ansible task:

- name: Create VMware protection group
  cohesity.dataprotect.cohesity_protection_group:
    cluster: "{{ cohesity_vip }}"
    username: "{{ cohesity_user }}"
    password: "{{ cohesity_pass }}"
    name: "tier1-vms-daily"
    environment: "VMware"
    policy: "Gold"
    sources:
      - vcenter01
    vm_tags:
      - "tier:1"
    state: present

Terraform Provider for Cohesity

The Terraform provider treats Cohesity resources (policies, protection groups, sources, views, RBAC roles) as declarative state. It is the right tool when:

Backup configuration must be reproducible across dev/test/prod environments.
Infrastructure pipelines (Terraform Cloud, GitHub Actions, GitLab CI) gate Cohesity changes alongside VMware and cloud changes.
An audit trail of “who changed which protection policy when” is required — Terraform commits live in Git.

A typical Terraform stanza:

resource "cohesity_protection_policy" "gold" {
  name              = "Gold"
  retention {
    unit            = "Days"
    duration        = 30
  }
  incremental_schedule {
    unit            = "Hours"
    frequency       = 4
  }
  full_schedule {
    unit            = "Weeks"
    day             = "Sunday"
  }
}

Figure 5.4: Cohesity automation stack layered above REST API surfaces

graph TD
    subgraph Tools[Operator-Facing Automation]
        TF[Terraform Provider<br/>declarative state]
        ANS[Ansible Collection<br/>playbook tasks]
        PS[PowerShell Module<br/>Windows-centric]
        IRIS[iris_cli<br/>on-cluster shell]
    end
    subgraph APIs[REST Surface Areas]
        V1[REST API v1<br/>legacy resources]
        V2[REST API v2<br/>object-oriented]
        HEL[Helios API<br/>fleet-wide tenant scope]
    end
    subgraph Targets[Cluster Targets]
        CL1[Cluster A]
        CL2[Cluster B]
        CL3[Cluster C]
    end
    TF --> V2
    TF --> HEL
    ANS --> V2
    ANS --> V1
    PS --> V2
    PS --> V1
    IRIS --> CL1
    V1 --> CL1
    V1 --> CL2
    V2 --> CL1
    V2 --> CL2
    V2 --> CL3
    HEL --> CL1
    HEL --> CL2
    HEL --> CL3

Analogy: Pick the API like you pick a kitchen tool. iris_cli is the chef’s knife — sharp, fast, on-cluster. REST is the food processor — bulk operations from outside. PowerShell is the rice cooker for Windows shops. Ansible and Terraform are the meal-prep system for the whole week.

Key Takeaway: Cohesity’s automation stack layers iris_cli (on-cluster shell) under REST APIs (v1 legacy, v2 modern, Helios fleet-wide), with PowerShell, Ansible, and Terraform wrappers for the configuration management style of the operating team.

Chapter Summary

Bootstrapping a Cohesity cluster is a tightly choreographed sequence: image nodes, access them via IPMI, configure the first node’s network with configure_network.sh or iris_cli, discover peers with iris_cli discover free-nodes, and atomically form the cluster (and its first partition) with iris_cli cluster create. The same mechanics apply across physical, Virtual Edition, and Cloud Edition deployments, but each form factor brings unique constraints — VMDK provisioning and DRS anti-affinity for VE, Azure core quotas and managed-disk multiples for CE, AWS Marketplace AMIs and S3 tiering for AWS CE.

Day-2 operations rely on Cohesity’s hallmark rolling upgrade: a distributed lock manager serializes per-node reboots, VIPs migrate so backups and replication never pause, and atomic root-partition swaps allow fast rollback. Capacity grows by adding nodes (or by disk extend on single-node VE), hardware replacement follows drain-rebuild semantics, and the daily Heartbeat log bundle plus Helios alerts give operators proactive visibility.

For automation at scale, architects choose among iris_cli, REST API v1/v2, the Helios API, PowerShell, Ansible, and Terraform — matching the tool to the team’s existing operating model rather than imposing a new one. The CCAE exam expects fluency across all six.

Key Terms

Bootstrap — the one-shot operation that gives nodes IPs, discovers peers, and forms the initial cluster quorum and partition via iris_cli cluster create.
iris_cli — Cohesity’s on-cluster command-line shell, organized into command groups (cluster, node, disk, interface, protection-job, recovery) and the authoritative scripting interface for cluster operations.
Virtual Edition (VE) — Cohesity SpanFS packaged as a virtual appliance (OVA/VHDX) for VMware, Hyper-V, AHV, KVM; runs as single-node (lab/Robo, expand by disk resize) or 3+ node clusters.
Cloud Edition (CE) — Cohesity SpanFS packaged as cloud Marketplace VMs for AWS, Azure, GCP; minimum 3 nodes for production, with cloud-specific sizing constraints (e.g., Azure Standard_DS5_v2, 64 VMs/resource group).
Helios API — SaaS-side REST surface scoped to a customer tenant, enabling fleet-wide operations across all registered clusters and exclusive access to DataHawk, FortKnox, and SiteContinuity orchestration.
Rolling upgrade — Cohesity’s zero-downtime upgrade mechanism: a distributed lock token serializes per-node reboots while VIPs migrate to keep services online; atomic active/passive root-partition swaps enable fast rollback.
Brick — a node-level fault domain used in chassis-aware placement; in dense multi-node-per-chassis hardware, brick mode tells SpanFS to treat each node as an independent fault domain so chunk replicas survive a chassis power failure.

Chapter 6: Identity, Access Management, and Multi-Tenancy

Securing a Cohesity DataPlatform is fundamentally a problem of identity. Who can log in? With what privileges? Against which resources? In a service-provider deployment, how do you guarantee that Tenant A cannot even see Tenant B’s backups? This chapter examines how Cohesity authenticates users (local, AD/LDAP, SAML SSO, MFA, API keys), authorizes them through role-based access control (RBAC) layered with access scopes, and how Organizations, View Boxes, and per-tenant VLANs combine to deliver multi-tenancy on a shared cluster.

The CCAE exam emphasizes architecture-level decisions: when to share a Storage Domain, when to dedicate one, which SAML attribute Cohesity uses for role mapping, and what AD FS cannot do that Okta and Azure AD can. The pitfalls are subtle and silent — we will mark each one explicitly.

Learning Objectives

By the end of this chapter you will be able to:

Design RBAC models using Cohesity’s built-in roles and custom roles, layered with access scopes for least-privilege enforcement.
Integrate Cohesity DataProtect and Helios with Active Directory, LDAP, and SAML 2.0 identity providers including Microsoft Entra ID (Azure AD), Okta, AD FS, and Ping.
Architect multi-tenant deployments using Organizations, View Box (Storage Domain) isolation, per-tenant VLANs, and hierarchical Organizations.
Apply the principle of least privilege across operators, tenant administrators, automation service accounts, and API consumers.
Recognize and avoid common pitfalls — Login-vs-Email attribute precedence, nested-AD-group limitations, AD FS signed-auth-request restrictions, and silent SSO login rejections caused by missing default roles.

Authentication Sources

Authentication answers “are you who you claim to be?” Cohesity supports four paths: local users, AD/LDAP, SAML SSO, and API keys (with optional MFA layered on top). Most enterprise designs use AD or SAML for humans, local accounts for break-glass, and API keys for automation.

Local Users vs. AD/LDAP

A local user is an account whose password is stored directly in Cohesity. Local accounts provide a recovery path when the corporate IdP is unavailable — you do not want a domain controller outage to lock you out of your backup system. The original admin account created during bootstrap is local. Best practice is to keep one or two named break-glass admins, enforce strong passwords and MFA, and audit them closely [Source: https://www.cohesity.com/blogs/role-based-access-control-rbac-cohesity-dataprotect-4-0/].

Active Directory and LDAP are the workhorses for daily authentication. When joined to AD, Cohesity validates users against the domain controller. Group membership drives role assignment: assign roles to AD groups (e.g., cohesity-operators) rather than to individuals. Hybrid Azure AD environments running AAD Connect typically resolve users by sAMAccountName to keep on-premises and cloud names aligned [Source: https://docs.cohesity.com/baas/Helios/azure.htm].

SAML SSO with Modern Identity Providers

For organizations that have standardized on a cloud IdP — Microsoft Entra ID (formerly Azure AD), Okta, Ping Identity, JumpCloud, OneLogin, Duo SSO, RSA SecurID Access, ADSelfService Plus, IBM Security Verify, CyberArk Workforce Identity, or Thales SafeNet — Cohesity speaks SAML 2.0 [Source: https://docs.cohesity.com/baas/Helios/SingleSignOn.htm]. SAML establishes a trust triangle:

Component	Role	Cohesity equivalent
Identity Provider (IdP)	Authenticates the user, issues a signed SAML assertion	Azure AD, Okta, Ping, AD FS
Service Provider (SP)	Consumes the assertion, makes authorization decisions	Cohesity cluster or Helios
User	Presents credentials to the IdP, redirected to SP	Backup admin, tenant operator

Authentication can be IdP-initiated (user starts at the IdP portal and clicks the Cohesity tile) or SP-initiated (user clicks “Sign in with SSO” on the Cohesity login page) [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/integrate-azure-ad-with-cohesity-sso-white-paper-en.pdf].

When Cohesity receives a SAML assertion it performs four validations: signature verification, temporal validity (NotBefore / NotOnOrAfter), recipient validation against the cluster’s ACS URL, and identity-attribute extraction.

Figure 6.1: SAML SSO authentication flow (SP-initiated) between user, Cohesity, and the IdP.

sequenceDiagram
    autonumber
    participant U as User Browser
    participant C as Cohesity (SP)
    participant I as IdP (Okta / Azure AD)
    U->>C: 1. GET /login (Sign in with SSO)
    C-->>U: 2. 302 Redirect with SAMLRequest
    U->>I: 3. SAMLRequest + user credentials
    I->>I: 4. Authenticate user + MFA
    I-->>U: 5. Signed SAML Response (assertion)
    U->>C: 6. POST /idps/authenticate (SAMLResponse)
    C->>C: 7. Validate signature, NotBefore/NotOnOrAfter, ACS URL
    C->>C: 8. Extract Login/Email + Groups, map to Role + Scope
    C-->>U: 9. Session cookie / Helios JWT

Configuration lives at Settings > Access Management > Single Sign-On > Configure SSO:

Field	Source	Notes
Protocol	Cohesity	Choose SAML
SSO Domain	Architect	e.g., `corp.example.com`; routes users by email domain when multiple IdPs are configured
SSO Provider	Cohesity	Dropdown — Microsoft Entra ID, Okta, JumpCloud, OneLogin, Ping, Duo SSO, RSA SecurID, etc.
Single Sign-On URL	IdP	The IdP’s SSO endpoint
Provider Issuer ID	IdP	The IdP’s Entity ID
X.509 Certificate	IdP	Must be PEM format — Cohesity rejects DER/CER
Sign Auth Request	Optional	Requires uploading Cohesity’s public cert to IdP. AD FS does not support this with Cohesity
Default Role	Architect	Fallback for users not in any mapped SSO group; if neither default role nor SSO groups are configured, login is rejected
Access to Clusters	Architect	All clusters or limited subset
Assign to Organization	Optional	For multi-tenant scoping

Cohesity needs to know two URLs about itself, which you give to the IdP:

Self-managed cluster: https://<cluster_fqdn>/idps/authenticate
Helios: https://helios.cohesity.com/v2/mcm/idp/authenticate

The Identifier (Entity ID) and Reply URL (ACS URL) must both equal that endpoint exactly. A mismatch produces the dreaded “Subject confirmation validation failed” error, which is by far the most common SAML setup failure [Source: https://www.veritas.com/support/en_US/article.100053273].

SAML pitfall #1 — Login attribute beats Email. Cohesity expects either an Email or Login SAML attribute for user identity. If both are present, Cohesity uses Login for role mapping and ignores Email entirely [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/integrate-azure-ad-with-cohesity-sso-white-paper-en.pdf]. Architects who assume Email “wins” are surprised when role bindings appear to be applied against the wrong identity. SAML attribute names are not case-sensitive in Cohesity’s processing, but the value selection between Login and Email absolutely is.

SAML pitfall #2 — Nested AD groups are not supported. When you configure the Groups claim in Azure AD, you must select “Groups assigned to the application.” This restricts the SAML response to only those groups directly assigned to the Cohesity app. Nested or sub-groups (a user being a member of cohesity-ops-na which is itself a member of cohesity-ops-global) are not expanded by Azure AD into the assertion [Source: https://docs.cohesity.com/baas/Helios/azure.htm]. The fix is to assign the leaf group(s) the user actually belongs to directly to the application. Same advice for Okta — use a flat group filter (e.g., “Starts with cohesity_”) and avoid relying on group nesting.

SAML pitfall #3 — AD FS does not support signed auth requests. Cohesity offers a “Sign Auth Request” option that signs the SP’s authentication request before sending it to the IdP, providing additional integrity. Okta and Azure AD support this (in Okta, set Signature Algorithm = RSA-SHA256 and Digest Algorithm = SHA256). AD FS does not support signed auth requests with Cohesity [Source: https://docs.cohesity.com/baas/Helios/adfs.htm]. If your IdP is AD FS, leave this option off.

Azure AD integration starts at Azure AD > Enterprise applications > Create your own application > Integrate any other application [Source: https://docs.cohesity.com/baas/Helios/azure.htm]. Group-claim source attribute depends on posture:

Hybrid Azure AD with AAD Connect (v1.2.70.0+): use sAMAccountName to match on-prem names.
Cloud-only Azure AD: use Group ID (object identifiers).

Okta integration is at Okta admin > Applications > Create App Integration > SAML 2.0 [Source: https://docs.cohesity.com/baas/Helios/okta.htm]. Single Sign-On URL and Audience URI both equal the Cohesity ACS URL. Attribute Statements map Email -> user.email and Login -> user.login. Group Attribute Statements use name groups with a regex or “Starts with cohesity_” filter. Okta delivers the certificate in .cert format which must be converted to .pem before upload.

MFA and API Key Authentication

MFA is best enforced at the IdP — Azure Conditional Access, Okta MFA, or Duo SSO. Cohesity also supports MFA for local users via TOTP. Compliance-driven deployments (HIPAA, PCI, FedRAMP) combine MFA with quorum approval (Chapter 11) for destructive operations.

API keys authenticate automation: Ansible, Terraform, PowerShell, CI/CD. An API key inherits exactly the privileges of the issuing user — there is no separate per-key permission set [Source: https://docs.cohesity.com/baas/data-protect/access-managment/manage-users-and-groups.htm]. For least-privilege automation, create a dedicated service-account user with a narrowly-scoped custom role, then issue the key from that account. A nightly recovery-validation script runs as svc-recover-validate with Recover only, not as an admin.

Certificate-based authentication exists for cluster-to-cluster and cluster-to-IdP trust. Avoid wildcard certificates, use individual certs per cluster, and track expiry — IdP signing cert rotation without a Cohesity update is the #2 most common SSO failure mode [Source: https://www.cohesity.com/blogs/updating-ssl-certificates-on-cohesity-clusters/].

Key Takeaway: Authentication design is a layered choice. Local accounts exist for break-glass; AD/LDAP and SAML SSO handle daily human access; API keys handle automation and inherit the issuing user’s role. The dangerous defaults to remember are: SAML uses Login over Email when both are present, nested AD groups do not expand in SAML assertions, AD FS cannot sign auth requests, and an SSO login with no default role and no group mapping is rejected outright rather than allowed in.

RBAC and Roles

Authorization answers “what may you do?” Cohesity’s model has three primitives: principals (users/groups), roles (privilege sets), and access scopes (resource boundaries). They combine multiplicatively — an identity has a role, and that role applies only to resources within the assigned scope.

Built-in Roles

Cohesity DataProtect ships with a comprehensive set of built-in roles. The table below shows the canonical set [Source: https://docs.cohesity.com/disaster-recovery/site-continuity/vmware-vms/initial-setup/access-management.htm] [Source: https://docs.cohesity.com/baas/data-protect/access-managment/manage-users-and-groups.htm]:

Role	Description	Typical Persona
Super Admin	Full access to all actions and workflows; manages other admins, roles, identity providers, and cluster configuration.	Cluster owner, platform team lead
Admin	Equivalent to Super Admin in many contexts; full management privileges.	Backup operations lead
Viewer	Read-only access across all workflows. Cannot run jobs, recover, or change configuration.	Auditor, security reviewer, observability tool
Operator	Viewer privileges plus the ability to run Protection Groups and create Recover Tasks. Cannot create/edit policies.	Daily backup operator
Data Security	Self-Service Data Protection plus the ability to create DataLock Views and set retention/expirations.	Compliance officer, ransomware response lead
Gaia Admin	Self-Service Gaia (search/AI) privileges: view and manage details and results.	Data discovery / e-discovery lead
Gaia Viewer	Query/read-only access in Gaia.	Investigator
High Classified	Privileged read access to fetch cluster details for specific API calls.	Auditing automation
SMB Backup Operator	SMB backup and restore privileges only.	Windows file-services admin
Self-Service	Viewer privileges plus the ability to manage Clones, Protection Groups, Policies, and Recover Tasks.	App-team self-service user
DR Admin	Viewer privileges plus the ability to create and manage DR workflows and tasks (failover, failback, runbooks).	DR architect, BCP team
Replication	Limited to setting up and replicating data to other clusters.	Cross-cluster service account
Cohesity Support Admin	Used by Cohesity Support to create a Super Admin if customer admin access is lost.	Vendor support break-glass

The rough hierarchy: Viewer < Operator < Self-Service < DR Admin / Data Security < Admin / Super Admin. Operator can run existing groups; Self-Service can also create them. DR Admin specializes in failover/failback without permitting protection-policy edits — useful for issuing a DR runbook engineer a role that excludes backup-frequency changes.

Custom Roles and Granular Privileges

Architects designing for least privilege almost always create custom roles via Settings > Access Management > Roles > Add Custom Role and select privileges from a checklist [Source: https://docs.cohesity.com/disaster-recovery/site-continuity/vmware-vms/initial-setup/access-management.htm]. Access Management privileges are pre-selected by default; un-check them for non-admin roles to prevent privilege escalation.

Canonical custom-role examples:

vSphere Backup Operator — Operator privileges scoped to VMware only.
M365 Recovery Operator — Recovery-only on Microsoft 365 sources.
NAS Self-Service — Self-Service on NAS protection groups only.
Audit Read — Read-only plus audit-log download.
Replication Service Account — Replication role only.

Custom roles are editable later and auditable like built-in roles [Source: https://www.cohesity.com/blogs/role-based-access-control-rbac-cohesity-dataprotect-4-0/].

Access Scopes — The Layer That Makes Least-Privilege Real

Roles alone tell Cohesity what an identity may do, but not where. Access Scopes layer on top of roles and constrain the resources a role applies to [Source: https://docs.cohesity.com/baas/data-protect/access-managment/access-scope.htm]:

Scope Dimension	Purpose	Example
Source Level	A specific item — one vCenter, one SQL host, one NAS share.	”Operator on vCenter `prod-vc01` only”
Source Type	A class of source.	”Operator on Microsoft 365 only”
Region	A geographic or logical region.	”DR Admin on Region `eu-west` only”
Service Level	A service: DataProtect, DR, etc.	”Self-Service on DataProtect only, no SmartFiles”

Multiple scopes combine (“Operator on VMware AND NAS in Region us-east”). Auto Assign brings newly added matching resources into scope automatically — useful for tenants with growing inventories.

The combination produces real least privilege. Example: Role Operator + Access Scope (Source Type = VMware, Source Level = vCenter-A only) — the principal runs backups and recoveries only on vCenter-A and cannot browse anything else. This is the foundation of MSP tenant isolation.

Figure 6.2: RBAC hierarchy — principals bind to roles, roles carry privileges, access scopes constrain where the role applies.

graph TD
    U1[User: alice@acme.com]
    U2[User: svc-recover-validate]
    G1[AD/SSO Group: cohesity_acme_ops]
    G2[AD/SSO Group: cohesity_acme_dr]
    R1[Role: Operator]
    R2[Role: DR Admin]
    R3[Custom Role: Recover-only]
    P1[Privileges: Run PG, Recover, View]
    P2[Privileges: Failover, Failback, Runbooks]
    P3[Privileges: Recover only]
    S1[Access Scope: Source Type = VMware]
    S2[Access Scope: Region = us-east]
    S3[Access Scope: Org = acme-corp]
    U1 --> G1
    U1 --> G2
    G1 --> R1
    G2 --> R2
    U2 --> R3
    R1 --> P1
    R2 --> P2
    R3 --> P3
    R1 --> S1
    R1 --> S3
    R2 --> S2
    R2 --> S3
    R3 --> S1
    R3 --> S3

Auditing Role Assignments

All role assignments and authentication events are logged and can be exported via Syslog or REST API to a SIEM [Source: https://docs.cohesity.com/baas/data-protect/audit-logs-dataprotect.htm]. Review assignments quarterly: identify Super Admin-equivalents, dormant service accounts, and drifted AD groups. A user with both individual and group role assignments inherits the union of both [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/integrate-azure-ad-with-cohesity-sso-white-paper-en.pdf] — useful for exceptions, but a privilege-creep risk if not reviewed.

Key Takeaway: Cohesity authorization is role × access scope. Built-in roles cover common personas; custom roles and granular privileges cover the rest. Access scopes (Source Level, Source Type, Region, Service Level) layer on roles to restrict where the role applies, and they are the mechanism that makes both least privilege and MSP tenant isolation possible. A user with both individual and group role assignments inherits the union — review periodically.

Multi-Tenancy with Organizations

Multi-tenancy runs independent workloads on shared infrastructure with strict isolation. Cohesity’s primitive is the Organization (Org), surrounded by Storage Domains (View Boxes), Views, VLANs, and quotas — together delivering isolation at application, storage, network, and resource layers.

The Apartment Building Analogy

A Cohesity cluster is an apartment building. Each Organization is a unit with its own door and keys. Helios is the building manager, holding the floor plan and seeing every unit but ordinarily not entering them. View Boxes are the plumbing and electrical risers — multiple units may share a riser (shared View Box: lower cost, residual risk) or have dedicated risers (dedicated View Box: higher cost, stronger isolation). VLANs are the building’s intercom wiring — each unit gets a private channel so traffic for 3B never crosses 7A’s wires. Quotas are per-unit utility metering. Tenants see only their own apartment; the manager sees the building.

Organizations as the Isolation Primitive

An Organization is a logical container for tenant resources [Source: https://www.penguinpunk.net/blog/wp-content/uploads/2018/11/cohesity_enable_multi-tenancy_v0.01.pdf]. The Cohesity blog “Multi-tenancy meets simplicity” emphasizes Organizations let MSPs serve multiple customers from a single cluster without sacrificing the security boundary [Source: https://www.cohesity.com/blogs/multi-tenancy-meets-simplicity/].

Critical principle: Organizations remain logically isolated regardless of Storage Domain sharing settings [Source: https://www.penguinpunk.net/blog/wp-content/uploads/2018/11/cohesity_enable_multi-tenancy_v0.01.pdf]. Even with shared View Boxes, a tenant admin sees only their own Org’s Storage Domains, Views, sources, Protection Groups, jobs, and reports. The UI is scoped.

Hierarchical Organizations let a parent Org contain child Orgs, each with their own Storage Domains, Views, and admins — common for large MSPs with reseller-managed customers under direct ones.

View Box Isolation — Shared vs. Dedicated

The Storage Domain (View Box) is the unit of storage policy: encryption keys, dedup scope, and compression are all configured per View Box, so View Box separation enforces cryptographic and deduplication boundaries [Source: https://www.penguinpunk.net/blog/wp-content/uploads/2018/11/cohesity_enable_multi-tenancy_v0.01.pdf].

Shared vs. dedicated View Boxes per tenant:

Dimension	Shared View Box	Dedicated View Box
Logical isolation between Orgs	Yes (Cohesity guarantee)	Yes
Physical co-residency	Tenants share underlying chunks	Tenant data is on its own domain
Encryption-key separation	Shared key per View Box	Per-tenant key (KMIP/KMS isolation possible)
Deduplication scope	Cross-tenant dedup → highest storage savings	Within-tenant dedup only → lower savings, no cross-tenant blob fingerprints
Compression	Shared compression policy	Per-tenant compression policy
Compliance posture	Suitable for low-sensitivity tenants	Suitable for regulated, sensitive, or competitor-tenant scenarios
Cost per tenant	Lower	Higher
Cluster-wide capacity efficiency	Higher	Lower
Use case	Internal departments, low-risk MSP tenants	Compliance-bound tenants, competing tenants, separate-key requirements

MSPs typically use dedicated Storage Domains when tenants require physical separation, separate encryption keys, or compliance-driven data adjacency rules [Source: https://www.penguinpunk.net/blog/wp-content/uploads/2018/11/cohesity_enable_multi-tenancy_v0.01.pdf]. A tiered model: Bronze on shared View Boxes, Silver on shared with per-tenant View-level encryption, Gold on dedicated View Boxes with KMIP-backed per-tenant keys.

Resource Hierarchy and Quota Inheritance

Organization -> Storage Domain -> View [Source: https://www.penguinpunk.net/blog/wp-content/uploads/2018/11/cohesity_enable_multi-tenancy_v0.01.pdf]:

Organization: Assigned Storage Domains and Views; tenant admins see only what is explicitly assigned.
Storage Domain: Quotas (physical limits) and default logical View limits cascade to Views.
View: Inherits or overrides Storage Domain quotas; configures protocols (SMB, NFS, S3) and permissions.

Quotas are hard — writes exceeding a quota are rejected, not just alerted [Source: https://developers.cohesity.com/v1-helios-latest/reference/createstoragedomain-1]. Silver tenants exceeding their 50 TB allocation get write failures, not surprise bills.

Network Isolation — VLAN per Organization

Application-layer isolation alone does not satisfy regulated tenants. VLAN-per-Organization extends isolation to the network layer [Source: https://www.cohesity.com/blogs/multi-tenancy-meets-simplicity/]. Tenant A’s traffic rides VLAN 100; Tenant B’s rides VLAN 200. The cluster supports multiple VIPs across VLANs so each tenant gets a DNS-resolvable management endpoint reachable only from their network.

This matters when source environments are themselves segregated — a HIPAA-bound VMware environment should send backup traffic over its regulated VLAN end-to-end, never crossing into a shared management network.

Figure 6.3: Multi-tenant isolation layers — Organization scopes UI, View Box scopes storage policy, VLAN scopes network, quotas scope capacity.

flowchart LR
    subgraph Cluster[Cohesity Cluster - Shared Infrastructure]
        direction LR
        subgraph OrgA[Organization: acme-corp]
            VBA[View Box: sd-shared-silver<br/>encryption + dedup scope]
            VLA[VLAN 412 + Tenant VIP]
            QA[Quotas: 25 TB FETB hard limit]
            VWA[Views: acme-vmware<br/>acme-nas]
        end
        subgraph OrgB[Organization: globex-inc]
            VBB[View Box: sd-dedicated-globex<br/>per-tenant KMIP key]
            VLB[VLAN 520 + Tenant VIP]
            QB[Quotas: 100 TB FETB hard limit]
            VWB[Views: globex-sql<br/>globex-m365]
        end
    end
    OrgA --> VBA --> VWA
    OrgA --> VLA
    OrgA --> QA
    OrgB --> VBB --> VWB
    OrgB --> VLB
    OrgB --> QB

The multi-tenancy isolation summary table:

Isolation Dimension	Mechanism	Trade-off
Identity / UI	Organization scoping	Always on; tenant admins see only their Org’s resources
Authorization	RBAC + Access Scopes	Scope by Org assignment, source, region, service
Storage policy / encryption	Dedicated View Box	Higher cost; lower dedup; stronger crypto isolation
Storage capacity efficiency	Shared View Box with quotas	Cross-tenant dedup; logical isolation only
Network	VLAN-per-Org + dedicated VIPs	Configuration overhead; required for regulated tenants
DNS	Per-Org management FQDN	Aligns with VLAN isolation
Quotas	Hard quotas at Storage Domain & View	Predictable backpressure; required for chargeback
Reporting	Per-Org reports and Helios scoping	Built-in; supports tenant-facing dashboards

Per-Tenant Policies and Reporting

Each Organization carries its own policies, retention, and reports. Helios scopes per-Org reporting so tenant admins see their own SLA dashboard while the MSP sees aggregate fleet metrics — essential for chargeback.

Key Takeaway: Multi-tenancy in Cohesity rests on four isolation layers: Organizations (logical/UI scope, always enforced), Storage Domains (storage policy, encryption keys, dedup boundary — shared for efficiency, dedicated for isolation), VLANs-per-Organization (network-layer separation), and quotas (hard backpressure for chargeback). Architects choose the depth of isolation per tenant tier — Bronze tenants share, Gold tenants get dedicated everything.

Service Provider Patterns

Multi-tenancy is most stress-tested in the MSP/CSP world. This section synthesizes the patterns service providers use.

MSP/CSP Deployment Patterns

Two dominant patterns:

Shared cluster, multiple Organizations — One cluster hosts many tenants. MSP gets economies of scale, tenants get logical isolation, design relies on RBAC + Access Scopes + (optionally) dedicated View Boxes + VLANs. Most common for SMB/mid-market MSPs.
Dedicated cluster per tenant — Often a Cloud Edition per tenant. Eliminates shared-infrastructure concerns but loses scale economies. Used for the largest or most regulated tenants.

Hybrid patterns combine both — shared clusters for Bronze/Silver, dedicated for Gold. Helios unifies the fleet view across both [Source: https://www.cohesity.com/blogs/multi-tenancy-meets-simplicity/].

Self-Service Portals via Helios

The Self-Service role (or a custom equivalent) lets tenant admins create Protection Groups, modify Policies within MSP-imposed bounds, run on-demand backups, and recover — without ticketing the MSP.

Chargeback and Usage Metering

Quotas plus Helios reporting drive chargeback. Typical metering:

FETB (Front-end TB) protected per Organization
Back-end consumed capacity (post dedup/compression) per Storage Domain
Protection Groups, Views, and recoveries as activity metrics
Egress for cross-cluster replication or cloud archive

Pricing is typically base $/TB FETB by tier (Bronze/Silver/Gold) plus add-ons for replication, DataLock, and DataHawk. Helios reports export to billing via API.

Tenant Onboarding Workflow

Figure 6.4: MSP tenant onboarding workflow — from contract to self-service portal in twelve steps.

flowchart TD
    A[Tenant signs contract<br/>tier: Bronze/Silver/Gold] --> B[Day -2: Network prep<br/>VLAN + VIPs + firewall rules]
    B --> C[Day -1: Storage Domain assignment<br/>shared sd-silver or dedicated sd-tenant]
    C --> D[Day 0: Create Organization<br/>assign Storage Domain + VLAN + VIPs]
    D --> E[Day 0: Create Views with quotas<br/>per-source-type Views]
    E --> F[Day 0: Configure SAML SSO<br/>IdP metadata + Default Role]
    F --> G[Day 0: Define custom roles<br/>+ Access Scopes + Auto Assign]
    G --> H[Day 0: Map IdP groups to roles<br/>flat groups, no nesting]
    H --> I[Day 1: Source registration<br/>vCenter, NAS, M365 over tenant VLAN]
    I --> J[Day 1: Protection policies + groups<br/>from MSP templates]
    J --> K[Day 1: Helios self-service portal<br/>scoped tenant view]
    K --> L[Day 2: API key for automation<br/>scoped service account]
    L --> M[Day 2: Document off-boarding plan<br/>revoke SAML, export audit, terminate Org]

Worked Example: MSP Onboards “Acme Corp”

Scenario. ProtectIT (MSP) runs a multi-tenant Cohesity cluster. New customer Acme Corp signs a Silver-tier contract: 25 TB FETB, dedicated VLAN, shared View Box, group-based RBAC via Acme’s Okta tenant. Acme has 150 VMs in vCenter and a 10 TB NAS share.

Network prep (Day -2). Provision VLAN 412; allocate three IPs from 10.84.12.0/24 on VLAN 412 for tenant VIPs (management UI, backup VIP, reserved); configure firewall rules for Acme’s source IPs and Helios outbound HTTPS.
Storage Domain assignment (Day -1). Place Acme on shared Storage Domain sd-shared-silver (Silver shares; Gold dedicates). Set a 30 TB physical quota at the Storage Domain level (25 TB sold + 20% buffer).
Create Organization (Day 0). Via Settings > Multi-Tenancy > Organizations > Create Organization, create acme-corp. Assign sd-shared-silver, associate VLAN 412, assign tenant VIPs.
Create Views with quotas (Day 0). Pre-create acme-vmware-backups (25 TB) and acme-nas-backups (5 TB), both on sd-shared-silver. Explicit View-level quotas prevent one View from consuming the other’s allocation.
Configure SAML SSO (Day 0). Settings > Access Management > SSO > Configure SSO: Protocol = SAML; SSO Domain = acme.com; Provider = Okta; paste Acme’s SSO URL, Provider Issuer ID, and PEM cert. Check Assign to Organization = acme-corp. Set Default Role to a restrictive acme-default-deny so unmapped users get no access.
Define custom roles and access scopes (Day 0). Create acme-backup-operator (Operator, scope = Source Types VMware + NAS in Acme Org) and acme-dr-admin (DR Admin, scoped to Acme Org). Enable Auto Assign so new Acme sources fall into scope.
Map Okta groups (Day 0). Users > Add SSO Users & Groups, SSO Domain acme.com: map cohesity_acme_ops -> acme-backup-operator and cohesity_acme_dr -> acme-dr-admin. Acme’s Okta admin assigns these groups directly (not nested) to the SAML app.
Source registration (Day 1). Acme operator (SAML-authenticated, scoped) registers vCenter and NAS sources; traffic rides VLAN 412 to the Acme VIP.
Protection policies and groups (Day 1). Operator creates Protection Groups using MSP policy templates; backups land in the Acme Views.
Helios self-service (Day 1). Acme operator logs into Helios; view is scoped to Acme. MSP sees aggregate fleet metrics.
API key for automation (Day 2). Create service account acme-svc-recover-validate with Recover-only role scoped to VMware in Acme Org; issue API key from that user. Key inherits exactly that ceiling.
Off-boarding plan (Day 2). Document the reverse-onboarding: revoke SAML assignment, export audit logs, terminate Org (cascades to Views per contracted retention), reclaim VLAN 412, remove Okta group mappings.

This threads every primitive — Organizations, Storage Domains, Views, VLANs, quotas, custom roles, access scopes, SAML SSO, group mapping, default roles, API keys, and Helios — into a repeatable workflow.

Key Takeaway: Onboarding is the integration test of every chapter concept: VLAN, Storage Domain, Organization, Views with quotas, SAML SSO, custom roles + access scopes, group-based mapping, source registration, Protection Groups, Helios self-service, scoped API keys, and documented off-boarding. Skip any step and you have a silent isolation gap.

Chapter Summary

This chapter covered the lifecycle of identity and access in Cohesity. Authentication spans local users (break-glass), AD/LDAP (daily access), SAML SSO (Azure AD, Okta, Ping, AD FS, Duo), MFA at the IdP, and API keys that inherit the issuing user’s role. Three SAML pitfalls to remember: Cohesity uses Login over Email when both are present; nested AD groups do not expand in SAML assertions; and AD FS does not support signed auth requests with Cohesity.

Authorization rests on principals, roles (Super Admin, Admin, Operator, Viewer, Data Security, DR Admin, Replication, Self-Service, Gaia, SMB Backup Operator, plus custom), and access scopes (Source Level, Source Type, Region, Service Level). Role × scope produces real least privilege and MSP tenant isolation.

Multi-tenancy adds Organizations (logical tenant primitive), View Boxes (encryption/dedup boundary — shared for efficiency, dedicated for isolation), per-Org VLANs (network separation), and hard quotas (predictable backpressure and chargeback). Hierarchical Organizations let MSPs nest customer Orgs under reseller Orgs.

Service-provider patterns include shared-cluster-multi-Organization (common), dedicated-cluster-per-tenant (regulated/large), Helios self-service, and FETB-based chargeback. The twelve-step Acme onboarding example threads every primitive into a repeatable workflow.

For the CCAE exam, expect scenarios that combine these primitives: choosing between shared and dedicated View Boxes for regulated tenants; diagnosing SSO failures (missing default role, Login-vs-Email, nested groups, AD FS signed-auth-request); designing role-plus-scope combinations; and ordering MSP onboarding steps with VLAN-per-Org isolation.

Key Terms

RBAC — Role-Based Access Control. Cohesity’s authorization model where principals (users/groups) are assigned roles, and roles are constrained by access scopes.
Organization — Cohesity’s multi-tenancy primitive. A logical container for a tenant’s resources (Storage Domains, Views, sources, Protection Groups, users). Tenants in different Organizations are logically isolated regardless of underlying View Box sharing. Hierarchical Organizations are supported.
Tenant — A customer (in MSP/CSP context) or a department (in enterprise context) hosted as an Organization on a shared Cohesity cluster.
View Box (Storage Domain) — The unit of storage policy in Cohesity. Encryption keys, deduplication scope, and compression behavior are configured per View Box. Dedicated View Boxes give cryptographic and dedup isolation between tenants; shared View Boxes give cross-tenant deduplication and lower cost.
View — A logical share within a View Box, exposing SMB, NFS, or S3 protocols. Views inherit Storage-Domain-level quotas and configuration but can override at the View level.
Access Scope — The resource-boundary layer that constrains where a role applies. Scopes can be Source Level (a specific item), Source Type (a class), Region (geographic/logical), or Service Level (DataProtect, DR, etc.). Combined with roles to produce least privilege.
SAML — Security Assertion Markup Language 2.0. The protocol Cohesity speaks to integrate with cloud IdPs (Azure AD, Okta, Ping, AD FS, etc.). Establishes a trust triangle between IdP, SP, and user; relies on PEM-format X.509 certificates.
SSO — Single Sign-On. The user-experience outcome of SAML integration: one IdP login grants access to many SPs including Cohesity.
MFA — Multi-Factor Authentication. Best applied at the IdP layer (Azure Conditional Access, Okta MFA, Duo) or via TOTP for local users. Combined with quorum approval for highly destructive actions.
API key — A programmatic credential issued from a Cohesity user account. Inherits exactly the privileges of the issuing user, including the user’s role and access scopes. Least-privilege automation requires dedicated service-account users with narrowly-scoped custom roles.
VLAN per Organization — Network-layer isolation pattern where each tenant Organization is assigned a dedicated VLAN and its own VIPs, ensuring tenant traffic never crosses into shared management or peer-tenant networks.
Default Role — The fallback role assigned to SSO-authenticated users who match no SSO group mapping. If neither default role nor SSO groups are configured, login is rejected outright.
Hierarchical Organization — A parent Organization containing child Organizations, enabling MSPs to host customer Organizations under a top-level service-provider Organization with cascading quota and policy enforcement.
Helios — Cohesity’s SaaS control plane and the “building manager” in the multi-tenancy analogy. Provides global fleet view for MSPs and per-Organization scoped self-service for tenants.

Chapter 7: Data Protection: Sources, Policies, and Protection Groups

If the previous chapters built the platform — clusters, networks, identity — this chapter is where Cohesity finally earns its keep. Data Protection is the day-job: pulling backups from a sprawling, heterogeneous estate of hypervisors, file servers, databases, and SaaS tenants; storing those copies efficiently; and proving, at 3 a.m. on the worst day of someone’s career, that the data can come back. As an architect, your job is not to click “Protect” — it is to design a system where the right thing gets backed up, at the right cadence, with the right retention, automatically, even as the production estate churns underneath you.

The CCAE exam tests three intersecting constructs: Sources (what you protect), Policies (how often, how long, where copies go), and Protection Groups (the binding object that stitches the two together). Master those three nouns, and most of the data-protection blueprint falls into place.

Figure 7.1: End-to-end data protection object model — Source through Snapshots

flowchart TD
    A[Source<br/>vCenter / Hyper-V / Prism / Physical / NAS / DB] --> B[Protection Group<br/>Membership: Static / Container / Tag]
    B --> C[Policy<br/>SLA Contract]
    C --> D[Schedule<br/>Frequency / RPO]
    C --> E[Retention<br/>GFS Hierarchy + DataLock]
    D --> F[Snapshots<br/>Local Cluster Storage]
    E --> F
    F --> G[Replication<br/>DR Cluster]
    F --> H[Archive<br/>CloudArchive / S3 Glacier]
    style A fill:#1f6feb,color:#fff
    style C fill:#238636,color:#fff
    style F fill:#8957e5,color:#fff

Learning Objectives

By the end of this chapter, you will be able to:

Register and protect heterogeneous sources, including VMware vSphere, Microsoft Hyper-V, Nutanix AHV, physical Linux/Windows hosts, NAS systems via SMB/NFS/NDMP, and database engines such as Oracle, SQL Server, SAP HANA, and Exchange.
Design protection policies that align with explicit RPO, RTO, and retention SLAs, using GFS-style hierarchical retention and a tiered Gold/Silver/Bronze model.
Build Protection Groups using static membership, container-based auto-protection, and vSphere tag-based auto-protection — and choose between them based on operational maturity and audit requirements.
Optimize backup performance and reduce production impact using SmartCopy storage-snapshot integration, Changed Block Tracking (CBT), proxy distribution, and per-datastore stream throttling.
Differentiate application-consistent from crash-consistent backups and select the appropriate quiescing path per workload.

Source Registration and Discovery

Before Cohesity can protect anything, it must know that the source exists, hold credentials to talk to it, and understand its API surface. Source registration is the moment a “production system” becomes a “discoverable, protectable inventory” inside Cohesity.

vCenter, SCVMM, and Nutanix Prism Integration

For VMware environments, the primary handshake is at the vCenter level. You register vCenter once, and Cohesity walks the entire managed inventory — datacenters, clusters, hosts, resource pools, folders, datastores, tags, and individual VMs [Source: https://docs.cohesity.com/baas/data-protect/register-vmware-sources.htm]. Registration requires a service account with sufficient privileges to read inventory, snapshot VMs, and (for some restore paths) attach virtual disks. Most architects create a dedicated svc-cohesity account in vCenter rather than reusing a domain admin.

Critically, registration is the moment to set per-datastore stream caps. After Cohesity discovers all datastores, you can override global stream limits by enabling a Cap and setting a maximum number of concurrent backup streams per datastore [Source: https://docs.cohesity.com/baas/data-protect/register-vmware-sources.htm]. This single setting is one of the most commonly overlooked exam topics: a small, hot all-flash datastore hosting tier-1 transactional workloads should not be saturated by a 32-stream backup job hammering its queues. Cap the streams, and Cohesity self-throttles. For Microsoft Hyper-V, registration goes through SCVMM (or directly to standalone Hyper-V hosts), and Cohesity uses Resilient Change Tracking (RCT) instead of CBT for incremental detection. Nutanix AHV registers via Prism Element or Prism Central; Cohesity then uses Nutanix’s native snapshot APIs.

Physical Agent for Linux and Windows

Not everything is virtualized, and Cohesity’s physical agent handles the rest. The Cohesity Agent is a lightweight binary that runs on Linux or Windows and provides three modes:

File-based backup for individual filesets and directories.
Volume-based (block) backup for full-system imaging, including bare-metal recovery.
Application-aware backup for SQL, Oracle, Exchange, SharePoint, and Active Directory, where the agent coordinates with the application’s quiescing API (VSS on Windows, RMAN on Oracle, VDI on SQL Server, Backint on SAP HANA).

Architects should plan agent rollout via configuration management (Ansible, SCCM, Puppet) rather than manual installs; in a fleet of 5,000 servers, a manual approach simply does not scale.

NAS Sources via SMB/NFS and NDMP

NAS protection has two main flavors. For modern NAS (NetApp ONTAP, Dell PowerScale/Isilon, Pure FlashBlade, generic Linux NFS exporters, Windows file servers), Cohesity registers the share over SMB or NFS and walks the namespace. For legacy or large enterprise NAS where snapshot-and-stream is preferable, Cohesity drives backups via NDMP — talking directly to the array’s tape-out protocol but redirecting the stream into Cohesity instead of physical tape. NDMP backups are typically faster and lighter on the array than client-side share crawls.

A useful design pattern: register NetApp filers via the array’s snapshot-API integration (similar in spirit to the Pure SmartCopy pattern discussed below) so Cohesity ingests from a SnapMirror or array snapshot rather than competing with production NFS clients.

Database Sources: Oracle, SQL, SAP HANA, Exchange

Database protection is its own discipline (and Chapter 8 dives deeper), but at the source-registration layer the architect’s job is to ensure:

Oracle: The Cohesity agent integrates with RMAN as a media-management library; you register the Oracle host and Cohesity provides the channel target.
SQL Server: Registration uses VDI (Virtual Device Interface) for stream backups and is AAG-aware (Always-On Availability Groups) so Cohesity can target the preferred replica.
SAP HANA: Registration plugs into the Backint API; HANA writes its backup stream directly to the Cohesity-provided endpoint.
Exchange: Registration uses the VSS Exchange writer for application-consistent mailbox backups.

In each case, source registration captures the credentials and connection metadata; the protection logic — log backups, point-in-time recovery, granular item recovery — is configured later at the Protection Group and Policy layers.

Key Takeaway: Source registration is the inventory step. Register vCenter once and let auto-discovery handle the rest; for physical hosts, NAS, and databases, plan agent deployment and credential management as a fleet operation, and use per-datastore stream caps during registration to protect production I/O.

Policies and Schedules

A Cohesity Protection Policy is the SLA expressed in code. It encapsulates how often a backup runs (RPO), how long copies are kept (retention), where copies go (replication and archival targets), and any immutability rules. Critically, a single policy can express the entire lifecycle of a backup — from the first snapshot on local cluster storage all the way through replication to a DR site and archival to S3 Glacier seven years later [Source: https://docs.cohesity.com/baas/data-protect/policies.htm].

Frequency, Retention, and Lock Attributes

The minimum RPO Cohesity can express in a standard policy is 15 minutes for hypervisor-based backups using Redirect-on-Write (RoW) snapshots [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity_Data_Protection_White_Paper.pdf]. Tighter RPOs — sub-minute, even continuous — are achievable when integrated with primary array snapshots through SmartCopy, because Cohesity is no longer limited by hypervisor snapshot overhead; it is simply orchestrating the array’s native snapshot engine.

Retention is configured as “Keep for N days/weeks/months/years” and supports DataLock attributes for compliance and ransomware resilience. DataLock makes a backup immutable until its retention expires — even a cluster admin cannot delete it. Two flavors exist: Compliance Lock (truly immutable, legally enforceable) and Governance Lock (soft-immutable, can be overridden by a quorum of admins). For SOX, HIPAA, and PCI workloads, Compliance Lock is mandatory.

Hierarchical Retention (Daily/Weekly/Monthly/Yearly)

Cohesity policies natively support the Grandfather-Father-Son (GFS) retention model. Beyond the base “Keep for” period, you can promote specific snapshots to extended retention buckets:

The first successful snapshot of each day, retained for N days.
The first successful snapshot of each week, retained for N weeks.
The first successful snapshot of each month, retained for N months.
The first successful snapshot of each year, retained for N years.

This is exactly how seasoned backup admins have thought for decades; Cohesity simply makes it a checkbox rather than a custom script. Combined with global variable-length deduplication, the storage cost of a 7-year monthly retention is far lower than naive arithmetic suggests, because the unchanged blocks are stored once across the entire chain.

Policy Templates and Re-Use

A core architectural principle: one policy per SLA tier, not per workload. If you have 50 SQL servers, 200 file shares, and 1,200 VMs all in the “Gold” tier, they should all reference the same Gold policy. When the SLA changes — and it will — you edit one object instead of 1,450. Cohesity allows policies to be cloned and templated through the Helios global policy framework, so you can apply a single Gold definition across multiple clusters consistently [Source: https://docs.cohesity.com/baas/data-protect/policies.htm].

Time Zones and Blackout Windows

Schedules run in the cluster’s configured time zone; for global enterprises with clusters in Frankfurt, Singapore, and Virginia, ensure local schedules respect local maintenance windows. Cohesity supports blackout windows where backups are paused — for example, suspending VM backups during the nightly business close or a SAN firmware upgrade. Architects should document blackout policy in the runbook and confirm RPO is still achievable around the suspension.

Key Takeaway: Treat protection policies like SLA contracts: define one per service tier (Gold/Silver/Bronze), use GFS hierarchical retention, apply DataLock for immutability where compliance dictates, and reuse the same policy across many workloads instead of cloning per application.

The SLA Analogy

Think of a protection policy like an SLA contract that your backup service signs with the application teams. The contract specifies the deliverables (RPO, RTO), the warranty period (retention), the geographic redundancy clause (replication), and the long-term archive (compliance). A Protection Group, by contrast, is the customer roster — the list of accounts subscribed to that contract. The same Gold contract can be sold to a hundred customers (Protection Groups), and changing the contract terms automatically updates all subscribers. This separation is what makes Cohesity scale operationally: contracts and rosters are decoupled.

Reference SLA Tier Design (Gold/Silver/Bronze)

The following table captures a battle-tested tiered policy reference that maps directly to common enterprise SLAs [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity_Data_Protection_White_Paper.pdf].

Tier	Frequency (RPO)	Local Retention	Replication	Archive	Target RTO	Typical Workloads
Gold (mission-critical)	Every 15 min via storage snapshot integration (SmartCopy)	30 days, app-consistent	Async replication to DR cluster, every cycle	CloudArchive Direct, monthly, 7+ years, DataLock Compliance	Minutes (Instant Mass Restore)	Tier-1 OLTP databases, ERP, EHR, payment systems
Silver (business-important)	Every 4–6 hours, hypervisor CBT + app quiesce	14–30 days	Async daily to DR cluster	CloudArchive monthly, 3–5 years	Less than 1 hour	Internal apps, file shares, VDI gold images, mid-tier DBs
Bronze (tier-3, dev/test)	Daily, crash-consistent acceptable	7–14 days	None or weekly to DR	CloudArchive Direct quarterly, 1 year	Less than 4 hours	Dev/test VMs, sandbox, ephemeral workloads

A common architect mistake is to over-engineer Bronze (giving every dev VM a 15-minute RPO “just in case”) or under-engineer Gold (relying on hypervisor backups for a database whose change rate makes a 6-hour RPO meaningless). Match the tier to the actual business cost of data loss.

Figure 7.2: Tiered policy decision tree — RPO/RTO requirements drive Gold/Silver/Bronze selection

graph TD
    A[Workload SLA Requirement] --> B{RPO needed?}
    B -->|<= 15 min| C{RTO needed?}
    B -->|4-6 hours| D[Silver Tier]
    B -->|>= 24 hours| E[Bronze Tier]
    C -->|Minutes<br/>Instant Mass Restore| F[Gold Tier]
    C -->|< 1 hour| D
    F --> G[SmartCopy + Pure<br/>30d local + DR replication<br/>7y archive + Compliance Lock]
    D --> H[CBT/RCT Hypervisor<br/>14-30d local + DR daily<br/>3-5y archive]
    E --> I[Daily crash-consistent<br/>7-14d local<br/>1y archive]
    style F fill:#d4af37,color:#000
    style D fill:#c0c0c0,color:#000
    style E fill:#cd7f32,color:#fff

Protection Groups

A Protection Group is the binding object that connects a set of source objects to a single policy. It also holds operational settings such as proxy assignment, indexing options, pre/post scripts, application-quiesce flags, and exclude lists. The architectural decision that dominates Protection Group design is how membership is determined: statically, by container, or by tag.

Static Membership vs. Tag-Based Auto-Protection

Static membership means the administrator hand-picks individual VMs (or volumes, or shares, or databases) at job-creation time. The Protection Group’s scope never changes unless someone edits it. This is highly deterministic — only the explicitly listed objects are ever backed up — and it is auditable: a regulator can ask “is VM pci-cardvault-01 protected?” and the answer is a binary yes/no based on the membership list.

The risk is silent under-protection. A new VM provisioned by a junior engineer in a regulated environment may not appear in any Protection Group for weeks until someone notices. For PCI cardholder data, HIPAA-regulated systems, or any workload with a documented compliance scope, static membership combined with an out-of-band reconciliation report (provisioned VMs minus protected VMs) is the safer pattern [Source: https://www.penguinpunk.net/blog/cohesity-basics-auto-protect/].

Auto-protect automatically protects new VMs added to selected parent objects — datacenters, folders, clusters, hosts, resource pools — and supports vSphere tags for inclusion and exclusion [Source: https://knowledge.broadcom.com/external/article/324666/cohesity-data-protection-solution-for-vm.html]. New VMs added to that container are automatically swept into the next backup run using the assigned policy. Removed or deleted VMs naturally fall out of scope. This is operationally elegant: provisioning automation can drop a VM into the right vCenter folder and protection “just happens.”

Auto-Protect with vSphere Tags

Tag-based auto-protect is the most powerful and the most surprising. Cohesity’s tag logic has a non-obvious quirk that frequently appears on the CCAE exam:

Adding tags one-by-one with exclude selected produces an OR operation — Cohesity excludes any VM with any of the listed tags.
Adding multiple tags simultaneously in a single operation produces an AND operation — Cohesity excludes only VMs with all the listed tags [Source: https://www.youtube.com/watch?v=_067ubHLEt4].

This matters in practice. If you want to exclude everything tagged dev or lab from your Gold Protection Group, add them one-by-one (OR). If you want to exclude only VMs that are both dev and decommissioned, add them together (AND). Misreading this distinction has caused architects to either over-protect (tagging an entire dev fleet into Gold) or under-protect (silently excluding production workloads).

VM Folders have been supported as parent objects since Cohesity DataProtect 4.0, propagating settings hierarchically [Source: https://www.cohesity.com/blogs/cohesity-dataprotect-4-0-extends-management-integration-vmware-vsphere/]. This is useful when vSphere admins encode tenancy or app boundaries into folder hierarchy.

Figure 7.3: Auto-protect via vSphere tags — dynamic membership update flow

flowchart LR
    A[vCenter Tag<br/>tier=gold] --> B[Cohesity Inventory Sync]
    B --> C{Tag Filter<br/>Include / Exclude}
    C -->|Match| D[Dynamic Membership<br/>Update]
    C -->|No Match| E[Excluded from PG]
    D --> F[Protection Group<br/>pg-gold-auto]
    F --> G[Next Backup Run<br/>New VMs Auto-Swept]
    H[New VM<br/>Provisioned + Tagged] -.-> A
    I[Untagged VM<br/>Removed] -.-> E
    style A fill:#1f6feb,color:#fff
    style D fill:#238636,color:#fff
    style F fill:#8957e5,color:#fff

When to Use Each Membership Model

Membership Model	Best For	Risks	Audit Posture
Static	PCI/HIPAA/SOX-scoped VMs; small, slow-changing high-value sets	New VMs silently unprotected	Strongest — explicit list
Container auto-protect (folder/cluster/RP)	Well-organized vSphere with folder-per-business-unit	Folder reorganization can shift scope	Good — provided folder hygiene
Tag auto-protect	Cross-cutting concerns (tier=gold, app=sql) where folder structure is contested	Tag drift, AND/OR confusion, double-coverage	Moderate — requires tag governance

A common hybrid pattern: use static membership for regulated workloads and tag-based auto-protect for everything else, with a nightly Helios report flagging any provisioned VM not covered by any Protection Group.

Indexing Options and Search Implications

When a Protection Group runs, Cohesity can index the contents of the backup at the file level. Indexing enables global search (“find me every file named customer.csv across every backup, anywhere”) and self-service granular restore. The cost is metadata storage and ingest CPU. For VMs containing only block-level databases, indexing has minimal value and can be disabled. For file servers, NAS, and end-user VMs, indexing is essential. Architect the policy: index-everything-by-default tends to over-spend on metadata; selective indexing is a pure win.

App-Consistent vs. Crash-Consistent Backups

A crash-consistent backup captures the disk state as if the system had been suddenly powered off — the filesystem and any applications must rely on their own crash-recovery mechanisms (journal replay, transaction rollback) to come back cleanly. For modern journaled filesystems and databases this usually works, but it is not guaranteed.

An app-consistent backup pauses the application briefly so it can flush buffers, checkpoint state, and quiesce I/O before the snapshot is taken. On Windows, this is VSS (Volume Shadow Copy Service); on Linux, it is typically pre-/post-freeze scripts; for databases, it is the database’s own quiesce API (RMAN, VDI, Backint). The result is a backup the application can mount cleanly without recovery work — essential for databases where transaction-log replay from backup must be precise.

Property	Crash-Consistent	App-Consistent
Application quiesce	None	Yes (VSS/agent/script)
Performance impact on source	Minimal	Brief pause (sub-second to several seconds)
Recovery cleanliness	Depends on app crash-recovery	Guaranteed clean
Required for DBs?	No (risky)	Yes
Required for general VMs?	Often acceptable	Recommended
Required for ephemeral/dev	Acceptable	Optional

Architects should default to app-consistent for any VM running a database or transactional system, and crash-consistent only where quiesce overhead is unacceptable or the workload is genuinely stateless.

Key Takeaway: Static membership is auditable but brittle; auto-protect is operationally elegant but requires tag/folder governance. Pair static membership with regulated workloads and auto-protect with everything else, and always choose app-consistent backups for database and transactional workloads.

Performance and Concurrency

A perfectly designed policy is worthless if backups run hot enough to crash production. Performance tuning sits at the intersection of source impact, network bandwidth, proxy capacity, and cluster ingest throughput.

SmartCopy and Storage Snapshot Integration

SmartCopy is Cohesity’s snapshot-based copy and replication mechanism that integrates directly with primary storage arrays — most prominently Pure Storage FlashArray, but also NetApp, HPE Nimble/Primera, and Dell PowerStore via partner integrations [Source: https://www.cohesity.com/newsroom/press/cohesity-unveils-native-integration-pure/]. Rather than running a hypervisor-side or in-guest backup that competes with production I/O, Cohesity drives the array’s own snapshot APIs and ingests data from those snapshots.

The architecture flow is elegant [Source: https://www.cohesity.com/blogs/cohesity-pure-storage-data-protections-speed-flash/]:

Discovery — Register the Pure FlashArray as a source in Cohesity; the cluster enumerates volumes and volume groups.
Policy assignment — Assign a Protection Policy to the chosen Pure volumes; the policy defines snapshot frequency, retention on the array, retention on Cohesity, and optional replication and archive.
Snapshot creation — At schedule time, Cohesity calls the Pure REST API to take a snapshot. Optional pre/post scripts quiesce SQL/Oracle/Exchange to make the snapshot app-consistent.
Mount and read — Cohesity mounts the snapshot (typically via iSCSI) to the cluster, reads only changed blocks (using array-side change tracking), and ingests through inline dedupe and compression.
Retention tiering — A few recent snapshots remain on Pure for instant restore at flash speed; older snapshots are aged out on the array but retained on Cohesity for long-term recovery.
Recovery — Restore is volume-level back to any Pure FlashArray (same or DR site), file-level via SmartFiles mount, or cross-platform to native cloud VMs.

The exam-relevant point: SmartCopy enables sub-15-minute RPOs with zero hypervisor overhead and is the canonical Gold-tier mechanism for transactional databases sitting on Pure. It also means the Cohesity cluster does not need the bandwidth or proxy capacity to ingest a full hypervisor stream every 15 minutes — it ingests only the snapshot delta.

Figure 7.4: SmartCopy with Pure FlashArray — orchestration sequence

sequenceDiagram
    participant App as SQL/Oracle App
    participant Cohesity as Cohesity Cluster
    participant Pure as Pure FlashArray
    participant Archive as CloudArchive (S3)
    Cohesity->>App: Pre-script: Quiesce (VSS/RMAN)
    App-->>Cohesity: Quiesce ACK
    Cohesity->>Pure: REST API: Take Snapshot
    Pure-->>Pure: Native array snapshot created
    Pure-->>Cohesity: Snapshot ID
    Cohesity->>App: Post-script: Release quiesce
    Cohesity->>Pure: Mount snapshot (iSCSI)
    Pure-->>Cohesity: Stream changed blocks only
    Cohesity-->>Cohesity: Inline dedupe + compression
    Note over Pure: Recent snapshots retained<br/>on flash for instant restore
    Note over Cohesity: Older snapshots tier<br/>to Cohesity DataPlatform
    Cohesity->>Archive: Tier monthly to S3 Glacier

A subtle but important note: “SmartFiles SmartCopy” is not a distinct Pure feature; the integration is implemented through Cohesity’s Protection Policies [Source: https://www.cohesity.com/blogs/cohesity-pure-storage-data-protections-speed-flash/]. SmartFiles can act as the immutable backup target with DataLock/WORM, while SmartCopy is the orchestration layer.

CBT/RCT and Incremental Forever

For non-array-integrated VM backups, Cohesity uses Changed Block Tracking (CBT) on VMware and Resilient Change Tracking (RCT) on Hyper-V. The hypervisor maintains a bitmap of changed blocks since the last backup; Cohesity reads only those changed blocks. Combined with global variable-length deduplication on ingest, this delivers an Incremental Forever model: a single full backup at job inception, then deltas only — and even those deltas are deduped against existing cluster data.

CBT can occasionally desynchronize (after a storage vMotion, a snapshot consolidation failure, or certain VMware patches), and Cohesity will then need to fall back to a full read or a CBT reset. Architects should monitor cbt_reset events and budget for occasional full reads on a small percentage of jobs.

Job Concurrency and Proxy Distribution

Cohesity backups run as parallel streams. Concurrency is governed at multiple layers:

Policy-level concurrency — how many objects in a Protection Group run simultaneously.
Cluster-level concurrency — total simultaneous streams across the cluster.
Source-level concurrency — typically governed by the per-datastore stream cap discussed in the registration section.
Proxy-level concurrency — for environments using physical or virtual backup proxies (rare in modern Cohesity, more common in legacy hybrid deployments).

Tuning concurrency is iterative: start with defaults, watch for source saturation (vCenter task queues, datastore latency, NAS array CPU), and adjust caps where bottlenecks appear. The most common mistake is to crank concurrency to maximize throughput on day one and then spend three months chasing latency complaints from the storage team.

Throttling and QoS

Cohesity supports time-windowed bandwidth throttling for replication and archive traffic — for example, capping replication to 200 Mbps during business hours and lifting the cap overnight. Per-policy QoS lets you mark Gold backups higher-priority than Bronze on a shared cluster, so when contention arises, low-tier jobs slow first. This is essential when a single cluster serves multiple SLA tiers, which is the norm in real enterprise deployments.

Key Takeaway: Use SmartCopy with primary array snapshots for sub-15-minute RPOs without hypervisor overhead; use CBT/RCT incremental-forever for everything else; cap streams at the source (per-datastore) and tune concurrency iteratively rather than maximizing day-one.

Worked Example: Designing a Gold Policy

Let us design a Gold policy end-to-end against a concrete requirement.

Requirement. A financial-services customer runs SQL Server transactional databases on Pure FlashArray volumes. The application owner demands:

15-minute RPO for the database.
30 days of daily on-cluster recovery points.
1 year of monthly archives for regulatory review.
Cross-region replication to a DR cluster in a paired region.
App-consistent backups (no crash-consistent compromises).
Immutable backups for ransomware resilience.

Design.

Source registration: Register the Pure FlashArray as a source in Cohesity. Register the SQL Server hosts as physical sources with the Cohesity Agent so pre/post scripts can run VSS-coordinated quiesce.
Membership: Create a Protection Group pg-sql-gold-payments with static membership of the specific Pure volumes hosting the payment database files and logs. (Static, because PCI scope demands deterministic membership.)
Frequency: Configure the policy pol-gold-15min with a 15-minute snapshot frequency using SmartCopy against the Pure FlashArray. Pre-script triggers SQL VSS quiesce; post-script releases.
Local retention: 30 days, daily-extended retention pinned (the first successful snapshot of each day held for 30 days; intra-day snapshots aged out after 24 hours to control storage growth).
Monthly archive: Add CloudArchive Direct to S3 with a Glacier lifecycle, monthly cadence, 1 year retention, Compliance DataLock enabled.
Replication: Add an async replication target to the DR Cohesity cluster, replicating every cycle; DR cluster retains 7 days of recovery points for failover.
Quiesce: App-consistent — pre-script triggers SQL VSS quiesce, post-script releases; if the script fails, the policy is configured to fail the run rather than fall back to crash-consistent (a deliberate choice for Gold).
Indexing: Disabled at the volume level (block backups don’t benefit from file-level index); SQL granular recovery handled at the database layer through Cohesity’s SQL integration.
Concurrency: Cap the Pure datastore at 8 streams during business hours, 16 overnight; replication throttled to 500 Mbps daytime, uncapped overnight.

The result: a single named policy (pol-gold-15min) bound to a single Protection Group (pg-sql-gold-payments) that delivers 15-minute RPO, instant RTO via mount-from-snapshot, a 1-year compliance archive, cross-region DR, and ransomware immunity. Adding the next Gold workload (say, the Oracle ERP database) is a matter of registering its Pure volumes and creating a second Protection Group bound to the same pol-gold-15min policy. The contract is reused; only the customer roster grows.

Chapter Summary

This chapter unpacked the trio of objects that drive Cohesity data protection: Sources (registered hypervisors, hosts, NAS, databases), Policies (RPO/retention/replication/archive contracts), and Protection Groups (the binding object that subscribes a set of sources to a policy).

You learned how to register vCenter, SCVMM, Prism, physical hosts, NAS, and databases — and the importance of per-datastore stream caps set at registration time. You walked through the policy structure: GFS hierarchical retention, DataLock immutability, blackout windows, and the tiered Gold/Silver/Bronze SLA model. You compared static membership against container- and tag-based auto-protect, including the AND/OR tag logic that catches careless architects on the exam. You learned why app-consistent backups are non-negotiable for databases. And you traced how SmartCopy with Pure FlashArray (and similar primary-array integrations) enables 15-minute RPOs without hypervisor overhead, while CBT/RCT incremental-forever handles the rest of the estate.

Hold the analogy in mind: a policy is the SLA contract; a protection group is the customer roster. Design contracts once per service tier, reuse them across many rosters, and let the platform’s dedupe, CBT, and SmartCopy mechanics make the math work. In Chapter 8 we move from generic protection into application-aware backup and recovery, where the database engines and SaaS endpoints have their own quirks and quiesce paths.

Key Terms

Protection Group: The binding object that associates a set of source objects (VMs, volumes, shares, databases) with a Protection Policy and execution settings (proxy, indexing, scripts).
Policy: A named, reusable SLA definition specifying frequency (RPO), retention, replication, archive, and immutability rules. One policy is typically applied to many Protection Groups.
RPO (Recovery Point Objective): The maximum acceptable data loss measured in time. Cohesity supports as low as 15 minutes for hypervisor backups and tighter for storage-snapshot-integrated sources.
RTO (Recovery Time Objective): The maximum acceptable downtime. Cohesity targets minutes for Instant Mass Restore, scaling longer for archive recoveries.
CBT (Changed Block Tracking): VMware’s mechanism for identifying blocks that have changed since the last snapshot, enabling incremental-forever backups. Hyper-V’s equivalent is RCT (Resilient Change Tracking).
Auto-protect: Cohesity feature that dynamically tracks vCenter inventory containers (folders, clusters, hosts, resource pools) or vSphere tags so newly added VMs are automatically protected without administrator intervention.
App-consistent: A backup that captures the source after applications have been quiesced (via VSS, RMAN, VDI, Backint, or pre/post scripts), guaranteeing clean recovery without crash-recovery work.
SmartCopy: Cohesity’s snapshot-based copy and replication mechanism that integrates with primary storage arrays (Pure FlashArray, NetApp, HPE, Dell) to drive native array snapshots into the Cohesity DataPlatform with minimal production impact.

Chapter 8: Application-Aware Backup and Recovery Patterns

The CCAE exam expects an architect to do far more than schedule a nightly snapshot. Real-world Cohesity designs must speak the native protocols of Oracle, SQL Server, SAP HANA, Microsoft Exchange, and Microsoft 365 — and they must turn those backups into instantly usable recovery products: bootable VMs, mounted NFS exports, writable clones, and individually restorable mailbox items. This chapter walks through how Cohesity wires itself into each application stack, why those wires matter for RPO/RTO, and how the Instant Mass Restore and clone primitives transform a passive backup repository into an active recovery and dev/test platform.

Learning Objectives

By the end of this chapter, you should be able to:

Design backup strategies for Oracle (RMAN), Microsoft SQL Server (VDI/AAG), SAP HANA (Backint), Exchange, and Microsoft 365 (Mailbox/OneDrive/SharePoint/Teams).
Choose between target-side and source-side deduplication for database workloads and configure the corresponding RMAN channels, ports, and SBT libraries.
Implement Cohesity Instant Mass Restore and contrast it with VMware-native vSphere Instant Recovery on scale, performance, and orchestration.
Recover individual files, mailboxes, and database objects using indexed search and point-in-time recovery (PITR) workflows.
Validate recoveries with run-books, automated test failovers, and clone-based dev/test pipelines that do not consume additional storage.

8.1 Database Workloads: Oracle, SQL Server, and SAP HANA

Database protection on Cohesity is fundamentally about meeting the database engine on its own terms. Oracle wants to drive its own backup via RMAN. SQL Server expects an application to call its VDI (Virtual Device Interface). SAP HANA mandates a Backint-compliant target. The common thread is that Cohesity does not pretend to be a generic file system to these engines — it presents itself through each engine’s native API so backups and restores remain application-consistent and supportable by the database vendor.

8.1.1 Oracle RMAN Integration

Oracle’s Recovery Manager (RMAN) is the canonical backup driver for Oracle databases. Cohesity integrates with RMAN by registering itself as an SBT (System Backup to Tape) target — the same interface RMAN uses for tape libraries — while still providing disk-class performance and global deduplication on the back end [Source: https://www.cohesity.com/resources/solution-brief/Cohesity-Oracle-Databases-Solution/].

The Cohesity Remote Adapter built into the DataPlatform consolidates RMAN scripts, schedules, and alerts under a single console, eliminating the per-host crontab sprawl that plagues legacy Oracle backup operations [Source: https://www.cohesity.com/blogs/title-streamlining-oracle-database-protection-recovery-cohesity-oracle-rman/]. When you create a protection job, Cohesity auto-selects an active single-instance Oracle node and configures the number of RMAN channels for the database object. For RAC clusters, the architect can manually pin specific nodes and tune the channel count and SBT library path. Channel count is the primary throughput lever: more channels increase parallelism but also drive up CPU and network load on the Oracle host.

Cohesity supports two deduplication paths for RMAN, and choosing between them is a CCAE-level design decision:

Path	How RMAN Sees It	Where Dedupe Happens	When to Use
Target-side dedupe	Cohesity exports an NFS view; the Oracle host mounts it. RMAN writes backup pieces to the mount.	Inline on the Cohesity cluster as data lands.	Default for low-latency LANs, when CPU budget on the Oracle host is constrained, or when DBAs want zero source-side software.
Source-side dedupe	The Cohesity Oracle Source-Side Dedupe plugin is an SBT library installed on the Oracle host.	On the Oracle host before bytes traverse the network.	WAN-attached database servers, bandwidth-constrained sites, or when the Oracle host has spare CPU and the network is the bottleneck.

Both paths leverage Cohesity’s global variable-length deduplication and compression, so the on-cluster footprint is identical regardless of which side did the fingerprinting [Source: https://www.cohesity.com/blogs/explaining-cohesitys-space-efficient-target-source-side-dedupe-integration-oracle-rman/].

Network requirements are precise and exam-relevant: the Cohesity Linux Agent on the Oracle host requires inbound TCP 50051 for backup operations and 59999 for self-monitoring. Miss either port and discovery silently degrades or RMAN sessions hang [Source: https://docs.cohesity.com/baas/data-protect/oracle-requirements.htm].

Figure 8.1: Oracle RMAN backup data flow through SBT library to Cohesity cluster

sequenceDiagram
    participant DBA as DBA / Scheduler
    participant RMAN as Oracle RMAN
    participant SBT as SBT Library<br/>(Cohesity plugin)
    participant Agent as Cohesity Linux Agent<br/>(ports 50051/59999)
    participant Cluster as Cohesity Cluster<br/>(SnapTree view)

    DBA->>RMAN: BACKUP DATABASE PLUS ARCHIVELOG
    RMAN->>RMAN: Allocate N channels<br/>(parallelism lever)
    RMAN->>SBT: sbtopen / sbtwrite (backup pieces)
    alt Source-side dedupe
        SBT->>SBT: Variable-length fingerprint
        SBT->>Agent: Send unique blocks only
    else Target-side dedupe (NFS mount)
        SBT->>Agent: Stream all blocks via NFS
        Agent->>Agent: Inline dedupe at landing
    end
    Agent->>Cluster: Write to protection view
    Cluster-->>Agent: Ack + catalog metadata
    Agent-->>SBT: sbtwrite OK
    SBT-->>RMAN: Piece complete
    RMAN-->>DBA: Backup successful (catalog updated)

The integration supports full and incremental backups whether or not Oracle Change Block Tracking (CBT) is enabled, but CBT is strongly recommended for large databases because it limits incremental reads to changed blocks, dramatically shrinking the backup window. Archive log backups can be scheduled independently from datafile backups inside the same protection policy — this lets you hit aggressive RPO targets (e.g., 15-minute log shipping) without forcing full datafile passes that often.

Key Takeaway: Cohesity speaks RMAN natively via the SBT interface. Architects choose between target-side dedupe (NFS mount, simple, LAN-friendly) and source-side dedupe (host-side plugin, WAN-friendly), tune RMAN channel count for parallelism, and always open ports 50051 and 59999 to the Cohesity Linux Agent.

8.1.2 SQL Server: VDI and Always On Availability Groups

For Microsoft SQL Server, Cohesity uses the Virtual Device Interface (VDI) — the same native API that Microsoft Backup, Veeam, and other enterprise products use. The Cohesity Windows Agent registers as a VDI client; SQL Server streams its own backup to the agent, which then writes to a Cohesity view. This produces a SQL-consistent backup that includes the transaction log chain needed for point-in-time recovery.

Always On Availability Groups (AAG) add a wrinkle: the same database exists on multiple replicas. Cohesity’s AAG-aware protection can target the preferred backup replica configured in the AAG, the primary replica, or the secondary replica with the lowest backup priority. The protection policy understands that backing up from a secondary offloads I/O from the primary while still producing a usable backup chain. For point-in-time recovery, log backups are taken from the active log-shipping replica and Cohesity reconstructs the chain across replica failovers.

Granular SQL recovery options include:

Full database restore — to original or alternate instance, with optional file relocation.
Point-in-time restore — apply log backups up to a specific timestamp.
Instant volume mount — present the backup as an iSCSI/SMB share so DBAs can attach it as a database without copying data, the SQL equivalent of VM Instant Recovery.
Object-level recovery — extract individual tables or schemas via Cohesity’s database object-level recovery workflow.

A common architectural pattern: tier 1 OLTP databases run on AAG with a Cohesity-backed secondary replica taking log backups every 5 minutes; tier 2 reporting databases run as standalone instances with daily full + hourly log policies. Both use the same Cohesity protection policy template with retention overrides.

8.1.3 SAP HANA Backint Integration

SAP HANA’s official backup interface is Backint — a SAP-certified shared library that HANA dynamically loads when triggered to back up. Cohesity provides a Backint agent that HANA loads, and HANA streams its native backup format directly to the Cohesity cluster. Because Backint is the only SAP-supported third-party backup path for HANA, this is non-negotiable for production SAP support: writing HANA volumes via crash-consistent VM snapshots will technically work but is not supported by SAP for production restores.

Backint integration supports:

Data backups — full and incremental backups of HANA tenants and the system database.
Log backups — continuous redo log shipping for PITR.
Catalog backups — HANA’s backup catalog is replicated to Cohesity so it can be reconstructed during DR.
Multi-tenant container databases (MDC) — each tenant is protected independently.

The same target-side vs. source-side dedupe distinction applies as with Oracle, though most HANA deployments use the target-side path because HANA hosts already run hot and DBAs are reluctant to add CPU load with source-side fingerprinting.

8.1.4 Log Backups and Point-in-Time Recovery

PITR is the differentiator between “I have a backup” and “I can restore my business to the moment before the bad transaction.” For Oracle, SQL Server, and HANA, Cohesity protection policies expose log backup frequency as an independent dial from full/incremental cadence. A typical tier-1 database policy looks like:

Backup Type	Frequency	Retention
Full	Weekly (Sunday 22:00)	4 weeks
Incremental	Daily (22:00)	14 days
Log	Every 15 minutes	7 days
Archive copy	Monthly	7 years

This pattern hits a 15-minute RPO with 7 days of fine-grained PITR, weekly full reset for chain hygiene, and a monthly archive copy for long-term compliance retention.

Key Takeaway: Each database engine has a canonical native API — RMAN for Oracle, VDI for SQL Server, Backint for HANA. Cohesity speaks all three. Independent log backup cadence is the lever that converts a daily backup into a minute-grained point-in-time recovery capability.

8.2 Microsoft 365 and SaaS Workloads

Microsoft 365 occupies a unique architectural position: the data lives entirely in Microsoft’s cloud, accessed only through Graph API and EWS, with platform-imposed throttling and a shared-responsibility model that explicitly puts third-party backup on the customer. Cohesity protects M365 through Cohesity DataProtect on-premises, Cohesity DataProtect as a Service (the SaaS offering), or a hybrid that integrates with Microsoft 365 Backup Storage (MBS) for high-throughput recovery [Source: https://www.cohesity.com/solutions/microsoft-365/].

8.2.1 Mailbox, OneDrive, SharePoint, and Teams Coverage

Cohesity protects four primary M365 workloads, each with workload-specific recovery semantics:

Workload	Granularity of Recovery	Notable Behavior
Exchange Online (mailboxes)	Mailbox, folder, single message, attachment	Independent retention from Microsoft’s native; global keyword search across the mailbox corpus [Source: https://exchangesavvy.com/cohesitys-microsoft-365-backup-your-solution-for-rapid-recovery/].
OneDrive for Business	File or folder with full ACL/permission fidelity	MBS integration delivers up to 3 TB/hour restore throughput by bypassing Graph throttling [Source: https://www.cohesity.com/resources/solution-brief/microsoft-365/].
SharePoint Online	Site, document library, list item, page	Supports protecting all child objects of a site and restoring to original or alternate location.
Microsoft Teams	Channel messages, files, tabs, underlying SharePoint site	Important caveat: restoring a fully deleted Team requires that an admin manually re-create the Teams/Groups container in M365 first. Graph API does not let third-party apps recreate the Group object [Source: https://www.cohesity.com/blogs/the-practitioners-guide-to-microsoft-365-teams-and-groups-data-protection/].

Figure 8.3: Microsoft 365 protection via Graph API across Exchange, OneDrive, SharePoint, and Teams

sequenceDiagram
    participant Cohesity as Cohesity DataProtect
    participant Entra as Entra ID<br/>(service principal)
    participant Graph as Microsoft Graph API
    participant MBS as M365 Backup Storage<br/>(MBS APIs)
    participant Tenant as M365 Workloads

    Cohesity->>Entra: Acquire app-permission token
    Entra-->>Cohesity: OAuth2 access token
    Cohesity->>Graph: Enumerate users / sites / teams
    Graph-->>Cohesity: Object inventory
    par Mailbox protection
        Cohesity->>Graph: Read messages (per-user parallel)
        Graph->>Tenant: Exchange Online mailbox
        Tenant-->>Graph: Items + attachments
        Graph-->>Cohesity: Indexed mailbox data
    and OneDrive / SharePoint
        Cohesity->>MBS: Snapshot file content
        MBS->>Tenant: OneDrive / SharePoint stores
        Tenant-->>MBS: Files + ACLs (3 TB/hr)
        MBS-->>Cohesity: Bulk content (bypasses throttling)
    and Teams
        Cohesity->>Graph: Channel messages + tabs
        Graph->>Tenant: Teams + underlying SharePoint
        Tenant-->>Graph: Messages, files, metadata
        Graph-->>Cohesity: Teams payload
    end
    Cohesity->>Cohesity: Index + dedupe + write to view
    Note over Cohesity,Graph: 429 backoff with exponential delay<br/>parallelize across users not requests

8.2.2 MFA, Graph API Limits, and Authentication

Cohesity registers the M365 tenant as a source using a registered Entra ID application (service principal) with the appropriate Graph API permissions. Modern authentication and MFA are mandatory for the consent flow, but the running protection job itself uses application-permission tokens that are not subject to interactive MFA — which is exactly what you want for unattended nightly backups.

Graph API throttling is the most common operational pain point. Microsoft applies per-tenant and per-app throttles measured in requests per minute. Cohesity mitigates throttling by:

Parallelizing across users rather than across requests for a single user — Graph throttles per-user mailbox access more aggressively than fan-out enumeration.
Backing off and retrying when 429 responses arrive, with exponential delays.
Using Microsoft 365 Backup Storage (MBS) for OneDrive, SharePoint, and Teams files — MBS uses Microsoft’s storage-side restore APIs that bypass Graph throttling entirely. This is where the 3 TB/hour figure originates.
Indexing on the Cohesity side so that search and selective restore do not require additional Graph calls during recovery.

8.2.3 Auto-Protection and Granular Restore

A critical design pattern is policy-driven auto-protection. When a new user is provisioned in the tenant, their mailbox and OneDrive are automatically discovered and added to the protection group — no manual onboarding ticket. This eliminates the operational drift problem that plagues static include-list backup designs [Source: https://www.cohesity.com/dm/tip-sheets/4-ways-to-back-up-microsoft-365/].

Granular restore options match the workload’s natural granularity. For Exchange, an admin can search globally for “subject contains ‘Q4 forecast’”, select the matching message from a specific user’s mailbox at a specific point in time, and restore it back to the original folder, an alternate folder, or an alternate mailbox entirely.

8.2.4 Salesforce and Other SaaS Adapters

Beyond M365, Cohesity adapters cover Salesforce (object-level metadata and data backup), Microsoft Entra ID (users, groups, configuration), and additional SaaS targets. These adapters share architectural traits: API-driven enumeration, indexed backups for search, throttling-aware schedulers, and granular restore to original or alternate tenants.

Key Takeaway: M365 protection lives or dies on three things: handling Graph API throttling (use MBS where possible), auto-protecting new users via policy, and remembering that Teams/Groups containers cannot be re-created by third-party tools — admin must recreate the shell before granular Teams restore can land.

8.3 Instant Recovery Mechanics

Backups become recovery products through one of two primitives: Instant Mass Restore (IMR) for VMs and Clone for VMs and databases. Both rely on Cohesity’s SnapTree metadata structure, which provides O(1) snapshot access regardless of snapshot depth — every snapshot is a fully hydrated, instantly mountable view, not a delta chain that requires walking [Source: https://www.cohesity.com/blogs/instant-recovery-unlimited-vms-point-time-distributed-resilient-data-store/].

8.3.1 Instant Mass Restore for VMs

Instant Mass Restore is the marquee feature for VMware bulk recovery. It works by presenting an NFS datastore from the Cohesity cluster directly to the ESX/ESXi hosts, registering VMs from the backup metadata, powering them on, and then Storage vMotion-ing them back to primary storage in the background [Source: https://www.cohesity.com/blogs/cohesity-instant-mass-restore-better-solution-to-an-old-problem/].

The five-step automated workflow:

Present an NFS datastore to ESX/ESXi hosts — the Cohesity cluster acts as a scale-out NFS server exporting a view of the backup snapshot.
Create new VMs from backup metadata, registering them with vCenter.
Power on VMs from the temporary NFS datastore — workloads are live and serving users.
Storage vMotion the running VMs back to primary storage at the customer’s chosen pace.
Clean up the temporary NFS datastore after migration completes.

From clicking “Recover” to having recoveries in progress takes approximately 30 seconds. Subsequent steps are fully automated [Source: https://www.cohesity.com/blogs/cohesity-instant-mass-restore-better-solution-to-an-old-problem/].

Figure 8.2: Instant Mass Restore flow from Cohesity NFS export to primary storage

flowchart LR
    A[Cohesity Cluster<br/>SnapTree snapshot] -->|NFS export<br/>scale-out| B[ESXi Hosts<br/>mount datastore]
    B -->|register from<br/>backup metadata| C[vCenter<br/>VM inventory]
    C -->|power on VMs<br/>~30 seconds| D[Live Workloads<br/>serving users]
    D -->|Storage vMotion<br/>staggered| E[Primary Storage<br/>production array]
    E -->|migration complete| F[Auto-cleanup<br/>NFS export removed]

    style A fill:#1f6feb,stroke:#58a6ff,color:#fff
    style D fill:#238636,stroke:#3fb950,color:#fff
    style F fill:#6e40c9,stroke:#a371f7,color:#fff

8.3.2 IMR vs. VMware vSphere Native Instant Recovery

CCAE candidates must be able to articulate the difference clearly. Both approaches boot VMs from a backup-side surface, but the architectural assumptions differ:

Dimension	Cohesity Instant Mass Restore	VMware vSphere Instant Recovery
Scale	Unlimited concurrent VMs; demonstrated to 200 VMs simultaneously [Source: https://www.cohesity.com/blogs/instant-recovery-unlimited-vms-point-time-distributed-resilient-data-store/]	Designed for one or a handful of VMs
Storage backing	Distributed scale-out cluster — runs production load directly	Single replica appliance — performance cliff under load
Migration	Automated Storage vMotion orchestrated by Cohesity	Manual Storage vMotion by operator
Cleanup	Automatic NFS export removal post-migration	Manual cleanup
Point-in-time	Any snapshot, O(1) access via SnapTree	Latest replica only
Performance	Cohesity-published testing showed 3x transactions/minute vs. Veeam-from-target [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Accelerating-Instant-Recovery-with-Cohesity.pdf]	Backup target was not designed for production load

The “mass” in Instant Mass Restore is the differentiator. If a ransomware event encrypts 200 VMs, IMR brings them all online in ~30 seconds and Storage vMotions them back as primary storage capacity allows. vSphere’s native Instant Recovery is fine for a single test restore but cannot orchestrate a 200-VM mass recovery.

Strict consistency in Cohesity’s distributed file system ensures that even under concurrent recovery, every ESXi host sees a consistent view of the NFS export — a non-trivial requirement when dozens of hosts are mounting the same export simultaneously [Source: https://www.cohesity.com/blogs/strict-consistency-must-capability-vmware-instant-restores/].

8.3.3 VM Cloning and Dev/Test Pipelines

Clone is the lighter-weight cousin of IMR. Where IMR is intended for production recovery (and ends with Storage vMotion back to primary), Clone spins up a writable copy of any backup snapshot for non-production use — dev, test, training, forensics — and never migrates back. The clone consumes only the metadata footprint plus changed-block writes, courtesy of SnapTree.

Analogy: A Cohesity clone is to a backup what a git branch is to a commit: a cheap, isolated, fully writable workspace forked off a known-good point-in-time, where changes are tracked separately from the source. IMR is one step further — like a git checkout of that branch into a running production environment, with the migration step being the eventual git merge back into primary storage.

Common clone use cases:

Dev/test database refresh — DBAs clone last night’s production Oracle backup into a dev instance, redact PII, and ship it to developers without consuming additional storage.
Forensic investigation — security team clones a VM at the moment of suspected compromise and isolates it on a quarantine network for analysis.
Patch testing — clone a production VM, apply the patch, validate, then discard the clone.
Training environments — clone a representative slice of production for new-hire onboarding labs.

8.3.4 Mounting Backups for Application Teams

For application teams that need read-only access to historical data without a full restore, Cohesity can mount a backup snapshot directly:

NFS/SMB mount of a file or VM backup view — analysts walk the file tree and grab specific files.
Volume mount of a SQL or Oracle backup — DBAs attach the backup as a database for read-only queries.
Browse-and-search via the Cohesity UI — indexed search across protected sources, with download or push-to-target options.

These mount workflows are read-only by default; if write access is needed, the team uses Clone instead.

8.3.5 Storage vMotion-Out from Cohesity

The final step of IMR — Storage vMotion-out — is where the architect’s design choices interact with VMware operations. The vMotion is executed by VMware, not Cohesity, which means migration speed is bounded by ESXi host CPU, vMotion network bandwidth, and primary storage write throughput. Cohesity orchestrates and monitors; it does not accelerate the underlying vMotion.

Best practices:

Stagger the migration rather than vMotion all VMs simultaneously — VMware default limits 8 concurrent storage vMotions per host.
Reserve vMotion bandwidth on dedicated VMkernel ports.
Plan primary storage capacity before initiating IMR — a 200-VM IMR will land 200 VMs’ worth of writes on primary storage during the migration window.

Key Takeaway: Instant Mass Restore differs from VMware’s native Instant Recovery in three architectural dimensions: scale (unlimited vs. one), storage performance (distributed cluster runs production load vs. appliance performance cliff), and orchestration (automated end-to-end vs. manual). Clones are the dev/test sibling — git branch for backups.

8.4 Granular File and Item Recovery

Figure 8.4: Granular recovery decision tree — choosing the right Cohesity primitive

flowchart TD
    Start[Recovery Request] --> Q1{Scope of loss?}
    Q1 -->|Whole VM<br/>or many VMs| VM[Instant Mass Restore<br/>NFS mount + power-on<br/>+ Storage vMotion]
    Q1 -->|Single file or<br/>file version| File[Indexed Search<br/>Yoda service<br/>restore to original/alt]
    Q1 -->|Mailbox / message /<br/>attachment| Item{Exchange type?}
    Q1 -->|Database object<br/>table / schema| DB{Engine?}

    Item -->|On-prem Exchange| ItemOn[VSS + Exchange API<br/>mailbox/folder/message]
    Item -->|Exchange Online| ItemCloud[Graph API + EWS<br/>restore to original or PST]

    DB -->|SQL Server| SQL[Mount backup as DB<br/>Cohesity object browser<br/>bcp / INSERT-SELECT]
    DB -->|Oracle| Oracle[RMAN RECOVER TABLE<br/>or clone + expdp/impdp]
    DB -->|SAP HANA| HANA[Clone tenant<br/>extract via SQL export]

    VM --> Done[Recovery complete]
    File --> Done
    ItemOn --> Done
    ItemCloud --> Done
    SQL --> Done
    Oracle --> Done
    HANA --> Done

    style Start fill:#1f6feb,stroke:#58a6ff,color:#fff
    style Done fill:#238636,stroke:#3fb950,color:#fff
    style Q1 fill:#6e40c9,stroke:#a371f7,color:#fff
    style Item fill:#6e40c9,stroke:#a371f7,color:#fff
    style DB fill:#6e40c9,stroke:#a371f7,color:#fff

Bulk recovery — restoring a VM, mounting a database, presenting an NFS export — is only half the recovery story. The other half is granular: a single email, a specific file version, an individual database table. Cohesity’s indexed search architecture makes granular recovery a first-class workflow rather than an afterthought.

8.4.1 Indexed Search Across VMs and NAS

When indexing is enabled on a protection group, Cohesity walks file-system metadata at backup time and pushes filenames, paths, sizes, timestamps, and (optionally) full-text content into the Yoda search service that runs across the cluster. Administrators can then query:

“Find all files matching *.pst modified in the last 30 days across all protected VMs.”
“Locate salary_2025.xlsx in any backup of any VM in the Finance protection group.”
“Show all versions of appsettings.json from the WebApp01 VM, with timestamps.”

Selected files restore directly to the original VM, an alternate VM, or download to the admin’s workstation. No full VM restore is required.

The indexing trade-off: full-text indexing adds CPU and storage overhead at backup time and inflates metadata footprint. Architects typically enable full indexing on user file shares and selective metadata-only indexing on database-heavy or system VMs.

8.4.2 Item-Level Recovery for Exchange

For Exchange (both on-premises and Online), item-level recovery operates at a finer granularity than file-level. The Cohesity Exchange agent reads the backup at the message-store level and exposes:

Mailbox — restore an entire mailbox.
Folder — restore a single folder (e.g., “Inbox/Archive/2024”).
Message — restore an individual email and its attachments.
Attachment — extract a specific attachment without restoring the message.

Restore destinations include the original mailbox, an alternate mailbox in the same Exchange organization, or a PST export for download.

8.4.3 Database Object-Level Recovery

For SQL Server, Cohesity supports object-level recovery to extract individual tables, schemas, or stored procedures from a database backup without restoring the entire database. The workflow:

Mount the database backup as an attachable database on a recovery instance.
Use the Cohesity object browser to navigate tables and schemas.
Select the objects to extract.
Cohesity scripts the extraction (typically via bcp or INSERT ... SELECT from the mounted source) into the target database.

For Oracle, object-level recovery typically uses RMAN’s RECOVER TABLE command or a clone-and-extract pattern: clone the backup into a temporary instance, export the desired tables with expdp, and import into the target.

8.4.4 Self-Service Restore Portals

For end users, Cohesity exposes self-service restore via Helios. A user can:

Browse their own VM’s backup history.
Restore a single file to their workstation without filing a ticket.
View restore audit logs scoped to their own data.

RBAC ensures users only see their own VMs/files. Self-service is heavily used by dev teams wanting to roll back a single config file without involving infrastructure ops.

Key Takeaway: Indexed search converts the backup repository into a search engine. Combined with item-level recovery for Exchange, object-level recovery for SQL/Oracle, and self-service portals via Helios, granular recovery becomes a routine helpdesk task — not a 4-hour full-restore project.

8.5 Worked Example: Three Recovery Scenarios

To cement the patterns, work through three CCAE-style recovery scenarios that map to the same protection groups but exercise very different recovery paths.

Scenario A: Single Exchange mailbox recovery. A finance manager accidentally deleted her mailbox folder containing audit-related emails three weeks ago, past Exchange’s native retention.

Open Cohesity Helios; navigate to Search.
Filter by Source = tenant.onmicrosoft.com, User = manager@example.com, Type = Mailbox.
Select the snapshot from 24 days ago (one day before the deletion).
Browse to the deleted folder; select all messages.
Restore to original mailbox, “Recovered_Audit” subfolder.
Total time: ~5 minutes; data movement: ~80 MB; impact on other users: zero.

Scenario B: Full VM recovery after ransomware. An overnight ransomware event encrypted 47 VMs in the production cluster.

Helios alerting (DataHawk anomaly detection on the protection groups) flags abnormal change rate at 03:47.
SOC declares the incident at 04:15. Last known good backup = 22:00 the prior night.
Architect triggers Instant Mass Restore on all 47 VMs from the 22:00 snapshot.
~30 seconds later, the 47 VMs are powering on from the Cohesity NFS datastore on a quarantine VLAN.
Validation team confirms VMs are clean; production traffic is cut over.
Storage vMotion to primary array begins, throttled to 4 concurrent migrations per host.
Migration completes over the next 8 hours; NFS export auto-removes.
RTO: under 1 hour for service restoration; RPO: 6 hours (last backup at 22:00).

Scenario C: Oracle PITR after a bad transaction. A developer ran DELETE FROM orders against production at 14:32. The database is 8 TB.

Architect identifies last full backup (Sunday 22:00), nightly incrementals, and 15-minute archive log backups.
Decision: PITR restore to the original host, recovering up to 14:31.
Cohesity job invokes RMAN with the SBT library, restores the most recent incremental, and applies archive logs up to 14:31:59.
Channel count: 8 (matches host CPU and network capacity).
Restore completes in 1 hour 12 minutes for 8 TB on a 10 GbE network.
Database opens at 14:31 state; deleted rows are present; the bad DELETE is gone.
RPO for this incident: 1 minute; RTO: 1 hour 12 minutes.

The same Cohesity protection environment delivered three radically different recovery products: a 5-minute item restore, a 1-hour mass VM recovery, and a 1-hour 8 TB database PITR.

8.6 Native App Integration Comparison

A consolidated table comparing how Cohesity integrates with each major application stack:

Application	Native Interface	Agent/Plugin	Key Ports	Granular Recovery	PITR Support
Oracle	RMAN via SBT library	Cohesity Linux Agent + optional source-side dedupe plugin	50051, 59999	Object via RECOVER TABLE or clone-and-extract	Yes — archive log backups
SQL Server	VDI	Cohesity Windows Agent	50051, 59999	Table/schema via object-level recovery	Yes — log backups across AAG replicas
SAP HANA	Backint shared library	Cohesity Backint agent	50051, 59999	Table via clone-and-extract	Yes — log backups + catalog
Exchange (on-prem)	VSS + Exchange APIs	Cohesity Windows Agent	SMB/RPC	Mailbox/folder/message/attachment	Per-database log replay
Exchange Online	Graph API + EWS	Cloud-side service principal	HTTPS to Graph	Mailbox/folder/message/attachment	Snapshot granularity
OneDrive/SharePoint	Graph API + MBS	Service principal	HTTPS; MBS APIs	File/site/list-item with ACLs	Snapshot granularity
Microsoft Teams	Graph API + SharePoint	Service principal	HTTPS	Channel/message/file (Group must pre-exist for full-team restore)	Snapshot granularity
VMware VMs	VADP, CBT	None on guest (agentless) or VMware Tools quiescing	VMware ports	File-level via indexed search	Per-snapshot

Chapter Summary

Application-aware backup is the place where architectural choices map directly onto user-visible outcomes. Cohesity integrates with Oracle through RMAN’s SBT interface (with target-side or source-side dedupe), with SQL Server through VDI (AAG-aware), and with SAP HANA through the SAP-mandated Backint API. Microsoft 365 protection covers Exchange Online, OneDrive, SharePoint, and Teams via Graph API and Microsoft 365 Backup Storage, with the critical caveat that fully deleted Teams require manual Group recreation before restore. Instant Mass Restore turns the backup cluster into a temporary primary datastore, recovering up to hundreds of VMs in approximately 30 seconds with full automation through Storage vMotion-back; this differs from VMware’s native Instant Recovery in scale, performance, and orchestration. Clones provide cheap, writable forks of any backup for dev/test, forensics, and patch validation. Granular recovery — file, mailbox item, database object — is enabled by Cohesity’s indexed search and self-service Helios portals. CCAE candidates should be able to map any recovery requirement (single email, mass VM event, database PITR) to the appropriate Cohesity primitive and explain the network, port, and policy prerequisites.

Key Terms

RMAN — Oracle Recovery Manager. The native Oracle backup driver. Cohesity registers as an SBT (System Backup to Tape) target; RMAN channels stream backup pieces via either NFS mount (target-side dedupe) or the Cohesity source-side dedupe plugin.
VDI — Virtual Device Interface. Microsoft SQL Server’s native API for third-party backup. Cohesity’s Windows Agent registers as a VDI client; SQL Server pushes consistent backup streams to the agent.
AAG — Always On Availability Group. SQL Server’s HA/DR feature. Cohesity AAG-aware protection can target preferred, primary, or secondary replicas and reconstruct log chains across replica failovers.
Backint — SAP-certified shared library interface for HANA backup. Cohesity’s Backint agent is the only SAP-supported third-party backup path for HANA production environments.
Instant Mass Restore (IMR) — Cohesity’s bulk VM recovery primitive. Presents an NFS datastore to ESXi hosts, registers and powers on VMs from backup metadata, and Storage vMotions them back to primary storage. Demonstrated to recover 200 VMs concurrently.
Clone — A writable, point-in-time fork of a backup snapshot. Consumes only metadata plus changed blocks. Used for dev/test, forensics, training, and patch validation. Conceptually analogous to a git branch.
Granular recovery — Recovery at sub-object granularity: a single email, a specific file, an individual database table. Enabled by Cohesity’s indexed search and Yoda service.
M365 — Microsoft 365 (Exchange Online, OneDrive, SharePoint Online, Teams, Entra ID). Protected through Graph API, EWS, and Microsoft 365 Backup Storage (MBS), which enables up to 3 TB/hour restore throughput.
MBS — Microsoft 365 Backup Storage. Microsoft’s storage-side backup/restore APIs that bypass Graph API throttling for OneDrive, SharePoint, and Teams files.
SBT — System Backup to Tape. RMAN’s interface specification for tape-class backup targets. Cohesity uses SBT to integrate without requiring custom RMAN scripting.
SnapTree — Cohesity’s metadata structure providing O(1) snapshot access regardless of snapshot depth. The foundation of IMR and Clone — every snapshot is a fully hydrated, instantly mountable view rather than a delta chain.
CBT (Oracle) — Change Block Tracking. Reduces incremental backup time by limiting reads to changed blocks. Strongly recommended for large Oracle databases.
PITR — Point-in-Time Recovery. Restoring a database to an arbitrary moment using full/incremental backups plus replayed transaction or archive logs.

Chapter 9: Replication, Disaster Recovery, and SiteContinuity

Disaster recovery is where backup architecture stops being a storage problem and becomes a business continuity problem. A backup that cannot be replicated, orchestrated, and recovered within the agreed Recovery Time Objective (RTO) is, from the point of view of a regulator or a CFO, no backup at all. This chapter walks an aspiring CCAE through the four levers Cohesity gives you to meet RPO, RTO, and recovery-site requirements: cluster-to-cluster replication, the SiteContinuity orchestration engine, cloud-native recovery options (CloudReplicate, CloudSpin, CloudArchive, CloudTier), and the bandwidth math that ties them all together.

Learning Objectives

By the end of this chapter you will be able to:

Design active-passive and active-active replication topologies for one-to-one, one-to-many, many-to-one, and cross-cloud patterns.
Calculate replication bandwidth requirements from front-end TB (FETB), daily change rate, deduplication efficiency, and replication window.
Architect orchestrated DR with Cohesity SiteContinuity runbooks, including the full Failover Ready -> Failover Complete -> Prepare Failback -> Failback Ready -> Failback Complete state machine.
Differentiate CloudSpin, CloudReplicate, CloudArchive, and CloudTier and justify the right option for a given recovery scenario.
Identify the network ports, bonding modes, and MTU settings that must be present for replication to succeed at scale.

Replication Topologies

Cohesity replication is a Protection-Group-aware operation: the source cluster sends only unique, deduplicated, compressed chunks to the target cluster, and the target cluster keeps a fully addressable copy that can be searched, recovered, mounted, or orchestrated independently of the source [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity_Data_Protection_White_Paper.pdf]. That property — the replica is a working cluster, not a tape — is what makes the four classic topologies viable.

One-to-One, One-to-Many, Many-to-One, and Cross-Cloud

The architect’s first decision is direction and fan-out. Cohesity supports four canonical topologies, summarized below.

Topology	Pattern	Typical Use Case	Pros	Cons
One-to-one	Cluster A -> Cluster B	Two-site enterprise DR (primary + DR site)	Simple to operate; symmetric failback; predictable bandwidth	No tertiary copy; single point of DR failure
One-to-many	Cluster A -> Cluster B + Cluster C	Tier-1 workloads requiring an in-region DR copy plus a cross-region cyber-vault	Multiple recovery options; geographic diversity	Higher WAN cost; more policy maintenance
Many-to-one (fan-in)	Clusters A, B, C, D -> Hub Cluster	ROBO and branch consolidation to a central enterprise data center	Centralized retention, dedup, and reporting; lower licensing per spoke	Hub is a single failure domain; aggregate ingest must be sized carefully
Cross-cloud	On-prem -> AWS-hosted Cohesity Cloud Edition (or Azure)	Cloud-as-DR; eliminates secondary physical site	OPEX model; pay-as-you-grow; tight integration with CloudSpin	Egress costs on recall; cloud licensing premium

A typical large enterprise blends all four: a hub-and-spoke many-to-one for ROBO consolidation, a one-to-one between primary data centers for orchestrated DR, and a cross-cloud one-to-many leg into AWS or Azure to provide a third “air-gapped” copy for ransomware resiliency.

Figure 9.1: Replication topology variants (1:1, 1:Many, Many:1, Cross-cloud)

flowchart LR
    subgraph OneToOne["1:1 (Active-Passive DR)"]
        A1[Cluster A<br/>Primary] -->|replicate| B1[Cluster B<br/>DR Site]
    end

    subgraph OneToMany["1:Many (Geographic Diversity)"]
        A2[Cluster A<br/>Primary] -->|replicate| B2[Cluster B<br/>In-Region DR]
        A2 -->|replicate| C2[Cluster C<br/>Cross-Region Vault]
    end

    subgraph ManyToOne["Many:1 (ROBO Fan-In)"]
        S1[Spoke A<br/>ROBO] -->|replicate| H[Hub Cluster<br/>Central DC]
        S2[Spoke B<br/>ROBO] -->|replicate| H
        S3[Spoke C<br/>ROBO] -->|replicate| H
    end

    subgraph CrossCloud["Cross-Cloud (Cloud-as-DR)"]
        OP[On-Prem Cluster] -->|replicate| CE[Cloud Edition<br/>AWS / Azure]
    end

Replication Policies and Retention

Replication is configured on the Protection Policy, not on the Protection Group. A policy can specify multiple “external” targets — local snapshot, replication target cluster, CloudArchive target, CloudTier — each with its own retention. The result is a single source-of-truth schedule that drives all secondary copies in lockstep [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity_Data_Protection_White_Paper.pdf].

Retention on the replication target is independent of the source: it is common to keep 14 daily on the primary cluster, 30 daily plus 12 monthly on the DR cluster, and 7 yearly on the CloudArchive target, all driven by one policy.

Key Takeaway: Cohesity replication topology decisions are driven by failure domain rather than bandwidth. Pick the topology — one-to-one, one-to-many, many-to-one, cross-cloud — that aligns with your blast-radius assumptions, then size WAN and retention to match.

Replication Mechanics and Tuning

Encrypted, Compressed, Deduplicated Wire Format

Cohesity replication is always encrypted in flight (TLS) and is deduplicated and compressed at the chunk level before the wire. The source cluster queries the target’s chunk fingerprint database and only ships chunks the target does not already have [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity_Data_Protection_White_Paper.pdf]. This “global dedup on the wire” is the dominant reason Cohesity can hit aggressive RPOs over modest WAN links.

Bandwidth Throttling and Windowing

Three throttling surfaces exist [Source: https://docs.cohesity.com/baas/data-protect/manage-network-saas.htm] [Source: https://www.youtube.com/shorts/EPXqfK6qxF4]:

SaaS Connector throttling for DataProtect-as-a-Service, configured per-connector with day/time windows. Direction can be split (upload vs. download).
Source-agent throttling for individual physical hosts whose CPU or NIC cannot absorb the agent’s full streaming rate.
Cluster-to-cluster Protection Group / policy windows that align replication runs to off-business hours.

A common gotcha: the SaaS Connector accepts throttle values in bytes per second, not bits per second, which is the opposite convention used by tools like Veritas AIR [Source: https://docs.cohesity.com/baas/data-protect/manage-network-saas.htm] [Source: https://www.veritas.com/support/en_US/article.100051869]. A “100 MB/s” throttle is therefore 800 Mbps on the wire — entering “100” thinking in bits will under-throttle by a factor of eight and saturate the WAN.

Network Architecture and Required Ports

For replication-target clusters and ROBO nodes, Cohesity recommends 2x 10 GbE LACP Bond Mode 4 with MLAG/VPC, providing 20 Gbps of combined management plus replication bandwidth per node [Source: https://kb.expedient.com/docs/cohesity-client-premises-networking-requirements] [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/reference-architecture/Optimal-Network-Designs-with-Cohesity-RA.pdf]. Higher-throughput sites use 25 GbE or 40/100 GbE, but 2x10 LACP is the baseline expectation.

Two distinct traffic classes must be sized:

Traffic	Description	Sizing Rule
North-South	WAN replication egress/ingress, dedup-reduced	Throttle to fit available WAN; throughput scales linearly with node count
East-West	Intra-cluster RF/EC rebuild, metadata gossip	Non-blocking, non-oversubscribed switch fabric mandatory

When multiple VLANs exist, a dedicated Interface Group / VLAN for replication isolates WAN traffic from front-end backup ingest [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/reference-architecture/Optimal-Network-Designs-with-Cohesity-RA.pdf].

The required firewall ports between source and target clusters are [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/reference-architecture/Optimal-Network-Designs-with-Cohesity-RA.pdf] [Source: https://www.youtube.com/watch?v=BkzPxpq7Swg]:

Port	Protocol	Purpose
443	TCP	HTTPS / API control plane
111	TCP	Portmap / RPC
20000	TCP	Replication data channel
24444	TCP	Replication control / metadata

Enable jumbo frames (MTU 9000) end-to-end on the replication path; a single device along the path that does not honor 9000-byte frames will silently fragment or drop and tank effective throughput [Source: https://kb.expedient.com/docs/cohesity-client-premises-networking-requirements].

Initial Seed Strategies

The first replication cycle (“seed”) is, by definition, full — there is nothing on the target to dedup against. For 50 TB+ datasets over a constrained WAN this can take days or weeks. Two seed strategies exist:

Wire seeding with extended window. Run the initial replication during a multi-day quiet window with relaxed throttles. Acceptable for moderate datasets and adequate WAN.
Physical seed transport. Replicate to a portable or temporary cluster physically located at the source site, ship it, then re-home the replication target. Used when the dataset-to-WAN ratio makes wire seeding infeasible.

After the initial seed, daily replication shrinks to change rate minus dedup — typically single-digit percentages of FETB.

Replication Failure Handling

Cohesity replication is checkpointed: a partial run resumes from the last committed chunk rather than restarting. Persistent failures generate Helios alerts and pause the policy after configurable retry attempts. The architect’s job is to ensure the monitoring path (Helios -> SNMP/Syslog/Email) actually reaches an on-call engineer before replication lag exceeds RPO.

Key Takeaway: Replication mechanics are built on dedup, compression, and TLS, but they only work if you’ve sized 2x10 GbE LACP, opened TCP 443/111/20000/24444, enabled MTU 9000 end-to-end, and remembered that SaaS Connector throttles are in bytes per second.

Replication Bandwidth Math

The single formula every CCAE must memorize:

Required throughput = (FETB x daily change rate x (1 - dedup/compression)) / replication window

All four inputs are negotiable, and the architect’s job is to balance them against the WAN budget.

The Bytes-vs-Bits Gotcha

Network engineers speak in bits per second (Mbps, Gbps); storage engineers and Cohesity throttles speak in bytes per second (MB/s, GB/s). The conversion factor is 8 (plus a small percentage of TCP/IP overhead, conventionally ignored at this level).

Wire Speed (bits)	Theoretical Bytes	50% Utilization Target
1 Gbps	125 MB/s	62.5 MB/s
10 Gbps	1,250 MB/s (1.25 GB/s)	625 MB/s
40 Gbps	5,000 MB/s	2,500 MB/s
100 Gbps	12,500 MB/s	6,250 MB/s

Cohesity guidance is to plan for 50 percent of nominal wire speed as the sustainable replication ceiling, leaving headroom for retransmits, other tenants on the link, and the inevitable noisy neighbor [Source: https://kb.expedient.com/docs/cohesity-client-premises-networking-requirements] [Source: https://www.cohesity.com/blogs/demonstrating-linear-scalability-cohesity-data-platform/].

Worked Example: 50 TB FETB, 5% Daily Change, 4-Hour Window

A common CCAE-style scenario. Inputs:

FETB: 50 TB of front-end protected data.
Daily change rate: 5 percent (typical mixed VM/file workload).
Replication dedup/compression efficiency: 60 percent reduction on the wire (i.e., we keep 40 percent of the change).
Replication window: 4 hours = 14,400 seconds.

Step 1 — Daily change in bytes:

50 TB x 0.05 = 2.5 TB of change per day
2.5 TB = 2,500 GB = 2,500,000 MB = 2,500,000,000,000 bytes (2.5 x 10^12)

Step 2 — Apply dedup/compression on the wire:

2.5 TB x (1 - 0.60) = 2.5 TB x 0.40 = 1.0 TB on the wire
1.0 TB = 1,000 GB = 1,000,000 MB

Step 3 — Divide by replication window in seconds:

1,000,000 MB / 14,400 s = ~69.4 MB/s required

Step 4 — Convert to bits per second for the WAN team:

69.4 MB/s x 8 = ~556 Mbps

Step 5 — Apply 50 percent utilization headroom:

556 Mbps / 0.50 = ~1.11 Gbps minimum WAN provisioning

So 50 TB FETB at 5 percent change with 60 percent on-wire reduction and a 4-hour window requires roughly 1.1 Gbps of provisioned WAN, which one 10 GbE link comfortably absorbs. Halving the window to 2 hours doubles the requirement to ~2.2 Gbps. Doubling the change rate to 10 percent (e.g., a database-heavy workload) doubles it again to ~4.4 Gbps — at which point a single 10 GbE link is operating uncomfortably close to its 50 percent ceiling and the architect should either move to 25 GbE, lengthen the RPO, or add a second LACP bond.

The RPO / Bandwidth / Dedup Equation

Rearranging the formula gives three knobs the architect can turn when the math doesn’t close:

Knob	What It Means	Typical Range
Increase WAN bandwidth	Buy more circuit	$$$, lead time of weeks
Lengthen RPO (replication window)	Replicate every 8 hr instead of every 4	Free, but business must accept
Reduce change rate	Tighter Protection Group scoping; exclude logs/scratch	Cheap, but bounded
Improve dedup on wire	Better source filtering, larger target retention pool	Modest gains, slow to realize

If, after exhausting these knobs, the inequality (FETB x change x (1 - dedup)) <= (WAN bandwidth x window) still fails, the architecture must shift to a closer replication target (regional rather than transcontinental) or to a cloud-native pipe.

Key Takeaway: Memorize the bandwidth equation, always do the math in bytes first, multiply by 8 last, and budget for 50 percent of wire speed. The bytes-vs-bits trap has cost more than one architect a six-figure WAN over-provisioning bill.

SiteContinuity Orchestration

Replication moves the data; SiteContinuity moves the application. SiteContinuity is Cohesity’s DR orchestration engine that turns “we have replicated VMs” into “we have a runnable runbook with a measurable RTO.”

The Runbook Analogy

Think of a SiteContinuity runbook as an emergency-evacuation plan for an office building. A good evacuation plan answers four questions in advance:

Who goes first? (Dependency order — domain controllers and DNS before app servers; app servers before web tiers.)
Where do they go? (Resource Profile — which target compute, which port group, which datastore.)
What address do they get when they arrive? (Re-IP and VLAN mapping at the DR site.)
How do you know everyone got out? (Validation steps — VM up, services running, smoke tests passing.)

Failback is going home after the storm: the same plan in reverse, but only after the building has been certified safe. You don’t sprint back into the office while the roof is still on fire — you “Prepare for Failback” first, which seeds the data back, then you actually move people.

Runbook Building Blocks

A SiteContinuity DR Plan is composed of [Source: https://docs.cohesity.com/disaster-recovery/site-continuity/vmware-vms/failback.htm] [Source: https://docs.cohesity.com/disaster-recovery/site-continuity/vmware-vms/prepare-for-failback.htm]:

DR Applications — logical groups of VMs that recover together with a defined boot order.
Resource Profiles — reusable mappings of target vCenter, datastore, port group, and IP-customization rules at the recovery site.
Failback Resource Set — a separate resource definition added via Edit > Add Resource Set on a DR Plan, used when failing back to the primary or to a brand-new cluster.
Snapshot Selection — at execution time, the operator can accept the latest snapshot or override with a specific recovery point for explicit RPO control.

The SiteContinuity State Machine

Every DR Plan moves through a discrete set of states that gate which actions are available [Source: https://docs.cohesity.com/disaster-recovery/site-continuity/vmware-vms/failover.htm] [Source: https://docs.cohesity.com/disaster-recovery/site-continuity/vmware-vms/actual-failback.htm] [Source: https://docs.cohesity.com/disaster-recovery/pdf/site-continuity-user-guide.pdf]:

[Failover Ready] --(Failover)--> [Failover In Progress] --> [Failover Complete]
                                                                    |
                                                          (Prepare for Failback)
                                                                    v
                                                      [Prepare for Failback In Progress]
                                                                    |
                                                                    v
                                                            [Failback Ready]
                                                                    |
                                                              (Failback)
                                                                    v
                                                       [Failback In Progress]
                                                                    |
                                                                    v
                                                          [Failback Complete]
                                                                    |
                                                          (Prepare for Failover)
                                                                    v
                                                            [Failover Ready] <- back to start

Test variants (Test Failover, Test Failback) operate non-disruptively and do not change the underlying state of the production plan [Source: https://docs.cohesity.com/disaster-recovery/pdf/site-continuity-user-guide.pdf].

Figure 9.2: SiteContinuity DR Plan state machine

stateDiagram-v2
    [*] --> FailoverReady
    FailoverReady --> FailoverInProgress: Failover
    FailoverInProgress --> FailoverComplete
    FailoverComplete --> PrepareFailbackInProgress: Prepare for Failback
    PrepareFailbackInProgress --> FailbackReady: reverse seed complete
    FailbackReady --> FailbackInProgress: Failback
    FailbackInProgress --> FailbackComplete
    FailbackComplete --> FailoverReady: Prepare for Failover

    FailoverReady --> FailoverReady: Test Failover (no state change)
    FailbackReady --> FailbackReady: Test Failback (no state change)

    note right of FailoverReady
        Steady state - DR site
        seeded and ready
    end note
    note right of FailbackReady
        Reverse seed complete -
        ready to return home
    end note

Failover Procedure (Production Cutover)

The SiteContinuity workflow for an actual failover [Source: https://docs.cohesity.com/disaster-recovery/site-continuity/vmware-vms/actual-failover.htm] [Source: https://docs.cohesity.com/disaster-recovery/site-continuity/vmware-vms/failover.htm]:

Navigate to DR Plans > Disaster Recovery Plans.
Select the desired plan, then Actions (kebab) > Failover.
Choose a Resource Profile with the network mapping, IP customization, and target compute.
Optionally enable Protect VMs at DR Site so recovered VMs are immediately added to a Protection Group on the DR Cohesity cluster — closing the backup gap that otherwise opens at the moment of failover.
Confirm the snapshot (latest by default; can be overridden with an earlier RPO).
Type YES to confirm and start the orchestrated workflow.
Validate via DR vCenter: VM startup order, resource allocation, network connectivity, application functionality, and Protection Group state on the DR cluster.

Figure 9.3: Failover orchestration sequence (Helios -> Source -> Target -> VM Power-On)

sequenceDiagram
    participant Op as Operator
    participant Helios as Helios / SiteContinuity
    participant Src as Source Cluster
    participant Tgt as DR Target Cluster
    participant vC as DR vCenter
    participant VM as Recovered VMs

    Op->>Helios: Trigger Failover (DR Plan)
    Helios->>Helios: Validate Resource Profile + snapshot
    Helios->>Src: Quiesce replication (if reachable)
    Helios->>Tgt: Select latest replicated snapshot
    Tgt->>Tgt: Mount SnapTree view
    Helios->>vC: Register VMs from mounted view
    vC->>VM: Apply IP customization + VLAN mapping
    vC->>VM: Power on (boot order: DC, App, Web)
    VM-->>vC: Services up
    vC-->>Helios: Boot validation OK
    Helios->>Tgt: Optionally protect VMs at DR site
    Helios-->>Op: State = Failover Complete

Prepare for Failback

Before any production failback, the plan must transition cleanly back to a failback-ready state [Source: https://docs.cohesity.com/disaster-recovery/site-continuity/vmware-vms/prepare-for-failback.htm]:

Confirm the plan is in Failover Complete.
Choose Actions > Prepare for Failback, which drives reverse replication from the DR cluster back to the primary cluster.
The plan moves through Prepare for Failback In Progress to Failback Ready when the reverse seed completes.

Failback Procedure

From the Failback Ready state [Source: https://docs.cohesity.com/disaster-recovery/site-continuity/vmware-vms/actual-failback.htm]:

Select Actions > Failback.
Pick the failback Resource Profile; confirm or override the snapshot (VADP or CDP).
Type YES to confirm.
Validate against the primary vCenter — VM order, resource attachments, networks, applications — and confirm the Protection Group on the primary cluster has resumed normal backups.

After Failback Complete, run Actions > Prepare for Failover to re-seed the DR site and return the plan to Failover Ready for the next event [Source: https://docs.cohesity.com/disaster-recovery/site-continuity/vmware-vms/actual-failback.htm].

For new-target failbacks (e.g., a rebuilt primary cluster after a true site loss), existing DR Plans must be deleted and recreated against the new target, and DR Applications must be redefined [Source: https://docs.cohesity.com/disaster-recovery/site-continuity/vmware-vms/failback.htm].

Network Re-IP and VLAN Mapping

At the DR site, VMs almost always need new IP addresses (different subnet) and possibly different VLANs. SiteContinuity’s IP customization runs guest-OS scripts during the failover boot to apply the new address. Architects must pre-stage:

DR-site VLAN port groups that match production naming conventions (or define explicit mappings).
Guest customization specs validated against current OS versions.
DNS update strategy (dynamic DNS, scripted A-record updates, or GSLB-based geo-redirection).

RTO and RPO Measurement

RPO is bounded by replication frequency — if you replicate every 4 hours, RPO is 4 hours.
RTO is bounded by runbook execution time — boot order serialization, IP customization, application warmup.

SiteContinuity reports actual recovery times for each Test Failover, which is how architects prove RTO compliance to auditors. Regular Test Failovers are the single dominant predictor of successful real-world recoveries [Source: https://www.cohesity.com/blogs/the-disaster-recovery-reality-check/] [Source: https://www.cohesity.com/glossary/disaster-recovery/]. SiteContinuity is licensed and deployed alongside DataProtect, sharing the underlying Protection Groups so the same backup snapshot used for granular recovery is also the source of the orchestrated failover [Source: https://www.dataguardworks.com/SiteContinuity.asp].

Key Takeaway: A SiteContinuity DR Plan is a state machine — Failover Ready -> Failover Complete -> Prepare Failback -> Failback Ready -> Failback Complete -> back to Failover Ready. Test Failover does not change the state. Skipping “Prepare for Failback” is the most common operational error and will cause Failback to fail.

Cloud-Based DR Options

Three Cohesity features deliver DR-style outcomes into hyperscaler clouds, and a fourth (CloudTier) is frequently confused with them. The CCAE exam will test the differences directly.

CloudReplicate — DR Replica as a Working Cluster

CloudReplicate replicates Protection Group snapshots from an on-prem Cohesity cluster to a Cohesity cluster running in the cloud (Cloud Edition in AWS or Azure) [Source: https://tekhead.it/blog/2016/04/cohesity-announces-cloud-integration-services/]. The destination is a fully functional Cohesity cluster: granular file recovery, Instant Mass Restore, View mounting, and SiteContinuity orchestration are all available on the cloud-side replica.

Use CloudReplicate when:

The cloud is the DR site and you want to skip building a second physical data center.
You want SiteContinuity-driven failover into the cloud with the same runbook semantics as on-prem-to-on-prem DR.
You may eventually CloudSpin specific VMs to native EC2/Azure VMs at recovery time, but want to keep the option open.

CloudSpin — On-Demand Conversion to Native Cloud VMs

CloudSpin converts an on-prem (or cloud-resident) backup snapshot into a native cloud VM — an AWS EC2 instance with EBS volumes, or an Azure VM with Managed Disks [Source: https://www.cohesity.com/resources/solution-brief/manage-data-thats-fragmented-across-cloud/]. The conversion is an active operation: Cohesity rewrites disk format from its native SnapTree representation into the hypervisor format the cloud provider requires.

Use CloudSpin when:

You need a dev/test clone running natively in the cloud (no Cohesity required at recovery time).
You want lightweight cloud DR without standing up a full Cloud Edition cluster.
You want to test cloud-failover scenarios without committing to permanent cloud infrastructure.

CloudSpin is active (you trigger it on demand); CloudReplicate is continuous (the policy drives it).

CloudArchive — Long-Term Retention to Object Storage

CloudArchive creates a separate archival copy in cloud object storage (S3, Azure Blob, GCP) for compliance and long-term retention [Source: https://www.cohesity.com/blogs/cloud-clear-cohesity-cloud-archival/] [Source: https://www.cohesity.com/solutions/long-term-retention-and-archival/]. Driven by Protection Policy schedules (e.g., monthly archives for 7 years). The data stays in Cohesity’s deduplicated, compressed format; the source cluster keeps a full index/metadata copy locally so search and selective restore work without re-ingesting [Source: https://www.cohesity.com/resources/solution-brief/simplify-long-term-data-retention-and-archival/].

CloudArchive Direct is a variant that streams archives directly to cloud storage with only minimal local footprint — index stays on-prem, parallel uploads send full data blocks to the cloud target [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/archive-your-data-directly-with-cohesity-cloudarchive-direct-white-paper.pdf]. Intended for organizations that have outgrown local archive capacity.

CloudTier — Capacity Overflow (Not a DR Tool)

CloudTier automatically tiers cold blocks (default >60 days, configurable) from on-prem nodes into cloud object storage when local capacity exceeds 80 percent [Source: https://www.cohesity.com/glossary/cloud-tiering/] [Source: https://www.penguinpunk.net/blog/cohesity-basics-cloud-tier/]. It is an invisible extension of the View Box — the data is moved, not duplicated.

Two critical operational facts:

Once enabled on a View Box, Cloud Tier cannot be disabled. It is irreversible.
CloudTier is not a DR copy. The cluster is still the only authoritative copy of the data; if the cluster dies, the cloud-tiered blocks are unrecoverable on their own.

Decision Matrix: CloudSpin vs. CloudReplicate vs. CloudArchive vs. CloudTier

Feature	Purpose	Result in Cloud	Reversible?	Trigger	Recovery Speed	Primary Use Case
CloudSpin	Dev/test or DR with native cloud VMs	EC2 / Azure VM (hyperscaler-native)	Yes — VM is ephemeral	On-demand by user	Minutes (single VM)	Quick cloud spin-up; dev/test
CloudReplicate	DR replica as a working Cohesity cluster	Full Cohesity cluster in cloud	Yes	Policy schedule	Standard restore + IMR	Cloud-as-DR-site
CloudArchive	Compliance / long-term retention	Cohesity-format archive in object storage	Yes — separate copy	Policy schedule	Slow (rehydrate from object)	7-year compliance retention
CloudArchive Direct	LTR with minimal local footprint	Cohesity archive in object storage; index local only	Yes	Policy schedule	Slow	LTR when local archive exhausted
CloudTier	Capacity relief for cold local data	Object-storage extension of View Box	No	Auto when local capacity > threshold	Transparent (cluster fetches)	Capacity overflow only

The CCAE-style trick question: “Customer wants the cloud to be both their archive target and their DR target with minimum cost.” Wrong answer: CloudTier (not a DR tool). Right answer: CloudReplicate for DR + CloudArchive for compliance retention, sharing one external cloud target where possible.

Figure 9.4: Cloud DR option decision tree

graph TD
    Start[Cloud Use Case?] --> Q1{Primary Goal?}

    Q1 -->|Disaster Recovery| Q2{Need Cohesity features<br/>at recovery time?}
    Q1 -->|Long-Term Retention| Q3{Local archive<br/>capacity available?}
    Q1 -->|Capacity Relief| CT[CloudTier<br/>WARNING: Irreversible<br/>NOT a DR tool]

    Q2 -->|Yes - IMR, search,<br/>SiteContinuity| CR[CloudReplicate<br/>Full Cohesity cluster<br/>in AWS/Azure]
    Q2 -->|No - just need<br/>native cloud VM| CS[CloudSpin<br/>EC2/Azure VM<br/>on-demand conversion]

    Q3 -->|Yes| CA[CloudArchive<br/>Cohesity-format<br/>in object storage]
    Q3 -->|No - exhausted| CAD[CloudArchive Direct<br/>Index local, data<br/>streamed direct to cloud]

    CR --> Tip1[Use with SiteContinuity<br/>for full DR runbook]
    CS --> Tip2[Active trigger; ideal<br/>for dev/test clones]
    CA --> Tip3[Policy-driven; slow<br/>rehydrate on restore]

    style CT fill:#5a1f1f,stroke:#ff6b6b,color:#fff
    style CR fill:#1f3a5a,stroke:#58a6ff,color:#fff
    style CS fill:#1f3a5a,stroke:#58a6ff,color:#fff
    style CA fill:#1f3a5a,stroke:#58a6ff,color:#fff
    style CAD fill:#1f3a5a,stroke:#58a6ff,color:#fff

Recovery Into VMware Cloud and Azure VMware Solution

VMware Cloud on AWS (VMC) and Azure VMware Solution (AVS) are first-class targets for Cohesity DR because they expose a native vCenter that SiteContinuity can drive directly — no CloudSpin conversion required, no Cloud Edition cluster required. The trade-off is the cost of running the VMC/AVS SDDC versus the cost of a Cloud Edition cluster.

Comparing Cost / Recovery Options

Recovery Option	OPEX	Recovery Speed	Operational Familiarity
Second physical DC + Cohesity	Highest CAPEX, lowest OPEX	Fastest	Highest (same as production)
Cloud Edition + CloudReplicate + SiteContinuity	Medium OPEX	Fast (SiteContinuity-driven)	High (same SiteContinuity workflow)
VMC / AVS + CloudReplicate	High OPEX	Fast	Highest (native vCenter)
CloudSpin only (no Cloud Edition)	Lowest OPEX	Slow (per-VM conversion)	Lower (different from production)
CloudArchive + manual restore	Cheapest	Slowest	Lowest

Key Takeaway: CloudReplicate gives you a working cluster, CloudSpin gives you a native VM, CloudArchive gives you compliance retention, and CloudTier gives you capacity relief — only the first three are DR options, and CloudTier is irreversible once enabled.

Chapter Summary

Topology choice — one-to-one, one-to-many, many-to-one (fan-in), or cross-cloud — is driven by failure-domain assumptions, not by bandwidth.
Replication is encrypted, deduplicated, and compressed on the wire. Throughput scales linearly with node count.
The 2x10 GbE LACP baseline, jumbo frames end-to-end, and the four required ports (TCP 443, 111, 20000, 24444) are non-negotiable.
The bandwidth formula — (FETB x change rate x (1 - dedup)) / window — is the single most important calculation in this domain. Always work in bytes first, then multiply by 8. Plan for 50 percent of wire speed.
SaaS Connector throttles are bytes per second, not bits per second — a common bug.
SiteContinuity orchestrates DR via a state machine: Failover Ready -> Failover Complete -> Prepare Failback -> Failback Ready -> Failback Complete. Test variants do not change the state.
A runbook is an evacuation plan: define DR Applications (boot order), Resource Profiles (target compute/network), and a Failback Resource Set. Failback is going home after the storm — only after Prepare for Failback completes the reverse seed.
CloudReplicate, CloudSpin, CloudArchive, CloudTier are four distinct tools for four distinct problems: DR replica cluster, on-demand native VM, compliance retention, and capacity overflow respectively. CloudTier is irreversible and is not a DR tool.
Test Failovers are the dominant predictor of real-world recovery success. Run them on a schedule, not just before audits.

Key Terms

Replication — Cluster-to-cluster movement of deduplicated, compressed, encrypted Protection Group snapshots, driven by Protection Policy schedules and consumed by DR, retention, and orchestration workflows.
SiteContinuity — Cohesity’s DR orchestration product that drives runbook-based failover and failback for VMware VMs, consuming the same underlying snapshots used for granular recovery.
Runbook — A SiteContinuity DR Plan composed of DR Applications (VM groups with boot order), Resource Profiles (target compute/network/IP mappings), and optional Failback Resource Sets; the operational analog to an emergency evacuation plan.
CloudSpin — On-demand conversion of a Cohesity backup snapshot into a native cloud VM (AWS EC2 with EBS, or Azure VM with Managed Disks) for dev/test or lightweight cloud DR.
CloudReplicate — Continuous policy-driven replication from an on-prem Cohesity cluster to a Cohesity Cloud Edition cluster running in AWS or Azure; the destination remains a fully functional Cohesity cluster.
Failover — The orchestrated cutover of a DR Plan from the primary site to the DR site, transitioning the plan from Failover Ready through Failover In Progress to Failover Complete.
Failback — The orchestrated return of a DR Plan from the DR site to the primary site, executed only after Prepare for Failback has successfully reverse-seeded data and moved the plan to Failback Ready.
RTO (Recovery Time Objective) — The maximum acceptable elapsed time between disaster declaration and full application recovery; bounded by SiteContinuity runbook execution time including boot order and IP customization.
RPO (Recovery Point Objective) — The maximum acceptable amount of data loss measured in time; bounded by replication frequency (e.g., 4-hour replication = 4-hour RPO worst case).

Chapter 10: Cloud Integration: Archive, Tier, Replicate, and Spin

Cohesity exposes four distinct cloud integration patterns: CloudArchive for long-term retention, CloudTier for capacity extension, CloudReplicate for cluster-to-cluster replication into a Cohesity Cloud Edition, and CloudSpin for converting on-prem backups into native cloud VMs. These four features rely on the same plumbing — the External Target abstraction — but solve different architectural problems with different cost profiles, recovery semantics, and IAM surfaces.

This chapter walks an architect through choosing among the four, configuring the AWS S3 Glacier and Azure Blob targets that most CCAE candidates will see on the exam, and modeling the egress and recall costs that quietly dominate cloud TCO.

Learning Objectives

By the end of this chapter, you will be able to:

Differentiate CloudArchive, CloudArchive Direct, CloudTier, CloudReplicate, and CloudSpin by purpose, data movement model, and recovery semantics.
Configure External Targets to AWS S3 (including Glacier and Deep Archive), Azure Blob (including Archive tier), GCP, and S3-compatible on-prem object stores.
Apply storage-class lifecycle policies correctly — keeping bucket-side rules and cluster-side retention separated and aligned.
Estimate egress, retrieval-request, and rehydration charges, and design retention and recall scenarios that avoid surprise invoices.
Recognize the IAM and RBAC minimums required for each target — the CloudFormation Template for AWS, Storage Blob Data Contributor for Azure, and the role of setBlobTier/action in tier transitions.

Analogy: Garage cleanup vs. offsite storage rental. Think of your cluster as a two-car garage. CloudTier is the garage cleanup: you mark anything older than 90 days and haul those boxes to a self-storage unit. The boxes are gone from the garage — you got the floor space back — but you can drive over and fetch them when needed (slowly, for a small fee). CloudArchive is the offsite records vault: you photocopy critical documents and ship the copies to an archival facility. The originals stay in the garage; the archival facility is the durable, regulator-friendly second copy. The most common CCAE exam mistake is conflating these two.

Section 10.1: CloudArchive — Long-Term Retention to Object Storage

10.1.1 Long-Term Retention to Object Storage

CloudArchive is a copy-out mechanism. The Cohesity cluster keeps a complete local snapshot — chunk files, blob files, and metadata — and additionally writes a deduplicated, compressed copy to a registered external target. The local cluster remains authoritative for indices and metadata so that catalog operations (browse, search, restore) can be answered without rehydrating cloud objects unnecessarily [Source: https://www.cohesity.com/solutions/long-term-retention-and-archival/].

CloudArchive is driven by Protection Policies: each policy may attach one or more Archival actions, each referencing an External Target with its own retention horizon. A typical pattern: daily incrementals retained 30 days on cluster, weekly fulls 90 days on cluster, and monthly fulls retained 7 years in S3 Glacier Deep Archive via CloudArchive. Only the third tier leaves the cluster [Source: https://www.cohesity.com/resources/solution-brief/simplify-long-term-data-retention-and-archival/].

10.1.2 Encryption and Immutability Options

CloudArchive honors the cluster’s encryption posture end-to-end. Data leaves the cluster over TLS 1.2+ and is written at rest as AES-256. When the target is AWS S3, Cohesity can also use SSE-S3 or SSE-KMS server-side encryption, the latter requiring kms:Encrypt, kms:Decrypt, and kms:GenerateDataKey permissions on the customer-managed key [Source: https://docs.cohesity.com/baas/data-protect/aws-requirements-s3.htm].

Immutability for ransomware-resistant archives uses S3 Object Lock (AWS) or the equivalent Immutable Blob Storage (Azure). Cohesity sets per-object retention via the s3:PutObjectRetention API as objects are written. Object Lock must be enabled at bucket creation — it cannot be retrofitted without contacting AWS Support [Source: https://aws.amazon.com/blogs/apn/how-to-turn-archive-data-into-actionable-insights-with-cohesity-and-aws/].

10.1.3 Indexing for Cloud-Archived Snapshots

The Cohesity index (handled by Yoda) stays on the cluster. That has two consequences architects must internalize:

You can browse and search archived snapshots without paying retrieval fees. The metadata is local; only the actual chunk data lives in Glacier.
If the originating cluster is destroyed, you must rebuild the index from cloud-resident metadata before recovery is fast. This is one of the principal differences between CloudArchive (cluster-authoritative) and FortKnox cyber vaulting (Cohesity-managed, see Chapter 11).

10.1.4 Direct Archive vs. Archive on Policy

Two operational variants exist:

CloudArchive (standard) — the default; archive copy is created via a Protection Policy attached to a Protection Group.
CloudArchive Direct — a streaming variant for pure archival workloads. Data flows through the cluster but is not retained as a full local copy — only metadata and index live on cluster, while bulk data is streamed directly to the external target [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/archive-your-data-directly-with-cohesity-cloudarchive-direct-white-paper.pdf].

CloudArchive Direct is appropriate when the local cluster does not need to be the recovery hot tier — e.g., decommissioned apps retained only for compliance.

Key Takeaway: CloudArchive is your long-term retention copy to cheap object storage. The cluster keeps the local copy; the cloud holds the second copy. Use it when retention horizons exceed the cluster’s economic capacity sweet spot, or when compliance demands a separate, off-cluster copy.

Section 10.2: CloudTier — Capacity Extension for Cold Blocks

10.2.1 Capacity Tiering for Cold Blocks

CloudTier is a move operation, not a copy. The cluster’s tiering engine continuously profiles block heat: blocks not accessed for the configured threshold are migrated out to cloud object storage, freeing local capacity. The cluster retains a pointer (a stub) so the namespace appears unchanged — when the data is needed, it is rehydrated transparently [Source: https://www.cohesity.com/glossary/cloud-tiering/].

This is the garage-cleanup behavior from the chapter analogy. The block was in the garage. Now it is in the storage unit. There is no second copy.

Because tiering moves data, CloudTier is irreversible without a recall operation. If you tier 500 TB out to S3 Standard-IA, you cannot simply “untier” by toggling a setting — you must recall the data, which counts as a full read against the object store and incurs egress and request charges.

10.2.2 Tiering Thresholds and Recall Behavior

Tiering is policy-driven on a per-View Box basis. Common thresholds:

Age-based: “tier any block not read in 90 days.”
Capacity-based: “begin tiering when the View Box exceeds 80% utilization.”
Hybrid: combine both — tier opportunistically by age, urgently by capacity pressure.

When a recall is required (e.g., a restore operation reads a tiered block), the cluster fetches the object, repopulates it locally, and returns the read. The recall completes transparently from the application’s perspective, but the latency profile changes — tiered blocks pay a one-time round trip to the cloud.

10.2.3 Performance Impact Considerations

Architects must reserve local capacity headroom to avoid tiering hot data. Common pitfalls:

Aggressive thresholds (e.g., “tier anything older than 7 days”) tier blocks that will be read by month-end backups, causing recall storms.
Disaster recall storms — rehydrating 100 TB of tiered data through WAN bandwidth and cloud egress quotas can dominate RTO. Plan this scenario explicitly.
Cluster cache effect — recalled blocks repopulate the local tier, so recurring workloads stop hitting the cloud after the first recall.

10.2.4 Tier vs. Archive Trade-offs

Dimension	CloudTier	CloudArchive
Data movement	Move (single copy)	Copy (two copies — local + cloud)
Local footprint	Reduced	Unchanged (full local + cloud)
Reversibility	Recall required	Local copy still authoritative
Typical destination class	S3 Standard-IA, Azure Cool	S3 Glacier, Azure Archive
Driver	Cluster running out of capacity	Compliance / LTR retention
Recall cost exposure	High — every restore pulls from cloud	Low — restores read local; cloud only on disaster

The two patterns are complementary, not mutually exclusive. A common enterprise design tiers cold backup blocks to S3 Standard-IA (CloudTier) and copies monthly fulls to Glacier Deep Archive (CloudArchive). The tier reduces local footprint; the archive provides the compliance-grade second copy [Source: https://www.cohesity.com/blogs/leverage-cloud-long-term-archival-with-cohesity/].

Key Takeaway: CloudTier moves cold blocks to free local capacity; CloudArchive copies snapshots out for retention and durability. If your driver is “cluster is full,” tier. If your driver is “regulator requires 7-year retention,” archive. If both — do both.

Section 10.3: CloudReplicate and CloudSpin — Cloud as a DR Plane

10.3.1 CloudReplicate to a Cohesity Cloud Edition

CloudReplicate is conceptually identical to cluster-to-cluster replication (Chapter 9), except the destination cluster is a Cohesity Cloud Edition running inside AWS, Azure, or GCP. The replicated data lands on a fully-functional Cohesity cluster — just like the source — so all DataPlatform features (instant mass restore, indexing, granular search) are available on the cloud side.

CloudReplicate is the right answer when:

The DR strategy requires a functioning Cohesity control plane in the cloud (e.g., to run Helios apps, recover into VMware Cloud on AWS, or serve restores into native services).
RTO requirements exclude a slow rehydration from object storage.
Compliance permits the cluster’s normal feature set in the cloud (DataLock, indexing, etc.).

The cost profile is meaningfully higher than CloudArchive — you are paying for cluster compute (EC2 instances or Azure VMs running Cohesity), local SSD/EBS, and replication network — but the recovery posture is dramatically better.

10.3.2 Converting Backups to Native Cloud VMs (CloudSpin)

CloudSpin converts an on-prem VM backup into a native cloud VM — an EC2 instance, an Azure VM, or a GCE instance — without requiring a Cohesity cluster on the destination side. The operator picks a VM backup (typically a VMware or Hyper-V VM), specifies the target cloud account, network, and instance type, and Cohesity:

Reads the VM backup from local cluster (or recalls from CloudArchive if needed).
Converts the disk format (VMDK or VHDX → AMI for AWS, managed disk for Azure).
Boots the VM into the target VPC/VNet with the chosen instance shape.

CloudSpin is the right answer for:

Test/dev cloud bursting — spin a copy of a production VM in the cloud for a stress test, then destroy it.
Forensic investigation — boot a known-clean snapshot in an isolated VPC for malware analysis.
Cloud migration trial runs — validate that a workload runs in EC2 before committing to lift-and-shift.

CloudSpin is not a continuous DR replication mechanism — each spin is a discrete conversion job. Compare and contrast with CloudReplicate, where the cloud cluster is continuously hydrated with the latest snapshots.

10.3.3 Network and IAM Prerequisites

For both CloudReplicate and CloudSpin, the Cohesity cluster needs:

Outbound HTTPS (port 443) to the cloud control plane endpoints.
IAM credentials with permission to create EC2/Azure VM resources, manage EBS/managed disks, and configure VPC/VNet network interfaces.
VPC/VNet design with appropriate subnets, security groups, and route tables for the spun VMs.
For CloudReplicate, a registered Cloud Edition cluster as the replication destination — it must be reachable from the source cluster over the network.

The IAM minimums for CloudSpin in AWS include ec2:RunInstances, ec2:CreateVolume, ec2:AttachVolume, ec2:CreateImage, ec2:RegisterImage, iam:PassRole, plus the S3 actions to read the backup objects if they were archived to S3 [Source: https://www.cohesity.com/partners/aws/].

10.3.4 Test Recovery and Clean-Up

A discipline the CCAE exam emphasizes: every cloud DR mechanism must be tested without affecting production. SiteContinuity (Chapter 9) wraps CloudSpin and CloudReplicate operations in runbooks that allow:

Test failover — spin VMs in an isolated VPC for validation, then tear down without touching the production destination.
Planned failover — orchestrated cutover with re-IP and DNS updates.
Failback — reverse replication once the primary site is recovered.

Clean-up matters because spun cloud VMs accrue compute charges as long as they run. Always include a destroy step in your runbook.

Key Takeaway: CloudReplicate gives you a hot Cohesity cluster in the cloud — full feature parity, paid by the hour. CloudSpin gives you a one-shot native cloud VM, useful for bursting and testing but not for continuous DR. Pick the one whose recovery model matches your RTO and budget.

Section 10.4: The Decision Matrix — Which Cloud Integration to Use When

The single most exam-relevant artifact in this chapter is the decision matrix below. Memorize it.

Capability	CloudArchive	CloudArchive Direct	CloudTier	CloudReplicate	CloudSpin
Primary purpose	Long-term retention copy	Streaming archive (low local footprint)	Capacity extension	Cluster-to-cluster cloud DR	Convert backup to native cloud VM
Data movement	Copy (local + cloud)	Stream (metadata local, data in cloud)	Move (single copy in cloud)	Copy to remote Cohesity cluster	Convert and boot
Local copy retained?	Yes	No (metadata only)	No (stub remains)	Yes	Yes
Reversible?	N/A — both copies exist	Limited — no local copy	No — must recall	N/A — both clusters live	N/A — VM is independent after spin
Typical destination class	S3 Glacier / Azure Archive	S3 Glacier / Azure Archive	S3 Std-IA / Azure Cool	EC2/Azure VM (Cloud Edition)	Native EC2/Azure VM
Driver	Compliance / LTR	Cold-only retention	Cluster running full	Cloud DR with Cohesity features	Cloud burst / test / forensics
Recovery speed	Hours (Glacier rehydration)	Hours	Seconds (warm cluster) on first read; recall after	Seconds (warm cluster)	Minutes (boot the VM)
Cost profile	Lowest $/GB-month, plus retrieval fees	Even lower (no local copy)	Mid; egress on recall	Highest (running cluster)	Mid (per-spin, then VM hourly)
Configured at	Inventory > External Targets (Archival)	Same, with Direct flag	Inventory > External Targets (Tiering)	Replication settings + remote cluster	Recover > Cloud Spin

Exam tip: “Cluster at 85% capacity, customer wants cold backups online 90 more days, budget tight” → CloudTier (driver is capacity). “7-year compliance requirement, ransomware-resistant copies off-cluster” → CloudArchive with Object Lock. “Quarterly DR test in AWS without standing up another Cohesity cluster” → CloudSpin.

Figure 10.1: Cloud integration option decision tree

flowchart TD
    Start([What is the primary driver?]) --> Q1{Cluster running<br/>out of capacity?}
    Q1 -->|Yes| Tier[CloudTier<br/>Move cold blocks<br/>S3 Std-IA / Azure Cool]
    Q1 -->|No| Q2{Long-term<br/>retention copy<br/>required?}
    Q2 -->|Yes| Q3{Need local copy<br/>for fast restore?}
    Q3 -->|Yes| Archive[CloudArchive<br/>Copy to Glacier/Archive<br/>Local + Cloud copies]
    Q3 -->|No| Direct[CloudArchive Direct<br/>Stream to cloud<br/>Metadata only on cluster]
    Q2 -->|No| Q4{Need warm<br/>cluster in cloud<br/>for DR?}
    Q4 -->|Yes| Replicate[CloudReplicate<br/>To Cohesity Cloud Edition<br/>Full feature parity]
    Q4 -->|No| Q5{Need native<br/>cloud VM from<br/>backup?}
    Q5 -->|Yes| Spin[CloudSpin<br/>Convert to EC2/Azure VM<br/>Burst / forensics / migration]
    Q5 -->|No| Reassess([Reassess requirements])

Section 10.5: Configuring CloudArchive to AWS S3 Glacier — The Five Steps

The AWS Glacier flow is the most exam-tested external target configuration. Cohesity documents a five-step pattern [Source: https://docs.cohesity.com/baas/data-protect/aws-requirements-s3.htm].

Figure 10.2: AWS S3 setup workflow — five-step configuration

flowchart LR
    A[Step 1<br/>Register<br/>External Target<br/>Inventory > Targets<br/>Purpose: Archival] --> B[Step 2<br/>IAM via CFT<br/>CloudFormation<br/>least-privilege role<br/>+ KMS policy]
    B --> C[Step 3<br/>Bucket Policy<br/>Allow Cohesity role<br/>PutObject/GetObject<br/>RestoreObject<br/>Object Lock enabled]
    C --> D[Step 4<br/>Lifecycle Rule<br/>Std → IA → Glacier<br/>→ Deep Archive<br/>Bucket-side, not<br/>Cohesity]
    D --> E[Step 5<br/>Bind to<br/>Protection Policy<br/>Add Archival action<br/>Set retention<br/>Validate + run]
    style A fill:#1f6feb,color:#fff
    style B fill:#1f6feb,color:#fff
    style C fill:#1f6feb,color:#fff
    style D fill:#1f6feb,color:#fff
    style E fill:#238636,color:#fff

10.5.1 Step 1 — Register the External Target

Navigate to Inventory > External Targets > Register External Target in Cohesity Dashboard or Helios. Configure:

Name: descriptive, e.g., S3-Glacier-Archive-Prod
Purpose: Archival (not Tiering — this is a one-character mistake that defines the target’s whole behavior)
Provider: AWS > S3
Bucket name, region, AWS Access Key, Secret Key
Storage class: Glacier or Deep Archive (or Standard if relying on a bucket-side lifecycle rule)

Cohesity 7.1+ supports the full Glacier API family — Glacier Instant Retrieval, Glacier Flexible Retrieval, and Glacier Deep Archive [Source: https://aws.amazon.com/blogs/storage/storing-data-with-aws-partner-solutions-and-amazon-s3-glacier-instant-retrieval/].

10.5.2 Step 2 — IAM via the Cohesity CloudFormation Template

Cohesity publishes a CloudFormation Template (CFT) that creates a least-privilege IAM role for the cluster. Run it from the AWS Console; it provisions a role with these actions [Source: https://docs.cohesity.com/baas/data-protect/aws-requirements-s3.htm]:

s3:PutObject
s3:GetObject
s3:DeleteObjectVersion
s3:RestoreObject              ← required to recall from Glacier
s3:PutLifecycleConfiguration
s3:GetLifecycleConfiguration
s3:GetBucketObjectLockConfiguration
s3:PutObjectRetention         ← required for Object Lock / WORM
iam:SimulatePrincipalPolicy
kms:Encrypt
kms:Decrypt
kms:GenerateDataKey           ← required for SSE-KMS

The CFT-generated role is the right answer on the exam — never grant s3:* or ec2:* to the Cohesity principal. If a customer-managed KMS key encrypts the bucket, the Cohesity role ARN must also be added to the KMS key policy, not just the bucket policy.

10.5.3 Step 3 — Bucket Policy

The CFT applies a bucket policy authorizing the Cohesity role explicitly:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowCohesityArchive",
      "Effect": "Allow",
      "Principal": {"AWS": "arn:aws:iam::<ACCOUNT>:role/<COHESITY-ROLE>"},
      "Action": [
        "s3:PutObject",
        "s3:GetObject",
        "s3:RestoreObject",
        "s3:PutObjectRetention"
      ],
      "Resource": "arn:aws:s3:::your-bucket/*"
    }
  ]
}

If immutability is required, enable S3 Object Lock at bucket creation time. It cannot be retrofitted on an existing bucket without AWS Support intervention. With Object Lock on, Cohesity will set per-object retention via s3:PutObjectRetention, producing WORM copies that resist ransomware and rogue-admin deletion [Source: https://aws.amazon.com/blogs/apn/how-to-turn-archive-data-into-actionable-insights-with-cohesity-and-aws/].

10.5.4 Step 4 — S3 Lifecycle Rule for Glacier Transition

The S3 lifecycle rule lives on the bucket, not in Cohesity. This separation is crucial: Cohesity manages retention (how long the logical object lives, and whether it is locked); the bucket lifecycle manages storage class (which physical tier it sits in while it lives).

A typical rule:

Rule:
  ID: ToDeepArchive
  Status: Enabled
  Filter:
    Prefix: cohesity/
  Transitions:
    - Days: 30
      StorageClass: GLACIER
    - Days: 180
      StorageClass: DEEP_ARCHIVE
  # Optional cleanup matching Cohesity retention horizon
  Expiration:
    Days: 2555  # 7 years

Architects must respect the minimum storage durations below. Deleting objects before the minimum triggers an early-deletion charge equal to the storage cost of the remaining minimum days [Source: https://docs.cohesity.com/baas/data-protect/protect-amazon-s3.htm].

10.5.5 Step 5 — Bind to a Protection Policy and Validate

In Cohesity, edit a Protection Policy (or create one), add an Archival action that targets the new external target, and define the retention. Attach the policy to a Protection Group, validate connectivity from the External Targets page, and trigger an on-demand archive job.

Figure 10.3: CloudArchive to S3 Glacier — end-to-end sequence

sequenceDiagram
    participant Cluster as Cohesity Cluster
    participant Policy as Protection Policy
    participant S3 as S3 Bucket (Standard)
    participant Lifecycle as S3 Lifecycle Rule
    participant Glacier as S3 Glacier / Deep Archive
    Policy->>Cluster: Trigger archive job<br/>(monthly fulls)
    Cluster->>Cluster: Dedupe + compress<br/>+ AES-256 encrypt
    Cluster->>S3: PutObject (TLS 1.2+)<br/>SSE-KMS server-side
    Cluster->>S3: PutObjectRetention<br/>(Object Lock WORM)
    S3-->>Cluster: 200 OK + ETag
    Note over S3,Lifecycle: Object lives in Standard<br/>per bucket-side rule
    Lifecycle->>S3: Day 30: transition GLACIER
    Lifecycle->>Glacier: Day 180: transition<br/>DEEP_ARCHIVE
    Note over Cluster,Glacier: Recall path on disaster
    Cluster->>Glacier: RestoreObject<br/>(Standard, 12h)
    Glacier-->>S3: Rehydrate to<br/>temporary copy
    Cluster->>S3: GetObject
    S3-->>Cluster: Restored chunks<br/>(egress fees apply)

10.5.6 Glacier Pricing and Minimum Retention

Storage Class	$/GB-month (us-east-1)	Min Retention	Retrieval Time (Standard)	Use Case
S3 Standard	~$0.023	none	n/a	Active backups, hot recovery
S3 Standard-IA	~$0.0125	30 days	n/a	CloudTier destination
S3 Glacier Instant Retrieval	~$0.004	90 days	milliseconds	Rare-but-fast archive recall
S3 Glacier Flexible Retrieval	~$0.0036	90 days	3–5 hours (Standard)	Default Glacier tier
S3 Glacier Deep Archive	~$0.00099	180 days	12 hours	Multi-year compliance

[Source: https://aws.amazon.com/blogs/storage/storing-data-with-aws-partner-solutions-and-amazon-s3-glacier-instant-retrieval/]

Cost-optimization heuristics:

Retention under 90 days: stay in Standard or Standard-IA; Glacier early-deletion fees will erase the savings.
Retention 90 days to 1 year: Glacier Flexible Retrieval or Glacier Instant Retrieval.
Retention beyond 1 year: Deep Archive — at $0.00099/GB-month, 100 TB costs ~$99/month versus ~$2,300/month at Standard.
Always model retrieval-request fees in addition to storage. A 100 TB recall from Deep Archive is roughly $0.02/GB plus per-request fees, easily $2,000+ per recall event.

Key Takeaway: The five-step AWS pattern — register target, run CFT for IAM, apply bucket policy, set lifecycle rule, bind to Protection Policy — is the most testable workflow in Chapter 10. Memorize who manages what: Cohesity owns retention; the bucket lifecycle owns storage class.

Section 10.6: Configuring CloudArchive to Azure Blob Storage

Azure’s permission model differs structurally from AWS. There is no JSON IAM policy; you assign RBAC roles to a service principal, managed identity, or user [Source: https://docs.cohesity.com/baas/data-protect/aws-requirements-s3.htm].

10.6.1 The Minimum Role: Storage Blob Data Contributor

The single role a CCAE candidate must remember is Storage Blob Data Contributor. It grants:

Read, write, and delete on blobs.
The setBlobTier/action data action — required for tier transitions to Archive.
Sufficient for backup, archive, list, and restore against the target container.

What it does not grant: control-plane operations like creating storage accounts. That is Storage Account Contributor, which is rarely needed for Cohesity (the customer typically pre-provisions the account).

A common exam trap: assigning Reader or Storage Account Contributor alone. These are control-plane roles — they let you see and configure the storage account but do not grant data-plane access to blobs. The cluster will fail to write objects with confusing 403 errors. Storage Blob Data Contributor is the data-plane role.

10.6.2 Authentication Model — Entra ID Service Principals Preferred

Cohesity supports three Azure auth methods, in this order of preference:

Microsoft Entra ID (formerly Azure AD) service principal with RBAC — the recommended pattern. Create a service principal in Entra, assign it Storage Blob Data Contributor scoped to the container or storage account, and provide Cohesity the tenant ID, client ID, and client secret. If the cluster is running in Azure (Cloud Edition), use a managed identity to avoid storing secrets at all.
Shared Access Signature (SAS) — time-limited, scoped tokens. Suitable for short-lived integrations or where RBAC is restricted, but rotation is operator-managed and silent expiry causes failed archives. Avoid for production.
Storage Account Access Keys — full account-level access. Easiest to configure, hardest to revoke, and the highest blast radius if leaked.

The Entra service principal pattern is the production answer. Always prefer it on the exam.

10.6.3 The setBlobTier Data Action and the Azure Archive Tier

To move a blob to Azure’s Archive access tier — the cheapest tier, comparable to S3 Deep Archive — the principal must have the data action:

Microsoft.Storage/storageAccounts/blobServices/containers/blobs/setBlobTier/action

This action is included in Storage Blob Data Contributor, which is why that role is sufficient. If you build a custom role for least privilege, do not forget this action, or tier transitions will fail silently.

Two ways drive the tier change to Archive:

Lifecycle Management Policy on the storage account — analogous to S3 lifecycle rules. Example: “move blobs older than 30 days to Cool, then 90 days to Archive.”
setBlobTier API call — direct per-blob tier setting, useful for Cohesity-driven scripted transitions.

Rehydration from the Archive tier takes up to 15 hours (Standard priority) or up to 1 hour (High priority, additional cost). Cohesity’s restore UI lets the operator pick the rehydration priority [Source: https://www.cohesity.com/glossary/cloud-tiering/].

10.6.4 Private Endpoints — Production Networking

For production deployments, lock down the Blob endpoint to private networking:

Provision an Azure Private Endpoint on the storage account’s blob sub-resource.
Approve the private endpoint connection in Networking > Private endpoint connections.
Ensure the Cohesity cluster is on a VNet that can resolve privatelink.blob.core.windows.net via Azure Private DNS or a custom DNS forwarder.
If public access is permitted at all, lock the storage account firewall to the Cohesity public IPs or to the VNet/subnet via service tags (Storage.Blob).

A common production pitfall: private endpoint configured but DNS pointed at the public Blob endpoint. The cluster’s traffic falls back to the public IP, which is then blocked by the storage account firewall, and archives fail. Verify DNS resolution explicitly during cutover.

10.6.5 Custom Role JSON for Reference

For environments where Storage Blob Data Contributor is too broad, a custom role can be defined:

{
  "Name": "Cohesity Blob Archive",
  "Actions": [
    "Microsoft.Storage/storageAccounts/blobServices/containers/read",
    "Microsoft.Storage/storageAccounts/blobServices/containers/write"
  ],
  "DataActions": [
    "Microsoft.Storage/storageAccounts/blobServices/containers/blobs/read",
    "Microsoft.Storage/storageAccounts/blobServices/containers/blobs/write",
    "Microsoft.Storage/storageAccounts/blobServices/containers/blobs/delete",
    "Microsoft.Storage/storageAccounts/blobServices/containers/blobs/setBlobTier/action"
  ],
  "AssignableScopes": [
    "/subscriptions/{sub-id}/resourceGroups/{rg}/providers/Microsoft.Storage/storageAccounts/{account}"
  ]
}

This is the least-privilege equivalent of Storage Blob Data Contributor scoped to a single account.

Key Takeaway: For Azure, Storage Blob Data Contributor is the data-plane minimum role. Always prefer Entra ID service principals over SAS tokens or account keys. Always plan for private endpoints in production. The setBlobTier/action is what makes Archive transitions work.

Section 10.7: Storage Classes and Cost Modeling — The Egress Worked Example

10.7.1 S3 and Azure Class Mapping

Use Case	AWS S3 Class	Azure Blob Tier	Min Retention	Latency
Active backups, fast recovery	S3 Standard	Hot	none	ms
Cold backups (CloudTier target)	S3 Standard-IA	Cool	30 days	ms
Archive — fast occasional recall	S3 Glacier Instant Retrieval	(no exact equivalent)	90 days	ms
Archive — multi-hour recall ok	S3 Glacier Flexible Retrieval	Cold (preview in regions)	90 days	3–5 h
Deepest archive — multi-year	S3 Glacier Deep Archive	Archive	180 days	up to 15 h

Figure 10.4: Storage class taxonomy — AWS S3 and Azure Blob tiers

graph TD
    Root[Cloud Object Storage Classes]
    Root --> AWS[AWS S3]
    Root --> AZ[Azure Blob]

    AWS --> AStd[S3 Standard<br/>Hot active backups<br/>~$0.023/GB-mo<br/>ms latency]
    AStd --> AIA[S3 Standard-IA<br/>CloudTier target<br/>~$0.0125/GB-mo<br/>30-day min]
    AIA --> AGIR[Glacier Instant Retrieval<br/>~$0.004/GB-mo<br/>90-day min, ms recall]
    AGIR --> AGFR[Glacier Flexible Retrieval<br/>~$0.0036/GB-mo<br/>90-day min, 3-5h recall]
    AGFR --> ADA[Glacier Deep Archive<br/>~$0.00099/GB-mo<br/>180-day min, 12h recall]

    AZ --> ZHot[Hot tier<br/>Active workloads<br/>ms latency]
    ZHot --> ZCool[Cool tier<br/>30-day min<br/>ms latency]
    ZCool --> ZCold[Cold tier<br/>90-day min<br/>ms latency]
    ZCold --> ZArch[Archive tier<br/>180-day min<br/>up to 15h rehydrate<br/>setBlobTier/action]

    style ADA fill:#0d4429,color:#fff
    style ZArch fill:#0d4429,color:#fff
    style AStd fill:#1f6feb,color:#fff
    style ZHot fill:#1f6feb,color:#fff

10.7.2 The Three Cost Components

Every cloud target has three cost layers that an architect must model separately:

Storage — $/GB-month for the data at rest. Cheapest in Deep Archive or Azure Archive.
Requests — per-API-call fees for PUT, GET, RESTORE, etc. Glacier and Archive request fees can dominate when many small objects are archived.
Egress — outbound data transfer when restoring data back on-prem (or to a different region). Often the largest single charge.

Egress is the silent budget killer. AWS charges roughly $0.09/GB egress to Internet (us-east-1, list price; first 100 GB/month free). Azure egress is similarly priced. Within-region transfer to another AWS service is typically free; cross-region transfer is roughly $0.02/GB.

10.7.3 Worked Example: 100 TB Disaster Recall

Scenario: a customer archived 500 TB of monthly fulls to S3 Glacier Deep Archive over four years. A ransomware event destroys the production environment. The customer must recall 100 TB to a new on-prem cluster within 48 hours.

Cost Component	Calculation	Cost
Storage at rest (4 years × 500 TB × $0.00099/GB-month × 12 × 4 / 1024)	$0.99/TB-month × 500 × 48	~$23,760
Restore (Standard retrieval, $0.02/GB × 100 TB)	$0.02 × 100,000 GB	~$2,000
Restore requests (~$0.025 per 1,000 PUT/RESTORE; ~10M objects ≈ $250)		~$250
Egress to Internet ($0.09/GB × 100 TB)	$0.09 × 100,000 GB	~$9,000
Total recall event		~$11,250 one-time
Plus 4 years of storage already paid		~$23,760

Architectural lessons:

Egress dominates the recall event. Recalling within AWS (e.g., into EC2 in the same region) makes egress effectively free.
Storage is cheap; recall is not. Always design recall destinations to keep traffic in-region when possible.
Bandwidth may be the real constraint. 100 TB over a 1 Gbps WAN link takes ~9 days; AWS Snowball Edge ships 80 TB devices for offline recall.

10.7.4 Lifecycle Policies and Rehydration Windows

Best practices for lifecycle and retention alignment:

Match Cohesity’s retention to the bucket’s lifecycle expiration, with a small safety margin (e.g., expire bucket objects 30 days after Cohesity’s last expected retention day) so orphans are cleaned up but Cohesity-managed objects are never prematurely deleted.
Stage transitions — go Standard → Standard-IA at 30 days → Glacier at 90 days → Deep Archive at 180 days, rather than jumping straight to Deep Archive. This avoids early-deletion charges if a Protection Policy retention is shortened mid-flight.
Test rehydration windows quarterly. A 12-hour rehydration on a 50 TB recall is a real RTO cost that should appear in the DR runbook.

Key Takeaway: Storage cost is the headline; egress and request fees are the surprise. Always model the recall scenario, not just the resting state, and prefer recovery destinations that keep traffic in-region.

Chapter Summary

Cohesity’s four cloud integration patterns solve four distinct architectural problems:

CloudArchive copies snapshots to cheap object storage for long-term retention; the cluster keeps the local copy and remains authoritative. CloudArchive Direct streams directly to the target without the local full copy.
CloudTier moves cold blocks out to free local capacity — no second copy, irreversible without recall. Driver is on-cluster footprint, not retention.
CloudReplicate replicates to a Cohesity Cloud Edition in AWS/Azure/GCP — best RTO, highest cost.
CloudSpin converts a backup to a native cloud VM on demand — for bursting, forensics, and migration trials, not continuous DR.

The five-step AWS pattern: register External Target (Archival), run the Cohesity CFT for IAM, apply the bucket policy, set the S3 lifecycle rule (respecting 90-day Glacier and 180-day Deep Archive minimums), and bind to a Protection Policy. For Azure, the minimum role is Storage Blob Data Contributor (which includes setBlobTier/action); always prefer Entra service principals and Private Endpoints in production.

Egress is the cost layer most likely to surprise — a 100 TB recall to on-prem costs about $9,000 in egress alone, often more than a year of resting storage. Design recall destinations to stay in-region, and rehearse rehydration windows in runbooks. When in doubt on the exam, ask: is the driver local capacity (tier), long-term retention (archive), continuous cloud DR (replicate), or one-shot cloud VM (spin)?

Key Terms

CloudArchive — Copies snapshots from the cluster to a registered external target (object storage, NFS, or tape) for LTR. Cluster keeps the local copy.
CloudArchive Direct — Streaming variant that omits the full local copy, keeping only metadata on cluster while streaming bulk data to the target.
CloudTier — Moves cold blocks from the cluster to cloud object storage to free local capacity. Single-copy move, not a copy; recall required to read.
CloudReplicate — Replication from on-prem Cohesity cluster to a Cloud Edition cluster running in a public cloud, providing full DataPlatform features.
CloudSpin — Converts a VM backup into a native cloud VM (EC2, Azure VM, GCE) on demand. Used for bursting, testing, and forensic isolation.
External Target — Registered storage destination (object store, NFS, tape) in Cohesity Inventory. Typed at registration as Archival or Tiering.
S3 Glacier — AWS archival class family: Glacier Instant Retrieval (90-day min, ms recall), Flexible Retrieval (90-day min, 3–5 h), Deep Archive (180-day min, 12 h).
Azure Archive — Azure Blob’s coldest tier, comparable to Glacier Deep Archive. Up to 15 h rehydration. Requires setBlobTier/action to enter.
Lifecycle policy — Bucket-side (S3) or storage-account-side (Azure) rule transitioning objects between classes by age. Distinct from Cohesity retention, which governs lifetime.
Storage Blob Data Contributor — Minimum Azure RBAC role for Cohesity to read/write/delete/tier blobs. The data-plane role; control-plane roles alone are insufficient.
CloudFormation Template (CFT) — Cohesity-published AWS automation that creates a least-privilege IAM role and bucket policy for S3 archival.
S3 Object Lock — AWS bucket feature for WORM immutability; must be enabled at bucket creation. Cohesity uses s3:PutObjectRetention to set per-object retention.
Egress — Outbound transfer from a cloud region to Internet or another region; typically $0.09/GB. Largest variable cost in recall scenarios.
Rehydration — Restoring an archived object to a readable class. Deep Archive: up to 12 h Standard; Azure Archive: up to 15 h Standard.

Chapter 11: Security, Encryption, and Ransomware Resilience

Backups used to be the last thing an attacker thought about. Today they are the first. Modern ransomware operators have learned that destroying recovery points is the fastest way to force payment, and they routinely spend days or weeks inside an environment specifically hunting for backup admin credentials before triggering encryption. For a Cohesity architect, this changes the design conversation completely. Security is no longer a hardening checklist applied after the cluster is built — it is the central design axis around which encryption, immutability, isolation, detection, and recovery are organized. This chapter walks through that axis end-to-end, from FIPS-validated cryptography on individual disks to multi-cloud cyber vaults that cannot be touched even by a fully compromised root account.

Learning Objectives

By the end of this chapter you will be able to:

Apply defense-in-depth across hardware, OS, software, and identity layers on a Cohesity cluster.
Configure FIPS-validated encryption at rest and in transit using software encryption, Self-Encrypting Drives (SEDs), and external KMIP/KMS providers.
Architect immutability with DataLock, WORM semantics, and legal hold workflows enforced by a Security Officer role.
Differentiate Cohesity FortKnox cyber vaulting from CloudArchive and explain when each is appropriate.
Design ransomware detection and clean-room recovery patterns using Cohesity DataHawk, including anomaly detection, threat intelligence, and BigID-powered classification.
Layer DataLock + FortKnox + DataHawk into a coherent threat-defense architecture for a regulated workload.

Figure 11.1: Defense-in-Depth Layers across the Cohesity Security Stack

flowchart LR
    HW[Hardware<br/>SED + FIPS modules] --> OS[OS<br/>Hardened Linux]
    OS --> SW[Software<br/>SpanFS + TLS]
    SW --> ID[Identity<br/>SSO + MFA + Quorum]
    ID --> DATA[Data Immutability<br/>DataLock + WORM]

Encryption at Rest and In Transit

Encryption is the foundation. If the disks walk out of the data center, if a replication packet is captured on the wire, or if a cloud archive bucket is misconfigured, the data must remain unreadable. Cohesity supports two parallel encryption approaches at rest — software encryption performed by SpanFS and hardware encryption performed by Self-Encrypting Drives — and TLS for every byte that leaves the cluster.

Software Encryption vs. Self-Encrypting Drives

Software encryption (sometimes called “AES at the SpanFS layer”) is performed by the Cohesity software itself before data is written to disk. Every chunk that lands in a chunk file is encrypted with AES-256 using a Data Encryption Key (DEK) that is itself wrapped by a Key Encryption Key (KEK). The advantage is portability: software encryption works on any node — physical, virtual, or cloud — regardless of the underlying drive technology. The cost is a small CPU overhead, typically absorbed by AES-NI hardware acceleration on modern Intel and AMD processors.

Self-Encrypting Drives (SEDs) push encryption down into the drive’s firmware. The drive itself holds the Media Encryption Key (MEK) and refuses to release plaintext without an authentication key. SEDs are appealing for compliance because the cryptographic boundary is the physical drive — pulling a drive out of a chassis and walking away with it yields ciphertext. Cohesity’s SED-equipped nodes ship as a hardware option on supported appliances and ReadyNodes.

In practice, architects choose between them based on three factors:

Factor	Software Encryption	Self-Encrypting Drives (SED)
Where it runs	SpanFS / Cohesity software	Drive firmware (FIPS 140-2/3 validated)
Form factor	All (physical, VE, Cloud Edition)	Physical appliances and ReadyNodes only
Performance impact	Small (AES-NI accelerated)	None on the host
Key management	KMIP, internal KMS, or AWS/Azure KMS	KMIP authentication key, drive holds MEK
Crypto-erase	Re-key wipes data logically	PSID revert wipes drive instantly
Typical use case	Mixed environments, VE, cloud	High-compliance physical sites, fast decommission

A useful analogy: software encryption is like keeping your valuables in a locked safe inside your house — the safe goes with you wherever you live. SEDs are like buying a house where every room has its own combination lock built into the wall — only available in certain houses, but you do not need to bring the safe yourself.

Customer-Managed Keys with KMIP and KMS

Encryption is only as strong as the key custody model. Cohesity supports an internal key manager for small deployments, but at enterprise scale the assumption is that keys live in a customer-controlled Key Management System (KMS) and are fetched by the cluster over the Key Management Interoperability Protocol (KMIP). Common integrations include Thales CipherTrust, Entrust KeyControl, IBM Guardium / SKLM, HashiCorp Vault (via KMIP secrets engine), AWS KMS, and Azure Key Vault.

The flow looks like this:

The cluster generates DEKs locally for each chunk file.
DEKs are wrapped using a KEK that lives only in the external KMS.
To read or write, the cluster calls the KMS over KMIP/TLS to wrap or unwrap the DEK.
If the KMS is unreachable or revokes the KEK, encrypted data on the cluster becomes inaccessible — a powerful kill-switch in a breach scenario.

For CCAE design questions, watch for clues that point to customer-managed keys: regulated industries (healthcare, financial services, government), explicit mentions of “key escrow,” “BYOK,” “HYOK,” or “separation of duties between storage admin and security admin.” All of these are KMIP signals.

Figure 11.2: KMIP Key Management Flow between Cohesity and the External KMS

sequenceDiagram
    participant Cluster as Cohesity Cluster
    participant KMS as KMIP / KMS Server
    participant Disk as SpanFS Chunk File
    Cluster->>Cluster: Generate local DEK
    Cluster->>KMS: Request KEK wrap (KMIP/TLS)
    KMS-->>Cluster: Wrapped DEK released
    Cluster->>Disk: Encrypt chunk with DEK
    Note over Cluster,KMS: On read: cluster requests unwrap
    Cluster->>KMS: Unwrap DEK request
    KMS-->>Cluster: Plaintext DEK (in-memory)
    Cluster->>Disk: Decrypt chunk
    Note over KMS: Revoke KEK = global kill-switch

TLS for Management and Replication

In transit, every interface that leaves the node is TLS-protected. Management UI and REST API traffic uses TLS 1.2 or 1.3 with administrator-installed CA-signed certificates (avoid the self-signed defaults in production). Replication between source and target clusters is encrypted, compressed, and deduplicated on the wire. Cloud archive traffic to S3, Azure Blob, or GCS uses HTTPS with the cloud provider’s TLS endpoint.

For the exam, remember that protocols like SMB and NFS — used by SmartFiles — have their own encryption modes. SMB3 supports per-session encryption (SMB Encryption); NFSv4.1 with Kerberos krb5p provides privacy. These are configured per View, not globally, and are an architect’s lever when a tenant needs encrypted client traffic without VPN overhead.

FIPS 140-2 / 140-3 Mode

For US federal customers and many regulated industries, the cluster must operate in FIPS mode, which forces all cryptographic operations through FIPS 140-2 (or the newer 140-3) validated modules. Enabling FIPS mode on a Cohesity cluster:

Restricts cipher suites for TLS to FIPS-approved algorithms.
Disables non-compliant algorithms (e.g., MD5, SHA-1 for signatures).
Forces SEDs to operate in their FIPS-validated configuration.
Requires that the external KMS also be FIPS-validated for end-to-end compliance.

FIPS mode is a cluster-wide setting, not per-View, and is most easily enabled at deployment time. Toggling FIPS on a brownfield cluster is supported but requires a validated change window because some services restart.

Key Takeaway: Cohesity offers parallel encryption paths — software (universal, AES-NI accelerated) and SED (hardware-rooted, physical-only) — both manageable via KMIP-attached KMS. Combined with TLS for all in-transit paths and FIPS 140-2/140-3 mode for regulated workloads, encryption gives the architect complete cryptographic separation between data, keys, and operators. [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/threat-defense-architecture-white-paper-en.pdf]

Immutability and DataLock

Encryption protects confidentiality. Immutability protects existence. A ransomware actor who has stolen storage admin credentials can decrypt and re-encrypt as they like — what they cannot do, against an immutable backup, is delete it. Cohesity’s immutability story has two layers: the SpanFS baseline that applies to every snapshot ever taken, and the DataLock policy layer that turns selected snapshots into time-bound WORM objects no one can remove.

SpanFS Baseline Immutability

By default, every Cohesity backup snapshot is stored in SpanFS as a read-only, immutable object. The original “gold copy” cannot be mounted, modified, encrypted in place, or deleted by any external system, application, or process. When a workload needs read/write access to a backup — for instant recovery, dev/test refresh, or sandbox investigation — Cohesity creates a zero-cost, redirect-on-write clone. The clone is read/write; the gold copy is not. This single property defeats the most common ransomware pattern, which is to enumerate backup files and overwrite them in place. [Source: https://www.cohesity.com/blogs/how-backup-immutability-defends-against-ransomware-attacks/]

DataLock Policies and WORM Semantics

DataLock takes baseline immutability and adds a hardened, time-bound, role-bound enforcement layer. A DataLock policy is applied to a Protection Group (or to specific snapshots) by a designated Security Officer — a separate Cohesity role distinct from the regular cluster administrator. Once the policy is applied:

The snapshot enters Write Once, Read Many (WORM) state for the configured retention period.
It cannot be deleted or modified by any user — including the cluster administrator, the Security Officer who applied the lock, or any account with full privileges. The lock is enforced by the platform itself.
The lock cannot be shortened, removed, or “talked out of” before the timer expires. It can only be extended.
DataLock applies equally to copies that are tiered or archived to cloud targets (CloudArchive, FortKnox), so the WORM property follows the data. [Source: https://www.cohesity.com/resources/solution-brief/counter-ransomware-attacks-with-cohesity/]

Think of DataLock as a safe deposit box with a time lock. The bank manager set the timer this morning and locked the door behind you. Even if the manager comes back at noon and wants to open it — even if they are the most senior person in the bank — they physically cannot. The vault opens when the timer says it opens, not before. That is exactly what DataLock does to a snapshot.

Figure 11.3: DataLock Lifecycle States

stateDiagram-v2
    [*] --> Created: Snapshot taken
    Created --> Locked: Security Officer applies DataLock
    Locked --> Locked: Extend retention (allowed)
    Locked --> Expired: Retention timer ends
    Locked --> LegalHold: Legal hold applied
    LegalHold --> Locked: Hold released
    Expired --> Released: Snapshot deletable
    Released --> [*]
    note right of Locked
        WORM enforced
        Cannot be shortened
        Cannot be deleted
    end note

Compliance Lock vs. Governance Lock

DataLock is offered in two flavors that map to two different regulatory postures:

Mode	Who can shorten retention	Typical use case	Regulatory mapping
Governance	Security Officer can shorten or remove the lock	Internal policy, operational immutability	Best-practice data-protection hygiene
Compliance	Nobody — not even the Security Officer	SEC 17a-4, FINRA, HIPAA retention floors	Strict regulatory immutability

Governance mode gives the security team an override path for legitimate exceptions; Compliance mode locks the door even on the Security Officer. For a CCAE scenario question, the giveaway phrase for Compliance mode is anything resembling “regulator requires that no individual, regardless of role, be able to delete records before retention expires.” For Governance mode, the phrase is typically “internal policy” or “data protection officer needs the ability to override in extraordinary circumstances.”

Legal Hold and Snapshot Deletion Approval

Legal hold is a related but distinct workflow. While DataLock is a time-bound lock applied at policy creation time, legal hold is an indefinite freeze applied when litigation or investigation is anticipated. Legal hold extends retention until the hold is explicitly released, regardless of the DataLock timer.

For sensitive operations that fall outside the locked window — deleting a Protection Group, changing a retention policy, removing an external target, or releasing a legal hold — Cohesity supports quorum approval, a multi-person workflow in which the action stays pending until N of M approvers (typically 2 of 3 or 3 of 5) have signed off. Quorum approval is the architectural answer to “what if a single privileged account is compromised?” The compromised account can request the deletion; it cannot finish it without independent approval from accounts the attacker does not control. For DR and DataLock removal, quorum approval is typically combined with MFA-enforced logins to make credential theft alone insufficient.

Key Takeaway: SpanFS makes every snapshot immutable; DataLock makes selected snapshots undeletable by anyone, including the Security Officer who applied the lock, for a specified duration. Governance mode allows controlled override; Compliance mode does not. Combined with legal hold and quorum approval, immutability transforms backups from a deletable target into a durable, time-locked recovery substrate. [Source: https://www.cohesity.com/blogs/guarding-against-ransomware-requires-more-than-just-detection/]

Cyber Vaulting with Cohesity FortKnox

DataLock prevents deletion. It does not, by itself, protect against scenarios in which an attacker has unfettered network access to the cluster and unlimited time. For that, the architect needs isolation — physical and operational separation between the production estate and a tertiary copy of the data. Cohesity FortKnox is the answer.

What FortKnox Is

FortKnox is a SaaS-delivered cyber vault — Cohesity calls it Data Isolation and Recovery as a Service, or DIRaaS — that stores an immutable, isolated tertiary copy of backup data in a Cohesity-managed cloud tenant on AWS, Azure, or Google Cloud. The customer does not deploy or maintain vault infrastructure; they subscribe, point source clusters at the service, and configure vaulting policies. [Source: https://www.cohesity.com/resources/datasheet/cohesity-fortknox/]

If FortKnox is the Swiss bank vault — a service operated for you by a separate institution, in a separate jurisdiction, with multi-person approval to open the door — then CloudArchive is the warehouse rented across town: cheap, capacious, and reachable any time you have the key.

The Virtual Air Gap

The defining architectural feature of FortKnox is the virtual air gap. The network connection between the source Cohesity cluster and the FortKnox vault is opened only during a configurable transfer window, just long enough to ingest a new vaulted copy, and is then severed. Outside the transfer window, the vault is operationally unreachable from the production cluster — there is no live network path an attacker can ride from a compromised cluster to the vaulted copies. This contrasts with CloudArchive’s persistent connection, which is appropriate for ongoing tiering and retention but offers no isolation guarantee. [Source: https://www.cohesity.com/blogs/going-beyond-the-air-gap-data-isolation-and-recovery-for-the-modern-era/]

Figure 11.4: FortKnox Cyber Vault Flow with Virtual Air Gap and Quorum Recovery

flowchart TD
    SRC[Source Cohesity Cluster] --> WIN{Transfer<br/>Window Open?}
    WIN -->|02:00-04:00| OPEN[Air Gap Opens]
    OPEN --> REPL[Replicate Snapshot<br/>to FortKnox SaaS Vault]
    REPL --> CLOSE[Air Gap Closes]
    CLOSE --> VAULT[(Isolated<br/>WORM Vault<br/>AWS / Azure / GCP)]
    VAULT --> RREQ[Recovery Request]
    RREQ --> QUORUM{Quorum<br/>2-of-3 Approved?}
    QUORUM -->|No| DENY[Operation Blocked]
    QUORUM -->|Yes| RECOVER[Recover to Source /<br/>Alternate Cluster /<br/>Cloud Target]

Mandatory Multi-Person Quorum

FortKnox enforces multi-person quorum approval for sensitive operations — recoveries, retention changes, vault configuration changes — at the vault level. Typically two or more authorized users must approve before the operation proceeds. The control is purpose-built for insider-threat and stolen-credential scenarios: a single privileged account is never enough to exfiltrate, destroy, or release vaulted data. [Source: https://aws.amazon.com/blogs/apn/defending-against-ransomware-with-aws-and-cohesity-fortknox/]

Defense Layers Inside the Vault

Once data lands in FortKnox, it inherits and extends Cohesity’s defense model:

Physical and tenant separation. The vault lives in a Cohesity-managed cloud tenant, in a separate trust domain from the customer’s production cluster.
Network and operational isolation. Virtual air gap, separate identity plane, separate management.
WORM immutability. Every vaulted snapshot is locked, with the same DataLock semantics that apply on-prem.
ML-based anomaly detection. Anomaly scoring runs on vaulted data, not just on production backups.
Flexible recovery targets. Recover to the original source cluster, an alternate cluster, or directly into a target cloud, supporting a wide range of disaster scenarios.

FortKnox vs. CloudArchive

This comparison is one of the most testable items in the chapter:

Dimension	FortKnox (Cyber Vault SaaS)	CloudArchive
Primary purpose	Isolated, immutable cyber-recovery vault	Long-term cloud tiering / archive
Connectivity	Virtual air gap; network open only during transfer windows	Persistent connection to cloud target
Approval model	Mandatory multi-person quorum for recoveries / critical actions	Standard MFA + RBAC, no vault-level quorum
Operating model	Cohesity-managed SaaS (DIRaaS), no customer infrastructure	Customer-configured external target (S3/Blob/GCS)
Cloud	AWS, Azure, GCP — Cohesity tenant	Customer’s own buckets in any supported cloud
Use case	Ransomware/cyber-recovery “third copy” in 3-2-1-1-0	Cost-optimized long-term retention and compliance archive
Cost profile	Subscription, premium for isolation + service	Storage + egress, customer-controlled

A useful CCAE heuristic: when the scenario uses words like “cyber recovery,” “isolated copy,” “air gap,” “ransomware blast radius,” or “regulatory mandate to keep an offline copy,” the answer is FortKnox. When the scenario uses words like “long-term retention,” “7-year archive,” “tape replacement,” or “cold storage,” the answer is CloudArchive. They are complementary; many enterprises use both.

Key Takeaway: FortKnox is a SaaS cyber vault (DIRaaS) that adds three controls CloudArchive does not have — virtual air gap, mandatory multi-person quorum, and Cohesity-managed isolation across AWS/Azure/GCP — making it the right answer for ransomware-resilient tertiary copies. CloudArchive remains the right answer for long-term retention and cost optimization. [Source: https://www.cohesity.com/blogs/cohesity-fortknox-is-now-available-on-google-cloud/]

Ransomware Detection and Recovery with DataHawk

Encryption protects confidentiality, immutability prevents deletion, and FortKnox provides isolation — but none of those tell you that an attack is happening. Detection and clean-room recovery is the job of Cohesity DataHawk.

What DataHawk Does

DataHawk is the AI/ML-driven security service inside the Cohesity Data Cloud. It packages three capabilities — ransomware anomaly detection, threat intelligence-based malware hunting, and BigID-powered data classification — into a single SaaS offering whose job is to answer the three questions that arise during any cyber incident:

Is there an attack in progress? Anomaly detection.
Where is the malware, and which recovery point is clean? Threat intelligence and YARA scanning.
What sensitive data was exposed? BigID classification.

[Source: https://www.cohesity.com/blogs/introducing-cohesity-datahawk/]

Figure 11.5: DataHawk Three-Pillar Architecture

graph TD
    DH[Cohesity DataHawk<br/>AI/ML Security SaaS]
    DH --> AD[Anomaly Detection]
    DH --> TI[Threat Intelligence]
    DH --> CL[BigID Classification]
    AD --> AD1[Entropy analysis]
    AD --> AD2[Change-rate baselines]
    AD --> AD3[Clean snapshot recommendation]
    TI --> TI1[100K+ IOCs daily]
    TI --> TI2[YARA + CrowdStrike feeds]
    TI --> TI3[Malware hash matching]
    CL --> CL1[200+ patterns]
    CL --> CL2[50+ compliance policies]
    CL --> CL3[PII / PHI / PCI / GDPR]

Anomaly Detection via Entropy and Change Rate

DataHawk continuously analyzes backup snapshots and produces an anomaly strength score for each one based on machine-learning models trained on per-workload baseline behavior. The features the models inspect include:

Data entropy. Encrypted-in-place data has a near-uniform byte distribution; normal application data does not. A sudden rise in average entropy across a snapshot is a strong signal of mass encryption.
File and object change rates. A workload that normally changes 2% of its files per night and suddenly changes 80% is likely under attack, not getting busier.
Write/modification patterns. Bulk file extension changes (e.g., everything ending .locked or .crypt) and sudden new file creation/deletion patterns are flagged.
Per-workload baselines. A SQL transaction log workload has very different normal behavior than a user fileshare; DataHawk learns each.

Anomalous snapshots are flagged in the anti-ransomware dashboard. The same scoring drives the clean snapshot recommendation that points administrators at the last-known-good recovery point, which is critical because a naive restore from “the most recent backup” will often restore the encryption itself. [Source: https://www.cohesity.com/blogs/cohesity-ransomware-detection-machine-learning-models/]

Threat Intelligence and YARA

Rather than asking customers to author and maintain their own YARA rules, DataHawk ships an automated, continuously updated threat-intelligence feed of more than 100,000 indicators of compromise (IOCs) refreshed daily from 160,000+ sources, including curated YARA rules, CrowdStrike Falcon Intelligence, and Cohesity-curated default libraries. When DataHawk scans backup data, it identifies:

Malware hashes present in the snapshot.
The specific files that contain them.
The variant or family involved.

This converts “we were hit, restore everything from yesterday” into “we were hit, here are the 312 infected files and the time window of compromise — restore those specifically.” [Source: https://www.cohesity.com/blogs/cohesity-datahawk-continuing-the-ai-ml-transformation-of-data-security-and-management/]

BigID-Powered Data Classification

DataHawk integrates the BigID classification engine to discover and classify sensitive data inside backup snapshots. The engine combines regular expressions, named-entity recognition, AI/ML classifiers, 200+ predefined patterns, and 50+ out-of-the-box compliance policies (PII, PHI/HIPAA, PCI payment data, GDPR, and more).

After an anomaly hits or malware is found, classification reports tell the responder exactly which categories of sensitive data lived in the affected files. This is essential for:

Breach notification timelines (HIPAA 60 days, GDPR 72 hours).
Regulatory reporting accuracy.
Incident scoping (“did the attacker reach PHI?”).

[Source: https://www.cohesity.com/platform/data-classification/]

Clean-Room Recovery Pattern

Even with isolated, immutable, classified backups, restoring straight back to production is risky — the malware may still be in transit. The clean-room recovery pattern is:

Identify the clean recovery point using DataHawk’s ML recommendation.
Provision an isolated environment — an alternate cluster, an alternate VLAN, or a recovery-only AWS/Azure VPC.
Restore from FortKnox or DataLock-protected snapshot into the clean room.
Run threat intelligence scans, AV scans, and integrity validation against the restored data.
Cut over only after validation passes; otherwise iterate further back in time.

The clean-room is the “operating room” of recovery: sterile, instrumented, and isolated until you are sure the patient is no longer contagious.

Key Takeaway: DataHawk turns backups into a security telemetry source with three layered capabilities — entropy/change-rate anomaly detection, daily-refreshed threat intelligence with 100K+ IOCs, and BigID classification using 200+ patterns and 50+ policies. Combined with clean-room recovery, this implements the NIST Detect/Respond/Recover functions across the entire backup estate. [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/threat-defense-architecture-white-paper-en.pdf]

Hardening, Compliance, and Defense-in-Depth

The cluster itself must be hardened before any of the above features matter. A weakly secured admin plane is the easiest way for an attacker to dismantle the rest.

Identity, MFA, and Quorum

Strong identity. Integrate with AD or LDAP; do not run on local accounts in production. SAML SSO with an enterprise IdP (Okta, Azure AD, Ping) is preferred for centralized lifecycle management.
MFA enforcement. Mandatory for all administrative roles, including the Security Officer. MFA defeats the most common credential-theft attack pattern.
Role separation. Cluster admin, Security Officer, audit reviewer, and tenant operators are distinct roles. The Security Officer is the only role that can apply or extend DataLock policies.
Quorum approval. Required for any operation that could destroy data — DataLock removal, retention shortening, vault configuration changes, Protection Group deletion.

Audit Logging and SIEM Integration

Every administrative action and every security-relevant event is logged. Architects should:

Forward audit logs to an enterprise SIEM (Splunk, Sentinel, Chronicle, QRadar) via Syslog or webhook.
Retain logs in the SIEM beyond the cluster’s retention window.
Build alerts on high-risk events: DataLock policy changes, quorum approval requests, failed MFA attempts, role assignment changes, KMIP fetch failures.

Compliance Frameworks

Cohesity is engineered to map cleanly to multiple frameworks. Key alignment points for CCAE:

Framework	Cohesity controls that support it
HIPAA	DataLock (retention floors), encryption at rest/in transit, audit logging, BigID PHI classification
PCI-DSS	FIPS-validated encryption, KMIP/KMS separation, MFA, role separation
FedRAMP	FIPS 140-2/3 mode, audit logging, identity integration with FedRAMP-authorized IdPs
SEC 17a-4 / FINRA	DataLock Compliance mode (WORM), legal hold, immutable audit trail
GDPR	BigID classification (PII), data subject access via search, retention controls

Worked Example: 500 TB Hospital Layered Defense

To make all this concrete, work through a realistic CCAE-style scenario. A regional hospital system has 500 TB of front-end data: 280 TB of imaging (PACS), 120 TB of EHR databases, 60 TB of clinician fileshares, and 40 TB of M365 (Exchange, OneDrive, SharePoint). Compliance requirements include HIPAA, a state-level seven-year retention floor for medical records, and a board-mandated ransomware recovery posture after a peer hospital was hit last year. RPO 4 hours, RTO 24 hours for clinical systems.

Step 1 — Cluster sizing and encryption. Provision a Cohesity all-flash cluster on FIPS-validated SED-equipped ReadyNodes at the primary site, sized for 500 TB FETB with a 3% daily change rate and erasure coding (4:2). Enable FIPS 140-2 mode at deployment. Configure a KMIP integration with the customer’s existing Thales CipherTrust Manager so that all DEKs are wrapped by KEKs that live outside the cluster — pulling a node, or even pulling all the disks, yields ciphertext.

Step 2 — Protection policies. Create three policies: Clinical-Tier1 (4-hour RPO, 14 daily / 4 weekly / 12 monthly / 7 yearly retention), Fileshares-Tier2 (24-hour RPO, 30 daily / 12 monthly / 7 yearly), and M365-Tier3 (daily, 7 yearly). All three carry DataLock in Compliance mode for the seven-year retention floor required by state medical-records law — neither the cluster admin nor the Security Officer can release records before the timer expires.

Step 3 — Replication and tertiary copy. Replicate to a secondary Cohesity cluster at a regional DR site for fast same-day recovery. In parallel, configure FortKnox vaulting on AWS for daily clinical and weekly fileshare snapshots. Configure the FortKnox transfer window for 02:00–04:00 local; outside that window, the virtual air gap is closed. Configure a quorum policy of 2-of-3 approvers (CISO, VP Infrastructure, Compliance Officer) for any FortKnox recovery, retention change, or vault configuration change.

Step 4 — Detection and classification. Subscribe to DataHawk. Anomaly detection runs against every backup; the anti-ransomware dashboard is reviewed daily by the SecOps team. Threat intelligence scans every snapshot for the 100,000+ IOCs in the daily-refreshed feed. BigID classification is configured with HIPAA, PII, and PCI policies and runs against PACS, EHR, and fileshare snapshots so that any incident can be scoped against actual PHI exposure within the 60-day breach notification clock.

Step 5 — Hardening. All admin accounts are sourced from Azure AD via SAML SSO with mandatory MFA. The Security Officer role is held by the CISO and the deputy CISO only. Audit logs stream to Microsoft Sentinel; alerts fire on DataLock changes, FortKnox quorum requests, and KMIP fetch failures. Quorum approval is enabled for Protection Group deletion, retention shortening, and external-target removal.

Step 6 — Recovery rehearsal. Quarterly, the team performs a clean-room recovery rehearsal: spin up an isolated VLAN, restore a representative EHR database from FortKnox using the DataHawk-recommended clean snapshot, run AV and integrity scans, validate database consistency, and confirm RTO. The runbook is owned by the SRE team and reviewed by the CISO.

The result is a layered defense — encryption at the disk and key-management layer, immutability at the snapshot layer, isolation at the FortKnox vault layer, detection at the DataHawk layer, and identity hardening across all admin paths — that survives a full compromise of the production cluster and can demonstrably recover within 24 hours.

Chapter Summary

Security is the central design axis of a modern Cohesity architecture, not an afterthought. The chapter built up a five-layer defense:

Encryption — software AES via SpanFS or hardware via SEDs, both keyed through customer-controlled KMIP/KMS, all over TLS, optionally in FIPS 140-2/140-3 mode.
Immutability — SpanFS baseline plus DataLock (Compliance or Governance), enforced by a Security Officer role and cannot be removed or shortened during the lock window.
Isolation — FortKnox SaaS cyber vault with virtual air gap, mandatory multi-person quorum, and Cohesity-managed tenancy on AWS, Azure, or GCP.
Detection and classification — DataHawk’s anomaly detection (entropy, change rate), threat intelligence (100K+ IOCs daily-refreshed), and BigID classification (200+ patterns, 50+ policies).
Hardening — MFA, SSO, role separation, quorum approval, audit logging to SIEM, and clean-room recovery rehearsals.

For the CCAE exam, focus on the precise distinctions: software vs. SED encryption, Governance vs. Compliance DataLock, FortKnox vs. CloudArchive, and the order in which DataHawk’s three capabilities answer the three incident questions. Architecting these layers together — not just turning each on individually — is what defines an expert-level design.

Key Terms

DataLock — Cohesity’s WORM immutability policy, applied by a Security Officer to Protection Groups or snapshots, time-bound, cannot be removed even by the Security Officer who applied it during the lock window.
WORM (Write Once, Read Many) — Storage semantics in which data, once written, cannot be modified or deleted until a retention timer expires.
KMIP (Key Management Interoperability Protocol) — OASIS-standard protocol Cohesity uses to fetch and manage keys from external KMS providers (Thales, Entrust, HashiCorp, etc.).
FIPS — Federal Information Processing Standard; FIPS 140-2 and the newer 140-3 specify validated cryptographic modules required for US federal and many regulated workloads.
DataHawk — Cohesity’s AI/ML SaaS security service combining anomaly detection, threat intelligence (100K+ IOCs daily-refreshed), and BigID classification (200+ patterns, 50+ policies).
FortKnox — Cohesity’s SaaS cyber vault (Data Isolation and Recovery as a Service / DIRaaS) on AWS, Azure, or GCP, featuring a virtual air gap and mandatory multi-person quorum.
Anomaly detection — DataHawk’s ML-driven analysis of entropy, change rate, and write patterns to flag suspicious snapshots and recommend the last-known-good recovery point.
Cyber Vault — An isolated, immutable tertiary copy of backup data, designed to survive full compromise of the production environment and provide a clean recovery substrate.
Quorum approval — Multi-person approval workflow (e.g., 2-of-3) required for sensitive operations such as DataLock removal, FortKnox recoveries, or Protection Group deletion, ensuring a single compromised account cannot destroy data.

Chapter 12: SmartFiles: Files, Objects, and Unstructured Data Services

For most enterprises, the largest pool of “data sprawl” is unstructured — engineering home directories, render farm scratch, surveillance video, M&E project folders, genomics datasets, build artifacts, and a rapidly growing tide of S3 buckets that started as developer experiments and ended up in production. Cohesity SmartFiles is the product that turns the same DataPlatform you already use as a backup target into a primary, multi-protocol unstructured-data service. For the CCAE candidate, SmartFiles is not a separate appliance to learn — it is a consumption mode of the cluster you have already designed. Pass the SpanFS-and-View-Box mental model from Chapter 2 forward, and SmartFiles becomes mostly a question of policy choices: which protocols, which QoS, which quotas, which tier-down rules.

This chapter walks the architecture from the bottom (SpanFS) up through the View, the View Box / Storage Domain, the protocol surface (SMB3, NFSv3/v4, S3), the governance layer (quotas, QoS, tiering), the data services (snapshots, replication, audit, ICAP AV), and finally the migration playbook for replacing or absorbing legacy NetApp and Dell/EMC Isilon estates.

Learning Objectives

By the end of this chapter you will be able to:

Architect SmartFiles for primary file and object workloads on top of an existing Cohesity cluster, choosing the right View Box and Storage Domain shape per workload class.
Compare SMB3, NFSv3/v4, and S3 access semantics on a Cohesity View, including cross-protocol identity mapping and locking.
Apply quotas, QoS policies, and hot/cold tiering policies to Views and explain why QoS must be selected up front.
Design data protection for SmartFiles, including snapshots, DR replication, file audit logging, and ICAP-based antivirus scanning.
Plan migrations from legacy NetApp or Isilon using a combination of cold-data tiering, the Cohesity NAS File Migration Service, and backup-driven cross-filer restores.

SmartFiles Architecture

From SpanFS to View: One File System, Many Faces

SmartFiles is not a separate product riding on top of Cohesity — it is a way of consuming SpanFS, the same distributed file system that holds backup snapshots, archived databases, and replicated VMs. SpanFS exposes a single global namespace across every node in the cluster with strict consistency, and it stripes data across distributed volumes so there is no single point of failure [Source: https://www.cohesity.com/platform/spanfs/]. That single-namespace property is what lets the same logical object be a file to an NFS client, a share to an SMB client, and an object to an S3 client at the same time, without copying data into protocol-specific silos [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity-SpanFS-and-SnapTree-WP.pdf].

The central SmartFiles construct is the View. A View is a logical container that lives inside a View Box (Storage Domain) and that can be exposed simultaneously as:

An NFS export, with mount-point / volume semantics (NFSv3 and NFSv4 are both supported).
An SMB share, with Windows file share semantics (SMB3 with signing and encryption).
An S3 bucket, where files become S3 objects keyed by their path in the View.

A single View, in other words, is a file share and a bucket and an NFS export — pointing at the same underlying SpanFS objects. A file written by an SMB client appears as an object in the S3 namespace of the same View, and as an NFSv4 file at the same logical path, with permissions translated across protocols using AD/LDAP identity mapping [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/solution-brief/Cohesity-SmartFiles-Solution-Brief.pdf].

Figure 12.1: SmartFiles architecture from SpanFS up through the protocol surface.

flowchart LR
    SpanFS[SpanFS<br/>Distributed File System<br/>Single Global Namespace]
    VB[View Box /<br/>Storage Domain<br/>Policy Boundary]
    V[View<br/>Logical Container]
    SMB[SMB3<br/>Windows Shares]
    NFS[NFSv3 / NFSv4<br/>UNIX Mounts]
    S3[S3<br/>Object Buckets]

    SpanFS --> VB
    VB --> V
    V --> SMB
    V --> NFS
    V --> S3

    SMB -.same data.-> NFS
    NFS -.same data.-> S3

Analogy: The Multilingual Restaurant Menu. Think of a View as the menu in a multilingual restaurant. The food in the kitchen — the actual SpanFS data — is the same regardless of which language you order in. The English menu is SMB3, the French menu is NFSv4, and the Mandarin menu is S3. A vegetarian customer (an ACL) asking for “no meat” gets the same treatment whether they say it in English or French because the kitchen has one set of dietary rules, not three. Legacy NAS architectures, by contrast, run three separate kitchens and pretend the menus are translations of each other. SmartFiles runs one kitchen.

View Boxes / Storage Domains: The Policy Boundary

The View Box — which newer documentation calls a Storage Domain — is the policy container for the Views inside it. The View Box is where you define:

Storage efficiency: deduplication on/off, inline vs. post-process, compression algorithm.
Resiliency: Replication Factor 2/3 or erasure coding (e.g., 4:2, 6:2).
Encryption: software vs. SED, KMS provider.
Tiering policy: which cloud or remote tier cold blocks move to.
Default quotas and quota alert limits for child Views.

Views inherit these settings from their parent View Box. They are scoped to the View Box’s available capacity unless overridden with explicit per-View quotas [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity-SpanFS-and-SnapTree-WP.pdf]. A common architectural pattern is to maintain at least two Storage Domains — one tuned for backup landing (high dedupe, erasure-coded, HDD-biased) and one tuned for primary file/object workloads (SSD-biased, lower dedupe ratio target) — so that backup ingest cannot starve a busy SmartFiles user share.

+--------------------------------------------------------------+
| Cohesity Cluster (SpanFS, single global namespace)           |
|                                                              |
|  +----------------------+    +-------------------------+     |
|  | Storage Domain:      |    | Storage Domain:         |     |
|  | "BackupTarget"       |    | "SmartFiles-Primary"    |     |
|  | RF2 + EC 4:2         |    | RF2, SSD-biased         |     |
|  | Inline dedupe        |    | Post-process dedupe     |     |
|  |                      |    |                         |     |
|  |  View: vmware-bk     |    |  View: media-projects   |     |
|  |  View: oracle-bk     |    |  View: build-artifacts  |     |
|  |  View: m365-bk       |    |  View: home-dirs        |     |
|  +----------------------+    +-------------------------+     |
+--------------------------------------------------------------+

Protocol Surface: SMB3, NFSv3/v4, S3

SmartFiles supports several deliberate combinations of protocol exposure on a single View [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/solution-brief/Cohesity-SmartFiles-Solution-Brief.pdf]:

Mode	NFS	SMB	S3	Typical Use
Multi-protocol R/W	Read/Write	Read/Write	Read-only	Existing NAS workload that wants modern apps to consume via S3 API without mutate rights
File-only	R/W	R/W	Off	Pure user/home-directory or build farm consolidation
S3-only	Off	Off	R/W	Cloud-native app, Splunk SmartStore target, container backups
Writable S3 clone	R/W (live View)	R/W (live View)	R/W (instant clone)	Analytics or ML pipelines that need a writable object copy without disturbing the live file workload

The “writable S3 clone” pattern is worth pausing on. Allowing parallel writes from S3 against a live NFS/SMB View creates locking and consistency problems that no amount of identity mapping can paper over (S3 has no real notion of a byte-range lock). Cohesity sidesteps this by spawning an instant clone — a zero-copy SnapTree clone of the View that is exposed as a writable S3 bucket. The original file workload stays clean, and the analytics pipeline gets its own object-writable namespace [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity-SpanFS-and-SnapTree-WP.pdf].

Cross-Protocol Identity and Locking

Multi-protocol access only works if identity translates cleanly between worlds:

NFS uses UID/GID numbers (and, in NFSv4, optionally name@domain principals).
SMB3 uses Windows Security Identifiers (SIDs) and Kerberos principals.
S3 uses bucket policies and IAM-style access keys.

Cohesity bridges these using AD/LDAP integration: an AD user maps to a SID for SMB3 access and to a UID/GID for NFS access on the same View, so a file created over SMB by CONTOSO\alice is owned by the corresponding POSIX UID when seen from NFS. S3 access uses access keys that can be tied to the same identity directory.

Cross-protocol locking is enforced inside SpanFS: if an SMB3 client holds an exclusive oplock on a file, an NFS client attempting a conflicting access is blocked or denied per the locking semantics SpanFS exposes. Architects designing mixed-Linux/Windows workloads should still test the specific lock contention patterns of their applications — multi-protocol locking removes the surprise but does not magically remove the contention.

Figure 12.2: Multi-protocol access — same data, three protocols, one View.

sequenceDiagram
    participant SMB as SMB3 Client<br/>(Windows)
    participant View as Cohesity View<br/>(SpanFS)
    participant NFS as NFSv4 Client<br/>(Linux)
    participant S3 as S3 Client<br/>(Cloud App)

    SMB->>View: Write file \\share\report.csv (CONTOSO\alice)
    View->>View: Map SID to UID/GID via AD/LDAP
    View->>View: Persist to SpanFS, dedupe + compress
    NFS->>View: read /mnt/share/report.csv
    View-->>NFS: Same bytes, POSIX UID-owned
    S3->>View: GET bucket/report.csv
    View-->>S3: Same bytes via S3 API
    Note over View: One file. One copy on SpanFS.<br/>Three protocol faces.

Performance Characteristics

Because all three protocols land in the same View, every byte benefits from the same SpanFS data services [Source: https://futurumgroup.com/wp-content/uploads/documents/EGL2_Cohesity_SmartFiles-2.pdf]:

Global, variable-length deduplication across the entire cluster, not just within a share.
Compression (including Zstandard).
Unlimited zero-copy snapshots and clones via SnapTree.
Multi-tier placement across NVMe/SSD, HDD, and S3-compatible cloud, transparently to clients.

The architectural payoff is one of the biggest selling points against scale-out NAS competitors: a file that is also an S3 object is dedup’d, compressed, snapshotted, and tiered exactly once.

Key Takeaway: SmartFiles is built on SpanFS, with the View as the multi-protocol logical container and the View Box / Storage Domain as the policy boundary. The same View can expose SMB3, NFSv3/v4, and S3 against the same data, with AD/LDAP-mediated identity mapping and SpanFS-enforced cross-protocol locking. Data services (dedupe, compression, snapshots, tiering) apply once at the SpanFS layer regardless of which protocol the client used.

Quotas, QoS, and Tiering

Once a View exists, three governance knobs determine whether it stays a good neighbor on a multi-tenant cluster: quotas (capacity), QoS (performance), and tiering (placement). A fourth, Storage Domain defaults, sets the floor for all of them.

Quotas: Capacity Governance at View and User Scope

SmartFiles supports both per-View quotas and per-user quotas inside a View, with audit logs of usage and Helios REST endpoints (getViewUserQuotas, top quotas by usage) to drive reporting and chargeback [Source: https://developers.cohesity.com/v1-helios-latest/reference/getviewuserquotas-1].

Storage-Domain defaults are typically configured via the CLI parameters default-view-quota (in GiB) and default-view-quota-alert-limit. Newly created Views inherit these defaults unless an architect overrides them at the View level [Source: https://mirror.vcu.edu/pub/cohesity/docs/Cohesity%20CLI%20Reference%20Guide%207.3.2.pdf].

A subtlety the CCAE exam can test: Cohesity’s public documentation does not sharply distinguish “soft” from “hard” quotas in the NetApp sense (with grace periods, etc.). Instead, you should think of:

The quota itself as the enforced cap.
The alert-limit as the soft trigger — the value at which an operator gets a warning.

Set the alert below the quota by a comfortable margin (e.g., alert at 80%, quota at 100%) so that operators have time to react before writes start failing [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/academy/cohesity-smartfiles-administration-en.pdf]. Storage Domain capacity itself is the hard ceiling: per-View quotas are governance, not protection against a runaway domain.

Figure 12.3: Quota and policy hierarchy — Storage Domain default to View override to user quota.

graph TD
    SD[Storage Domain<br/>default-view-quota = 10 TiB<br/>alert-limit = 8 TiB]
    V1[View: home-dirs<br/>inherits domain default]
    V2[View: media-projects<br/>OVERRIDE: 50 TiB / alert 40 TiB]
    V3[View: render-scratch<br/>OVERRIDE: 200 TiB / alert 160 TiB]
    U1[User quota: alice<br/>500 GiB cap]
    U2[User quota: bob<br/>500 GiB cap]
    U3[User quota: build-svc<br/>2 TiB cap]

    SD --> V1
    SD --> V2
    SD --> V3
    V1 --> U1
    V1 --> U2
    V2 --> U3

QoS Policies: Workload-Aware Placement and Throttling

QoS in SmartFiles is selected at View creation time and steers two things: which storage tier (SSD vs. HDD) the View prefers, and how aggressively foreground IO competes with background tasks like dedupe and garbage collection. The shipping predefined policies include [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/solution-brief/cohesity-smartfiles-beyond-scale-out-nas-solution-brief-en.pdf]:

QoS Policy	Tier Bias	IO Profile	Designed For
Backup Target Low	HDD-heavy	Large sequential / mixed small-block	Backup landing zones, secondary storage
TestAndDev High	SSD-optimized	Transactional, low-latency	Active dev/test, VDI-style workloads
(Archive / general purpose)	HDD/cold	Capacity-oriented	Cold archive Views, file shares with relaxed latency

Two CCAE-flavored design rules around QoS:

Pick the QoS at View creation. Changing it on a busy View is non-trivial and may require data movement; design up front based on the workload class.
Match QoS to workload, not to who paid for it. Putting an active SMB user share on “Backup Target Low” tanks latency. Putting a backup target on “TestAndDev High” wastes SSD on workloads that are mostly write-then-archive.

Tiering: Hot/Cold Placement Across Local and Cloud

SmartFiles applies policy-driven tiering across SSD, HDD, and S3-compatible cloud targets (AWS S3, Azure Blob, GCS, or any compatible object store). Cold blocks move out without breaking the namespace — clients still see the file or object at the same path, and access triggers a transparent recall [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/solution-brief/cohesity-smartfiles-beyond-scale-out-nas-solution-brief-en.pdf]. Tiering is configured at the Storage Domain or View level and applies to all protocols simultaneously: a file tiered to S3 that is then read via SMB or NFS or the S3 API behaves identically.

Architects should explicitly model:

Recall latency — first read after tier-down will hit cloud egress latency and may incur recall costs.
Working set sizing — local SSD/HDD must still hold the active working set; tiering is for cold data, not hot.
Egress cost — repeated recalls of the same dataset can dwarf the storage savings.

Worked Example: Media-and-Entertainment Workflow

Consider a post-production facility with two distinct workloads sharing one Cohesity cluster:

Editorial team — about 30 video editors using Avid/Premiere over SMB3 against an active project share. Latency-sensitive, small numbers of large files, frequent timeline scrubs that read the same media segments repeatedly. Working set is ~20 TB at any time; long tail of finished projects is ~400 TB.
Render farm — 200 Linux render nodes pulling assets over NFSv4 and writing rendered frames as objects via S3. Throughput-sensitive, highly parallel reads, append-heavy writes. Sustained ingest of ~5 GB/s during render windows.

A reasonable design:

Storage Domain: "Media-Primary"  (SSD-biased, RF2, post-process dedupe)
  +-- View: "edit-projects"
  |     Protocols: SMB3 R/W, NFSv4 R/W, S3 read-only (for archive readers)
  |     QoS: TestAndDev High (SSD priority, low latency)
  |     Quota: 50 TB, alert 40 TB
  |     Tiering: cold blocks > 90 days idle -> S3 (Standard-IA)
  |
  +-- View: "render-scratch"
        Protocols: NFSv4 R/W, S3 R/W
        QoS: General-purpose (HDD-biased, throughput-optimized)
        Quota: 200 TB, alert 160 TB
        Tiering: cold blocks > 30 days idle -> S3 (Glacier Instant Retrieval)

Editors get SSD-class latency on the active project share via “TestAndDev High”. The render farm gets capacity-class HDD throughput on a separate View whose IO profile cannot starve the editors. Both Views share the same SpanFS dedupe domain — so when the same source plate is referenced from both Views, it is stored once. Cold finished projects tier off to S3 transparently; an editor opening a 6-month-old project pays a one-time recall latency, but the file system path does not change.

Cohesity Insight Reports

SmartFiles surfaces capacity, top quota consumers, file age distribution, and access-pattern analytics through Helios reporting and the Insight family of reports. Architects use these for chargeback, for justifying tier-down policy choices to data owners, and for sizing migrations off legacy NAS (see next section).

Key Takeaway: SmartFiles governance has three knobs — quotas, QoS, and tiering — anchored to Storage-Domain defaults. Quotas are enforced caps with a separate alert-limit acting as the soft warning. QoS is chosen at View creation and is hard to change later, so map it to the workload class (Backup Target Low for landing zones, TestAndDev High for active SSD workloads). Tiering moves cold blocks to cloud transparently across all protocols.

Data Protection for SmartFiles

A primary file/object service that loses data is not a service. SmartFiles inherits the full DataPlatform protection stack, with a few SmartFiles-specific data services on top.

Snapshots and Policies

Every View can be snapshotted on a schedule using the same Protection Policies covered in Chapter 7 — frequency, hierarchical retention (daily / weekly / monthly / yearly), and lock attributes. SnapTree gives near-zero overhead for snapshots, so retention windows can be aggressive without paying capacity penalties [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity-SpanFS-and-SnapTree-WP.pdf]. Snapshots are mountable as read-only Views, which makes “previous versions” workflows straightforward for SMB users.

For ransomware resilience, combine snapshot policies with DataLock (Chapter 11) so snapshot deletes require either time-elapse or quorum approval. SmartFiles is a particularly attractive ransomware target precisely because it holds primary files; immutable snapshots are not optional for production deployments.

DR Replication for SmartFiles

Views replicate to a remote cluster using the same replication engine that DataProtect uses, with encryption, compression, and dedupe-aware transfer over the wire. Replication policies are attached to the View’s Protection Policy. For active-active patterns, the same View name on the remote side is presented as a read-only mirror that can be promoted on failover; for active-passive, snapshots and live data are pushed continuously and the remote View is brought online during a SiteContinuity-orchestrated failover.

Architectural notes for SmartFiles DR:

AD/LDAP must be reachable from the DR site for SMB and NFSv4 access to resolve identities post-failover.
DNS / VIP planning matters more than for backup workloads — clients are connecting on protocol VIPs, so the failover plan must include either VIP movement or DNS record updates.
S3 endpoint URLs must be planned end-to-end; cloud-native applications often hard-code the bucket endpoint and need a redirection strategy.

File Audit Logging

SmartFiles ships native file audit logging that records per-event activity (open, read, write, rename, delete, ACL change) on Views. This is pushed to Syslog or to SIEM platforms and replaces the bolt-on third-party audit appliances that traditional NetApp/Isilon shops ran in front of their NAS. Audit logging is also a control for ransomware detection: an unusual rate of rename-then-delete on a user share is a classic encryption signature.

Anti-Virus and ICAP Integration

SmartFiles integrates antivirus scanning natively via the ICAP (Internet Content Adaptation Protocol). When ICAP AV is enabled on a View, write paths fan out to one or more configured ICAP servers (Trellix, Symantec, Sophos, etc.) for scanning before the data is committed [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/solution-brief/cohesity-smartfiles-beyond-scale-out-nas-solution-brief-en.pdf]. This replaces the architecture pattern where customers ran a separate ICAP-broker appliance fronting Isilon or NetApp — one fewer hop, one fewer thing to license.

For the CCAE exam, remember:

ICAP is a synchronous scan; very high-throughput workloads must size ICAP scanner pools accordingly or scope ICAP to specific Views (e.g., user shares but not render scratch).
ICAP scan results integrate with audit logs and DataHawk where deployed.
Cohesity’s ICAP support is a feature of SmartFiles itself; it is not an additional appliance.

Key Takeaway: SmartFiles inherits SpanFS-level snapshots and replication and adds three native data services that have historically been bolt-ons in legacy NAS estates: file audit logging, ICAP-based AV, and DataLock-immutable snapshots. The replacement of bolted-on audit and AV products is one of the most common business cases for migrating from NetApp/Isilon to SmartFiles.

Migration and Modernization

Few customers buy SmartFiles for greenfield workloads. The CCAE-relevant scenarios are almost always migrations or modernizations: replacing a NetApp filer at end-of-life, absorbing an Isilon estate as part of a vendor consolidation, or onboarding cloud-native S3 workloads that started life on AWS.

NAS File Migration Service: The Packaged Path

Cohesity sells a packaged Professional Services engagement called the NAS File Migration Service for full NetApp/Isilon cutovers. The service covers cluster preparation, migration planning, the cutover itself, and end-state documentation, and is sized at approximately 30 TB per migration event [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/datasheets/cohesity-nas-file-migration-service-data-sheet-en.pdf]. For estates significantly larger than 30 TB, plan multiple cutover events — by share, by department, or by data classification.

Transparent Cold-Data Tiering: The Coexistence Path

When the legacy NAS is not yet at end-of-life, SmartFiles can absorb just the cold data without disturbing hot workloads. SmartFiles scans the source NAS using its built-in file analytics, classifies data by access pattern, and policy-tiers cold blocks to the Cohesity cluster (or directly to cloud) — without rehydration on access [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/solution-brief/Cohesity-SmartFiles-Solution-Brief.pdf]. The legacy NAS keeps serving hot data; SmartFiles silently absorbs the cold tail and exposes it at the same logical path. This is often used to defer a NAS refresh by years and to make the eventual cutover smaller.

Backup-Driven Migration: The Unconventional Path

Because Cohesity is also a NAS backup target, an unconventional but supported path is to back up the source NAS to Cohesity and then restore cross-filer — for example, restoring an Isilon backup directly into a SmartFiles View, or even into a different NAS array entirely [Source: https://www.cohesity.com/blogs/how-to-conquer-nas-backup-and-recovery-challenges/]. This bypasses traditional robocopy / rsync / NetApp-native tooling for many use cases and is particularly useful when permission preservation across SMB/NFS is critical.

Cohesity SmartFiles vs. NetApp ONTAP and Dell PowerScale (Isilon)

Capability	Cohesity SmartFiles	NetApp ONTAP	Dell PowerScale (Isilon)
Primary architecture	Distributed SpanFS, hyperconverged	Dual-controller HA pairs (cluster of pairs)	Distributed OneFS scale-out NAS
Single namespace	Yes, cluster-wide	Per SVM (vserver)	Yes, cluster-wide
Multi-protocol on same data	NFSv3/v4 + SMB3 + S3 simultaneously	NFS + SMB + S3 (S3 is bolt-on, separate bucket per SVM)	NFS + SMB + (S3 via separate ECS or via OneFS S3)
Global dedupe	Yes, variable-length, cluster-wide	Volume/aggregate scoped	Limited; per-volume
Cold-tier to public cloud	Yes, transparent recall, all protocols	FabricPool (block-level, ONTAP-only mechanism)	CloudPools (Isilon-specific)
Native ICAP AV	Yes, built in	Yes (Vscan)	Yes (CAVA / ICAP)
Native file audit	Yes	FPolicy	Audit subsystem
Snapshot model	SnapTree, unlimited, zero-copy	WAFL snapshots, capacity-efficient	OneFS snapshots
Immutability	DataLock (WORM)	SnapLock	SmartLock
Backup target capability	Native (same product, same cluster)	Possible but not the design center	Possible but not the design center
Single platform for backup + primary	Yes — that is the design center	No	No

The critical column is the last one. NetApp and Isilon are excellent primary NAS platforms, but they are not designed to also be your backup target. SmartFiles’ architectural argument is consolidation: one platform for backup, files, objects, archive, and DR.

Application Refactoring with S3

A meaningful subset of SmartFiles deployments are not file-server replacements at all — they are S3 endpoints for cloud-native applications that need an on-prem object store. Splunk SmartStore, container image registries, ML training datasets, IoT telemetry sinks, and Veeam-style backup targets all routinely consume S3-only Views. For these, treat the View as a bucket and forget the file protocol surface entirely.

Lift-and-Shift Patterns

For lift-and-shift modernization, a common sequence looks like:

Discovery. Run SmartFiles analytics against the source filers; produce a hot/warm/cold map per share.
Decision tree. For each share, choose tier-off (keep source, move cold) vs. full cutover (NAS Migration Service) vs. backup-driven restore.
Identity and AV. Stand up AD/LDAP integration and ICAP scanner pools before any cutover, not after.
QoS placement. Map each share to a QoS class — primary user share to TestAndDev High or general-purpose, archive to Backup Target Low, render scratch to general-purpose throughput.
Cutover windows. Plan cutover events at the 30-TB-per-event sizing of the packaged service; use replication seeding to minimize cutover time.
Decommission. Retire the source filer once the SmartFiles View is authoritative and snapshots have aged into the new retention windows.

Figure 12.4: NAS migration workflow — three paths from legacy filer to SmartFiles.

flowchart TD
    Legacy[Legacy NAS<br/>NetApp ONTAP / Dell Isilon]
    Disc[Discovery & Analytics<br/>SmartFiles file scan<br/>hot/warm/cold map]
    Decision{Decision tree<br/>per share}
    Cutover[Full Cutover Path<br/>NAS File Migration Service<br/>~30 TB per event]
    Tier[Cold-Data Tier-Off Path<br/>policy-driven tiering<br/>legacy keeps hot data]
    Backup[Backup-Driven Path<br/>NAS backup + cross-filer<br/>restore preserves ACLs]
    Prep[Pre-cutover prep<br/>AD/LDAP integration<br/>ICAP scanner pools<br/>QoS classification]
    SF[SmartFiles View<br/>SMB3 + NFSv4 + S3<br/>on Cohesity DataPlatform]
    Decom[Decommission legacy<br/>after retention aged]

    Legacy --> Disc
    Disc --> Decision
    Decision --> Cutover
    Decision --> Tier
    Decision --> Backup
    Cutover --> Prep
    Backup --> Prep
    Tier --> SF
    Prep --> SF
    SF --> Decom

Key Takeaway: SmartFiles offers three migration paths — full cutover via the 30-TB-per-event NAS File Migration Service, transparent cold-data tier-off that lets the legacy NAS keep running, and backup-driven cross-filer restore. The strategic argument against NetApp/Isilon is consolidation: SmartFiles is the same platform that holds your backups, replicates to DR, and tiers to cloud, with native file audit and ICAP AV replacing bolt-on appliances.

Chapter Summary

SmartFiles turns the Cohesity DataPlatform into a multi-protocol primary unstructured-data service without adding any new hardware or any new file system. It is built on SpanFS, the same distributed file system that holds your backups, and its central construct is the View: a logical container exposed simultaneously as SMB3, NFSv3/v4, and S3 against the same data. Views live inside View Boxes / Storage Domains that set policy — efficiency, encryption, resiliency, default quotas, tiering — and that act as fault-isolation boundaries between workload classes.

Governance is three knobs: quotas (with separate alert-limits as the soft trigger), QoS (predefined policies like Backup Target Low and TestAndDev High, chosen at View creation and hard to change later), and tiering (transparent hot/cold placement across SSD, HDD, and S3-compatible cloud, applied uniformly across all protocols). Identity is mediated by AD/LDAP so a single AD user maps to SMB SIDs, NFS UID/GID, and S3 access keys against the same View. Cross-protocol locking is enforced inside SpanFS.

Data services that legacy NAS estates traditionally bolted on — file audit logging, ICAP antivirus, DataLock immutability — are native to SmartFiles. So is the protection stack: snapshots via SnapTree, replication via the same engine DataProtect uses, and DR orchestration via SiteContinuity.

For migrations, three patterns dominate: the packaged NAS File Migration Service (~30 TB per cutover event), transparent cold-data tier-off that defers a refresh without disturbing hot workloads, and backup-driven cross-filer restore when permission preservation matters. The architectural value proposition against NetApp ONTAP and Dell PowerScale (Isilon) is consolidation: one platform for backup, files, objects, archive, and DR, with global dedupe across what used to be siloed estates.

For the CCAE exam, anchor your answers in the View / View Box / SpanFS hierarchy, remember that QoS is sticky and must be chosen up front, and recognize the three-path migration playbook. Most scenario questions about SmartFiles reduce to: which protocol(s) on which View, which QoS, which Storage Domain, which tiering policy, and which protection policy — in that order.

Key Terms

SmartFiles — Cohesity’s primary file and object data service, built on SpanFS, that exposes Views as multi-protocol containers (SMB3 + NFSv3/v4 + S3) on the same DataPlatform cluster used for backup, archive, and DR.
View — The unified logical container in SpanFS that data lives in. A View can be exposed simultaneously as an NFS export, an SMB share, and an S3 bucket against the same underlying data.
View Box (also Storage Domain) — The policy container that holds Views and defines defaults for storage efficiency, encryption, resiliency (RF/EC), tiering, and default quotas. Acts as the workload-class fault-isolation boundary.
SMB3 — The SMB protocol level supported by SmartFiles, including signing and encryption. Used for Windows file share semantics.
NFSv4 — The newer NFS version supported alongside NFSv3 on SmartFiles Views; supports name@domain principals and stateful locks.
S3 — S3-compatible object API exposed on Views, where files become objects keyed by their path. Supports read-only, read-write, and writable-clone modes per View.
Quota — Capacity governance enforced per-View or per-user-within-a-View. The quota itself is the enforced cap; a separate alert-limit acts as the soft warning before hard enforcement.
QoS — Predefined policies (e.g., Backup Target Low, TestAndDev High) selected at View creation that steer tier preference (SSD vs. HDD) and IO prioritization. Non-trivial to change after the View is in production.
ICAP — Internet Content Adaptation Protocol. The mechanism SmartFiles uses for native AV scanning by integrating with external ICAP scanner pools (Trellix, Symantec, Sophos, etc.), replacing the bolt-on ICAP-broker appliances common in legacy NAS estates.

Chapter 13: Helios SaaS, Marketplace Apps, and Automation

If Chapters 1 through 12 taught you how to design, deploy, secure, and protect data on individual Cohesity clusters, this chapter zooms out to the operating model an architect actually inherits in production: a fleet. Real CCAE-scale customers run anywhere from three or four clusters to several hundred, scattered across data centers, public clouds, and edge sites. They cannot afford to log in to each cluster individually, and they certainly cannot afford for the same Gold protection policy to mean three different things in three different regions. The architectural answer is the Helios SaaS control plane plus the Marketplace and the automation stack — REST API v2, Terraform, Ansible, and PowerShell — that turn the fleet into a single managed estate.

Think of it this way: a single Cohesity cluster is a server. Helios is the cloud console for the whole fleet, and the automation tools are the deployment pipelines that keep that fleet in compliance with whatever the architecture document says it should be.

Learning Objectives

By the end of this chapter, you will be able to:

Architect global fleet management with Helios across multi-cluster, multi-region deployments, including dark-site (self-managed Helios) variants.
Use Helios reporting, dashboards, federated RBAC, and global search effectively for both operations and compliance use cases.
Position Cohesity DataProtect delivered as a Service (DMaaS) — including data residency, region selection, and subscription implications — relative to self-managed clusters.
Deploy and govern Marketplace apps using the Apollo container runtime, AppSpec YAML, and view-mediated data access; identify vetted apps such as SentinelOne, ClamAV, Splunk, and Imanis.
Automate operations with REST API v2, the cohesity/cohesity Terraform provider, the cohesity.dataprotect Ansible collection, and the cross-platform PowerShell module — and choose the right tool for each job.

Helios SaaS Control Plane

Cohesity Helios is delivered as a SaaS service that aggregates and centralizes management of every Cohesity cluster a customer owns, regardless of whether those clusters live on-premises, in a public cloud, or in a hybrid topology [Source: https://www.cohesity.com/products/helios/]. From an architect’s standpoint, Helios is the single pane of glass that turns a fleet of independent clusters into one logical, policy-governed estate.

Helios Architecture and Tenancy Model

Architecturally, Helios is a multi-tenant cloud service hosted by Cohesity. Each customer gets a Helios account (a tenant), and clusters are bound to that account during onboarding. Connectivity from cluster to Helios is outbound HTTPS over TCP/443 — Cohesity does not require any inbound openings on customer firewalls, which is exactly what enterprise security architects want to hear [Source: https://docs.cohesity.com/]. The cluster establishes a long-lived TLS session to Helios, sends telemetry and accepts pushed configuration, and exposes a relay channel that lets a Helios admin click “Launch Cluster UI” and proxy into the per-cluster console without a VPN.

Helios Layer	Responsibility	Lives Where
Helios UI / API	Single pane of glass; entry point for admins, APIs, dashboards	Cohesity SaaS
Helios services (telemetry, search index, reporting)	Aggregate cluster metadata, run anomaly detection	Cohesity SaaS
Cluster agent / Helios connector	Outbound HTTPS tunnel from each cluster	On each managed cluster
Data plane (SpanFS, Bridge, Magneto)	Stores and protects data; Helios does not see data, only metadata	On each managed cluster

Figure 13.1: Helios SaaS control plane topology — outbound HTTPS aggregation of multi-cluster, multi-cloud, and DMaaS estates

flowchart LR
    subgraph OnPrem["On-Premises Data Centers"]
        C1[Cluster A<br/>NYC]
        C2[Cluster B<br/>Frankfurt]
        C3[Cluster C<br/>Singapore]
    end
    subgraph Cloud["Public Cloud"]
        CE1[Cloud Edition<br/>AWS us-east-1]
        CE2[Cloud Edition<br/>Azure westeurope]
    end
    subgraph DMaaS["Cohesity-Operated"]
        DM[DMaaS Tenant<br/>region-pinned]
        FK[FortKnox<br/>Cyber Vault]
    end
    C1 -- "outbound HTTPS<br/>TCP/443" --> Helios
    C2 -- "outbound HTTPS<br/>TCP/443" --> Helios
    C3 -- "outbound HTTPS<br/>TCP/443" --> Helios
    CE1 -- "outbound HTTPS" --> Helios
    CE2 -- "outbound HTTPS" --> Helios
    DM --> Helios
    FK --> Helios
    Helios[("Helios SaaS<br/>Control Plane<br/>UI / API / Reporting<br/>DataHawk / Anomaly")]
    Helios -- "HTTPS" --> Browser[Admin Browser /<br/>API Client]

The crucial separation is that Helios is a control plane only. Backup data, deduplication chunks, and SpanFS metadata never leave the cluster. What flows to Helios is operational metadata — job status, capacity, alerts, source inventory, policy IDs — which is the basis for both the global dashboard and the AI/ML anomaly detection [Source: https://www.cohesity.com/products/helios/].

Onboarding Clusters to Helios

Onboarding is intentionally trivial: in the cluster UI under Settings → Helios Registration, the admin enters the Helios account credentials, Helios issues a registration token, and the cluster establishes the outbound tunnel. From that moment on, the cluster appears in the Helios dashboard. Re-registering or detaching a cluster is equally a one-click operation from the Helios side. For dark sites or air-gapped environments where outbound HTTPS is forbidden, Cohesity offers Helios Self-Managed, a customer-hosted variant of the Helios services that runs on customer infrastructure and provides equivalent fleet management without depending on Cohesity’s SaaS [Source: https://www.cohesity.com/products/helios/]. Self-Managed Helios is the standard answer for federal classified environments, certain regulated banks, and intelligence-sector deployments.

Global Dashboards and SLA Reporting

Helios presents a unified, real-time view that aggregates health, capacity, protection status, SLA compliance, and performance metrics from every managed cluster. Operators drill from a fleet-level summary down to a single job on a single node without switching tools. The reporting engine produces customizable reports covering backup success rates, RPO/RTO compliance, storage consumption, growth trends, chargeback by tenant or organization, and audit/compliance evidence; reports can be scheduled and exported, which is essential for ITGC and SOC 2 audit cycles [Source: https://www.cohesity.com/products/helios/].

Federated global search lets an admin search for a VM, file, mailbox object, or database backup by name across the entire fleet; Helios resolves which cluster(s) hold the data. This is foundational for incident response (you don’t have to know in advance which cluster has the backup) and for legal/eDiscovery workflows that span sites.

Federated RBAC layers on top: granular roles can be scoped to specific clusters, regions, organizations, or object types, and can be sourced from Okta, Azure AD/Entra ID, Active Directory, or any SAML 2.0 IdP. A regional admin who can only see and recover within EMEA clusters is a one-line RBAC scope, not a per-cluster configuration project.

Helios-Only Features (DataHawk, FortKnox, Anomaly Detection)

Several Cohesity capabilities are Helios-resident by design — they have no on-cluster equivalent, because they need fleet-wide telemetry or cloud-side compute:

DataHawk — ML-based ransomware detection, threat intelligence (YARA), and data classification, scoring backup ingest patterns across the fleet to surface anomalies.
FortKnox — managed cyber vault that air-gaps an immutable copy of backups in a Cohesity-operated cloud tenant; it is provisioned, monitored, and recovered through Helios.
Cross-cluster anomaly detection — sudden change-rate spikes on one cluster correlated with similar patterns elsewhere, a signal you can only compute fleet-wide.

Key Takeaway: Helios is the SaaS control plane that converts N independent Cohesity clusters into one managed estate. Connectivity is outbound HTTPS only; backup data never leaves the cluster; and global dashboards, federated RBAC, global search, and Helios-only features (DataHawk, FortKnox) are the architectural payoff. For dark sites, Helios Self-Managed delivers the same model on customer infrastructure.

Helios as a Service (HaaS) and DMaaS

Helios is not just a console; it is also the entry point and management surface for Cohesity’s Data Management as a Service offerings. DMaaS shifts the operating model from “I run the cluster” to “I subscribe to backup outcomes” — Cohesity operates the underlying clusters in the cloud, and the customer consumes them through Helios.

DataProtect Delivered as a Service

DMaaS bundles DataProtect, replication, archive, and recovery as a fully managed SaaS service [Source: https://www.cohesity.com/products/data-management-as-a-service/]. The customer points sources (VMs, M365 tenants, databases, NAS) at the DMaaS endpoint; Cohesity provisions the underlying SpanFS capacity, runs the backup jobs, manages upgrades, and meters consumption. There is no cluster to bootstrap, no node to replace, and no version to upgrade — the SLA covers the platform.

Operating Model	What the Customer Owns	What Cohesity Owns
Self-managed DataProtect (on-prem cluster)	Hardware, OS, network, cluster software, policies, sources	Software releases, support, Helios SaaS
Self-managed DataProtect (Cloud Edition)	Cloud VM/IaaS bill, cluster software ops, policies	Software, Helios SaaS
DMaaS (DataProtect as a Service)	Sources, policies, RBAC, data residency choice	Cluster, capacity, upgrades, SLA, infrastructure
FortKnox cyber vault	Vault policies, recovery decisions	Vault infrastructure (immutable, air-gapped)

Region Selection and Data Residency

DMaaS is provisioned into specific cloud regions (AWS or Azure, depending on offering and customer choice). The architect’s job is to map regulatory boundaries — GDPR for EU data, data sovereignty laws in countries like Germany, Switzerland, India, Australia, and Canada — to a region selection. Because backup data carries the same residency obligations as primary data, picking a US region for an EU tenant’s M365 backups is not an option you can quietly hand-wave past an auditor. Helios surfaces region choice during DMaaS subscription and pins data residency for the lifetime of the tenant.

Subscription and Licensing Implications

DMaaS is sold on a subscription basis (typically per FETB-month or per workload tier), versus the perpetual or term license model common to self-managed DataProtect. This shifts the economics from CapEx + ongoing maintenance to pure OpEx. Architects sizing DMaaS apply the same FETB and change-rate inputs from Chapter 3 but must additionally model:

Egress — recall traffic leaving the cloud, billed by the cloud provider through Cohesity’s pass-through.
Long-term retention — DMaaS includes integration with cloud archive tiers (S3 Glacier, Azure Archive); cost optimization here is identical in spirit to the CloudArchive content from Chapter 10.
Replication scope — replication between DMaaS regions or between DMaaS and self-managed clusters is supported and factored into subscription tiers.

On-Prem vs. SaaS Operating Models — Comparison

Concern	On-Prem DataProtect	DMaaS
Cluster ops (upgrades, hardware)	Customer	Cohesity
Capacity planning	Customer (sizing tool, refresh cycles)	Cohesity (elastic)
Network ingress	LAN-speed to cluster	WAN egress to cloud (mind change rate)
Data residency	Wherever you put the cluster	Region selection at subscription time
Cost model	CapEx + maintenance	OpEx subscription
Best fit	Large, dense, predictable workloads; latency-sensitive recoveries	M365, branch offices, cloud-native apps, fast time-to-value

Key Takeaway: DMaaS is DataProtect delivered as a managed SaaS service through Helios. Architects choose a region for residency, subscribe by FETB/workload, and consume backup as an outcome rather than an appliance. The same Helios console manages DMaaS tenants and self-managed clusters side by side, which is the architectural enabler for hybrid fleets.

Marketplace Apps

The Cohesity Marketplace is the storefront and delivery mechanism for first- and third-party applications that run directly on the Cohesity DataPlatform, leveraging the Cohesity App framework [Source: https://www.cohesity.com/marketplace/]. The architectural value proposition is compute at data: instead of egressing backup data to a separate analytics, AV, or compliance system, applications run in containers next to SpanFS where the data already lives, eliminating data movement, copy proliferation, and the egress cost/latency tax.

App Framework, Apollo, and Isolation

The Cohesity App Framework is built on containers. Apps are packaged as Docker images plus a Cohesity AppSpec — a Kubernetes-style YAML descriptor extended with Cohesity-specific fields [Source: https://developer.cohesity.com/docs/get-started-apps]. The cluster’s container runtime, Apollo (introduced in the Pegasus 6.3 release line), executes the image. The AppSpec declares resources, view mounts, network requirements, and lifecycle hooks. Each app runs in its own Docker container — a clean isolation boundary that prevents one Marketplace app from interfering with another or with cluster services.

A minimal AppSpec looks like this (excerpted from the Cohesity App SDK documentation):

apiVersion: cohesity.com/v1
kind: App
metadata:
  name: clamav-scanner
  version: 1.4.0
spec:
  image: cohesity-marketplace/clamav:1.4.0
  resources:
    cpu: "2"
    memory: "4Gi"
  views:
    - name: vm-backups
      mountPath: /mnt/backups
      mode: ReadOnly
  network:
    egress: false
  lifecycle:
    onStart: /opt/clamav/scan.sh

Figure 13.2: Marketplace app deployment sequence — Admin to Helios to Apollo runtime to container, with SDK callbacks

sequenceDiagram
    participant Admin as Admin
    participant Helios as Helios UI
    participant Cluster as Target Cluster
    participant Apollo as Apollo Docker Runtime
    participant Container as App Container
    participant Views as SpanFS Views

    Admin->>Helios: Browse Marketplace, select app
    Admin->>Helios: Accept EULA, choose target cluster(s)
    Helios->>Cluster: Push AppSpec YAML + image ref
    Cluster->>Apollo: Register app, enforce resources
    Apollo->>Container: docker run (image, CPU/mem quotas)
    Container->>Views: Mount authorized views (NFS/SMB, ReadOnly)
    Container->>Container: Run lifecycle.onStart hook
    Container->>Cluster: Management SDK call (REST API v2)
    Cluster-->>Container: Snapshot / metadata response
    Container->>Helios: Telemetry / status (egress=false respected)
    Helios-->>Admin: App status: Running

Three security properties are worth pulling out:

views — the only sanctioned data path. The app sees backup data through view mounts (NFS/SMB-backed), restricted to the views the admin authorizes. There is no direct SpanFS access.
network.egress: false — apps can be locked into air-gapped operation, supporting dark-site and high-security deployments. This is exactly how SentinelOne scanning works without phoning home.
resources — Apollo enforces CPU/memory quotas declared in the AppSpec, so a misbehaving app cannot starve the cluster.

Marketplace Access and Vetted Apps

Apps are browsed and installed through the Helios UI or the public storefront at ccs-integration-marketplace.cohesity.com [Source: https://ccs-integration-marketplace.cohesity.com]. The deployment workflow is straightforward: select the target cluster(s), review and accept the EULA, then deploy. Installation is fleet-aware — an architect can push the same app to a curated subset of clusters via Helios.

App	Purpose	Typical Use Case
SentinelOne	Endpoint/AV scanning of backup data, no internet egress required	Validate that backups are clean before relying on them for recovery
ClamAV	Open-source antivirus scanning of NAS and VM snapshots	Cost-effective AV for compliance check-the-box
Sophos	Commercial antivirus alternative to ClamAV	Enterprise environments standardized on Sophos
Splunk Enterprise	Log analytics and SIEM ingest running on the cluster	Search audit logs and backup metadata in place
Imanis Data	NoSQL/Hadoop backup integration (since acquired by Cohesity)	MongoDB, Cassandra, Hadoop, Couchbase backup

Custom App Development Overview

Cohesity ships two SDKs for developers building custom apps [Source: https://developer.cohesity.com/docs/get-started-apps]:

App SDK — provides primitives like cohesity_mount for mounting Cohesity Views into the container so the app can read/scan/analyze backup data directly.
Management SDK — exposes the Cohesity REST API surface from inside the container so apps can drive cluster operations (start scans, create snapshots, fetch metadata).

The publish flow is:

Build a Docker image bundling the app and the Cohesity SDKs.
Author an AppSpec YAML and validate with appspecvalidator.
Submit for vetting (developer@cohesity.com) — Cohesity reviews for security, stability, and resource behavior before listing.
On approval, the app appears in Marketplace for customer deployment.

The vetting gate is the first defense; container isolation is the second; view-scoped mounts are the third. Together they form a defense-in-depth posture that makes “third-party code on my backup cluster” an acceptable architectural decision, not a compliance red flag.

Key Takeaway: Marketplace apps run as Docker containers under Apollo, declared via AppSpec YAML, isolated from each other and from cluster services, and limited to admin-authorized view mounts as the only data path. This enables compute-at-data analytics, AV scanning, and SIEM ingest without egressing backups, with vetting + isolation + view scope providing defense in depth.

Automation Stack

Helios and Marketplace solve the interactive operating model. The programmatic operating model is REST API v2 plus the language-specific wrappers — Terraform, Ansible, and PowerShell. At CCAE scale, automating policy and protection-group management is the only scalable path; manually clicking through hundreds of policies across dozens of clusters is both error-prone and untraceable.

REST API v2

All higher-level tools sit on top of the cluster’s REST API v2, supported on cluster versions 6.3.1+ [Source: https://developer.cohesity.com/]. The API covers protection groups, policies, sources, storage domains, views, alerts, recoveries, and tenants. Direct REST is the right choice when no higher-level wrapper exists yet, or for purpose-built integrations with ITSM (ServiceNow), SIEM (Splunk, Sentinel), or custom self-service portals.

Figure 13.3: REST API v2 surface map — auth and the principal resource hierarchy under the cluster control plane

graph TD
    Auth["/v2/mcm/access-tokens<br/>(Bearer token / Helios API key)"] --> Root["REST API v2 Root<br/>cluster 6.3.1+"]
    Root --> Policies["/data-protect/policies<br/>(RPO, retention, archive, replication)"]
    Root --> Groups["/data-protect/protection-groups<br/>(jobs bound to policies)"]
    Root --> Sources["/data-protect/sources<br/>(vCenter, M365, NAS, DBs)"]
    Root --> Views["/file-services/views<br/>(NFS/SMB/S3 namespaces)"]
    Root --> Storage["/storage-domains<br/>(dedup/encryption domains)"]
    Root --> Recoveries["/data-protect/recoveries<br/>(restore tasks)"]
    Root --> Alerts["/monitoring/alerts<br/>(events, severities)"]
    Root --> Reports["/reports<br/>(SLA, capacity, audit)"]
    Root --> Tenants["/multi-tenancy/tenants<br/>(orgs, RBAC)"]
    Groups -. "policy_id" .-> Policies
    Groups -. "source_id" .-> Sources

Authentication uses a session token obtained from /public/accessTokens (cluster) or a Helios API key for fleet-wide calls. A simple example fetching all protection groups:

TOKEN=$(curl -sk -X POST https://cluster.example.com/v2/mcm/access-tokens \
  -H 'Content-Type: application/json' \
  -d '{"username":"admin","password":"'"$PWD"'","domain":"LOCAL"}' \
  | jq -r .accessToken)

curl -sk -H "Authorization: Bearer $TOKEN" \
  https://cluster.example.com/v2/data-protect/protection-groups

Terraform Provider (`cohesity/cohesity`)

Published on the HashiCorp Terraform Registry as cohesity/cohesity and listed on the Cohesity Marketplace [Source: https://registry.terraform.io/providers/cohesity/cohesity/latest/docs], the Terraform provider is the right tool for declarative cluster-side configuration: storage domains, views, policies, protection groups, RBAC, replication targets. It is the source of truth checked into Git.

provider "cohesity" {
  cluster_vip      = "10.0.0.10"
  cluster_username = "admin"
  cluster_password = var.cohesity_password
}

Architects declare the full protection topology as code, run terraform plan to preview drift, and apply changes uniformly across environments. This is foundational for CI/CD-driven backup management and for keeping non-prod and prod policies aligned.

Ansible Collection (`cohesity.dataprotect`)

Installed with ansible-galaxy collection install cohesity.dataprotect, the Ansible collection is the right tool for source-side mutation: rolling agent installs across fleets, registering sources, and triggering on-demand jobs [Source: https://github.com/cohesity/ansible-collection]. Notable modules:

cohesity_uda_protection_group — register, remove, start, and stop universal data adapter protection groups [Source: https://galaxy.ansible.com/ui/repo/published/cohesity/dataprotect/].
Source/agent modules — install and register Cohesity backup agents on Windows, Linux, and macOS targets.

Ansible is push-based and idempotent; it integrates cleanly with Tower / Ansible Automation Platform (AAP) for RBAC and audit.

PowerShell Module

Cohesity publishes a cross-platform PowerShell module (Windows, Linux, macOS via PowerShell 7+) that wraps REST API v2 in cmdlets [Source: https://cohesity.github.io/cohesity-powershell-module/]. It is the natural choice for Windows-centric shops automating protection groups, policies, recoveries, and reporting from existing PowerShell-based tooling and runbooks.

Connect-CohesityCluster -Server cluster.example.com -Credential $cred
Get-CohesityProtectionGroup | Where-Object { $_.policyId -eq $goldPolicy.id }

Choosing the Right Tool — Comparison Table

Dimension	Terraform	Ansible	PowerShell	Raw REST API v2
Paradigm	Declarative IaC	Imperative + idempotent push	Imperative scripting	Direct HTTP
State management	Yes (`.tfstate`)	None (each run re-evaluates)	None	None
Best for	Cluster config: policies, groups, views, RBAC, replication	Source-side: agent installs, source registration, ad-hoc jobs	Windows ops, ad-hoc reporting, integration with existing PS tooling	ITSM/SIEM webhooks, custom portals, gap-fill where wrappers lag
CI/CD fit	Excellent (`plan`/`apply`, drift detection)	Good (AAP / Tower)	Moderate (Jenkins/PS pipelines)	Excellent (any HTTP-aware tool)
Drift detection	First-class	Re-runs converge but no diff	Manual	Manual
Skill curve	Medium (HCL, state)	Low–medium (YAML)	Low for Windows admins	High for non-trivial flows
Typical owner	Platform / SRE team	Server / source team	Windows ops team	Integration / dev team

Figure 13.4: Automation tool selection decision tree — choosing among Terraform, Ansible, PowerShell, and raw REST

flowchart TD
    Start{What are you<br/>automating?} --> Q1{Cluster-side config?<br/>policies, views,<br/>protection groups}
    Q1 -- Yes --> Q2{Need declarative<br/>state + drift detection?}
    Q2 -- Yes --> TF["Terraform<br/>cohesity/cohesity provider<br/>(.tfstate, plan/apply, Git)"]
    Q2 -- No --> Q5
    Q1 -- No --> Q3{Source-side push?<br/>agent installs,<br/>source registration}
    Q3 -- Yes --> AN["Ansible<br/>cohesity.dataprotect collection<br/>(idempotent, AAP/Tower)"]
    Q3 -- No --> Q4{Windows-native shop<br/>or ad-hoc reporting?}
    Q4 -- Yes --> PS["PowerShell Module<br/>(PS 7+, cross-platform,<br/>Connect-CohesityCluster)"]
    Q4 -- No --> Q5{Webhook from<br/>ITSM/SIEM, or gap<br/>in wrappers?}
    Q5 -- Yes --> REST["Raw REST API v2<br/>(curl, any HTTP client,<br/>ServiceNow/Splunk)"]
    Q5 -- No --> PS

The architect-level rule of thumb: Terraform for cluster-side IaC, Ansible for source-side push, PowerShell for Windows-native ops, REST when nothing else fits. All four are thin layers over REST API v2, so no tool boxes you in — each is a different ergonomic surface on the same control plane.

Worked Example: Terraform Module — Gold Policy + Protection Group

Here is a concrete CCAE-style worked example: a Terraform module that creates a “Gold” protection policy (1-hour RPO, 30-day local retention, 1-year archive) and a protection group bound to it for a set of VMware VMs. This is the kind of artifact that lives in Git, is reviewed in PRs, and is applied via CI to every cluster in the fleet to guarantee identical Gold semantics everywhere.

# modules/gold-vmware/main.tf
terraform {
  required_providers {
    cohesity = {
      source  = "cohesity/cohesity"
      version = "~> 1.2"
    }
  }
}

variable "cluster_vip"      { type = string }
variable "cluster_username" { type = string }
variable "cluster_password" { type = string, sensitive = true }
variable "vcenter_source_id" { type = number }
variable "vm_object_ids"     { type = list(number) }
variable "archive_target_id" { type = number }

provider "cohesity" {
  cluster_vip      = var.cluster_vip
  cluster_username = var.cluster_username
  cluster_password = var.cluster_password
}

# 1. Gold Policy: 1h RPO, 30d local, 1y archive
resource "cohesity_protection_policy" "gold" {
  name        = "Gold-1h-30d-1y"
  description = "Tier 1: 1h RPO, 30d local retention, 1y archive"

  backup_policy {
    regular {
      incremental {
        schedule {
          unit            = "Hours"
          hour_schedule {
            frequency = 1
          }
        }
      }
      retention {
        unit     = "Days"
        duration = 30
      }
    }
  }

  remote_target_policy {
    archival_targets {
      target_id = var.archive_target_id
      schedule {
        unit      = "Weeks"
        frequency = 1
      }
      retention {
        unit     = "Years"
        duration = 1
      }
    }
  }
}

# 2. Protection Group bound to the Gold Policy
resource "cohesity_protection_group" "tier1_vms" {
  name        = "Tier1-VMware-Gold"
  policy_id   = cohesity_protection_policy.gold.id
  environment = "kVMware"

  vmware_params {
    source_id  = var.vcenter_source_id
    object_ids = var.vm_object_ids
    app_consistent_snapshot = true
    indexing_policy {
      enable_indexing = true
    }
  }

  start_time {
    hour   = 22
    minute = 0
  }
}

output "policy_id" { value = cohesity_protection_policy.gold.id }
output "group_id"  { value = cohesity_protection_group.tier1_vms.id }

A consuming root module then wires it up per cluster:

module "frankfurt_gold" {
  source            = "./modules/gold-vmware"
  cluster_vip       = "10.20.0.10"
  cluster_username  = "admin"
  cluster_password  = var.frankfurt_password
  vcenter_source_id = data.cohesity_source.frankfurt_vcenter.id
  vm_object_ids     = data.cohesity_vmware_vms.frankfurt_tier1.ids
  archive_target_id = data.cohesity_external_target.frankfurt_archive.id
}

module "singapore_gold" {
  source = "./modules/gold-vmware"
  # ...same variables, different cluster...
}

The architectural payoff: every Cohesity cluster in the fleet has a Gold-1h-30d-1y policy that means exactly the same thing — 1-hour incremental, 30-day local retention, 1-year archive, app-consistent VMware snapshots, indexed for search, weekly archive cadence. A terraform plan after any drift instantly shows the deviation. A change request to extend retention to 45 days is a one-line PR that cascades to every cluster on merge.

Key Takeaway: Terraform owns declarative cluster-side IaC (policies, protection groups, views, RBAC); Ansible owns source-side push (agents, source registration); PowerShell handles Windows-native ops; raw REST fills gaps and powers ITSM/SIEM webhooks. All four are wrappers over REST API v2 — pick the right ergonomics for the job, and treat backup configuration as code in Git with CI/CD gates.

Chapter Summary

Helios is the architectural answer to fleet sprawl. As a SaaS control plane with outbound-HTTPS-only connectivity, it converts an arbitrary number of independent Cohesity clusters into one managed estate with shared dashboards, federated RBAC, global search, anomaly detection, and centralized policy authoring. For environments where SaaS is not acceptable, Helios Self-Managed delivers the same model on customer infrastructure for dark sites and air-gapped deployments. DMaaS extends the same Helios surface into a fully managed subscription model where Cohesity operates the cluster — region selection pins data residency, and the customer consumes backup as an outcome rather than an appliance.

The Marketplace brings third-party compute to where the data already lives: Apollo (the cluster’s Docker container runtime) executes apps declared by Kubernetes-style AppSpec YAML, isolated by container boundaries, scoped to admin-authorized view mounts, and optionally air-gapped from the internet. Vetted apps like SentinelOne, ClamAV, Splunk, and Imanis are installed fleet-wide through Helios; custom apps go through a SDK + AppSpec + Cohesity vetting pipeline.

The automation stack is the programmatic counterpart: REST API v2 is the bedrock, and Terraform (cohesity/cohesity), Ansible (cohesity.dataprotect), and the PowerShell module are the language-specific wrappers. Architects pair Terraform for declarative cluster-side IaC with Ansible for source-side push, use PowerShell for Windows-native operational work, and call REST directly when nothing else fits — all backed by Git, CI/CD, and policy review. The worked Terraform module for a Gold policy and protection group illustrates how a single, version-controlled artifact can guarantee identical SLA semantics across every cluster in a global fleet.

Carrying forward: in Chapter 14 we use Helios as the lens for monitoring, alerting, and triage — the operational consequence of having all this telemetry centralized.

Key Terms

Helios — Cohesity’s SaaS control plane that aggregates management of all clusters into a single pane of glass; communicates with clusters over outbound HTTPS only.
Helios Self-Managed — customer-hosted variant of Helios for dark sites and air-gapped environments, delivering equivalent fleet management without SaaS dependence.
DMaaS (Data Management as a Service) — Cohesity-operated, subscription-delivered DataProtect (and adjacencies like FortKnox) consumed through Helios with region-pinned data residency.
Marketplace — the storefront and delivery mechanism for first- and third-party apps (SentinelOne, ClamAV, Splunk, Sophos, Imanis) that run directly on Cohesity clusters via the App Framework.
App framework / Apollo / AppSpec — the container runtime (Apollo, Docker-based) and Kubernetes-style YAML descriptor (AppSpec) that together package, deploy, and isolate Marketplace apps with declared resources, view mounts, and network policy.
REST API v2 — the canonical Cohesity programmatic interface (cluster 6.3.1+) covering protection groups, policies, sources, views, alerts, and recoveries; the foundation under all higher-level wrappers.
Ansible (cohesity.dataprotect) — Cohesity’s official Ansible collection for push-based, idempotent source-side automation (agent installs, source registration, on-demand jobs).
Terraform (cohesity/cohesity) — Cohesity’s HashiCorp-registry Terraform provider for declarative, state-managed cluster-side IaC (policies, protection groups, views, RBAC, replication targets).
PowerShell module — cross-platform (PS 7+) Cohesity cmdlet library wrapping REST API v2; the natural fit for Windows-centric shops and ad-hoc operational scripting.
DataHawk / FortKnox — Helios-resident security capabilities: ML-based ransomware detection / classification (DataHawk) and managed cyber-vault immutable cloud copy (FortKnox), neither of which has an on-cluster equivalent.

Chapter 14: Performance, Monitoring, and Troubleshooting

If the previous chapters explained how to design and operate a Cohesity estate when everything is healthy, this chapter is about what to do when it is not. A backup that ran in 45 minutes last week is now taking six hours. A replication target is two days behind. A node has gone dark and the cluster is alerting about quorum risk. The CCAE-level architect is expected to do more than restart services and hope — they are expected to localize bottlenecks, generate the right diagnostic artifacts, integrate alerts into the customer’s operational tooling, and engage Cohesity Support effectively when escalation is warranted.

The discipline that ties all of this together is structured triage: walk the data path end to end, measure each segment, and find the lowest sustained throughput point. Whether the symptom is a slow backup, a lagging replica, or a degraded recovery, the investigation always reduces to the same question — where in the chain is the bottleneck, and what evidence proves it?

Learning Objectives

By the end of this chapter you will be able to:

Diagnose performance bottlenecks across the source -> network -> ingest -> NVRAM -> writer data path and identify which segment is the limiter.
Use Cohesity statistics, iris_cli, the Siren UI, and Helios alerts to triage incidents methodically.
Generate scoped log bundles and engage Cohesity Support with the artifacts they need on the first round-trip.
Configure Helios alerting across email/SMTP, SNMP, syslog, and webhook channels, including SMTP validation.
Recognize and respond to common failure modes — backup re-runs, replication lag, disk and node failures, and network partitions — and design resiliency for each.

Performance Bottleneck Analysis

A Cohesity backup is a multi-stage assembly line. Bytes are read from a source (a VM disk, a NAS share, a database), pushed across the network through a proxy or agent, ingested by a Cohesity node, journaled into NVRAM, and finally destaged onto SSD or HDD by the writer service. Each stage has a maximum sustainable throughput, and the slowest stage sets the throughput of the whole line. That is the definition of a bottleneck, and finding it is the architect’s primary skill in this chapter.

The Assembly-Line Analogy

Picture a five-station automotive assembly line. Station 1 attaches the chassis (source read), station 2 ships sub-assemblies between buildings (network), station 3 receives parts at the factory dock (cluster ingest), station 4 stages parts into a buffer area (NVRAM), and station 5 mounts them on the vehicle (writer/disk). If station 3 can only accept one chassis per minute while station 1 can produce three, you will see chassis piling up at the dock — but if you stand at station 5 and stare at the empty workstation, you will conclude (incorrectly) that the line is “slow.” The fix is not to add more workers at the end; it is to identify which station is the actual choke point and rebalance there.

Cohesity bottlenecks behave identically. A symptom at the writer (high write latency, growing queues) does not mean the writer is the problem — it may mean the source is feeding faster than the cluster can persist, or that NVRAM destage is back-pressuring upstream. Without measurements at each station, you are guessing.

The Five-Stage Data Path

[Source App/VM] -> [Network/Proxy] -> [Cluster Ingest] -> [NVRAM Journal] -> [Writer -> SSD/HDD]
     Stage 1            Stage 2            Stage 3              Stage 4            Stage 5

Each stage has characteristic symptoms when it is the limiter:

Stage	Symptom Pattern	Telltale Metric
1. Source	Low source read MB/s, idle network, idle writers	Source read latency high; CBT/RCT slow; storage array saturated
2. Network	Source ready, cluster idle, low end-to-end MB/s	iperf below link speed; retransmits; wrong VLAN/uplink
3. Ingest	Network saturated but cluster CPU/proxy queues high	Proxy concurrency exhausted; hypervisor NBD/HotAdd contention
4. NVRAM	Bursty throughput, periodic stalls, destage backpressure	NVRAM journal utilization high, destage queue growing
5. Writer	Network healthy, NVRAM filling, write latency rising	Writer latency, disk queue depth, SSD/HDD saturation

The most common Cohesity-side culprit is the target disk writer spending excessive time persisting data to the underlying tier — high writer latency or growing write queues even while network ingest looks healthy. NVRAM behavior is the leading indicator: incoming write batches land in NVRAM-backed journals before destaging, so saturating NVRAM or seeing destage backpressure is a strong signal that the cluster (not the source) is the limiter [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity_Data_Protection_White_Paper.pdf].

Figure 14.1: Bottleneck triage decision tree across the five-stage data path

flowchart TD
    Start([Slow backup symptom]) --> NetCheck{iperf >= link speed?}
    NetCheck -->|No| Stage2[Stage 2: Network<br/>Fix VLAN / uplink / MTU]
    NetCheck -->|Yes| SrcCheck{Source MB/s and<br/>writer MB/s both low<br/>and balanced?}
    SrcCheck -->|Yes| Stage1[Stage 1: Source<br/>Investigate array,<br/>hypervisor, agent]
    SrcCheck -->|No| NvramCheck{Writer latency high<br/>or NVRAM saturated?}
    NvramCheck -->|Yes| Stage45[Stages 4-5: NVRAM / Writer<br/>Cluster-side limiter:<br/>capacity, sizing]
    NvramCheck -->|No| ConcCheck{Per-stream healthy<br/>but total throughput low?}
    ConcCheck -->|Yes| Stage3[Stage 3: Ingest / Concurrency<br/>Add proxies, raise concurrency,<br/>split groups]
    ConcCheck -->|No| Bundle[Generate Siren log bundle<br/>Engage Cohesity Support]

Establishing a Network Baseline with iperf

Before blaming Cohesity, prove the substrate works. The standard baseline test is iperf3 between a source host (or backup proxy) and one or more Cohesity nodes. Run it bidirectionally over the same NIC and VLAN that backup traffic uses. If iperf reports 9.4 Gb/s on a nominal 10 GbE link, the network is healthy. If it reports 940 Mb/s, you have just found a 1 GbE path masquerading as 10 GbE, and no amount of backup tuning will fix it.

A real-world Cohesity case showed dramatically improved backup times after forcing all Cohesity backup traffic onto a dedicated 10 GbE VLAN and a dedicated vSphere VMkernel port, instead of letting it traverse multiple shared paths [Source: https://www.cohesity.com/blogs/optimizing-cohesity-and-vsphere-networking/]. The vSwitch teaming policy and VMkernel binding (NBD/HotAdd traffic) frequently route backups onto the wrong uplink without any UI indication. iperf surfaces this in 30 seconds.

A 1 GbE versus 10 GbE path makes a 10x difference in achievable ingest, which alone explains a large fraction of “slow job” tickets.

The SOBR Performance Trap (140 vs 20-30 MB/s)

When third-party tools like Veeam present Cohesity as a Scale-Out Backup Repository (SOBR), throughput can collapse — from roughly 140 MB/s on a single-node Cohesity repository down to 20-30 MB/s once SOBR is enabled with the default placement policy. The remediation is to set the SOBR placement policy to Performance rather than Data Locality, restoring parallel ingest across nodes [Source: https://forums.veeam.com/veeam-backup-replication-f2/veeam-9-5-and-cohesity-t49454.html].

This is a textbook case where the cluster looks underused and the source looks healthy, yet end-to-end throughput is dismal. The bottleneck is in the upstream construct’s placement logic, not in Cohesity itself.

SOBR Placement Policy	Behavior	Throughput
Data Locality (default)	Pin a backup chain to one extent	~20-30 MB/s — single node serializes
Performance	Stripe across SOBR extents in parallel	~140 MB/s — nodes work in parallel

Other Common Source-Side and Path Limiters

Antivirus on gateway / proxy servers. Physical-agent proxies and NAS scan hosts inspecting in-flight backup streams will throttle throughput severely. Exclude Cohesity processes and the staging directories from on-access scanning.
NAS incremental performance. Historically slow on certain DataProtect versions; reportedly improved in 6.4.0. Always cross-check release notes during triage [Source: https://www.peerspot.com/questions/what-needs-improvement-with-cohesity-dataprotect].
S3/object target tuning. For cloud-tier targets, increasing the multipart chunk size (e.g., AWS CLI --multipart-chunk-size-mb) yields significant gains when the cloud-write pipeline is the choke point.
Hypervisor CBT/RCT regressions. When CBT is reset or corrupted, the next backup falls back to a full read of the disk, dropping effective throughput by an order of magnitude.

Worked Example: A Backup Running at 20 MB/s

A protection group that normally completes 4 TB in 6 hours (~190 MB/s aggregate) is now running at 20 MB/s and projecting a 60-hour runtime. Walk the path:

Establish a baseline. Run iperf3 from the vSphere proxy host to three Cohesity node primary IPs. Result: 9.3 Gb/s sustained. Network substrate is healthy. Eliminate Stage 2 as the prime suspect.
Inspect job stats in Helios and iris_cli. Source read MB/s is 22, writer MB/s is 21, writer latency is normal, NVRAM is not saturated. Reads and writes are balanced and low — the cluster is not being fed. The bottleneck is upstream of ingest.
Check the source. Open vCenter; the protected VM disks live on a datastore whose backing array shows queue depth 95% and read latency 40 ms. Stage 1 is the limiter — the source array cannot deliver bytes faster.
Check for proxy/path issues. Confirm the backup proxy is using NBD over the dedicated 10 GbE VMkernel and not falling back to NBDSSL on the management network. It is correct.
Look for known-issue overlays. Review release notes for the running DataProtect build. No CBT regression listed.
Conclusion and remediation. The bottleneck is the source storage array under contention from a non-backup workload (a SQL re-index running concurrently). Reschedule the protection group outside the re-index window, or move the affected VMs to a less contended datastore. Do not collect a Cohesity log bundle — Cohesity is not the limiter and Support cannot accelerate someone else’s storage array.

The lesson: bottleneck = lowest sustained throughput point in the chain. The metric divergence (low read, low write, idle queues) localized the problem in two minutes, before any logs were collected.

Job Concurrency and Queue Analysis

Even when individual stages are healthy, concurrency can starve a job. A protection group with 200 VMs and only four backup proxies will queue 196 VMs while four run. Adding proxies, increasing per-proxy concurrency settings, or splitting the protection group into parallel groups can lift end-to-end completion time dramatically without changing per-stream MB/s.

For SmartFiles workloads, I/O profiling shifts from MB/s to IOPS and latency. A SmartFiles SMB share serving small-file metadata-heavy workloads will saturate at low MB/s but high IOPS. The fix is rarely “more bandwidth”; it is more nodes, more flash, or a redesign of the access pattern.

Key Takeaway: Bottlenecks are localized by walking the source -> network -> ingest -> NVRAM -> writer path and finding the lowest sustained throughput point. Always establish a network baseline with iperf, watch for SOBR placement-policy traps (140 vs 20-30 MB/s), and let metric divergence — not assumptions — point to the limiter.

Monitoring and Alerting

Once you understand bottlenecks in principle, you need a continuous monitoring fabric so the next incident finds you before the customer does. Cohesity’s primary alerting plane is Helios, which aggregates events from every registered cluster and fans them out to email, SNMP, syslog, and webhook channels.

Alert Categories and Severities

Helios groups alerts by category (cluster health, protection, replication, capacity, security, hardware, etc.) and severity (Critical, Warning, Info). The severity dimension is the architect’s main lever against alert fatigue: route Critical events to the on-call channel, Warning to the operations queue, Info to a SIEM archive [Source: https://docs.cohesity.com/baas/data-protect/alerts/alerts.htm].

Severity	Typical Examples	Recommended Routing
Critical	Node down, quorum risk, replication failure, data unavailable	Email + SNMP + PagerDuty webhook
Warning	Disk predictive failure, capacity > 80%, job missed SLA	Email + Syslog (SIEM)
Info	Job completed, snapshot expired, configuration change	Syslog only (SIEM archive)

Configuring Alert Notification Rules

In Helios (or DataProtect as a Service), the workflow is:

Navigate to Health > Notification.
Click Create > New Alert Notification Rule.
Set the rule Notification Name and Filters (category, severity, alert name, cluster scope).
Choose the delivery method: email, SNMP, syslog, or webhook.
For email, specify To, Cc, Subject.
Save. Matching alerts trigger automatically [Source: https://docs.cohesity.com/baas/data-protect/alerts/configure-alert-notification-rule.htm].

The same alert can fan to multiple channels — create one rule per channel with the same filter set, or different filters per channel for tiered routing. Programmatically, the endpoint is createAlertNotificationRule on the Helios API [Source: https://developers.cohesity.com/v1-helios-latest/reference/createalertnotificationrule-1].

Figure 14.2: Helios alerting fan-out across notification channels to recipients

sequenceDiagram
    participant Cluster as Cohesity Cluster
    participant Helios as Helios Aggregator
    participant Rule as Notification Rule<br/>(category + severity filters)
    participant SMTP as SMTP / Email
    participant SNMP as SNMP Trap (NMS)
    participant Syslog as Syslog / SIEM
    participant Webhook as Webhook (HTTPS JSON)
    participant Recipient as On-call / Ops / Security

    Cluster->>Helios: Cluster event<br/>(node down, replication fail, etc.)
    Helios->>Rule: Match against filters
    Rule-->>Helios: Severity = Critical
    par Fan-out to all matching channels
        Helios->>SMTP: Email (To/Cc/Subject)
        SMTP->>Recipient: Inbox notification
    and
        Helios->>SNMP: Trap with Cohesity MIB OIDs
        SNMP->>Recipient: NMS event
    and
        Helios->>Syslog: Forward event
        Syslog->>Recipient: SIEM correlation
    and
        Helios->>Webhook: POST JSON payload
        Webhook->>Recipient: PagerDuty / ServiceNow / SOAR
    end

Email/SMTP Configuration

Email delivery requires the cluster (or Helios) to know how to reach an SMTP relay. The relevant API surface is:

API	Purpose	Notes
`PUT /v2/clusters/smtp`	Update SMTP config	Server, port (465 SMTPS, 587 STARTTLS), auth credentials. Requires `CLUSTER_MODIFY` privilege.
`GET /v2/clusters/smtp`	Retrieve SMTP config	Audit current settings.
`POST /validate`	Test SMTP	Send a test message to a recipient address [Source: https://developers.cohesity.com/v1-helios-latest/reference/validatesmtpconfiguration].

Always run validate after changes. Silent SMTP relay breakage — expired credentials, blocked outbound 587, certificate chain failure — is one of the most common causes of “we never got the alert” tickets. The validate endpoint catches it before an incident does [Source: https://developers.cohesity.com/v1-cluster-7.3/reference/updatesmtpconfiguration].

SNMP Trap Integration

For customers with mature NMS environments (SolarWinds, OpenNMS, IBM Netcool), SNMP is the preferred channel. Configure trap targets via getHeliosSnmpAlertsConfig and tie SNMP delivery to alert filters through createAlertNotificationRule [Source: https://developers.cohesity.com/v1-helios-latest/reference/getheliossnmpalertsconfig-1]. Supply management station IPs, community/auth depending on SNMP version, and the trap filters. Validate by triggering a known alert and confirming the NMS decodes it correctly using the Cohesity MIB.

Syslog and SIEM

Syslog is the standard integration path for SIEM platforms — Splunk, QRadar, Sentinel, Chronicle. Configure through createAlertNotificationRule with the syslog target (server, port, optional facility). For full coverage, pair alert syslog forwarding with audit log forwarding so security teams can correlate operational events with administrative actions.

Webhook Integration

Webhook output ships a structured JSON payload — alertname, severity, cluster identifying info, and other context — to an HTTPS endpoint. This is the integration point for modern tooling: PagerDuty, ServiceNow, Slack/Teams via custom routers, or SOAR platforms like Cortex XSOAR.

Validation Pattern (Memorize This)

Configure SMTP/SNMP/syslog/webhook targets at the cluster or Helios level.
Create an alert notification rule with tight filters (start with one severity).
Trigger or simulate a matching event.
Confirm receipt at the email inbox / NMS / syslog server / webhook endpoint.
Run the explicit POST /validate endpoint for SMTP to catch silent relay breakage.

Capacity and SLA Reports

Beyond alerts, Helios provides SLA reports showing protection compliance per protection group, source, and cluster — what percentage of protected objects met their RPO over a reporting window. SLA reports are the artifact a CISO or IT director will demand quarterly; they are also the leading indicator of systemic problems that no single alert reveals. A protection group that drifts from 99.5% to 96% over six weeks is missing windows even though every individual job “succeeded” eventually after retries.

Key Takeaway: Helios fans alerts to email, SNMP, syslog, and webhook channels via filterable notification rules. Configure SMTP with PUT /v2/clusters/smtp, always run the validate endpoint to catch silent relay failures, and tier severity routing to prevent alert fatigue. SLA reports surface chronic drift that individual alerts miss.

Logs and Diagnostics

Alerts tell you something is wrong; logs tell you what. Cohesity’s diagnostic story revolves around three artifacts: the Siren UI for log bundle generation, the iris_cli command-line for cluster-side queries and management, and the time capsule directory where bundles are staged.

iris_cli: The Supported CLI

iris_cli is the supported CLI surface for cluster management operations. Authenticate with:

iris_cli -server <cluster-IP> -username=admin -password=<pwd>

It is documented in the Cohesity CLI Reference Guide (e.g., the 7.3.2 release) and is the same admin context used for SSL certificate updates, protection management, and many support workflows [Source: https://mirror.vcu.edu/pub/cohesity/docs/Cohesity%20CLI%20Reference%20Guide%207.3.2.pdf]. For the exam, iris_cli is the CLI to name when a question asks about cluster-side actions [Source: https://nshielddocs.entrust.com/interops-docs/cohesity-kc/cli.html].

Common command groups include cluster operations (cluster status, cluster nodes list), protection (protection-runs list, protection-jobs list), and stats queries useful during triage.

Service Logs

Cohesity is a microservices platform. Each major service writes its own logs:

Service	Responsibility	When to Inspect
iris	UI/control plane	UI errors, login failures, REST API issues
Bridge	I/O data path / SpanFS front-end	Read/write latency, NFS/SMB issues
Magneto	Backup orchestration	Job failures, scheduling, source registration
Apollo	Garbage collection, replication, indexing	GC stalls, replication lag, index issues
Stats	Metrics aggregation	Missing dashboards, metric gaps
Yoda	Search/index service	Search failures, indexing slowness
Gandalf	Configuration management	Cluster config issues
Nexus	Cluster networking control	Network path/route issues

Targeting the right service when you generate a bundle keeps the artifact small and Support’s analysis fast.

Generating a Log Bundle via Siren

The primary on-cluster tool for log collection is the Siren log analysis page, reached at https://<cluster-VIP-or-node-IP>/siren. From the Siren landing page, click Cluster Support Bundle to start collection [Source: https://www.youtube.com/watch?v=b3sn69irplo]. The dialog lets you scope:

Nodes — uncheck “Select all” to scope the bundle to specific node IPs (useful when one node is misbehaving and you want a smaller bundle).
Services — pick only the services relevant to the symptom (iris for UI issues, Bridge for I/O path, Magneto for backup orchestration, etc.).
Log level — verbosity threshold for the captured logs.
Time range — defaults to the last 24 hours; widen for older incidents, narrow to the precise incident window when you can.
Include hardware logs — pulls syslog and hardware diagnostics (firmware, IPMI/BMC, SMART, chassis events).

The Time Capsule Path

Generated bundles land on the cluster as “time capsules” under:

/home/cohesity/data/timecapsules

Bundle size typically ranges from a few MB up to 2-3 GB, depending on services and time range. Large bundles usually mean too many services or too wide a window — re-scope and regenerate. After Siren produces the bundle, copy it from the timecapsules directory and upload it to the location Support specifies (typically a per-case secure upload URL). Automation can use the uploadFilePackage API on the Cohesity Developer Portal [Source: https://developers.cohesity.com/v1-cluster-7.3/reference/uploadfilepackage].

Figure 14.3: Heartbeat telemetry and Siren log bundle lifecycle to Cohesity Support

flowchart LR
    subgraph Cluster[Cohesity Cluster]
        HB[Heartbeat<br/>continuous telemetry]
        Trigger([Siren UI Trigger<br/>Cluster Support Bundle])
        Scope[Scope inputs:<br/>nodes / services /<br/>log level / time range /<br/>hardware logs]
        TC[(Time capsule<br/>/home/cohesity/data/<br/>timecapsules)]
    end
    HBEP[Cohesity Heartbeat<br/>endpoint HTTPS/443]
    Upload[Secure upload URL<br/>uploadFilePackage API]
    Support[Cohesity Support<br/>Proactive + Reactive]

    HB -- continuous --> HBEP --> Support
    Trigger --> Scope --> TC
    TC -- copy / upload --> Upload --> Support

Heartbeat: The Continuous Diagnostic

Beyond on-demand bundles, Cohesity clusters emit a Heartbeat stream — a continuous, lightweight diagnostic feed reporting cluster health, version, configuration, and key metrics back to Cohesity. Heartbeat is what lets Cohesity Proactive Support spot brewing issues (failing disks, capacity creep, configuration drift) before they cause outages. Architects should ensure Heartbeat egress (HTTPS/443 to the Cohesity Heartbeat endpoint) is open, otherwise proactive support is blind.

Practical Bundle Hygiene

Scope tightly. Pick the minimum services and the narrowest time window covering the incident. Smaller bundles upload faster over customer egress.
Capture both healthy and unhealthy nodes. For cluster-wide issues, include at least one good node alongside the bad one for comparison.
Record context separately. Note exact UTC timestamps, protection job names/IDs, and recent change events (upgrades, hardware swaps, firmware updates). Support correlates faster when this is in the case notes alongside the bundle [Source: https://www.cohesity.com/content/dam/cohesity/agreements-docs/cohesity-global-support-and-services-handbook-en.pdf].
Use iris_cli for scripted hygiene. For repeated troubleshooting, scripting iris_cli logins lets you cleanly capture cluster state alongside the Siren bundle.

For the exam, memorize the trio: Siren (UI generator) -> timecapsules (/home/cohesity/data/timecapsules) -> upload to Support. The bundle’s primary inputs are node set, services, log level, time range, and hardware logs toggle.

Audit Logs and Security Events

In addition to operational logs, Cohesity emits audit logs capturing administrative actions — who logged in, what they changed, who approved a snapshot deletion. Audit logs forward to syslog/SIEM for compliance and incident investigation [Source: https://docs.cohesity.com/baas/data-protect/audit-logs-dataprotect.htm]. They are the artifact a security auditor will demand during a HIPAA, PCI, or FedRAMP review.

Key Takeaway: Use Siren on the cluster to generate scoped log bundles (nodes, services, time range, hardware logs); bundles land in /home/cohesity/data/timecapsules. iris_cli is the named CLI for cluster-side operations. Heartbeat provides continuous proactive telemetry. Always scope tightly and pair the bundle with precise UTC timestamps when engaging Support.

Common Failure Modes

The final section catalogs the failure modes a CCAE-level architect must recognize and respond to. For each, there is a characteristic signature, an immediate action, and a longer-term design lever.

Figure 14.4: Common failure modes taxonomy with signatures and architectural levers

graph TD
    Root[Common Cohesity Failure Modes]
    Root --> Backup[Backup Job Failures]
    Root --> Repl[Replication Lag]
    Root --> Disk[Disk Failure]
    Root --> Node[Node Failure]
    Root --> Part[Network Partition]

    Backup --> B1[Stale credentials /<br/>CBT reset / locked files]
    Backup --> B2[Lever: policy design,<br/>retry rules, SLA reports]

    Repl --> R1[WAN underprovisioned /<br/>throttle misalignment /<br/>target ingest saturated]
    Repl --> R2[Lever: bandwidth windows,<br/>change-rate sizing]

    Disk --> D1[SpanFS RF/EC rebuild<br/>Heartbeat opens case]
    Disk --> D2[Lever: schedule replacement,<br/>capacity headroom]

    Node --> N1[Reduced capacity / perf<br/>Quorum risk if multiple]
    Node --> N2[Lever: fault domain awareness<br/>chassis / rack / site]

    Part --> P1[Paxos quorum split<br/>NTP drift / latency spikes]
    Part --> P2[Lever: dual-homed nodes,<br/>redundant ToR, cluster VLAN]

Backup Job Failures and Re-runs

Backup jobs fail for many reasons — source unreachable, credentials expired, snapshot quota exceeded, hypervisor CBT reset, network blip during stream. Cohesity’s default behavior is to retry within the window: a transient failure that resolves before the next scheduled run is invisible to most stakeholders. Persistent failures escalate.

Failure Pattern	Likely Cause	Action
First-run failures only	Stale source credentials, recently-changed VM	Refresh source credentials; re-discover
Random failures across many sources	Network/proxy intermittency	Check proxy health, network path
Same source fails repeatedly	Source-specific issue (CBT, agent, locked file)	Reset CBT; reinstall agent; investigate locked file
Cluster-wide failures spike	Cluster service issue, upgrade in progress	Check cluster health; review change log

The architectural lever is policy design: tight RPOs combined with aggressive retry rules will mask intermittent failures, while loose policies surface them as SLA misses.

Replication Lag

Replication lag — the source cluster’s most recent successful replicated snapshot lagging behind the source cluster’s latest local snapshot — is the canonical “silent” DR failure. The protection job is succeeding locally; replication is just not keeping up.

Causes:

WAN bandwidth insufficient for the daily change rate. The classic miscalculation: sized for steady-state daily change but not for full-resync scenarios, monthly fulls, or seasonal peaks.
Replication policy throttling windows misaligned with actual change rates.
Target cluster ingest saturated by other replication or local backup load.
Encrypted/compressed replication competing with backup CPU on undersized clusters.

Action: in Helios, the SLA report and replication dashboards expose lag per protection group. Recovery is rarely instant — if you are 48 hours behind, you need a window where ingest exceeds change rate to catch up. Architects design bandwidth throttle windows that yield to backup ingest during the active backup window and run replication at full bandwidth overnight.

Disk and Node Failures

A single disk failure on a Cohesity node is a non-event — SpanFS rebalances using RF or erasure coding parity, and the architect’s only action is to schedule disk replacement. Heartbeat usually opens the support case automatically.

A node failure is more consequential. The cluster continues to operate (assuming RF and EC tolerances are not exceeded), but capacity, performance, and resiliency are all reduced until the node is recovered or replaced. Quorum loss — too many nodes down at once — halts the cluster. Architects design fault domain awareness (chassis, rack, site) into the cluster from day one to avoid correlated failures taking the cluster below quorum.

Failure	Cluster Impact	Time to Recover
Single disk	Background rebuild; no outage	Hours (background)
Single node (RF2)	Reduced redundancy; rebuild begins	Hours to days for rebuild
Multiple nodes (within tolerance)	Performance degraded; rebuild contention	Days
Quorum loss	Cluster halts; data unavailable	Recovery operation; potential restore

Cluster Network Partition Events

A network partition — a portion of the cluster losing connectivity to another portion — is the most dangerous failure mode. SpanFS uses Paxos-based metadata with strict consistency and quorum; the side without quorum cannot serve writes. If the partition persists, jobs targeting that side fail; replication across the partition lags; and management UI may show inconsistent state from different nodes.

Detection: Heartbeat alerts, node-up/node-down alerts, intra-cluster latency spikes, and NTP drift warnings (often the first symptom of an underlying network problem).

Action:

Identify the partition boundary using iris_cli cluster status from multiple nodes.
Check the physical network — switch, uplink, VLAN.
If the partition is healed quickly, the cluster auto-recovers; if not, generate a Siren bundle scoped to Bridge, Apollo, and Gandalf and engage Support before attempting any manual remediation.
Document the event and review fault domain design — was the partition along an unexpected boundary?

The architectural lever is network design: dual-homed nodes, redundant top-of-rack switches, dedicated cluster interconnect VLANs separate from client traffic, and BGP/LACP configurations that fail predictably.

Decision Tree: Bottleneck Classification

When the symptom is “slow,” use this decision tree to classify quickly:

Is the network healthy? (iperf >= link speed)
  No  -> Stage 2 (Network) — fix VLAN/uplink/MTU first
  Yes -> next
       |
Are source read MB/s and writer MB/s both low and balanced?
  Yes -> Stage 1 (Source) — investigate source array, hypervisor, agent
  No  -> next
       |
Is writer latency high or NVRAM saturated?
  Yes -> Stages 4-5 (NVRAM/Writer) — cluster-side limiter; check capacity, sizing
  No  -> next
       |
Is per-stream throughput healthy but total throughput low?
  Yes -> Concurrency — add proxies, raise concurrency, split groups
  No  -> Generate log bundle, engage Support

This tree does not replace measurement, but it sequences the questions in the order most likely to find the bottleneck fast.

Key Takeaway: Common failure modes — backup re-runs, replication lag, disk/node failure, network partition — each have a characteristic signature, an immediate action, and a longer-term design lever. Architects design fault domain awareness, throttle windows, and network redundancy before the failure, not after.

Chapter Summary

Performance and troubleshooting are the operational disciplines that separate a well-architected Cohesity deployment from a fragile one. The CCAE-level architect must think like a process engineer: measure each stage of the data path, identify the lowest-throughput point, and apply the lever that lifts it. The five-stage pipeline (source -> network -> ingest -> NVRAM -> writer) is the mental model for every “slow backup” ticket, and metric divergence — not assumptions — localizes the bottleneck.

Tooling is straightforward but must be rehearsed. iris_cli is the supported CLI; the Siren UI generates support log bundles into /home/cohesity/data/timecapsules; Heartbeat streams continuous diagnostics back to Cohesity for proactive support. Helios fans alerts to email, SNMP, syslog, and webhook channels via filterable notification rules — and SMTP changes must always be followed by a POST /validate to catch silent relay breakage.

Common failure modes — backup re-runs, replication lag, disk and node failures, network partitions — each have characteristic signatures and architectural levers. The architect’s job is to build the levers (fault domain awareness, throttle windows, redundant networking, tiered alerting) into the design before the incident, then triage with discipline when the incident arrives.

For the exam, internalize three drills:

The bottleneck triage drill — given a slow-backup symptom, walk source -> network -> ingest -> NVRAM -> writer and name the metric that proves your verdict.
The log bundle drill — name Siren, name the timecapsules path, name the four scoping inputs (nodes, services, time range, hardware logs).
The Helios alerting drill — name the four channels (email, SNMP, syslog, webhook), the SMTP API (PUT /v2/clusters/smtp), the validation step (POST /validate), and the rule creator (createAlertNotificationRule).

Key Terms

Bottleneck — The lowest sustained throughput point in the source -> network -> ingest -> NVRAM -> writer chain. The whole pipeline’s throughput is set by its bottleneck; finding it is the architect’s primary triage skill.
Heartbeat — Continuous lightweight diagnostic stream emitted by Cohesity clusters back to Cohesity for proactive support. Requires HTTPS/443 egress to function.
iris_cli — The supported Cohesity command-line interface for cluster management operations. Authenticated with iris_cli -server <ip> -username=admin -password=<pwd> and documented in the Cohesity CLI Reference Guide.
Log bundle — Packaged collection of service logs, hardware data, and cluster state for a defined time window across selected nodes. Generated via the Siren UI and staged in /home/cohesity/data/timecapsules before upload to Cohesity Support.
Alert — A categorized, severity-tagged event emitted by a cluster (or aggregated through Helios) and routed to email, SNMP, syslog, or webhook subscribers via notification rules.
SLA report — Helios report that measures protection compliance per protection group, source, and cluster against the policy’s RPO over a reporting window. Surfaces chronic drift that individual alerts miss.
NVRAM — Non-volatile RAM journal in front of the SpanFS disk tier. Incoming writes land in NVRAM before destaging to SSD/HDD; saturation or destage backpressure is a leading signal of a cluster-side write bottleneck.

Chapter 15: End-to-End Architecture Scenarios and Exam Synthesis

The previous fourteen chapters built the vocabulary, mechanics, and design patterns of the Cohesity Data Cloud one layer at a time — SpanFS internals, sizing math, networking, identity, protection policies, replication, cloud integration, security, SmartFiles, Helios, and troubleshooting. The Cohesity Certified Architect Expert (CCAE) exam, however, almost never asks about a single layer in isolation. It asks you to assemble layers into a coherent design that satisfies a business problem under hard constraints. This final chapter does three things: it walks through three full reference architectures end-to-end, it decodes the exam blueprint and scenario question pattern in detail, and it gives you a 30-day plan plus a test-day playbook so you walk into the proctoring session knowing exactly how to spend your 90 minutes [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/datasheets/cohesity-certified-architect-expert-exam-preparation-guide-en.pdf].

Learning Objectives

By the end of this chapter, you should be able to:

Synthesize multi-domain Cohesity architectures that combine DataProtect, SmartFiles, SiteContinuity, FortKnox, DataHawk, and Helios for enterprise-scale scenarios.
Trace explicit business requirements (RPO, RTO, retention, compliance, residency, budget) through to specific Cohesity design choices and defend each choice against plausible alternatives.
Apply scenario-based reasoning to CCAE-style architecture questions, recognizing the four-option pattern and identifying the constraint each distractor violates.
Build a final 30-day study plan keyed to the published domain weights (22 / 35 / 18 / 13 / 12 percent) and execute a disciplined test-day strategy.

Scenario 1: Global Enterprise with Multi-Region DR

The Business Problem

A multinational manufacturer operates three primary data centers (Dallas, Frankfurt, Singapore) plus 42 branch and plant sites distributed across the Americas, EMEA, and APAC. The CIO has set the following targets: Tier-0 ERP and MES workloads need an RPO of 15 minutes and an RTO of 60 minutes; Tier-1 file shares and VMs need RPO of 4 hours and RTO of 4 hours; Tier-2 archive and dev/test workloads need RPO of 24 hours and an RTO of 24 hours; all data must be retained for seven years for tax and audit purposes; cross-region replication must survive the loss of any one regional data center; the security team requires a SaaS control plane for fleet visibility but will not allow production data to be archived to a third-party SaaS vault.

Topology Choice: Hub-and-Spoke Per Region with One-to-Many Across Regions

The reference pattern that maps cleanly to this requirement set is hub-and-spoke within each region and one-to-many across regions [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity_Data_Protection_White_Paper.pdf]. Each branch and plant site (the spokes) deploys a Cohesity Robo Edition or small Virtual Edition cluster sized for local backups; spokes replicate inbound to the regional hub (Dallas, Frankfurt, or Singapore). Each regional hub then replicates Tier-0 and Tier-1 protected data to one of the other two regions — Dallas replicates to Frankfurt, Frankfurt to Singapore, Singapore to Dallas — a triangular one-to-many mesh that survives any single regional loss [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/modern-data-security-management-topologies-guide-for-it-leaders-white-paper-en.pdf].

Topology Element	Decision	Rationale
Branch protection	Robo Edition / small VE cluster per site	Local backup for fast restore, reduces WAN backup traffic [Source: https://blogs.cisco.com/datacenter/disaster-recovery-solutions-for-the-edge-with-hyperflex-and-cohesity]
Branch-to-hub	Many-to-one inbound replication	Centralizes recovery, audit, retention; matches ROBO best practice [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity_Data_Protection_White_Paper.pdf]
Hub-to-hub	One-to-many triangular replication	Survives loss of any single region; geographic separation [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity_Data_Protection_White_Paper.pdf]
Long-term retention	CloudArchive to S3 with Glacier lifecycle	7-year retention without consuming hot capacity [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/actualtech-multicloud-data-protection-and-recovery-white-paper-en.pdf]
Control plane	Helios SaaS (multi-tenant)	Global SLA reporting, capacity prediction, fleet upgrades [Source: https://www.cohesity.com/blogs/using-cohesitys-saas-based-helios-manage-clusters-anywhere/]

Figure 15.1: Global enterprise hub-and-spoke topology with triangular cross-region replication and CloudArchive offsite

flowchart LR
    subgraph Americas
        S1[Branch Spoke A1]
        S2[Branch Spoke A2]
        DC1[(Dallas Hub)]
        S1 --> DC1
        S2 --> DC1
    end
    subgraph EMEA
        S3[Branch Spoke E1]
        S4[Branch Spoke E2]
        DC2[(Frankfurt Hub)]
        S3 --> DC2
        S4 --> DC2
    end
    subgraph APAC
        S5[Branch Spoke P1]
        S6[Branch Spoke P2]
        DC3[(Singapore Hub)]
        S5 --> DC3
        S6 --> DC3
    end
    DC1 <--> DC2
    DC2 <--> DC3
    DC3 <--> DC1
    DC1 -.7-yr archive.-> CA[(CloudArchive S3 Glacier)]
    DC2 -.7-yr archive.-> CA
    DC3 -.7-yr archive.-> CA
    HS{{Helios SaaS Control Plane}} -.fleet mgmt.-> DC1
    HS -.fleet mgmt.-> DC2
    HS -.fleet mgmt.-> DC3

Translating SLAs to Policies

The architect builds three policy templates in Helios — Tier-0, Tier-1, Tier-2 — and applies them through Protection Groups rather than per-job customization, which is the design discipline the exam expects [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/datasheets/cohesity-certified-architect-expert-exam-preparation-guide-en.pdf].

SLA Tier	Backup Frequency	Replication Frequency	Cloud Archive	Retention
Tier-0 (ERP/MES)	Every 15 min (CBT)	Continuous to nearest peer region	Monthly to S3 Glacier	7 years
Tier-1 (File / VM)	Every 4 hours	Every 4 hours, off-peak window	Quarterly to S3 Glacier	7 years
Tier-2 (Dev/Archive)	Daily	Daily	Monthly to S3 Deep Archive	7 years

For Tier-0 RTO of 60 minutes the architect leverages Instant Mass Restore at the regional hub, which mounts protected VMs directly from SpanFS so applications come up in minutes rather than hours [Source: https://www.cohesity.com/solutions/instant-mass-restore/]. SiteContinuity runbooks orchestrate the failover sequence — power-on order, IP re-mapping, dependency groups (database before app, app before web), and post-failover validation — and a non-disruptive test failover runs quarterly into an isolated network bubble to keep audit evidence current [Source: https://www.cohesity.com/blogs/recovering-from-a-data-disaster-with-cohesity/].

Capacity, Bandwidth, and Cost Sanity Check

A useful analogy here: think of the regional hubs as regional water reservoirs. Each branch site is a small upstream tank; water flows downhill into the regional reservoir at low pressure (background backup with WAN-optimized, deduplicated streams). Reservoirs then exchange water across long pipes (cross-region replication) only for the most critical workloads, because long pipes are expensive. The cloud archive is the underground aquifer — slow to recall but cheap and effectively infinite. An architect who tries to send every drop directly to the aquifer overbuilds bandwidth; one who lets every reservoir drain only locally fails the multi-region survival requirement.

WAN bandwidth is dimensioned around the change rate of Tier-0 and Tier-1 data, not total front-end TB. If Tier-0 produces 200 GB of unique change daily after Cohesity’s deduplication and compression, a 25 Mbps committed cross-region link comfortably absorbs the load with headroom for catch-up; oversubscribing the link with Tier-2 traffic would be a typical exam distractor.

RBAC and Helios Fleet Management

A four-tier RBAC model is implemented globally: a Global Architect role with cross-cluster admin, a Regional Operator role limited to one regional hub plus its spokes, a Tenant Operator role for business units that self-serve restores, and a Read-Only Auditor role for compliance. Helios provides the global pane for capacity prediction, predictive disk-failure analytics, and fleet-wide policy compliance reporting [Source: https://www.cohesity.com/blogs/using-cohesitys-saas-based-helios-manage-clusters-anywhere/].

Key Takeaway: A global enterprise design layers hub-and-spoke (intra-region) with one-to-many (inter-region) replication, drives every workload through tiered Helios policies, uses CloudArchive for long-term retention without hot-capacity bloat, and orchestrates recovery through SiteContinuity runbooks rather than scripts.

Scenario 2: Healthcare with HIPAA and Ransomware Posture

The Business Problem

A 600-bed regional health system runs Epic EHR, PACS imaging, lab systems, and clinical research workloads across two on-premises data centers. The CISO has been briefed on three healthcare ransomware incidents in the past 18 months and now requires: PHI must never leave the covered entity’s controlled environment (no SaaS vaulting); WORM-immutable backups that cannot be deleted by a compromised admin; encryption with customer-managed keys via KMIP; quorum approval for any retention change; ransomware anomaly detection on backup ingestion; a clean-room recovery capability validated quarterly; and a Helios control plane that does not require outbound SaaS connectivity.

Stack Selection: DataLock + FortKnox Self-Managed + DataHawk + Helios Self-Managed

Each requirement maps to a specific Cohesity component, and the integrated stack is the canonical healthcare reference design [Source: https://aws.amazon.com/blogs/apn/supercharge-your-cyber-resiliency-with-cohesity-datahawk/][Source: https://www.cohesity.com/blogs/new-self-managed-deployment-option-for-cohesity-fortknox/].

HIPAA / Security Requirement	Cohesity Component	How It Satisfies the Requirement
Immutability of backups	DataLock (WORM)	Object-level write-once retention, no admin override [Source: https://www.cohesity.com/glossary/cyber-vault/]
Quorum approval for retention changes	DataLock + RBAC + MFA	Four-eyes approval; defends insider threat [Source: https://www.cohesity.com/trust/]
Customer-managed encryption	KMIP / external KMS integration	AES-256 CBC at rest, TLS in transit, customer key control [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/cohesity-platform-security-white-paper-en.pdf]
No PHI in third-party SaaS	FortKnox Self-Managed	Air-gapped vault stays inside the covered entity’s perimeter [Source: https://www.cohesity.com/blogs/new-self-managed-deployment-option-for-cohesity-fortknox/]
Ransomware anomaly detection	DataHawk anomaly detection	Flags atypical encryption / change patterns on ingest [Source: https://aws.amazon.com/blogs/apn/supercharge-your-cyber-resiliency-with-cohesity-datahawk/]
Threat scanning for IOCs	DataHawk threat scan	Curated IOCs from Tenable / Qualys before recovery [Source: https://aws.amazon.com/blogs/apn/supercharge-your-cyber-resiliency-with-cohesity-datahawk/]
PHI/PII discovery for compliance	DataHawk classification	Identifies sensitive data, drives retention/access policy [Source: https://www.cohesity.com/platform/data-classification/]
No outbound SaaS for control plane	Helios Self-Managed	On-prem Helios for dark-site / regulated environments [Source: https://www.cohesity.com/blogs/using-cohesitys-saas-based-helios-manage-clusters-anywhere/]

Figure 15.2: Healthcare ransomware-resilient stack — DataLock + FortKnox Self-Managed + DataHawk + KMIP + Helios SM

flowchart TD
    PROD[Production Workloads<br/>Epic EHR / PACS / Lab] --> DP[Cohesity DataProtect Cluster<br/>on-premises]
    DP -->|WORM retention| DL[DataLock Immutability Layer]
    DP -->|encrypt at rest| KMIP[KMIP / External KMS<br/>Customer-Managed Keys]
    DP -->|ingest scan| DH[DataHawk<br/>Anomaly + Threat + Classification]
    DL -->|inbound-only<br/>transfer window| FK[(FortKnox Self-Managed<br/>Air-Gapped Vault)]
    DH -->|alerts| SOC[SOC / SIEM Splunk]
    HSM{{Helios Self-Managed<br/>Dark-Site Control Plane}} -.manages.-> DP
    HSM -.manages.-> FK
    HSM -.manages.-> DH
    QUORUM[/Quorum + MFA + RBAC/] -.governs.-> DL
    QUORUM -.governs.-> FK

The 3-2-1 Defense in Depth

The design layers three copies of every protected dataset on two media with one offsite/immutable copy — the classic 3-2-1 pattern hardened with Cohesity’s modern controls [Source: https://www.cohesity.com/resources/solution-brief/cohesity-fortknox-modern-cyber-vaulting-for-confident-recovery-en/].

Primary copy — production storage (Epic ODB, PACS arrays, file shares).
Secondary copy — Cohesity DataProtect cluster on-premises with DataLock WORM retention. Daily app-consistent backups; Tier-0 EHR backups every 15 minutes via log-based RPO. Encryption at rest with KMIP-managed keys.
Tertiary vault copy — Cohesity FortKnox Self-Managed in a logically and physically isolated network segment, behind a separate management VLAN, with an inbound-only transfer window that opens for replication and closes immediately after [Source: https://www.cohesity.com/resources/datasheet/cohesity-fortknox/]. Production admins do not have credentials for the vault; vault admins do not have credentials for production. This segregation of duties is what defeats the credential-compromise ransomware kill chain.

A useful analogy: think of FortKnox Self-Managed as the safe-deposit vault inside a bank inside a city. The DataProtect cluster is the bank — well-guarded, but you can walk in during business hours. The vault sits behind a second locked door whose key is held by a different person, and the door opens only on a published schedule. A burglar who compromises a teller has not compromised the vault.

Detection and Clean Recovery

DataHawk performs anomaly detection on every backup ingestion, comparing change rates and entropy against historical baselines; an unusual spike in encrypted blocks fires a Helios alert and pages the SOC [Source: https://aws.amazon.com/blogs/apn/supercharge-your-cyber-resiliency-with-cohesity-datahawk/]. Threat scanning then uses curated IOCs to identify which restore points are clean. Quarterly clean-room recoveries orchestrated through DataHawk and SiteContinuity restore Epic and PACS into an isolated network bubble; the recovery is validated by application owners and the run is recorded as audit evidence for HIPAA’s contingency-plan requirement.

Audit, Logging, and HIPAA Alignment

All administrative actions stream to a SIEM (Splunk in this design) via the Cohesity Data Security Alliance integration [Source: https://www.cohesity.com/company/data-security-alliance/]. RBAC is built around least privilege with custom roles: a Backup Operator can run restores but cannot change retention; a Compliance Officer can place legal holds but cannot delete data; a Cluster Admin can change configuration but cannot bypass DataLock. MFA is mandatory for all admin roles. Quorum approval is enabled for retention reduction and DataLock policy changes — two distinct admins must approve before the action commits.

Key Takeaway: Healthcare ransomware-resilient design is not a single product but a layered stack: DataLock provides immutability, FortKnox Self-Managed provides the air-gapped vault inside the covered entity, DataHawk provides detection and classification, KMIP provides customer-controlled keys, Helios Self-Managed provides the dark-site control plane, and quorum + MFA + RBAC provide segregation of duties. Removing any layer fails one of the requirements.

Scenario 3: Service Provider Multi-Tenant DMaaS

The Business Problem

A managed service provider (MSP) wants to launch a Backup-as-a-Service offering for mid-market customers. Requirements: 25 initial tenants growing to 200 within 18 months; per-tenant data isolation enforced cryptographically and operationally; tenants must self-serve protection policies, restores, and reports through a branded portal; consumption must be metered for monthly chargeback; tenant onboarding must complete in under one business day; offboarding must guarantee data destruction within 30 days; the MSP wants to deliver the service as Cohesity DMaaS (Data Management as a Service) rather than building its own infrastructure for the first wave.

Why DMaaS for the Initial Wave

Cohesity DMaaS delivers DataProtect as a SaaS offering with the MSP as a managed-service overlay [Source: https://www.cohesity.com/blogs/architecture-matters-blueprints-for-backup-as-a-service-offerings/]. The MSP avoids the capital expense and operational burden of standing up DataProtect clusters per region and instead consumes Cohesity’s regional SaaS instances. The architect picks the region that satisfies tenant data-residency requirements (e.g., us-east-1 for US tenants, eu-west-1 for EU tenants).

Tenant Isolation via Organizations

The cornerstone of multi-tenant Cohesity is the Organization construct. Each tenant becomes an Organization with its own scoped View Boxes, RBAC, network segmentation, and policies; cross-tenant visibility is impossible by design [Source: https://www.cohesity.com/blogs/architecture-matters-blueprints-for-backup-as-a-service-offerings/].

Isolation Dimension	Mechanism	Tenant Impact
Storage	Per-tenant View Boxes with quotas	Tenant A cannot see Tenant B’s data
Network	VLAN / VRF per tenant; tenant-scoped VIPs	Tenant traffic is L2/L3-isolated
Identity	Per-tenant SAML IdP federation	Tenant uses its own Azure AD / Okta
Encryption	Per-tenant KMIP keys (optional)	Cryptographic separation for high-trust tenants
Roles	Tenant-scoped RBAC	Tenant Admin role limited to its own Organization
Reporting	Per-tenant SLA and capacity reports	Tenant sees only its consumption

The MSP retains an MSP-Admin role across all Organizations for operational management but cannot access tenant data without explicit, audited break-glass procedures. This is the architect’s answer to the inevitable exam question about insider risk in multi-tenant designs.

Figure 15.3: MSP DMaaS service flow — Tenant to Helios self-service to Organization to View Box to Cluster

flowchart LR
    T1[Tenant A<br/>SAML IdP] --> HS{{Helios Self-Service Portal}}
    T2[Tenant B<br/>SAML IdP] --> HS
    T3[Tenant N<br/>SAML IdP] --> HS
    HS -->|scoped session| ORG1[Organization A]
    HS -->|scoped session| ORG2[Organization B]
    HS -->|scoped session| ORG3[Organization N]
    ORG1 --> VB1[View Box A<br/>quota + VLAN]
    ORG2 --> VB2[View Box B<br/>quota + VLAN]
    ORG3 --> VB3[View Box N<br/>quota + VLAN]
    VB1 --> CL[(DMaaS Regional Cluster)]
    VB2 --> CL
    VB3 --> CL
    CL -.metering API.-> BILL[MSP Billing System]
    MSP[/MSP-Admin role<br/>cross-org ops/] -.audited break-glass.-> ORG1

Self-Service via Helios

Tenants log into Helios with their own SAML IdP and see only their Organization. They can create Protection Groups, attach pre-approved policies (the MSP publishes Bronze / Silver / Gold templates), trigger restores, view SLA dashboards, and download compliance reports. The MSP does not field tickets for routine operations — the platform’s self-service surface absorbs them, which is precisely how MSPs achieve the unit economics needed for mid-market BaaS [Source: https://www.cohesity.com/blogs/architecture-matters-blueprints-for-backup-as-a-service-offerings/].

Service Tier	RPO	Retention	Cloud Archive	Monthly Price (illustrative)
Bronze	24 h	30 days	None	$0.05 / GB
Silver	4 h	90 days	Quarterly to S3 IA	$0.10 / GB
Gold	15 min	7 years	Monthly to S3 Glacier	$0.18 / GB

Chargeback and Metering

Helios exposes consumption metrics — protected front-end TB, change rate, archived TB, restore activity — that the MSP pulls via REST API into its billing system [Source: https://www.cohesity.com/blogs/automating-workflows-using-cohesity-rest-api-part-1/]. A monthly cron job calls the Helios API, joins the per-Organization metrics with the tier price, and emits invoices. The Cohesity Terraform provider lets the MSP version-control tenant configurations alongside the rest of its infrastructure-as-code [Source: https://www.cohesity.com/blogs/architecture-matters-blueprints-for-backup-as-a-service-offerings/].

Onboarding and Offboarding Automation

A new-tenant onboarding workflow runs as a Terraform plan plus an Ansible playbook: Terraform creates the Organization, View Boxes, network bindings, default policies, and SAML federation; Ansible registers the tenant’s sources (vCenter, M365, NAS) and applies the chosen tier policy. Total elapsed time: approximately two hours of automation plus tenant-side IdP configuration. Offboarding inverts the workflow: the tenant’s View Boxes are placed into a 30-day grace period with restore-only access, then cryptographically erased by destroying the per-tenant key. Audit logs from the entire lifecycle stream to the MSP’s SIEM for compliance evidence.

Key Takeaway: A Cohesity DMaaS MSP design rests on three pillars: Organizations for tenant isolation, Helios for self-service, and APIs (REST / Terraform / Ansible) for metering and lifecycle automation. Service tiers, not custom configurations, are the unit of sale; the platform absorbs Day-2 operations so the MSP can scale tenant count without scaling headcount.

CCAE Exam Blueprint and Scenario Question Pattern

The Numbers You Must Internalize

Exam Element	Value	Source
Exam code	COH500	[Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/datasheets/cohesity-certified-architect-expert-exam-preparation-guide-en.pdf]
Duration	90 minutes	Same
Cost	$200 USD	Same
Passing score	60% (approx. 36 of 60 correct)	Same
Question count	~60 scenario MCQs	Same
Question format	4-option scenario MCQ	Same
Validity	2 years	Same
Retake window	14 days	Same
Prerequisites	None formal; ~1 year hands-on recommended	Same

Ninety minutes for sixty scenario questions yields a 90-second per-question budget. That is enough time to read a paragraph carefully, reject two distractors, and pick between the remaining two — but it is not enough time to puzzle over feature trivia. Memorize the platform vocabulary so reading time is short and decision time is long.

Domain Weights and Per-Domain Question Counts

#	Domain	Weight	Questions (of 60)
1	Cohesity Data Cloud Data Management Platform Architecture	22%	~13
2	Cohesity Architecture Solution Discovery and Design	35%	~21
3	Design Security-focused Solutions	18%	~11
4	Integrate Third-party Solutions with Cohesity	13%	~8
5	Gap Analysis and Troubleshooting	12%	~7

Domain 2 alone is more than a third of the exam — it is the one place you cannot afford to be weak. Domains 1 and 3 together are 40 percent. Spending equal time on every domain is a strategic mistake; the 30-day plan below allocates study hours in proportion to the weights.

Figure 15.4: CCAE domain weight breakdown across the five exam domains

graph TD
    EXAM[CCAE COH500<br/>60 questions / 90 minutes]
    EXAM --> D1[Domain 1: Platform Architecture<br/>22% / ~13 questions]
    EXAM --> D2[Domain 2: Solution Discovery & Design<br/>35% / ~21 questions]
    EXAM --> D3[Domain 3: Security-Focused Solutions<br/>18% / ~11 questions]
    EXAM --> D4[Domain 4: 3rd-Party Integration<br/>13% / ~8 questions]
    EXAM --> D5[Domain 5: Gap Analysis & Troubleshooting<br/>12% / ~7 questions]
    style D2 fill:#1f6feb,stroke:#58a6ff,color:#fff
    style D1 fill:#2d5a8c,stroke:#58a6ff,color:#fff
    style D3 fill:#2d5a8c,stroke:#58a6ff,color:#fff
    style D4 fill:#1c3a5c,stroke:#58a6ff,color:#fff
    style D5 fill:#1c3a5c,stroke:#58a6ff,color:#fff

The Scenario Question Pattern

Every CCAE item follows a recognizable shape [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/datasheets/cohesity-certified-architect-expert-exam-preparation-guide-en.pdf]:

A scenario paragraph describing workload characteristics (volume, change rate, RTO/RPO), constraints (budget, bandwidth, existing infrastructure, compliance, residency), and a business challenge.
Four candidate designs, each plausible at first glance.
The correct answer satisfies all stated constraints.
The distractors each fail exactly one constraint — usually the cheapest option misses RTO, the most secure option exceeds budget or operational complexity, or the most performant option ignores compliance.

The decoding strategy is to underline the constraints in the scenario before reading the options, then test each option against the constraint list and eliminate any option that fails one. The remaining option, by construction, is the answer.

Distractor Archetype	What It Optimizes For	What It Sacrifices
The Cheap Option	Lowest CapEx / OpEx	RTO, RPO, or resilience
The Fortress	Maximum security	Operational simplicity, cost
The Performance Demon	Lowest RTO/RPO	Cost, retention, compliance
The Status Quo	Minimal change to existing estate	Future scale, modern features

The correct answer is almost always the option that balances competing objectives while applying the right Cohesity feature for the constraint set. Cohesity explicitly states the exam rewards judgment, not feature recall [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/datasheets/cohesity-certified-architect-expert-exam-preparation-guide-en.pdf].

Decision Criteria Checklist

For every scenario, run through this six-point checklist before picking an option:

Have I mapped every requirement to a measurable SLA (RPO, RTO, retention, residency, compliance)?
Does the protection technique match the workload class — VM snapshot, application-aware, agent, cloud-native, SaaS?
Is the hardware/edition appropriate for the performance, footprint, and refresh cycle?
Is the design policy-driven via Helios rather than hand-configured per job?
Are encryption, RBAC/MFA, immutability, and vaulting answered explicitly, not bolted on?
Is the design validated by a PoC artifact or as-built vs as-used review?

A 30-Day Study Plan Keyed to Domain Weights

The plan below allocates study days roughly in proportion to the published domain weights. Adjust ratios upward for any domain where your diagnostic-test score is below 70 percent.

Figure 15.5: 30-day CCAE study plan timeline mapped to domain weights

timeline
    title 30-Day CCAE Study Plan
    section Week 1 Foundation
        Days 1-7  : SpanFS internals
                  : Hardware editions
                  : Networking & Helios
                  : Domain 1 (22%)
    section Week 2 Design
        Days 8-18 : Sizing & workload patterns
                  : Hybrid / multi-cloud
                  : PoC architectures
                  : Domain 2 (35%)
    section Week 3 Security
        Days 19-23 : DataLock & FortKnox
                   : DataHawk & DSA partners
                   : Domain 3 (18%)
    section Week 4 Integration & Practice
        Days 24-26 : REST API & Terraform
                   : Organizations & multi-tenancy
                   : Domain 4 (13%)
        Days 27-28 : Gap analysis & Siren
                   : Capacity prediction
                   : Domain 5 (12%)
        Days 29-30 : Two timed practice exams
                   : Error analysis & remediation

Phase	Days	Focus	Domain	Activities
Foundations	1–7	SpanFS, hardware editions, networking, Helios	1 (22%)	Re-read Chapters 1–5; Cohesity SpanFS / SnapTree white paper [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity-SpanFS-and-SnapTree-WP.pdf]; Optimal Network Designs reference architecture [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/reference-architecture/Optimal-Network-Designs-with-Cohesity-RA.pdf]; build a sketch of a 6-node cluster with Bond Mode 4 and MLAG.
Solution Design	8–18	Sizing, workload patterns, hybrid/multi-cloud, PoC	2 (35%)	Re-read Chapters 3, 7, 8, 9, 10; sizing exercises with realistic dedupe ratios; review Suffolk County Council and Sky Lakes Medical case studies [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/case-study/suffolk-county-council-case-study-en.pdf][Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/case-study/sky-lakes-medical-case-study-en.pdf]; design three PoC architectures (one VM-heavy, one DB-heavy, one M365-heavy).
Security	19–23	DataLock, FortKnox, DataHawk, DSA integrations	3 (18%)	Re-read Chapter 11; threat-defense architecture white paper [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/threat-defense-architecture-white-paper-en.pdf]; map each Data Security Alliance partner to the Cohesity capability it extends [Source: https://www.cohesity.com/company/data-security-alliance/].
Integration	24–26	REST API, Terraform, Organizations, Hybrid Extender	4 (13%)	Re-read Chapters 6 and 13; hands-on with the Helios “Try this Operation” interactive API console [Source: https://www.cohesity.com/blogs/automating-workflows-using-cohesity-rest-api-part-1/]; write one Terraform module for a Protection Group; design a multi-tenant Organization layout.
Gap Analysis	27–28	Helios capacity, Siren, pre-checks, as-built vs as-used	5 (12%)	Re-read Chapter 14; practice reading Helios capacity prediction graphs; walk through a Siren diagnostic flow [Source: https://www.cohesity.com/blogs/using-cohesitys-saas-based-helios-manage-clusters-anywhere/].
Synthesis	29–30	Full-length practice exams under timed conditions	All	Two 60-question timed practice exams; post-exam error analysis; targeted remediation on the lowest-scoring domain.

A useful analogy for the plan: it is a marathon training schedule, not a sprint. Domain 2 is the long-run portion of the week — you cannot skip the long run and expect to finish. Domain 5 is the cooldown — necessary, but short. The synthesis weekend is the taper; the exam is race day.

Test-Day Strategy

The Day Before

Confirm proctoring system check is green, ID is ready, environment is private.
Re-read your one-page distilled summary of domain weights, RPO/RTO patterns, and the FortKnox + DataLock + DataHawk stack.
Sleep eight hours. Do not cram.

During the Exam

Spend the first 60 seconds on the question scanning for constraints: RPO, RTO, retention, compliance, residency, budget, existing infrastructure. Underline them mentally.
Spend the next 30 seconds eliminating distractors: the option that violates a stated constraint is wrong, regardless of how attractive it otherwise looks.
If two options remain, pick the one that balances rather than the one that maximizes a single axis. CCAE rewards balance.
Flag any question that takes more than two minutes and move on. Return on the second pass with a clearer head.
Reserve the final 10 minutes for flagged questions and a pass through any blanks. Never leave an answer blank — there is no penalty for guessing.

After the Exam

Pass: claim your two-year credential and start logging continuing-education hours toward renewal.
Fail: use the 14-day retake window deliberately. Do not retake immediately. Run a domain-by-domain post-mortem against your score report and target the weakest domain for one focused week, then retake [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/datasheets/cohesity-certified-architect-expert-exam-preparation-guide-en.pdf].

Key Takeaway: The exam is 90 minutes for 60 scenario MCQs at 60 percent passing — a budget of 90 seconds per question. Win by reading constraints first, eliminating distractors that violate any one constraint, and picking the balanced option. Allocate study time in proportion to domain weights (22 / 35 / 18 / 13 / 12), and prioritize Domain 2 because it is the largest single block of the exam.

Chapter Summary

This chapter pulled the entire CCAE curriculum into three end-to-end architectures and one exam strategy. The global enterprise scenario showed how hub-and-spoke replication inside each region combines with one-to-many replication across regions to deliver multi-region survival, with CloudArchive supplying long-term retention and Helios SaaS providing fleet-level visibility. The healthcare scenario showed how DataLock immutability, FortKnox Self-Managed cyber vaulting, DataHawk detection and classification, KMIP-managed encryption, and Helios Self-Managed combine into a HIPAA-aligned defense-in-depth stack where every layer answers a specific compliance or ransomware requirement. The MSP scenario showed how Cohesity Organizations, Helios self-service, and REST/Terraform/Ansible automation deliver a multi-tenant DMaaS offering with metered chargeback and lifecycle automation that scales tenants without scaling support headcount.

The exam blueprint section decoded COH500 — 90 minutes, 60 questions, $200, 60 percent passing — and made the domain weights actionable. The scenario question pattern is consistent: a paragraph of constraints, four plausible options, three distractors that each fail one constraint, and one balanced answer. The 30-day plan and test-day strategy translate the blueprint into daily activities and minute-by-minute exam discipline. If you can defend each design choice in the three scenarios above against alternatives, recognize the four distractor archetypes on sight, and finish a 60-question timed practice with at least 80 percent under exam conditions, you are ready to sit for the CCAE.

Key Terms

Hub-and-spoke — Replication topology in which spoke clusters back up locally and replicate inbound to a central hub for centralized DR, retention, and reporting; the canonical pattern for global enterprises with branch sites [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity_Data_Protection_White_Paper.pdf].
Cyber Vault — A logically and physically isolated tertiary copy of protected data with a virtual air gap and inbound-only transfer windows. Cohesity FortKnox is the platform’s cyber vault offering, available as Cohesity-managed SaaS or self-managed [Source: https://www.cohesity.com/glossary/cyber-vault/].
DMaaS (Data Management as a Service) — Cohesity’s SaaS delivery of DataProtect, where the customer or MSP consumes regional SaaS instances rather than operating clusters; the foundation of modern BaaS offerings [Source: https://www.cohesity.com/blogs/architecture-matters-blueprints-for-backup-as-a-service-offerings/].
Compliance — The set of regulatory and contractual requirements (HIPAA, PCI, FedRAMP, GDPR) that constrain architecture choices around encryption, retention, residency, and access control. The exam treats compliance as a hard, not soft, constraint.
Chargeback — Per-tenant or per-business-unit usage metering that converts platform consumption into invoiceable units. Implemented via Helios consumption metrics retrieved through the REST API [Source: https://www.cohesity.com/blogs/automating-workflows-using-cohesity-rest-api-part-1/].
Scenario design — The architectural discipline of mapping a paragraph of business constraints to a composed Cohesity solution that satisfies all constraints while balancing competing objectives. The unit of evaluation on the CCAE exam.
Domain weighting — The published percentage allocation of exam questions to each of the five CCAE domains (22 / 35 / 18 / 13 / 12). The basis for proportional study-time allocation [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/datasheets/cohesity-certified-architect-expert-exam-preparation-guide-en.pdf].

Cohesity Certified Architect Expert (CCAE) Certification Exam Preparation Guide

Table of Contents

Chapter 1: CCAE Exam Overview and Cohesity Platform Architecture

Learning Objectives

CCAE Exam Blueprint and Study Strategy

Domain Weightings and Number of Questions

Recommended Hands-on Prerequisites and Lab Environments

Exam Delivery, Scoring, and Recertification Cycle

How CCAE Differs from CCSE and CCPE

Cohesity DataPlatform Architecture Pillars

SpanFS Distributed File System

Hyperconverged Scale-out Node Model

MapReduce-style Indexing and Global Deduplication

Strict Consistency and Quorum Semantics

Core Services and Software Stack

Bridge, Apollo, Iris, and Magneto Services

Yoda Search Service and Global Indexing

Helios SaaS Control Plane

Marketplace Apps and the Cohesity App Framework

Cohesity Product Portfolio: DataProtect, SmartFiles, SiteContinuity, Helios

DataProtect — Backup and Recovery

SmartFiles — Files, Objects, and Unstructured Data

SiteContinuity — Disaster Recovery Orchestration

Helios — SaaS Management and Insights

Side-by-Side Product Comparison

Hardware, Cloud, and Virtual Edition Form Factors

Cohesity-Branded Appliances vs. ReadyNodes vs. Certified Partners

Virtual Edition (VE) and Cloud Edition Deployment Models

Robo Edition for Remote and Branch Offices

Choosing the Right Form Factor

Chapter Summary

Key Terms

Chapter 2: SpanFS Internals, Distributed Storage, and Cluster Mechanics

Learning Objectives

SpanFS Data Path

Chunk files, blob files, and chunk groups

NVRAM journaling and write coalescing

Read path and locality optimizations

Garbage collection and chunk reclamation

Resiliency: RF and Erasure Coding

Replication Factor 2 vs. RF3 trade-offs

Erasure coding schemes (2:1, 4:2, 6:2)

Inline vs. post-process erasure coding

Choosing resiliency policies per View Box

Deduplication and Compression

Variable-length and fixed-length sliding-window dedupe

Global vs. local dedupe domains

Inline vs. post-process compression

Estimating dedupe ratios for different workloads

Cluster Mechanics and Quorum

Strict consistency and Paxos-based metadata

Node, disk, and chassis fault domains

Quorum loss scenarios and recovery

Rolling upgrades and maintenance mode

Worked Example: Tracing a 1 MB Write Through SpanFS

Chapter Summary

Key Terms

Chapter 3: Cluster Design, Sizing, and Capacity Planning

Learning Objectives

Workload Profiling Inputs

Front-End TB (FETB), Change Rate, and Retention

Workload Categorization: VM, NAS, DB, Physical

Daily Ingest, Replication, and Cloud Archive Throughput

Performance vs. Capacity-Driven Sizing

Sizing Tools and Calculators

Cohesity Sizing Tool Inputs and Outputs

Effective Capacity, Dedupe Assumptions, and Overheads

Sizing for SmartFiles Primary Workloads

Cloud Edition and CloudArchive Sizing

Node Selection and Cluster Topology

All-Flash vs. Hybrid vs. Dense Storage Nodes

Minimum Cluster Sizes and Scaling Increments

Mixed-Node Clusters and Constraints

Brick Mode vs. Node Mode Considerations

Capacity Planning Over Time

Modeling Growth and Tech-Refresh Cycles

Tiering Strategy Across Local, Cloud, and Tape

Reserve Capacity for Failures and Rebuilds

Reporting and Forecasting via Helios

Worked Example: 500 TB FETB Sizing Walkthrough