Cohesity Certified Architect Expert (CCAE) Certification Exam Preparation Guide
A comprehensive advanced study guide for architects designing, deploying, and operating Cohesity DataProtect, SmartFiles, and Helios environments at enterprise scale.
Table of Contents
- Chapter 1: CCAE Exam Overview and Cohesity Platform Architecture
- Chapter 2: SpanFS Internals, Distributed Storage, and Cluster Mechanics
- Chapter 3: Cluster Design, Sizing, and Capacity Planning
- Chapter 4: Networking, DNS, and Cluster Connectivity
- Chapter 5: Cluster Deployment, Bootstrap, and Day-2 Operations
- Chapter 6: Identity, Access Management, and Multi-Tenancy
- Chapter 7: Data Protection: Sources, Policies, and Protection Groups
- Chapter 8: Application-Aware Backup and Recovery Patterns
- Chapter 9: Replication, Disaster Recovery, and SiteContinuity
- Chapter 10: Cloud Integration: Archive, Tier, Replicate, and Spin
- Chapter 11: Security, Encryption, and Ransomware Resilience
- Chapter 12: SmartFiles: Files, Objects, and Unstructured Data Services
- Chapter 13: Helios SaaS, Marketplace Apps, and Automation
- Chapter 14: Performance, Monitoring, and Troubleshooting
- Chapter 15: End-to-End Architecture Scenarios and Exam Synthesis
Chapter 1: CCAE Exam Overview and Cohesity Platform Architecture
Learning Objectives
- Describe the CCAE exam blueprint, domains, weightings, and prerequisites.
- Explain the high-level architecture of the Cohesity DataPlatform and its core services.
- Differentiate between Cohesity DataProtect, SmartFiles, SiteContinuity, and Helios.
- Identify how the CCAE role fits within the Cohesity certification track (CCSE, CCPE, CCAE).
- Map physical, virtual, and cloud form factors to specific architectural use cases.
CCAE Exam Blueprint and Study Strategy
Architects who pursue the Cohesity Certified Architect Expert (CCAE) credential are signaling that they can do more than operate a backup platform — they can size, design, and defend a Cohesity Data Cloud deployment in front of customers, security teams, and CIOs. This first section unpacks what the exam actually tests, who it is for, and how to study for it efficiently.
Domain Weightings and Number of Questions
The CCAE blueprint is divided into four weighted domains. The Solution Discovery and Design domain dominates at 35%, which tells you immediately that this is an architecture exam — not a memorize-the-CLI exam [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/datasheets/cohesity-certified-architect-expert-exam-preparation-guide-en.pdf].
| Domain | Weight | Focus |
|---|---|---|
| 1. Cohesity Data Cloud Data Management Platform Architecture | 22% | Products, technology use cases and limitations, designing a DataProtect platform |
| 2. Cohesity Architecture Solution Discovery and Design | 35% | Sizing, workload-appropriate protection, hybrid/multi-cloud, Helios Self-Managed for dark sites, business alignment |
| 3. Design Security-Focused Solutions | 18% | Cyber-resiliency design, immutability, encryption, ransomware patterns |
| 4. Integrate Third-party Solutions with Cohesity | 13% | Integration patterns and Cohesity APIs |
The remaining 12% spans cross-cutting topics such as licensing, support, and lifecycle. Treat the percentages as time-budget guidance: if you have 30 study days, allocate roughly 10-11 days to Domain 2, 6-7 days each to Domains 1 and 3, and the balance to Domain 4 [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/datasheets/cohesity-certified-architect-expert-exam-preparation-guide-en.pdf].
Worked example — building a study plan. An architect with strong DataProtect operational experience but limited security background should invert the time allocation slightly and over-invest in Domain 3. Why? Because security questions are often scenario-based (“the customer has FedRAMP requirements and a 3-2-1-1-0 mandate — which combination of features applies?”), and confidence in DataLock, FortKnox, and DataHawk pays disproportionate returns versus rote memorization of, say, a backup policy slider [Source: https://www.cohesity.com/academy/certification/].
Recommended Hands-on Prerequisites and Lab Environments
CCAE has no formal prerequisite certification, but the Cohesity Academy explicitly recommends prior hands-on experience with DataProtect, SmartFiles, SiteContinuity, and Helios, plus working knowledge of:
- VMware vSphere and Microsoft Hyper-V virtualization
- Database protection patterns (Oracle, SQL Server, SAP HANA)
- Object storage and S3 semantics
- Backup, DR, and hybrid-cloud architecture
- Cyber-resiliency and ransomware concepts [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/academy/cohesity-academy-customer-course-catalog.pdf]
The most reliable lab is a Helios sandbox tenant connected to either a Cohesity Virtual Edition cluster running on a workstation-class hypervisor or a Cloud Edition trial in AWS or Azure. The Cohesity Academy course catalog maps specific instructor-led and on-demand modules to each exam section, which is the highest-signal study artifact you can use [Source: https://www.cohesity.com/academy/].
Exam Delivery, Scoring, and Recertification Cycle
| Attribute | Value |
|---|---|
| Duration | 90 minutes |
| Cost | $200 USD |
| Passing score | 60% |
| Validity | 2 years |
| Retake policy | 14-day waiting period between attempts |
| Delivery | Online proctored |
Source data is consistent across the official preparation guide and independent exam-tracking sites [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/datasheets/cohesity-certified-architect-expert-exam-preparation-guide-en.pdf] [Source: https://www.nwexam.com/cohesity]. The two-year validity window means you should plan recertification roughly six months before expiry, especially because Cohesity Data Cloud feature velocity is high and the blueprint occasionally adds topics such as DirectIO, FortKnox, and DataHawk [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/netbackup-directio-powering-the-next-era-of-cyber-resilience.pdf].
How CCAE Differs from CCSE and CCPE
The Cohesity certification track is layered. CCAE sits at the top of the architect/expert tier, and the foundational and associate credentials (CCA, CCSE) feed candidates into it [Source: https://www.cohesity.com/academy/certification/].
| Certification | Audience | Focus | Depth |
|---|---|---|---|
| CCA (Cohesity Certified Associate) | Operators, junior admins | Cluster basics, daily operations | Foundational |
| CCSE (Cohesity Certified Sales Engineer) | Pre-sales SEs | Product positioning, demo, light design | Associate |
| CCPE (Cohesity Certified Professional Engineer) | Senior admins, consultants | Implementation, deployment, Day-2 ops | Professional |
| CCAE (Cohesity Certified Architect Expert) | Solutions architects, senior consultants | End-to-end design, sizing, security, multi-cloud | Expert |
Analogy. Think of the certification path the way an aviation track works: a CCSE is a flight instructor who can demonstrate the cockpit, a CCPE is a line pilot who can fly the aircraft daily, and a CCAE is the aerospace architect who designs the airframe and mission profile. The CCAE exam consequently leans heavily on why you would choose a topology rather than how you click through a wizard [Source: https://www.cohesity.com/academy/].
Figure 1.1: Cohesity certification track progression from foundational to expert tier
flowchart TD
CCA["CCA - Cohesity Certified Associate<br/>Operators and junior admins<br/>Cluster basics, daily operations"]
CCSE["CCSE - Cohesity Certified Sales Engineer<br/>Pre-sales engineers<br/>Product positioning, demo, light design"]
CCPE["CCPE - Cohesity Certified Professional Engineer<br/>Senior admins and consultants<br/>Implementation, deployment, Day-2 ops"]
CCAE["CCAE - Cohesity Certified Architect Expert<br/>Solutions architects, senior consultants<br/>End-to-end design, sizing, security, multi-cloud"]
CCA --> CCSE
CCA --> CCPE
CCSE --> CCAE
CCPE --> CCAE
style CCA fill:#1f6feb,stroke:#58a6ff,color:#ffffff
style CCSE fill:#1f6feb,stroke:#58a6ff,color:#ffffff
style CCPE fill:#1f6feb,stroke:#58a6ff,color:#ffffff
style CCAE fill:#238636,stroke:#58a6ff,color:#ffffff
Key Takeaway: CCAE is a 90-minute, four-domain, design-oriented exam dominated by the 35%-weighted Solution Discovery and Design domain. Plan 30 days of preparation, weight your time against the blueprint, and treat hands-on Helios + Virtual Edition labs as non-negotiable. The certification is the architect-tier capstone above CCA, CCSE, and CCPE.
Cohesity DataPlatform Architecture Pillars
Every CCAE exam scenario eventually traces back to four architectural pillars: a single distributed file system, a hyperconverged scale-out node model, MapReduce-style background services, and strict consistency. If you internalize these pillars, you can derive most design answers from first principles.
SpanFS Distributed File System
SpanFS is Cohesity’s distributed, web-scale file system that consolidates backups, files, objects, dev/test copies, and analytics onto a single tier of storage [Source: https://www.cohesity.com/platform/spanfs/]. Architecturally, SpanFS layers four major subsystems:
- Access Layer — exposes industry-standard NFS, SMB, and S3 protocols (with OST and DirectIO for NetBackup integration) on the same volumes via virtual IPs, with no master node and no protocol-specific choke point [Source: https://www.cohesity.com/blogs/cohesity-spanfs-snaptree/].
- I/O Engine — chunks data, performs variable-length global deduplication (inline or post-process), compresses, encrypts, indexes, and tiers blocks across SSD, HDD, and cloud based on access pattern [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity-SpanFS-and-SnapTree-WP.pdf].
- Metadata Management — uses a distributed key-value store built on a patented B+ tree, replicated and sharded consistently across every node. SnapTree (paired with SpanFS) implements Distributed Redirect-on-Write (D-ROW) for unlimited snapshots and clones with effectively zero performance penalty [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity-SpanFS-and-SnapTree-WP.pdf].
- Storage and Distribution — fully distributed across hyperconverged x86 nodes, dynamically rebalanced, and protected by erasure coding or replication [Source: https://blogs.vmware.com/affiliates/cohesity-spanfs-the-difference-maker-in-the-enterprise-and-secondary-storage-architectures].
Note: A 2015 USENIX paper titled “SpanFS” describes an unrelated academic project. The authoritative source for the Cohesity implementation is Cohesity’s own SpanFS and SnapTree whitepaper [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity-SpanFS-and-SnapTree-WP.pdf] [Source: https://www.usenix.org/system/files/conference/atc15/atc15-paper-kang.pdf].
Analogy. SpanFS is to a Cohesity cluster what HDFS is to a Hadoop cluster — except SpanFS adds strict consistency, multi-protocol access, snapshots, and tenant isolation, all of which HDFS lacks natively.
Hyperconverged Scale-out Node Model
A Cohesity cluster is a shared-nothing collection of x86 nodes; each node contributes CPU, memory, NVMe/SSD (used for metadata and write caching), and capacity HDDs to a single SpanFS namespace. There is no separate metadata controller, and no node is more privileged than another. This means:
- Throughput scales linearly with node count.
- Metadata service capacity grows with each added node.
- Failure of any single node degrades but does not stop the cluster (assuming RF or EC policy permits) [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/cohesity-platform-security-white-paper-en.pdf].
| Property | Traditional Two-Tier Backup | Cohesity Hyperconverged |
|---|---|---|
| Compute / storage scaling | Independent, often imbalanced | Coupled, balanced per node |
| Metadata controller | Dedicated server, bottleneck risk | Distributed across all nodes |
| Add-capacity workflow | Re-rack, re-license, migrate | Add ReadyNode, auto-rebalance |
| Failure blast radius | Often whole-array | Single node, EC-bounded |
MapReduce-style Indexing and Global Deduplication
Apollo, the cluster-wide background services engine, runs MapReduce-style jobs across every node to perform garbage collection, post-process deduplication, indexing, file analytics, and integrity scrubbing [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity-SpanFS-and-SnapTree-WP.pdf]. The deduplication itself is variable-length sliding-window, applied globally across the entire cluster — meaning two backup jobs that ingest the same OS image from different sources still collapse to a single physical chunk.
Worked example — dedupe ratio for a mixed VM estate. A customer protects 500 Windows VMs averaging 80 GB each, with significant OS overlap. Front-end TB (FETB) is roughly 40 TB. Variable-length global dedupe typically delivers 4-6x on this profile, and inline compression layers another 1.5-2x on top. Effective stored capacity lands near 5-7 TB before applying RF/EC, which is why Cohesity sizing tools can return surprisingly low usable-capacity requirements. We unpack the math in Chapter 3, but the architectural point is that Apollo, not Bridge, is what makes those ratios stick over time through post-process re-deduplication and garbage collection [Source: https://www.cohesity.com/blogs/cohesity-spanfs-snaptree/].
Strict Consistency and Quorum Semantics
SpanFS is strictly consistent, meaning any node can serve any I/O for any object and clients always see the latest committed state [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity-SpanFS-and-SnapTree-WP.pdf]. The cluster uses a Paxos-style quorum to maintain metadata agreement; a cluster requires a majority of nodes to remain healthy to continue accepting writes. This is the architectural reason that the minimum supported cluster size is three or four nodes depending on form factor: smaller clusters cannot maintain quorum during a single-node failure [Source: https://www.cohesity.com/blogs/cohesity-spanfs-snaptree/].
Key Takeaway: SpanFS combines a distributed-metadata architecture, hyperconverged x86 nodes, MapReduce-style background services in Apollo, and strict consistency to deliver linear scale without master/slave bottlenecks. Quorum requirements drive the three- or four-node minimums you will see in every sizing question.
Core Services and Software Stack
Beneath the CCAE exam’s scenario language are a handful of cooperating services. Memorizing what they do — and what they don’t do — is one of the highest-yield activities for Domain 1.
Bridge, Apollo, Iris, and Magneto Services
Every node runs the same software stack. Four services do most of the architecturally interesting work [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity-SpanFS-and-SnapTree-WP.pdf]:
| Service | Role | Responsibilities |
|---|---|---|
| Bridge | SpanFS data path | Chunking, dedupe, compression, encryption, erasure coding, tiering, NFS/SMB/S3 protocol stacks |
| Apollo | Background analytics & MapReduce | Garbage collection, post-process dedupe, indexing, scrubbing, file analytics |
| Magneto | Data protection orchestration | Backups, snapshots, replication, archive, recovery workflows; talks to vCenter, DBs, NAS, cloud |
| Iris | Management UI/control plane | Web UI, REST API, CLI dispatch, RBAC enforcement |
A second tier of services rounds out the platform. ScribeStore is the underlying distributed key-value metadata store that holds inodes, chunk locations, and snapshot trees [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity-SpanFS-and-SnapTree-WP.pdf]. Yoda is the global indexing and search service that powers cross-cluster file and object search.
Analogy. Picture a hospital. Bridge is the operating theatre — it is where actual patient I/O happens. Magneto is the intake and discharge coordinator — it decides when patients (workloads) arrive, how often they get checked, and where they go. Apollo is the night cleaning crew — it sweeps, sorts, and reorganizes the building so the next day’s operations stay efficient. Iris is the front desk — every external interaction goes through it. Yoda is the medical records librarian.
Figure 1.2: Layered Cohesity DataPlatform architecture from hardware to control plane
flowchart LR
subgraph L1["Hardware Layer"]
HW["x86 Nodes<br/>CPU + Memory<br/>NVMe/SSD + HDD"]
end
subgraph L2["Storage Foundation"]
SPANFS["SpanFS<br/>Distributed File System<br/>NFS / SMB / S3 / OST"]
SCRIBE["ScribeStore<br/>Distributed KV Metadata"]
end
subgraph L3["Core Services"]
BRIDGE["Bridge<br/>Data Path"]
APOLLO["Apollo<br/>MapReduce Background Jobs"]
MAGNETO["Magneto<br/>Protection Orchestration"]
IRIS["Iris<br/>UI / API / RBAC"]
YODA["Yoda<br/>Global Search"]
end
subgraph L4["Workload Products"]
DP["DataProtect<br/>Backup and Recovery"]
SF["SmartFiles<br/>Files and Objects"]
SC["SiteContinuity<br/>DR Orchestration"]
end
subgraph L5["Control Plane"]
HELIOS["Helios<br/>SaaS Multicloud Management<br/>and AI Insights"]
end
L1 --> L2
L2 --> L3
L3 --> L4
L4 --> L5
style L1 fill:#0d1117,stroke:#58a6ff,color:#ffffff
style L2 fill:#0d1117,stroke:#58a6ff,color:#ffffff
style L3 fill:#0d1117,stroke:#58a6ff,color:#ffffff
style L4 fill:#0d1117,stroke:#58a6ff,color:#ffffff
style L5 fill:#0d1117,stroke:#58a6ff,color:#ffffff
Yoda Search Service and Global Indexing
Yoda makes global search possible across an entire fleet of clusters. When a backup completes, Magneto signals Apollo to index file paths and metadata; that index is then surfaced through Yoda for queries originating in either the local Iris UI or the Helios global console. This is what allows a CCAE-exam scenario such as “find every PDF named contract-2024.pdf across 14 clusters in 8 regions” to return in seconds [Source: https://www.cohesity.com/blogs/cohesity-spanfs-snaptree/].
Helios SaaS Control Plane
Helios is Cohesity’s multicloud SaaS management plane that consolidates DataProtect, SmartFiles, and SiteContinuity into a single operations and reporting surface [Source: https://futurumgroup.com/document/cohesity-helios-mcdm-product-brief/]. It does not host customer backup data; it hosts control and insight:
- Global dashboards, SLA reporting, and capacity forecasting
- Cross-cluster search via Yoda
- AI-driven anomaly detection and ransomware threat indicators
- Policy-based fleet management
- Entry point for SaaS-only services such as DataProtect-as-a-Service, FortKnox cyber vault, and DataHawk
For air-gapped or sovereign environments, Cohesity offers Helios Self-Managed, a customer-hosted variant called out explicitly in the CCAE Solution Discovery and Design domain [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/datasheets/cohesity-certified-architect-expert-exam-preparation-guide-en.pdf].
Marketplace Apps and the Cohesity App Framework
The Cohesity App Framework allows third parties (and customers) to deploy containerized applications directly onto cluster nodes, sandboxed away from the data path. Typical app categories include compliance scanners, anti-virus engines (often via ICAP), eDiscovery tools, and analytics workloads. The architectural advantage is data gravity — the apps run where the data already lives rather than pulling petabytes across the network [Source: https://www.cohesity.com/blogs/analyst-reports-highlight-the-strength-of-cohesitys-multi-use-platform/].
Key Takeaway: Bridge, Apollo, Magneto, and Iris form the four-service backbone; ScribeStore and Yoda support metadata and global search; Helios delivers SaaS-tier visibility; and the Marketplace plus App Framework let analytics run on top of the data without copying it elsewhere. Most exam scenarios reduce to identifying which of these services owns a given behavior.
Cohesity Product Portfolio: DataProtect, SmartFiles, SiteContinuity, Helios
Domain 1 of the exam expects you to differentiate the four headline products and recognize when each is the right answer. They are best understood as personalities sharing the same DataPlatform body.
DataProtect — Backup and Recovery
DataProtect is Cohesity’s flagship enterprise backup product. It protects:
- VMware vSphere, Microsoft Hyper-V, and Nutanix AHV virtual machines
- Physical Linux and Windows servers
- Network-attached storage via SMB, NFS, and NDMP
- Databases including Oracle, SQL Server, SAP HANA, and MongoDB
- Microsoft 365 (Exchange, OneDrive, SharePoint, Teams)
- Kubernetes workloads
- Public cloud workloads (AWS EC2/EBS/RDS, Azure, GCP) [Source: https://www.cohesity.com/cohesity-vs-commvault/]
Internally, DataProtect uses Magneto-driven incremental-forever pipelines that persist recovery points as deduplicated SnapTree snapshots inside SpanFS. This is what enables Instant Mass Restore of thousands of VMs from any historical point in time — the data is already in a queryable, mountable filesystem rather than locked in proprietary backup tape format [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/solution-guides/the-essential-guide-to-modern-data-management-en.pdf].
SmartFiles — Files, Objects, and Unstructured Data
SmartFiles is Cohesity’s unified file and object service for primary and secondary unstructured data. Because it shares SpanFS with DataProtect, secondary copies of files can be created without provisioning a separate storage silo. Distinguishing characteristics:
- Multi-protocol access: SMB3, NFSv3/v4, S3 — including cross-protocol reads (a file written via SMB can be read via S3)
- Immutability for ransomware-resistant archive tiers
- Encryption and access-pattern analytics
- Application use cases ranging from file-share consolidation to S3 object storage for cloud-native apps and AI/ML pipelines [Source: https://futurumgroup.com/document/cohesity-helios-mcdm-product-brief/]
SiteContinuity — Disaster Recovery Orchestration
SiteContinuity converges backup and DR onto one platform. Where competitors require separate products (Veeam plus Zerto plus VMware SRM, for instance), SiteContinuity orchestrates DR runbooks directly on top of DataProtect snapshots and replication targets [Source: https://docs.cohesity.com/disaster-recovery/pdf/site-continuity-user-guide.pdf]. Capabilities include:
- Boot-order and dependency mapping
- Network re-IP and VLAN mapping during failover
- Non-disruptive DR test failovers
- Failback orchestration
- Near-zero RTO with Instant Mass Restore-backed runbooks
Helios — SaaS Management and Insights
Helios stitches everything together. It is the single control plane through which you observe and govern any number of DataProtect, SmartFiles, and SiteContinuity instances, on-premises or in cloud, with optional Helios Self-Managed for air-gapped sites [Source: https://futurumgroup.com/document/cohesity-helios-mcdm-product-brief/].
Side-by-Side Product Comparison
| Dimension | DataProtect | SmartFiles | SiteContinuity | Helios |
|---|---|---|---|---|
| Primary purpose | Backup and recovery | Primary/secondary file & object storage | DR orchestration & failover | SaaS management & AI/insights |
| Core workloads | VMs, DBs, NAS, M365, cloud | Unstructured data, archives, S3 apps | Mission-critical apps, RTO/RPO tiers | All clusters & services globally |
| Underlying tech | Magneto + SpanFS | SpanFS multi-protocol Views | DataProtect snapshots + replication + runbooks | Cloud control plane over all clusters |
| Replaces | Veeam, Veritas NetBackup, Commvault | Isilon, NetApp, ECS | VMware SRM, Zerto | Per-cluster UIs, separate analytics tools |
| Delivery | Software on cluster / SaaS | Software on cluster | Software + Helios | SaaS (or Self-Managed) |
[Source: https://www.cohesity.com/blogs/analyst-reports-highlight-the-strength-of-cohesitys-multi-use-platform/] [Source: https://docs.cohesity.com/disaster-recovery/pdf/site-continuity-user-guide.pdf] [Source: https://www.peerspot.com/products/comparisons/cohesity-data-cloud_vs_opentext-data-protector]
Worked example — picking the right product mix. A retailer wants to consolidate a Veeam-backed VMware estate, retire an aging Isilon cluster, and add a documented DR plan for its top-20 store-operations apps. The architect-grade answer is: deploy DataProtect for the VMware/M365 workloads, repurpose the same cluster’s SpanFS as SmartFiles to absorb the Isilon shares, and add SiteContinuity to author and test failover runbooks for the top-20 apps — all observed through one Helios tenant. Three products, one platform, one license envelope.
Figure 1.3: Cohesity Data Cloud product family taxonomy
graph TD
DC["Cohesity Data Cloud"]
DP_PLATFORM["DataPlatform<br/>Workload-facing products<br/>sharing SpanFS"]
HELIOS["Helios<br/>SaaS Control Plane<br/>(or Helios Self-Managed)"]
DC --> DP_PLATFORM
DC --> HELIOS
DP["DataProtect<br/>Backup and Recovery"]
SF["SmartFiles<br/>Files and Objects"]
SCONT["SiteContinuity<br/>DR Orchestration"]
FK["FortKnox<br/>SaaS Cyber Vault"]
DH["DataHawk<br/>Threat Detection<br/>and Classification"]
DP_PLATFORM --> DP
DP_PLATFORM --> SF
DP_PLATFORM --> SCONT
DP_PLATFORM --> FK
DP_PLATFORM --> DH
HELIOS -.observes.-> DP
HELIOS -.observes.-> SF
HELIOS -.observes.-> SCONT
HELIOS -.observes.-> FK
HELIOS -.observes.-> DH
style DC fill:#238636,stroke:#58a6ff,color:#ffffff
style DP_PLATFORM fill:#1f6feb,stroke:#58a6ff,color:#ffffff
style HELIOS fill:#a371f7,stroke:#58a6ff,color:#ffffff
style DP fill:#0d1117,stroke:#58a6ff,color:#ffffff
style SF fill:#0d1117,stroke:#58a6ff,color:#ffffff
style SCONT fill:#0d1117,stroke:#58a6ff,color:#ffffff
style FK fill:#0d1117,stroke:#58a6ff,color:#ffffff
style DH fill:#0d1117,stroke:#58a6ff,color:#ffffff
Key Takeaway: DataProtect, SmartFiles, and SiteContinuity are the three workload-facing products that share a single SpanFS tier and Magneto orchestration. Helios is the SaaS control plane that observes and governs them all. The exam frequently asks which product replaces a legacy point tool — memorize the replacement column above.
Hardware, Cloud, and Virtual Edition Form Factors
Domain 2 (Solution Discovery and Design) routinely asks you to choose a form factor. The wrong hardware decision can sink an otherwise-correct architecture, so understand the trade-offs.
Cohesity-Branded Appliances vs. ReadyNodes vs. Certified Partners
ReadyNodes are pre-validated 1U or 2U x86 reference appliances sold under a fixed bill of materials. Each ReadyNode ships with a balanced ratio of CPU, memory, NVMe/SSD (used for metadata and write caching), and high-density HDDs [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity-SpanFS-and-SnapTree-WP.pdf]. ReadyNodes are typically branded by partners — Cisco, HPE, Dell, Lenovo — and sold as turnkey hardware that runs Cohesity software.
| Form factor | Typical use | Procurement | Support model |
|---|---|---|---|
| Cohesity-branded appliance | Customers wanting a single throat to choke | Direct from Cohesity | Cohesity TAC end-to-end |
| ReadyNode (Cisco UCS, HPE Apollo, etc.) | Customers with hardware-vendor preferences or existing partnerships | OEM partner | OEM hardware + Cohesity software TAC |
| Certified third-party server | Custom sizing or specialized hardware | Customer-procured under HCL | Customer-led integration, Cohesity software TAC |
Virtual Edition (VE) and Cloud Edition Deployment Models
Cohesity software also runs as virtual machines:
- Virtual Edition (VE) runs on VMware vSphere or Microsoft Hyper-V. It is ideal for ROBO sites, dark-site management clusters, or small environments where deploying a physical appliance is uneconomical [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity-SpanFS-and-SnapTree-WP.pdf].
- Cloud Edition runs as native VMs in AWS, Azure, or GCP and is the foundation of CloudReplicate and CloudSpin recovery patterns [Source: https://blogs.vmware.com/affiliates/cohesity-spanfs-the-difference-maker-in-the-enterprise-and-secondary-storage-architectures].
VE and Cloud Edition share the same Bridge/Apollo/Magneto stack as physical clusters; the architectural difference is that storage is supplied by the underlying hypervisor or cloud-provider block service rather than directly attached HDDs.
Robo Edition for Remote and Branch Offices
Robo Edition is a small-footprint variant tuned for branch offices. It typically runs a one- or three-node cluster on lighter hardware, replicates back to a primary regional cluster, and is centrally managed via Helios so that local IT staff aren’t required.
Choosing the Right Form Factor
| Scenario | Recommended form factor | Why |
|---|---|---|
| 200 TB enterprise data center, mixed VM/DB workloads | ReadyNode (Cisco/HPE) or branded appliance | Predictable performance, dense capacity, partner support |
| 50 TB single dark-site classified environment | Physical cluster + Helios Self-Managed | No SaaS dependency, on-prem control |
| 30 retail branch offices, 2 TB each | Robo Edition replicating to regional hub | Small footprint, central Helios management |
| AWS-resident DR target for on-prem clusters | Cloud Edition | Native AWS, supports CloudReplicate and CloudSpin |
| Lab / proof-of-concept | Virtual Edition on existing vSphere | Zero hardware cost, fast spin-up |
| Production on customer-standard Cisco UCS shop | Cisco-branded ReadyNode | Fits procurement and operations model |
Worked example — a hybrid topology. An insurance company runs two primary data centers (East and Central US) and 18 branch offices. The architect should pair a pair of large physical ReadyNode clusters at each primary site for production protection, deploy Robo Edition at each branch with replication back to the nearest primary cluster, and stand up a Cloud Edition cluster in AWS us-west-2 as a third-site DR target. Helios SaaS unifies fleet management; the local DC clusters can additionally archive to S3 Glacier for long-term retention. Notice how the form-factor decision is downstream of a deeper requirement (RPO, RTO, branch IT capability, sovereignty), which is exactly how the exam frames its scenarios.
Figure 1.4: Form factor decision tree for Cohesity deployment placement
flowchart TD
START["New Cohesity Workload<br/>Identify Deployment Location"]
Q1{"Where will the<br/>cluster physically run?"}
Q_DC{"Data center<br/>requirements?"}
Q_BRANCH{"Branch office<br/>footprint?"}
Q_CLOUD{"Cloud-resident<br/>workload?"}
Q_LAB{"Lab, POC, or<br/>dark site?"}
APPLIANCE["Cohesity-Branded Appliance<br/>Single throat to choke<br/>Direct from Cohesity TAC"]
READYNODE["Partner ReadyNode<br/>(Cisco, HPE, Dell, Lenovo)<br/>OEM hardware + Cohesity software"]
CERTIFIED["Certified Third-Party Server<br/>Custom sizing under HCL"]
ROBO["Robo Edition<br/>1- or 3-node small footprint<br/>Replicates to regional hub"]
CLOUDED["Cloud Edition<br/>Native VMs in AWS / Azure / GCP<br/>CloudReplicate and CloudSpin"]
VE["Virtual Edition<br/>VMware vSphere or Hyper-V<br/>Zero hardware cost"]
START --> Q1
Q1 -->|On-premises DC| Q_DC
Q1 -->|Remote branch| Q_BRANCH
Q1 -->|Public cloud| Q_CLOUD
Q1 -->|Test or air-gapped| Q_LAB
Q_DC -->|Vendor-agnostic, turnkey| APPLIANCE
Q_DC -->|Existing OEM partnership| READYNODE
Q_DC -->|Specialized hardware needed| CERTIFIED
Q_BRANCH -->|Small site, 1-5 TB| ROBO
Q_CLOUD -->|DR target or cloud workload| CLOUDED
Q_LAB -->|Hypervisor available| VE
Q_LAB -->|Sovereign or air-gapped| READYNODE
style START fill:#238636,stroke:#58a6ff,color:#ffffff
style Q1 fill:#1f6feb,stroke:#58a6ff,color:#ffffff
style Q_DC fill:#1f6feb,stroke:#58a6ff,color:#ffffff
style Q_BRANCH fill:#1f6feb,stroke:#58a6ff,color:#ffffff
style Q_CLOUD fill:#1f6feb,stroke:#58a6ff,color:#ffffff
style Q_LAB fill:#1f6feb,stroke:#58a6ff,color:#ffffff
style APPLIANCE fill:#0d1117,stroke:#58a6ff,color:#ffffff
style READYNODE fill:#0d1117,stroke:#58a6ff,color:#ffffff
style CERTIFIED fill:#0d1117,stroke:#58a6ff,color:#ffffff
style ROBO fill:#0d1117,stroke:#58a6ff,color:#ffffff
style CLOUDED fill:#0d1117,stroke:#58a6ff,color:#ffffff
style VE fill:#0d1117,stroke:#58a6ff,color:#ffffff
Key Takeaway: ReadyNodes are the workhorse physical form factor; Virtual Edition addresses ROBO, dark-site, and lab scenarios; Cloud Edition powers cloud-resident DR; and Robo Edition addresses branch deployments. Form-factor choice is downstream of business and security requirements — not the other way around.
Chapter Summary
The Cohesity Certified Architect Expert (CCAE) exam tests an architect’s ability to design, size, secure, and integrate Cohesity Data Cloud deployments. The blueprint is dominated by the 35%-weighted Solution Discovery and Design domain, with substantial weight given to Platform Architecture (22%), Security-focused Solutions (18%), and Third-party Integration (13%). The exam runs 90 minutes, costs $200, requires 60% to pass, and is valid for two years — and it sits at the top of a certification track that includes the CCA, CCSE, and CCPE credentials [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/datasheets/cohesity-certified-architect-expert-exam-preparation-guide-en.pdf].
Architecturally, every Cohesity scenario reduces to four pillars. First, SpanFS — the distributed file system that consolidates backup, file, object, and analytics data on one tier with strict consistency, multi-protocol access, and SnapTree-based unlimited snapshots [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity-SpanFS-and-SnapTree-WP.pdf]. Second, the hyperconverged scale-out node model where Bridge, Apollo, Magneto, Iris, ScribeStore, and Yoda services run on every x86 node. Third, MapReduce-style background services — Apollo’s role — that maintain global variable-length deduplication, garbage collection, and indexing without disrupting the data path. Fourth, quorum-driven consistency that imposes the familiar three- or four-node minimums [Source: https://www.cohesity.com/blogs/cohesity-spanfs-snaptree/].
The four headline products — DataProtect (backup and recovery), SmartFiles (files and objects), SiteContinuity (DR orchestration), and Helios (SaaS management) — are different personalities sharing the same SpanFS body, and the form factors (Cohesity-branded appliances, partner ReadyNodes, Virtual Edition, Cloud Edition, Robo Edition) let architects place that body anywhere from a sovereign air-gapped vault to a public-cloud DR target. Learn which service owns which behavior, which product replaces which legacy tool, and which form factor matches which business constraint, and you will recognize the right answer in roughly 70% of CCAE scenario questions [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/solution-guides/the-essential-guide-to-modern-data-management-en.pdf].
Key Terms
| Term | Definition |
|---|---|
| SpanFS | Cohesity’s distributed, web-scale file system that consolidates backups, files, objects, dev/test, and analytics on a single tier; exposes NFS, SMB, S3, OST, and DirectIO via virtual IPs with no master node. |
| DataPlatform | The composite Cohesity software stack — SpanFS plus the Bridge, Apollo, Magneto, Iris, ScribeStore, and Yoda services — that runs on every cluster node. |
| DataProtect | Cohesity’s flagship enterprise backup and recovery product covering VMs, physical servers, databases, NAS, M365, Kubernetes, and public-cloud workloads via Magneto-driven pipelines. |
| Helios | Cohesity’s SaaS multicloud control plane (with a Helios Self-Managed option for dark sites) that unifies monitoring, policy, global search, and AI-driven anomaly detection across clusters. |
| SmartFiles | Cohesity’s unified file and object service offering NFS, SMB, and S3 access on the same data, with immutability and analytics, sharing SpanFS with DataProtect. |
| Bridge | The distributed scale-out file system data-path service that performs chunking, dedupe, compression, encryption, erasure coding, tiering, and protocol serving for SpanFS. |
| Apollo | The cluster-wide MapReduce-style background services engine that performs garbage collection, post-process dedupe, indexing, file analytics, and integrity scrubbing. |
| Magneto | The data protection orchestration service that drives backups, snapshots, replication, archive, and recovery workflows on top of SpanFS. |
| ReadyNode | A pre-validated 1U or 2U x86 reference appliance (Cisco, HPE, Dell, Lenovo, etc.) with balanced CPU, memory, NVMe/SSD, and HDD that is the standard physical form factor for Cohesity clusters. |
| Iris | The management UI and control-plane service that serves the Cohesity web interface, REST APIs, and CLI; enforces RBAC. |
| Yoda | The global indexing and search service that powers cross-cluster file and object search via Helios. |
| ScribeStore | The underlying distributed key-value metadata store that holds inodes, chunk locations, snapshot trees, and indexes; replicated and consistently sharded across nodes. |
| SnapTree | Cohesity’s snapshot and clone technology built on Distributed Redirect-on-Write (D-ROW), enabling unlimited snaps and clones with no performance impact. |
| SiteContinuity | Cohesity’s converged backup-plus-DR product providing automated failover, failback, and non-disruptive DR test runbooks on top of DataProtect snapshots. |
| Virtual Edition (VE) | Cohesity software delivered as VMware or Hyper-V virtual machines for ROBO, dark-site, or lab use. |
| Cloud Edition | Cohesity software delivered as native cloud VMs in AWS, Azure, or GCP, enabling CloudReplicate and CloudSpin DR patterns. |
| CCAE | Cohesity Certified Architect Expert — the architect-tier certification covering platform architecture, design, security, and third-party integration. |
Chapter 2: SpanFS Internals, Distributed Storage, and Cluster Mechanics
If Chapter 1 introduced the cast of services that make up the Cohesity DataPlatform, this chapter pulls back the floor and exposes the plumbing. Beneath every backup job, every SmartFiles share, every Helios SLA report sits a single distributed file system: SpanFS. SpanFS is responsible for accepting writes from many protocols, deduplicating and compressing them, replicating or erasure-coding them across nodes, and presenting a strictly consistent global namespace — all without a single master node. For a CCAE candidate, the most common architectural mistakes (under-sized clusters, unrecoverable failure domains, dedup ratios that never materialize) trace back to misunderstandings of how SpanFS actually behaves on the wire and on disk. This chapter dissects that behavior.
Learning Objectives
- Explain how SpanFS chunks, fingerprints, and stores data across nodes using chunk files, blob files, and the SnapTree metadata index.
- Describe Replication Factor (RF) and Erasure Coding (EC) trade-offs, including stripe placement, minimum cluster sizes, and when each scheme is appropriate.
- Trace a write I/O end-to-end from client through the Bridge service, NVRAM journal, IO Engine, and into chunk files on persistent media.
- Predict how a cluster behaves under disk failure, node failure, chassis failure, and quorum loss, including rebuild times and management-plane availability.
- Choose deduplication and compression policies (inline vs. post-process, variable-length scope) appropriate for VM, NAS, and database workloads.
SpanFS Data Path
The SpanFS data path is the sequence of services and structures that turn a client write into durable, deduplicated, replicated bytes on persistent media. Every Cohesity feature — backup, SmartFiles share, replication target — is ultimately a consumer of this data path.
Chunk files, blob files, and chunk groups
SpanFS stores user data as chunks, not as files in the traditional Linux sense. When data enters the cluster, the IO Engine runs a Rabin rolling hash across the byte stream and slices it at content-defined boundaries. Each resulting chunk is fingerprinted with SHA-1 and looked up in the cluster-wide deduplication hash table [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity-SpanFS-and-SnapTree-WP.pdf].
Three on-disk constructs hold these chunks together:
- Chunk file: the smallest unit of user data persisted to disk. After dedup and compression, each unique chunk lives in a chunk file, typically tens of kilobytes in size.
- Blob file: a container that aggregates many chunk files belonging to a logical object (a VM disk, a SmartFiles file, a database datafile). A blob file exposes a virtual byte range and points into the chunks that back it.
- Chunk group: the resiliency unit. Multiple chunk files are bundled into a chunk group that is then replicated (RF) or erasure-coded (EC) as a whole stripe across fault domains. Chunk groups let the cluster amortize the overhead of EC encoding across many small chunks.
The relationship between these three objects is captured in SnapTree, a distributed B+ tree built atop the cluster-wide key-value store. SnapTree’s root nodes represent Views or files; intermediate and leaf nodes ultimately resolve a logical offset within an object to the specific chunk file holding that data [Source: https://www.cohesity.com/blogs/cohesity-spanfs-snaptree/]. Because SnapTree supports copy-on-write semantics, a snapshot or clone is just a new root pointer — adding effectively zero capacity until divergence occurs.
Figure 2.1: Chunk file, Blob file, and Chunk group hierarchy
flowchart LR
Client[Client object<br/>VM disk / file] --> Blob[Blob file<br/>logical container]
Blob --> ST[SnapTree root<br/>B+ tree index]
ST --> CF1[Chunk file<br/>deduped + compressed]
ST --> CF2[Chunk file<br/>deduped + compressed]
ST --> CF3[Chunk file<br/>deduped + compressed]
CF1 --> CG[Chunk group<br/>resiliency unit]
CF2 --> CG
CF3 --> CG
CG --> RF[RF2/RF3 copies]
CG --> EC[EC stripe<br/>across fault domains]
Client object (VM disk)
|
v
Blob file ----> SnapTree root
| |
+--- chunk ref ---+
+--- chunk ref ---+
+--- chunk ref ---+
|
v
Chunk file (deduped, compressed)
|
v
Chunk group ----> RF2/RF3 copies OR EC stripe
NVRAM journaling and write coalescing
Hyperconverged backup workloads are punishing on disk: thousands of streams arrive concurrently, each with random small writes. To absorb this without driving spinning HDDs into seek-thrashing, every Cohesity node includes an NVRAM region — implemented as a battery- or flash-backed slice of SSD that survives power loss [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity-SpanFS-and-SnapTree-WP.pdf].
The mechanism follows a classic journal-then-checkpoint pattern that database engineers will recognize:
- The IO Engine appends incoming writes to the NVRAM log sequentially.
- The log entry is mirrored to one or two peer nodes (matching the configured Replication Factor) before the client receives the ACK.
- Once durable in NVRAM on the required number of nodes, the client gets a low-latency acknowledgement.
- A background destage task coalesces many small NVRAM entries into large, sequential disk writes, producing chunk files that play nicely with HDDs.
Analogy: the bank teller’s deposit slip. Imagine a busy bank where customers arrive every few seconds with small deposits. If the teller had to walk to the vault for every transaction, throughput would collapse. Instead the teller scribbles each deposit on a slip (the NVRAM journal), drops the slip into a duplicate carbon-copy file (the mirrored journal on a peer node), and hands the customer a receipt (ACK to the client). At the end of the hour, the back-office staff batches all of the day’s slips and updates the master ledger in one sweep (destage to chunk files on HDD). If the bank loses power, the carbon copies and slip stack survive — every transaction can be replayed.
Figure 2.2: SpanFS write I/O path from client to durable disk
sequenceDiagram
participant C as Client
participant V as Protection VIP
participant B as Bridge Service
participant IO as IO Engine
participant N1 as NVRAM (Local)
participant N2 as NVRAM (Peer)
participant KV as Distributed KV
participant D as Disk (HDD/SSD)
C->>V: NFS/SMB/S3 WRITE
V->>B: Route to node
B->>IO: Hand off byte stream
IO->>IO: Rabin chunk + SHA-1 fingerprint
IO->>KV: Dedup hash lookup
KV-->>IO: Hit/Miss per chunk
IO->>N1: Append journal entry
IO->>N2: Mirror journal (RF2)
N2-->>IO: Mirror ack
N1-->>IO: Local ack
IO-->>B: Durable in NVRAM
B-->>C: ACK to client
Note over IO,D: Background destage (asynchronous)
IO->>D: Coalesce + write chunk files
IO->>KV: Commit SnapTree update (quorum)
Read path and locality optimizations
Reads in SpanFS are deliberately asymmetric to writes. A read request enters via any node’s protocol head (NFS, SMB, S3, or the Bridge service for backup data). That node:
- Resolves the logical offset against the SnapTree to obtain the chunk fingerprint.
- Looks up the chunk’s location in the distributed key-value store.
- Fetches the chunk from local disk if present, or from the peer node holding the replica/EC fragment.
- Decompresses and returns the data to the client.
SpanFS exploits locality in two ways. First, the cluster prefers to ingest, dedup, and persist data on the same node that received it, so that subsequent reads of recent data are local. Second, the read cache (in DRAM and SSD) is per-node and warmed by access patterns; hot chunks stay close to the protocol head serving them. For sequential restore workloads (e.g., Instant Mass Restore of a 1 TB VM), SpanFS issues parallel reads to multiple nodes simultaneously, exploiting the fact that the chunk group’s fragments span the cluster.
Garbage collection and chunk reclamation
Because chunks are deduplicated, deleting a backup or expiring a snapshot does not immediately free space. A chunk is reclaimable only when its reference count in the distributed KV store drops to zero. SpanFS runs a background garbage collection (GC) process that:
- Walks the SnapTree to identify orphaned chunk references.
- Decrements reference counts for chunks released by deleted snapshots, expired backups, or overwritten regions.
- Compacts blob files by rewriting still-live chunks contiguously and freeing the original blob extents.
- Re-runs erasure coding on cold blob files that initially landed in RF2 (described in the next section).
GC is throttled to avoid contending with foreground I/O, which is why architects often see capacity reclaim lagging deletion events by hours or days. For sizing, always model worst-case GC lag (commonly 24–72 hours) when planning capacity headroom.
Key Takeaway: SpanFS implements a journal-then-checkpoint write path: NVRAM absorbs random writes for low-latency ACKs, the IO Engine deduplicates and compresses chunks, chunk groups apply RF or EC, and SnapTree indexes everything. Reads exploit locality and parallel fetch; garbage collection runs asynchronously, so deleted capacity returns over hours, not seconds.
Resiliency: RF and Erasure Coding
A SpanFS cluster keeps user data safe through two complementary mechanisms: Replication Factor (RF) and Erasure Coding (EC). Both can coexist within a single cluster — even within a single View Box — and the architect’s job is to choose the right policy for each workload.
Replication Factor 2 vs. RF3 trade-offs
Replication is the simplest scheme: store N identical copies of every chunk group on N different fault domains. Cohesity supports RF2 (two copies) and RF3 (three copies) [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/cohesity-fault-tolerance-data-integrity-for-modern-web-scale-environments-white-paper-en.pdf].
- RF2 survives one simultaneous failure (node, disk, or chassis depending on the configured fault domain). Recovery is a fast block-level copy from the surviving replica. Capacity overhead is 100% — usable capacity is roughly 50% of raw.
- RF3 survives two simultaneous failures and is required for the strictest SLAs or small clusters where a second concurrent failure during rebuild is plausible. Overhead is 200%, leaving roughly 33% of raw as usable capacity.
RF is preferred for hot data: NVRAM journaling, freshly destaged chunks, and any workload with low-latency write SLAs. The reason is simple: writing a mirror is cheaper than encoding a stripe, and reading is just a local fetch from any surviving copy.
Erasure coding schemes (2:1, 4:2, 6:2)
Erasure coding stripes data and parity fragments across many fault domains using Reed-Solomon-style mathematics. Cohesity supports several schemes; the most commonly cited are EC 2:1, EC 4:2, and EC 5:2 [Source: http://vastitservices.com/wp-content/uploads/2020/03/VAST-Flyer-Cohesity-DataPlatform.pdf].
| Scheme | Data Frags | Parity Frags | Min. Fault Domains | Failures Tolerated | Usable Capacity | Overhead |
|---|---|---|---|---|---|---|
| RF2 | 1 | 1 (mirror) | 3 nodes | 1 | ~50% | 100% |
| RF3 | 1 | 2 (mirrors) | 4 nodes | 2 | ~33% | 200% |
| EC 2:1 | 2 | 1 | 4 nodes | 1 | ~67% | 50% |
| EC 4:2 | 4 | 2 | 6 nodes | 2 | ~67% | 50% |
| EC 6:2 | 6 | 2 | 8 nodes | 2 | ~75% | 33% |
The headline observation: EC 4:2 and RF3 tolerate the same number of failures, but EC 4:2 uses half the overhead [Source: https://www.cohesity.com/blogs/erasure-coding-increase-fault-resilience-capacity/]. In a 100 TB raw cluster, RF3 yields ~33 TB usable while EC 4:2 yields ~67 TB usable — and both survive two simultaneous failures. The cost is encoding CPU and a more complex rebuild path.
Inline vs. post-process erasure coding
Cohesity rarely writes EC-protected chunks directly on the hot ingest path. Instead, the platform uses a two-stage resiliency pipeline:
- Stage 1 — Inline RF2: Hot, freshly ingested chunks land in NVRAM and SSD with RF2 protection. This minimizes write latency because mirroring is cheap.
- Stage 2 — Post-process EC: A background task identifies cold chunk groups, re-encodes them under the View Box’s EC policy (e.g., 4:2), persists the new EC stripe, and frees the RF2 copies.
This staging is the reason Cohesity can advertise both low write latency and high storage efficiency — it pays the latency cost for hot data with mirroring and pays the encoding cost for cold data when no client is waiting. Architects must factor in the transient RF2 overhead when sizing: you cannot assume EC 4:2 efficiency on day-one capacity, because freshly ingested data lives at RF2 for some period.
Figure 2.3: Erasure coding stripe placement decision flow
flowchart TD
Ingest[New write arrives at node] --> RF2Land[Stage 1: Land at RF2<br/>NVRAM mirror + SSD]
RF2Land --> Ack[Client ACK<br/>low latency]
Ack --> Cold{Chunk group<br/>cold?}
Cold -- No --> Stay[Remain at RF2<br/>hot tier]
Stay --> Cold
Cold -- Yes --> ECCheck{Enough fault<br/>domains for EC?}
ECCheck -- No --> StayRF[Keep at RF2/RF3<br/>per View Box policy]
ECCheck -- Yes --> Encode[Stage 2: Reed-Solomon encode<br/>EC 2:1 / 4:2 / 6:2]
Encode --> Place[Distribute fragments<br/>across fault domains]
Place --> Free[Release original<br/>RF2 copies]
Free --> Reclaim[GC reclaims capacity]
Choosing resiliency policies per View Box
Resiliency policy is configured at the View Box (storage domain) level, which means different workloads can have different policies on the same cluster. Practical guidelines:
| Workload | Recommended Policy | Rationale |
|---|---|---|
| Production backups, long retention | EC 4:2 (post-process) | Cold data dominates; capacity efficiency matters |
| Hot SmartFiles primary share | RF2 (or RF3) | Latency-sensitive; minimal encoding overhead |
| Small cluster (<6 nodes) | RF2 or RF3 | EC 4:2 requires 6 fault domains |
| Compliance / WORM | RF3 or EC 4:2 | Two-failure tolerance |
| CloudArchive tier | Cloud provider’s own | Native S3/Azure durability replaces local RF/EC |
Key Takeaway: RF2 is fast and cheap to rebuild but doubles capacity; RF3 doubles fault tolerance again at 3x overhead. EC 4:2 matches RF3’s fault tolerance at half the overhead but requires at least 6 fault domains and more CPU. Cohesity blends both — RF2 inline, EC in the background — so plan capacity assuming RF2 for hot data and the View Box’s EC policy for cold.
Deduplication and Compression
Storage efficiency is the difference between selling a customer 100 TB of usable capacity and 500 TB of effective capacity. Cohesity’s headline efficiency multipliers come from global, variable-length deduplication combined with inline compression.
Variable-length and fixed-length sliding-window dedupe
Most storage systems either store data in fixed 4 KiB or 8 KiB blocks (fixed-length dedup) or in chunks aligned to filesystem boundaries. Cohesity uses variable-length chunking driven by Rabin fingerprinting [Source: https://www.cohesity.com/blogs/global-deduplication-matters/]:
- The IO Engine slides a rolling hash window across the byte stream.
- When the rolling hash matches a content-defined breakpoint pattern, a chunk boundary is declared.
- Each resulting chunk gets a SHA-1 fingerprint.
- The fingerprint is queried against the global hash table; if found, only a metadata reference is stored.
The advantage is insertion-resilience. If a 1 KB header is prepended to a previously-backed-up file, fixed-block dedup will produce all new fingerprints because every block has shifted. Variable-length dedup will produce one new chunk (containing the header) and reuse every chunk after the next content-defined boundary. For incremental and synthetic-full backup workloads, this difference can be 5–10x in dedup ratio.
Global vs. local dedupe domains
The dedup hash table lives in the distributed key-value store and is replicated across nodes. The scope, however, is the View Box. Two View Boxes on the same cluster maintain separate hash tables, which means a chunk written into View Box A is not deduplicated against an identical chunk written into View Box B [Source: https://www.cohesity.com/glossary/data-deduplication/].
This has architectural consequences:
- Multi-tenancy: Per-tenant View Boxes provide cryptographic and capacity isolation, at the cost of dedup blindness across tenants.
- Workload grouping: Place similar workloads (e.g., all VMware backups) in the same View Box to maximize dedup. Mixing fundamentally different data (encrypted databases plus office documents) in one View Box dilutes ratios.
- Encryption: Per-View-Box keys are fine; per-tenant keys break dedup across tenants regardless of View Box settings, because identical plaintext encrypts to different ciphertext.
Inline vs. post-process compression
Compression in SpanFS is applied after dedup and before persistence to chunk files. Compression mode is configurable per View Box:
- Inline compression: Each unique chunk is compressed (typically LZ4 or zstd-class algorithm) before it is written to disk. Lower disk usage on first write; small CPU cost on the ingest path.
- Post-process compression: Chunks are written uncompressed first (faster ACK), then compressed by a background task. Better latency at the cost of transient capacity.
In practice, inline compression is almost universally enabled for backup workloads — the data is already at RF2 in NVRAM before the client sees the ACK, so the marginal CPU cost is negligible.
Estimating dedupe ratios for different workloads
CCAE candidates are routinely asked to size effective capacity. Use these planning ratios as starting points (always validate with the Cohesity sizing tool):
| Workload | Typical Dedup Ratio | Compression Ratio | Combined |
|---|---|---|---|
| VMware VM backups (mixed Windows) | 4–6x | 1.5–2x | 6–12x |
| Oracle/SQL database fulls | 3–5x | 1.5–2x | 4–10x |
| Microsoft 365 mailbox / SharePoint | 3–5x | 1.3–1.5x | 4–8x |
| File shares / SmartFiles | 1.5–3x | 1.3–2x | 2–6x |
| Already-compressed media (video, ZIP) | ~1x | ~1x | ~1x |
| Encrypted source data | ~1x | ~1x | ~1x |
Key Takeaway: SpanFS dedup is variable-length, content-defined, SHA-1 fingerprinted, and global within a View Box. The dedup scope boundary is the View Box, so multi-tenant designs trade dedup efficiency for isolation. Always plan effective capacity using validated workload ratios — never assume the same multiplier across mixed workloads.
Cluster Mechanics and Quorum
A SpanFS cluster is a peer-to-peer system: every node runs the same services and any node can serve any client request. Coordinating peers without a master requires explicit consensus, fault domain awareness, and disciplined upgrade procedures.
Strict consistency and Paxos-based metadata
Unlike eventually-consistent object stores, SpanFS guarantees strict consistency for both data and metadata operations [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/cohesity-fault-tolerance-data-integrity-for-modern-web-scale-environments-white-paper-en.pdf]. Reads always observe the most recent durably committed write. Two clients writing to the same offset never see split-brain results.
Strict consistency is implemented via a Paxos-style consensus protocol layered over the distributed key-value store. Metadata writes — directory updates, SnapTree modifications, dedup hash registrations — are committed only when a quorum of replicas acknowledge the change. With three KV replicas per record, a write requires at least two acks.
Node, disk, and chassis fault domains
Cohesity lets architects configure the fault domain at which RF copies and EC fragments must be distributed:
| Fault Domain | Description | Required Cluster Shape |
|---|---|---|
| Disk | Two copies/fragments may share a node | Default; smallest clusters |
| Node | Each copy/fragment lands on a different node | Recommended baseline |
| Chassis | Each copy/fragment lands on a different chassis (block) | Multi-node-per-chassis configurations |
| Rack | Each copy/fragment lands on a different rack | Stretched / large clusters |
For example, a 4-node single-chassis ReadyNode platform cannot meaningfully use chassis-level fault domain — there is only one chassis. Promoting fault-domain to chassis on such a cluster effectively forces the cluster into a state where it cannot place EC stripes. Architects must size for the fault domain count, not just node count.
Quorum loss scenarios and recovery
A SpanFS cluster requires a majority quorum of nodes for the management plane (Apollo / Iris) and for metadata writes. The classic scenarios:
- 3-node cluster, 1 node down: 2 of 3 nodes online → quorum maintained. Cluster operates with reduced redundancy; rebuilds proceed if RF2.
- 3-node cluster, 2 nodes down: 1 of 3 online → quorum lost. Cluster goes read-only or offline; manual intervention required.
- 6-node cluster, 2 nodes down: 4 of 6 online → quorum maintained; EC 4:2 still tolerates the second failure if both losses occurred within the window before rebuild completes.
- Network partition splitting cluster 3/3: neither side has majority → both sides go offline to prevent split-brain. Operator must manually bring the larger or designated side back.
This is why even-numbered cluster sizes are gently discouraged: a 4-node cluster splitting 2/2 has no automatic majority. Most production deployments use 4-node minimum but architect for failure scenarios assuming odd-numbered effective fault domains.
Figure 2.4: Cluster consistency states and quorum transitions
stateDiagram-v2
[*] --> Healthy
Healthy: Healthy<br/>All nodes online<br/>full quorum + RF/EC
Degraded: Degraded<br/>Node/disk down<br/>quorum maintained
Rebuilding: Rebuilding<br/>Background reconstruction<br/>fragments restored
QuorumLoss: Quorum Loss<br/>Majority unreachable<br/>read-only / offline
Recovery: Recovery<br/>Manual intervention<br/>partition resolved
Healthy --> Degraded: failure within tolerance
Degraded --> Rebuilding: replacement / spare available
Rebuilding --> Healthy: rebuild completes
Degraded --> QuorumLoss: additional failure exceeds tolerance
Healthy --> QuorumLoss: network partition splits cluster
QuorumLoss --> Recovery: operator action
Recovery --> Degraded: quorum restored
Recovery --> Healthy: all nodes rejoined
Rolling upgrades and maintenance mode
SpanFS upgrades are rolling: one node at a time is placed in maintenance mode, drained of leadership and active VIPs, upgraded, rebooted, and brought back into the cluster before the next node is touched. During the window when a node is in maintenance:
- Quorum tightens (must be maintained by the remaining nodes).
- Effective resiliency drops by one (a 2-failure scheme tolerates one further failure during the upgrade, not two).
- Inflight client connections are migrated to peer nodes via VIP failover.
For RF2 clusters, this means an upgrade temporarily reduces fault tolerance to zero. For mission-critical systems, RF3 or EC 4:2 is preferred so that maintenance plus an unplanned failure does not equal data loss.
Key Takeaway: SpanFS uses Paxos-based strict consistency, requires majority quorum for the management plane, and lets you configure fault domains (disk/node/chassis/rack). Plan for fault-domain count, not just node count; rolling upgrades temporarily reduce fault tolerance by one, so RF2 clusters become single-failure-vulnerable during upgrade windows.
Worked Example: Tracing a 1 MB Write Through SpanFS
Let’s tie everything together by following a single 1 MB write from a backup proxy into a 6-node Cohesity cluster configured with View Box policy EC 4:2 (post-process), inline dedup, inline compression, fault domain = node.
Step 1 — Client connection. The backup proxy opens an NFS mount to the cluster’s protection VIP. SmartDNS resolves the VIP to Node 3. The proxy issues an NFSv3 WRITE for offset 0, length 1,048,576 of /backup/vm-disk-001.vmdk.
Step 2 — Bridge service receives. Node 3’s Bridge service authenticates the request, identifies the target View Box, and hands the byte stream to the IO Engine.
Step 3 — IO Engine chunks the data. The IO Engine runs Rabin fingerprinting across the 1 MB. Suppose this produces 16 variable-length chunks averaging ~64 KiB each.
Step 4 — Dedup lookup. For each chunk, the IO Engine computes a SHA-1 fingerprint and queries the global hash table in the distributed KV store. Suppose 12 of the 16 chunks already exist (this VM has been backed up before); only 4 are unique.
Step 5 — Compress unique chunks. The 4 unique chunks are compressed with LZ4. Assume an average 1.6x compression ratio, yielding ~160 KiB of compressed unique data.
Step 6 — NVRAM journal (RF2). Node 3 appends the 4 compressed chunks plus their metadata pointers to its NVRAM log. The log entry is mirrored synchronously to Node 5’s NVRAM. Once both NVRAMs ack, Node 3 sends the NFS WRITE reply back to the proxy. End of synchronous path — total latency on the order of single-digit milliseconds.
Step 7 — SnapTree update. The IO Engine commits a SnapTree transaction adding 16 chunk references to the blob file representing vm-disk-001.vmdk. The transaction is quorum-committed across three KV replicas (Nodes 1, 3, and 5).
Step 8 — Background destage. Within seconds, the destage task flushes the NVRAM journal entries into chunk files on the SSD/HDD tier of Nodes 3 and 5. The chunks are now persistent on rotational media at RF2.
Step 9 — Reference count update. For the 12 deduplicated chunks, the dedup hash table increments their reference counts, recording that this new blob also references them.
Step 10 — Post-process EC. Hours later, garbage collection identifies the chunk group containing these chunks as “cold.” It re-encodes them under EC 4:2: 4 data fragments and 2 parity fragments are computed and distributed across Nodes 1, 2, 3, 4, 5, and 6 — one fragment per node, satisfying the node-level fault domain. The original RF2 copies are released, and capacity is reclaimed.
Step 11 — Steady state. The 1 MB write that originally consumed 2 MB at RF2 (one mirror copy) now consumes roughly 240 KiB (160 KiB unique compressed data plus 50% EC parity overhead, ignoring the deduped 12 chunks that consumed nothing additional). That is the compounding effect of dedup, compression, and EC across a backup retention window.
Chapter Summary
SpanFS is the foundation that makes every other Cohesity feature possible. Its write path is a textbook journal-then-checkpoint design: NVRAM absorbs random writes with mirrored, low-latency durability; the IO Engine performs variable-length Rabin chunking, SHA-1 dedup, and inline compression; SnapTree indexes everything in a copy-on-write B+ tree atop a strictly consistent distributed key-value store. Because every node is a peer, the cluster has no single point of failure — but it depends on Paxos-style consensus and configured fault domains to maintain that property under stress.
Resiliency is layered. RF2 provides single-failure tolerance with low latency at high capacity overhead; RF3 doubles fault tolerance at triple overhead; EC schemes (2:1, 4:2, 6:2) trade encoding CPU for dramatically lower overhead. Cohesity’s two-stage pipeline — RF2 inline for hot data, EC in the background for cold data — captures both benefits, but architects must size capacity assuming RF2 for recently ingested workloads and account for fault-domain count rather than raw node count. View Box-scoped global deduplication, inline compression, and per-workload policy choice shape the effective capacity multiplier the customer ultimately sees.
For the CCAE exam, the recurring traps are: assuming dedup spans tenants when it does not; sizing EC efficiency without accounting for transient RF2 overhead; ignoring quorum math on small or even-numbered clusters; promoting fault domain to a level the cluster shape cannot satisfy; and forgetting that rolling upgrades temporarily reduce effective fault tolerance by one. Internalize the write-path trace and the RF/EC trade-off table, and most cluster-mechanics questions become straightforward applications of those fundamentals.
Key Terms
| Term | Definition |
|---|---|
| Chunk file | Smallest persisted unit of user data in SpanFS, holding a single deduplicated, compressed chunk. |
| Blob file | Logical container that aggregates chunk references for a single object (VM disk, file, datafile). |
| NVRAM | Battery- or flash-backed SSD region used as a mirrored journal for low-latency, durable write ACKs. |
| View Box | Logical storage domain that defines deduplication scope, resiliency policy, encryption, QoS, and tiering. |
| Replication Factor (RF) | Number of full copies of each chunk group SpanFS stores; RF2 tolerates 1 failure, RF3 tolerates 2. |
| Erasure Coding (EC) | Reed-Solomon-style scheme storing data + parity fragments across fault domains for higher efficiency. |
| Quorum | Majority of cluster nodes/replicas required for metadata writes and management-plane operations. |
| Fault domain | Configured failure boundary (disk, node, chassis, rack) across which RF copies and EC fragments must spread. |
| Strict consistency | Guarantee that reads always observe the most recent durably committed write; enforced via Paxos-style consensus. |
Chapter 3: Cluster Design, Sizing, and Capacity Planning
If Chapters 1 and 2 explained what Cohesity is and how SpanFS holds data together, this chapter answers the question every architect actually has to answer in a customer meeting: “How big should the cluster be, and what nodes do I buy?” Sizing a Cohesity cluster is part arithmetic, part workload psychology, and part risk management. Get it right and the cluster runs comfortably for three to five years with linear expansion. Get it wrong and you either over-spend by 40% on day one, or you starve the cluster of headroom and watch backup windows blow out within eighteen months.
This chapter walks through the full sizing pipeline: profiling workloads, feeding inputs to the Cohesity Sizer, choosing among the C4000, C5000, C5200, and C6000 ReadyNode families, and planning capacity over a multi-year horizon. We’ll work through concrete examples (including the 500 TB FETB scenario architects encounter constantly on the CCAE exam) and surface the gotchas that separate a passing answer from a great one.
Learning Objectives
By the end of this chapter you will be able to:
- Design Cohesity clusters that meet specific RPO, RTO, and retention SLAs derived from a customer’s workload profile.
- Apply Cohesity sizing tools and capacity formulas to translate Front-End TB (FETB), change rate, and retention into raw and usable cluster capacity.
- Plan node count, fault tolerance margin (RF vs. EC), and growth headroom (N+1, year-3 horizon) for production deployments.
- Justify hardware family selection (C4000 hybrid, C5000/C5200 performance, C6000 dense, all-flash variants) based on the workload mix and SLA priority.
- Build a capacity plan that survives technology refresh cycles, tiering decisions, and growth shocks.
Workload Profiling Inputs
Sizing always starts with profiling — building a quantitative picture of the data you have to protect. A Cohesity cluster doesn’t care whether the source is “important”; it cares about how many bytes arrive, how often they change, how long they live, and how fast they have to come back. Skip this step and every later number is a guess.
Front-End TB (FETB), Change Rate, and Retention
Three numbers dominate any backup sizing conversation. Memorize them in this order, because every other metric flows downstream:
- Front-End TB (FETB) — the total source-side data footprint of the workloads you intend to protect, measured before Cohesity touches it. FETB is the licensing unit and the headline number on every quote. Think of FETB like the headline number on a car spec sheet — the “0 to 60” of backup sizing. It tells you the order of magnitude of the deal but it doesn’t tell you how the cluster will actually behave under load. [Source: https://www.cohesity.com/blogs/fetb-vs—betb—thinking-beyond-the-invoice/]
- Change Rate — the percentage of FETB that mutates between successive backups. Typical values: 1–3% daily for VM and NAS workloads; 5–15% daily for transactional databases; up to 30% daily for log-heavy systems if you protect transaction logs as part of the daily ingest. Change rate drives the incremental ingest size and therefore the back-end TB consumed per retention day.
- Retention — how long each recovery point must remain on the cluster. A typical Grandfather-Father-Son (GFS) policy might be 30 daily / 12 weekly / 12 monthly / 7 yearly. Retention transforms a single-day ingest number into a multi-month cumulative consumption number, and retention is where small assumptions blow up into large capacities.
The fundamental back-end math:
BETB ≈ FETB × full_compression_factor + (FETB × change_rate × retention_days × incremental_factor)
Effective_BETB = BETB / (dedupe_ratio × compression_ratio)
In practice the Sizer computes this per-workload and aggregates, because dedupe and compression ratios differ by source type. A 1% change rate sounds small until you multiply by 365 days of retention and realize you’ve stored 4.65 full copies’ worth of incremental change. [Source: https://www.cohesity.com/blogs/fetb-vs—betb—thinking-beyond-the-invoice/]
Figure 3.1: Sizing input-to-output flow (FETB, change rate, retention, dedup → usable capacity → node count)
flowchart TD
A[FETB by Workload] --> E[Initial Full Backend]
B[Daily Change Rate] --> F[Incremental Volume]
C[Retention Policy<br/>D/W/M/Y] --> G[Cumulative Retention Multiplier]
D[Dedupe + Compression Ratios] --> H[Reduction Factor]
E --> I[Effective BETB Required]
F --> I
G --> I
H --> I
I --> J[Apply Growth + N+1 Headroom]
J --> K[Target Usable Capacity]
K --> L{Choose EC Scheme}
L --> M[Compute Raw Capacity]
M --> N[Select ReadyNode Family]
N --> O[Final Node Count]
Key Takeaway: FETB is the licensing input, change rate is the daily multiplier, and retention is the time multiplier. All three must be known with reasonable accuracy before any capacity number can be trusted. Treat customer-provided change rates with healthy skepticism — measure when possible, or pad by 20%.
Workload Categorization: VM, NAS, DB, Physical
Not all FETB is equal. The Sizer expects you to break the total down by workload type because each category has different reduction ratios, different ingest patterns, and different performance demands.
| Workload Type | Typical Daily Change | Typical Reduction Ratio (dedupe + compression) | Sizing Notes |
|---|---|---|---|
| VMware / Hyper-V / AHV VMs | 1–3% | 6:1 to 10:1 | CBT/RCT incremental forever; OS dedupe is excellent across similar VMs |
| Physical servers (Linux/Windows agent) | 1–5% | 3:1 to 5:1 | Lower dedupe than VMs because OS images are not as templated |
| NAS (SMB/NFS/NDMP) | 1–3% | 2:1 to 3:1 | Mixed media (images, video, archives) compress poorly |
| Databases (Oracle/SQL/HANA) | 5–15% (data) + log | 3:1 to 5:1 | Frequent log backups for low RPO; engine-native compression reduces additional dedupe gains |
| Microsoft 365 | 1–2% | 2:1 to 3:1 | API-rate-limited rather than throughput-limited |
| Object/S3 sources | varies | 1.5:1 to 2:1 | Often pre-compressed; treat as low-reduction |
[Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity_Data_Protection_White_Paper.pdf] [Source: https://www.cohesity.com/glossary/data-deduplication/]
A 100 TB FETB cluster of homogeneous Windows VMs might land on disk as 12 TB after dedupe and compression. The same 100 TB from a video-production NAS might consume 45 TB. Mixing workloads dilutes the average reduction ratio, so when a customer adds a new workload class to an existing cluster, re-run the Sizer rather than extrapolating.
Figure 3.2: Workload categorization and typical dedupe + compression ratios
graph TD
W[Workload Categorization] --> VM[VMware / Hyper-V / AHV VMs]
W --> PHY[Physical Servers]
W --> NAS[NAS SMB/NFS/NDMP]
W --> DB[Databases<br/>Oracle/SQL/HANA]
W --> M365[Microsoft 365]
W --> OBJ[Object / S3 Sources]
VM --> VMR[6:1 to 10:1<br/>High OS template dedupe]
PHY --> PHYR[3:1 to 5:1<br/>Lower OS reuse]
NAS --> NASR[2:1 to 3:1<br/>Mixed media]
DB --> DBR[3:1 to 5:1<br/>Engine-compressed]
M365 --> M365R[2:1 to 3:1<br/>API rate-limited]
OBJ --> OBJR[1.5:1 to 2:1<br/>Pre-compressed]
Daily Ingest, Replication, and Cloud Archive Throughput
Capacity is only half the sizing problem. The other half is flow: how many bytes per hour must traverse network, NVRAM, and SpanFS during the backup window?
The headline formula:
Required_Ingest_Throughput (TB/hr) = (FETB × change_rate) / backup_window_hours
Worked example. A customer has 500 TB of FETB, 2% daily change rate, an 8-hour overnight window. Daily incremental volume = 10 TB. Required ingest = 10 / 8 = 1.25 TB/hr, which is comfortably within a 4-node C5000 cluster’s headroom. Now suppose the customer wants the first full backup completed in the same 8-hour window: 500 / 8 = 62.5 TB/hr, which would require the seed to be staged differently (parallelized first-fulls, longer initial seed window, or a multi-night soft-rollout). Architects routinely miss the “first-full vs. steady-state” distinction in sizing.
Replication throughput must be sized separately. If the customer replicates daily incrementals (10 TB) over a 24-hour window to a DR cluster, required egress = ~0.42 TB/hr ≈ 935 Mbps after compression. WAN sizing in Chapter 9 builds on this same arithmetic. CloudArchive throughput is bounded by the object-storage endpoint and the cluster’s egress NIC — typical sizings assume 200–800 MB/s sustained per node depending on object size and parallelism.
Performance vs. Capacity-Driven Sizing
Cohesity sizings collapse into one of two regimes:
- Capacity-driven sizing — when retention × FETB dwarfs ingest throughput, the cluster’s physical disk count is the constraint. Choose dense nodes (C6000), wide EC stripes (6:2), and grow the cluster by adding capacity. Typical fingerprint: long retention (≥1 year), stable workload, modest RPO, modest RTO.
- Performance-driven sizing — when ingest throughput, restore SLA, or instant-mount concurrency dominates, the cluster’s CPU/NVRAM/flash count is the constraint. Choose performance or all-flash nodes (C5000/C5200/C6200), narrower EC or RF=2, and grow by adding nodes. Typical fingerprint: short RTO (instant VM mount), database log-shipping at high cadence, large numbers of concurrent restores.
The Sizer detects which regime applies and outputs node count to satisfy both. When in doubt, build for the more demanding axis and let the less-demanding axis benefit. [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/cohesity-data-cloud-a-unified-platform-en.pdf]
Key Takeaway: Profile workloads by category, change rate, retention, and SLA. Decide whether the design is capacity- or performance-driven before choosing nodes. The same FETB can require very different cluster shapes depending on which axis dominates.
Sizing Tools and Calculators
Cohesity provides a Sizer (the partner/SE-facing planning tool) and a published set of formulas that let you sanity-check or hand-build a sizing in a whiteboard session. Both are valid; the CCAE expects you to understand the formulas and to know when the Sizer’s defaults need to be overridden.
Cohesity Sizing Tool Inputs and Outputs
The Sizer is a structured calculator that takes a workload profile and returns a recommended ReadyNode model, node count, and capacity projection over a multi-year horizon. [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity_Data_Protection_White_Paper.pdf]
| Sizer Input | Purpose | Typical Default |
|---|---|---|
| Total FETB by workload type | Capacity baseline | Customer-provided |
| Annual growth rate | Multi-year extrapolation | 10–25% |
| Daily change rate by workload | Incremental ingest math | 2% (VM), 5% (DB), 1.5% (NAS) |
| Retention policy (D/W/M/Y) | Cumulative storage demand | 30 / 12 / 12 / 7 GFS |
| RPO (backup frequency) | Backup window concurrency | 24h |
| RTO (recovery target) | Performance node selection | Hours (hybrid) or minutes (all-flash) |
| Dedupe ratio assumption | Effective capacity | 2:1–4:1 by workload |
| Compression ratio assumption | Effective capacity | 1.5:1–2:1 |
| Replication target | Secondary cluster sizing | Optional |
| Archive target | CloudArchive sizing | Optional |
| EC scheme or RF | Usable capacity efficiency | EC 4:2 or 6:2 |
The Sizer outputs a recommended ReadyNode (e.g., C5066 hybrid), a minimum node count satisfying both EC and N+1 constraints, year-1 / year-3 / year-5 capacity projections, and the SKU-level licensing recommendation (DataProtect, SmartFiles, FETB Plus tier). [Source: https://www.cohesity.com/blogs/fetb-vs—betb—thinking-beyond-the-invoice/]
Effective Capacity, Dedupe Assumptions, and Overheads
The chain of capacity transformations from “raw disk on a pallet” to “FETB I can sell” looks like this:
Raw Capacity (sum of all HDD/NVMe across all nodes)
↓ × (D / (D+P)) for EC, or × (1/RF) for replication
Usable Capacity (after resiliency overhead)
↓ − ~5–10% reserved
Available Capacity (after metadata, snapshots, rebuild reserve)
↓ × dedupe × compression
Effective Capacity (the number on the data sheet)
↓ ÷ retention multiplier
Protectable FETB (the number on the sales quote)
A worked example. A 10-node cluster of C5066 nodes at 192 TB raw per node = 1,920 TB raw. EC 6:2 yields 1,920 × (6/8) = 1,440 TB usable. Subtract 7% for metadata/rebuild reserve = ~1,340 TB available. Apply a typical 4.5x combined reduction for VM workloads = ~6,030 TB effective. Divide by an effective retention multiplier of, say, ~12 (which captures 30 daily incrementals + 12 weekly + 12 monthly + 7 yearly with realistic incremental sizes) = ~500 TB protectable FETB. [Source: https://www.cohesity.com/blogs/erasure-coding-increase-fault-resilience-capacity/] [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/cohesity-fault-tolerance-data-integrity-for-modern-web-scale-environments-white-paper-en.pdf]
Figure 3.3: Capacity transformation stack (Raw → RF/EC overhead → Effective → Reserve → Usable)
flowchart LR
R[Raw Capacity<br/>Sum of all HDD/NVMe] -->|"× D/(D+P) for EC<br/>or × 1/RF"| U[Usable Capacity<br/>After resiliency]
U -->|"− 5–10%<br/>metadata + rebuild"| A[Available Capacity]
A -->|"× dedupe<br/>× compression"| E[Effective Capacity]
E -->|"− N+1 reserve<br/>− 20% headroom"| RES[Practical Ceiling]
RES -->|"÷ retention multiplier"| F[Protectable FETB]
The compounding here is unforgiving. A 20% optimistic dedupe assumption multiplied by a 20% optimistic retention multiplier produces a 44% capacity miss. Validate dedupe with a pilot whenever the deal is large enough to justify the calendar time.
RF vs. EC capacity comparison table — useful for back-of-envelope checks:
| Scheme | Min Nodes | Failures Tolerated | Storage Overhead | Usable / Raw |
|---|---|---|---|---|
| RF 2 | 3 | 1 | 100% | 50.0% |
| RF 3 | 4 | 2 | 200% | 33.3% |
| EC 2:1 | 3 | 1 | 50% | 66.7% |
| EC 4:1 | 5 | 1 | 25% | 80.0% |
| EC 4:2 | 6 | 2 | 50% | 66.7% |
| EC 5:2 | 7 | 2 | 40% | 71.4% |
| EC 6:2 | 8 | 2 | 33% | 75.0% |
Note that EC 6:2 delivers the same fault tolerance as RF 3 (2 failures) at less than half the storage cost — but you need at least 8 nodes to use it. Cluster size unlocks EC efficiency, which is one of the most important architectural levers in Cohesity sizing. [Source: https://www.cohesity.com/blogs/erasure-coding-increase-fault-resilience-capacity/]
Sizing for SmartFiles Primary Workloads
SmartFiles changes the sizing calculation in two ways. First, the workload is primary rather than backup, so reduction ratios are typically lower (mixed media, no dedupe-friendly OS templates). Second, performance characteristics matter more — clients hit views directly via SMB/NFS/S3, so latency and IOPS feed into node selection.
For SmartFiles you size based on:
- Hot vs. cold data ratio (drives tiering)
- Concurrent user/share count (drives node count)
- Object vs. file vs. mixed access (drives protocol choice)
- QoS requirements per view (drives all-flash vs. hybrid)
A typical SmartFiles sizing assumes 2:1 to 3:1 reduction and a usable-capacity target of ~70% of available (leaving 30% headroom because primary workloads are less tolerant of full-cluster behavior than backup workloads). [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/solution-brief/cohesity-smartfiles-beyond-scale-out-nas-solution-brief-en.pdf]
Cloud Edition and CloudArchive Sizing
Cloud Edition (CE) deploys SpanFS on cloud VMs (AWS EC2, Azure VMs, GCP), and sizing must account for the constraints of the underlying cloud:
- Instance types determine NVMe and bandwidth caps; sizings cite specific instance SKUs (e.g., AWS i3en.6xlarge for performance tier)
- Object-storage backing for capacity tier (S3, Azure Blob, GCS) adds latency and per-request cost
- Egress fees are nonlinear; sizing must include estimated monthly egress for replication or recovery
- Minimum cluster size (typically 3 nodes) imposes a fixed monthly floor cost
CloudArchive sizing is comparatively simple — it’s an external object-storage target receiving compressed/encrypted snapshots. Size for: cumulative archived data × storage class price + estimated monthly recall + ingestion bandwidth. Glacier and Azure Archive offer 80–90% cost reduction over hot tiers but introduce 4–12 hour rehydration latency that must align with the customer’s restore SLA. [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/cohesity-data-cloud-a-unified-platform-en.pdf]
Key Takeaway: The Sizer is a structured wrapper around the same arithmetic you can do by hand. Validate dedupe assumptions, always include retention multipliers, and remember that EC efficiency improves dramatically with cluster size — small clusters pay an inevitable resiliency tax.
Node Selection and Cluster Topology
With a target effective capacity and ingest throughput in hand, the next decision is which physical nodes to deploy. Cohesity ships three main ReadyNode families plus all-flash variants, and each is optimized for a different point on the capacity-vs-performance curve.
All-Flash vs. Hybrid vs. Dense Storage Nodes
| Family | Form Factor | CPU | Storage | Power (avg/peak) | Optimized For |
|---|---|---|---|---|---|
| C4000 | 2U single | 1× Xeon 8-core 2.1 GHz | 8× HDD slots + NVMe metadata tier | Modest | Entry/edge, branch |
| C5066 (hybrid) | 2U | 1× Xeon 16-core 2.4 GHz, 128 GB RAM | 54 TB HDD + 3.2 TB NVMe per node | 605–895W avg / 1690W peak | Performance backup |
| C5200 (perf/all-flash) | 2U, 4-node block | 5th-Gen Xeon 16-core 2.0 GHz, PCIe Gen5 | 216 TB HDD + 12.8 TB flash per block (or all-NVMe) | 605–895W avg / 1690W peak | Density + performance |
| C6000 (dense) | Dense 2U | Sized for streaming I/O | Up to 168–192 TB raw HDD/node + flash | ~605W avg / 650W peak | Long-term retention, archive |
| C6200 (all-flash dense) | 2U | Newer all-flash variant | All NVMe | ~440W avg | High-perf retention with low power |
[Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/datasheets/Cohesity-C4000-Datasheet.pdf] [Source: https://www.cohesity.com/resources/datasheet/c5000-data-sheet/] [Source: https://www.cohesity.com/resources/datasheet/c5200-data-sheet/] [Source: https://www.cohesity.com/resources/datasheet/cohesity-c6000-series-high-density-converged-nodes/] [Source: https://www.networld.co.jp/product_file/file/Cohesity_C6000_.pdf]
Selection rules of thumb:
- C4000 when: edge/ROBO site, ≤4 nodes, mostly capacity, low concurrency. Lowest TCO entry point but limited CPU headroom. [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/datasheets/Cohesity-C4000-Datasheet.pdf]
- C5000/C5066 when: mainstream datacenter backup, 6+ nodes, mixed workloads, balanced capacity and ingest. The default choice for ~70% of CCAE-style scenarios.
- C5200 when: performance density per RU matters, customer wants 4-node-per-2U efficiency, or the design requires PCIe Gen5 NVMe throughput. Common in modernization/refresh deals.
- C6000 when: retention dominates (≥1 year archive on cluster), TB-per-watt and TB-per-RU are deciding factors, restore SLA is hours rather than minutes. The classic “deep retention” node.
- All-flash (C5200 / C6200) when: instant mass restore at scale, CloneRefresh pipelines, sub-minute RTO, analytics on backup data, MongoDB/Cassandra restore SLAs. Highest $/TB but the only option that satisfies aggressive RTO across many simultaneous mounts. [Source: https://www.cohesity.com/platform/c6000/]
Figure 3.4: ReadyNode selection decision tree
flowchart TD
S[Start: Workload + SLA Profile] --> Q1{Edge / ROBO site<br/>≤ 4 nodes?}
Q1 -->|Yes| C4[C4000<br/>Entry / Edge Hybrid]
Q1 -->|No| Q2{Sub-minute RTO<br/>or instant mass restore?}
Q2 -->|Yes| Q3{Density per RU<br/>also matters?}
Q3 -->|Yes| C6200[C6200<br/>All-Flash Dense]
Q3 -->|No| C5200AF[C5200 All-Flash<br/>PCIe Gen5 NVMe]
Q2 -->|No| Q4{Retention dominates<br/>≥ 1 year on cluster?}
Q4 -->|Yes| C6000[C6000<br/>Dense Hybrid]
Q4 -->|No| Q5{4-node-per-2U<br/>density required?}
Q5 -->|Yes| C5200H[C5200 Hybrid<br/>Performance Density]
Q5 -->|No| C5066[C5066<br/>Mainstream Hybrid<br/>Default for ~70% of designs]
Minimum Cluster Sizes and Scaling Increments
The minimum supported cluster size is 3 nodes for production deployments (Robo Edition can run with fewer at smaller scale, with reduced resiliency). The minimum is set by quorum (Chapter 2) and by RF=2 placement rules.
EC schemes constrain minimums further: D + P ≤ node count. The fault-tolerance white paper plots usable capacity by cluster size and EC scheme; the practical implication is:
| Cluster Size | Smallest Usable EC | Recommended EC |
|---|---|---|
| 3 nodes | RF 2 only | RF 2 |
| 4–5 nodes | EC 2:1 | EC 2:1 or RF 2 |
| 6–7 nodes | EC 4:2 | EC 4:2 or EC 5:2 |
| 8+ nodes | EC 6:2 | EC 6:2 (best efficiency at 2-failure tolerance) |
| 12+ nodes | EC 6:2 | EC 6:2 (more parallel rebuild domains) |
Scaling beyond the initial cluster is node-by-node, with SpanFS rebalancing chunk files as new nodes come online. Best practice is to add nodes in increments that preserve the EC scheme (e.g., add nodes in pairs if EC 4:2 placement benefits from even balance). Brick-blocks (C5200’s 4-nodes-per-2U) scale in 4-node increments naturally.
Mixed-Node Clusters and Constraints
Cohesity supports heterogeneous clusters — different ReadyNode models (different CPU, capacity, or even families) coexisting in a single cluster, with SpanFS auto-rebalancing data across them. This is the architectural escape hatch for “we need more capacity at year 3 without forklift”: add C6000 dense nodes to an existing C5066 cluster, and SpanFS migrates cold data to the denser nodes while keeping hot data on the performance nodes. [Source: https://www.cohesity.com/platform/c6000/]
Constraints to watch:
- All nodes in a cluster must be on a compatible Cohesity software version (rolling upgrade, see Chapter 5)
- Mixing all-flash and hybrid nodes produces an effectively tiered cluster — SpanFS routes hot data to flash and cold data to HDD
- Per-node storage imbalance (e.g., 192 TB nodes mixed with 36 TB nodes) can lead to capacity-skewed placement; the larger nodes will hit failure-domain constraints first when the smaller nodes fill
- Performance is gated by the slowest node class for any given chunk’s I/O — a 2 GHz C4000 in a cluster of 3.0 GHz C5200s will throttle the chunks placed on it
Brick Mode vs. Node Mode Considerations
In standard configuration, each physical node is a single fault domain — a node failure removes one EC stripe member or one RF replica. Brick mode subdivides a single dense node into multiple independent fault domains (“bricks”), each with its own subset of disks and metadata. This is useful when:
- A C6000 dense node holds enough TB that losing the entire node would exceed the cluster’s rebuild capacity
- The architect wants finer-grained EC stripe placement than per-node placement allows
- The cluster has only a few large nodes and would otherwise be unable to satisfy EC node-count requirements
Brick mode trades simplicity for placement flexibility. In production it is uncommon outside of dense-node configurations; most architects accept the per-node fault domain default unless the math forces them off it. [Source: https://www.cohesity.com/resources/datasheet/cohesity-c6000-series-high-density-converged-nodes/]
Key Takeaway: Choose nodes by SLA, not by price. Match C4000 to entry/edge, C5000/C5200 to mainstream datacenter, C6000 to retention-heavy, and all-flash to RTO-driven scenarios. Heterogeneous clusters let you grow capacity-only at year 3 without replacing performance nodes.
Capacity Planning Over Time
A cluster sized for day-one is a cluster that runs out of room in eighteen months. Capacity planning must explicitly model growth, technology refresh, tiering, and reserve capacity, then track those assumptions against reality through Helios reporting.
Modeling Growth and Tech-Refresh Cycles
Standard practice: size for year-3 protected FETB at minimum, ideally year-5. Apply an annual data growth rate (Cohesity sizings typically use 10–25% depending on the customer; financial services and healthcare trend higher, manufacturing lower).
A 5-year compounding model:
| Year | FETB (15% YoY growth) | Effective BETB (4.5x reduction, 12x retention factor) |
|---|---|---|
| 0 | 500 TB | 1,333 TB |
| 1 | 575 TB | 1,533 TB |
| 2 | 661 TB | 1,763 TB |
| 3 | 760 TB | 2,028 TB |
| 4 | 874 TB | 2,332 TB |
| 5 | 1,005 TB | 2,681 TB |
A cluster sized for 1,333 TB BETB on day one will exceed 80% utilization in late year-1 unless the architect plans the expansion path explicitly. The two viable strategies:
- Buy ahead — deploy enough nodes on day one to satisfy year-3 demand. Higher capex but simpler operations.
- Add as needed — deploy a smaller initial cluster and expand annually with additional nodes. Lower capex day one but requires accurate forecasting and predictable supply chain.
Most enterprises hybrid these: deploy for year-3 capacity but configure EC and node count such that adding a single node (or a single 4-node block in C5200) at any point shifts headroom by a known amount.
Tech-refresh cycle. Cohesity hardware typically operates on a 5-year refresh, often aligned with the customer’s broader datacenter lifecycle. Refresh strategies include:
- Forklift — replace the entire cluster with new nodes, migrate data via replication. Disruptive but clean.
- Rolling refresh — add new-generation nodes alongside old, drain old nodes, remove. SpanFS supports this without downtime. [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity-SpanFS-and-SnapTree-WP.pdf]
Tiering Strategy Across Local, Cloud, and Tape
A well-designed Cohesity cluster doesn’t keep every byte on local NVMe forever. Tiering policies move cold data off expensive media into cheaper storage classes, freeing local capacity for hot recovery points and reducing TCO over multi-year retention windows. Tiering options:
| Tier | Media | Latency | Cost | Use Case |
|---|---|---|---|---|
| Hot | Local NVMe / flash | <1 ms | Highest | Last 7–30 days, instant recovery |
| Warm | Local HDD | 5–10 ms | Medium | 30–180 days, routine recovery |
| Cold (CloudTier) | S3 Standard / Azure Cool | 50–200 ms | Low | 6–24 months |
| Archive (CloudArchive) | S3 Glacier / Azure Archive | 4–12 hr rehydrate | Lowest | 1+ year compliance retention |
| Tape (rare) | LTO via gateway | Hours | Very low | Air-gap, regulatory archive |
The CCAE expects you to know that CloudTier is a transparent capacity extension (cluster manages it as additional storage) while CloudArchive is a logical destination for retention/policy-driven movement. Chapter 10 covers both in depth.
Reserve Capacity for Failures and Rebuilds
Two reserves must always be sized into a cluster:
- N+1 capacity reserve — at least one full node’s worth of free space, so SpanFS can redistribute chunks after a node failure without exceeding 100% utilization. For a 10-node cluster of 192 TB nodes, that’s ~192 TB held in reserve. [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/cohesity-fault-tolerance-data-integrity-for-modern-web-scale-environments-white-paper-en.pdf]
- 80% utilization ceiling — best practice is to keep cluster utilization below 80% steady-state. Above that threshold, performance degrades, garbage collection contention rises, and rebuild margins shrink. Helios alerts trigger by default at 80% capacity used.
Combined, this means a “1,920 TB raw” cluster realistically targets ~80% × usable capacity − N+1 reserve as its planning ceiling. For a 10-node, 192 TB-per-node, EC 6:2 cluster: usable = 1,440 TB; 80% = 1,152 TB; minus 192 TB N+1 = 960 TB practical effective ceiling. Architects who size to the full 1,440 TB usable number are setting up the customer for failure.
Reporting and Forecasting via Helios
Helios provides multi-cluster capacity reporting with trend lines, forecasting, and SLA dashboards. The CCAE-relevant capabilities:
- Capacity trend reports — historical growth per cluster, projected exhaustion date based on linear or seasonal trend
- Reduction ratio trending — actual dedupe and compression realized per workload, contrasted against sizing assumptions (a sharp drop in dedupe ratio is a leading indicator of workload composition change)
- Per-View / Per-Protection-Group consumption — find the workload responsible for unexpected capacity growth
- SLA reporting — RPO/RTO compliance per protection group, useful for audit and contract review
- Multi-cluster fleet view — fleet-wide capacity for service providers and large enterprises
Helios forecasting converts theoretical capacity planning into operational practice. The architect’s job at deployment is to pick the right starting point; Helios’s job over time is to flag deviations early enough that another node can be ordered before the cluster hits the ceiling. [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/cohesity-data-cloud-a-unified-platform-en.pdf]
Key Takeaway: Plan capacity for year-3 growth at minimum, reserve N+1 plus 20% headroom, tier cold data to cloud, and use Helios trend reports to catch deviations from your sizing assumptions before they become outages.
Worked Example: 500 TB FETB Sizing Walkthrough
Putting it all together with a CCAE-style scenario: a customer has 500 TB FETB (mixed: 350 TB VMware, 100 TB NAS, 50 TB SQL Server), 3% blended daily change rate, and a 30-day retention policy. Design a cluster.
Step 1: Aggregate ingest math.
- Daily incremental = 500 TB × 3% = 15 TB/day
- 30-day cumulative incremental = ~450 TB (before reduction)
- One full FETB sweep at protection-group registration time = 500 TB initial seed
Step 2: Apply blended reduction.
- VMware (350 TB) at 7:1 → 50 TB
- NAS (100 TB) at 2.5:1 → 40 TB
- SQL (50 TB) at 4:1 → 12.5 TB
- Initial-full back-end ≈ 102.5 TB
- 30 days of incrementals at blended ~5:1 → 450 / 5 = 90 TB
- Steady-state back-end ≈ ~193 TB
Step 3: Add growth and N+1 headroom.
- Year-3 with 15% YoY growth: 193 × 1.52 ≈ 293 TB BETB
- Plus 20% headroom: 293 × 1.25 ≈ 366 TB target available capacity
- Plus N+1: must equal at least one node’s raw capacity above target
Step 4: Choose EC and compute raw.
- EC 6:2 (≥8 nodes) → 75% efficiency → raw = 366 / 0.75 = 488 TB raw minimum
- Add N+1 reserve of one node, so we want raw ≈ 488 TB + one node
Step 5: Pick nodes.
- C5066 hybrid at 54 TB HDD per node: 8 nodes × 54 TB = 432 TB raw — slightly under target
- 9 nodes × 54 TB = 486 TB — at target, but only 8 needed for EC 6:2 + 1 for N+1
- Pragmatic answer: 8× C5066 nodes for EC 6:2 plus 1× C5066 for N+1 = 9-node cluster, 486 TB raw, leaving ~70 TB year-5 headroom before adding nodes
- Alternatively for retention-heavy customer: 6× C6000 dense nodes at 96 TB each = 576 TB raw, EC 4:2 (since 6 nodes), with one-node N+1 → adequate but locks the cluster into a less efficient EC scheme
The “9-node C5066” answer is the typical exam-correct response: it satisfies EC 6:2 minimum (8) + N+1 (1), uses the mainstream performance node, and leaves growth headroom. The C6000 alternative is acceptable when retention dominates and the customer prefers density. [Source: https://www.cohesity.com/blogs/erasure-coding-increase-fault-resilience-capacity/]
Chapter Summary
Cluster design and sizing is the most arithmetic-heavy domain on the CCAE exam, but the arithmetic is simple once the conceptual model is clear. Start by profiling the workload (FETB, change rate, retention, RPO, RTO, workload mix), feed those inputs into the Cohesity Sizer or do the math by hand using the formulas in this chapter, and translate the answer into a node count and ReadyNode family. EC scheme selection is the largest lever for capacity efficiency on clusters of 8+ nodes; node-family selection is the largest lever for performance.
Always reserve N+1 capacity, target 80% utilization as a ceiling, and size for year-3 protected FETB at minimum. Helios reporting will tell you if your assumptions held up — and when they don’t, heterogeneous clusters let you add capacity-only nodes without forklift.
The architect who memorizes the RF-vs-EC table, the four ReadyNode families, the FETB-to-BETB transformation chain, and the N+1 / 80% rules will pass every sizing question on the CCAE. The architect who understands why each rule exists will design clusters that customers thank them for five years later.
Key Terms
| Term | Definition |
|---|---|
| FETB | Front-End TB — the source-side data footprint of protected workloads, measured pre-deduplication and pre-compression. The licensing unit and primary Sizer input. |
| Change rate | The percentage of FETB that mutates between successive backups, driving incremental ingest size and back-end capacity per retention day. Typically 1–3% for VMs/NAS, 5–15% for transactional databases. |
| Effective capacity | Cluster capacity after applying the full transformation chain: raw → usable (resiliency) → available (overhead) → effective (dedupe × compression). The number used to compute protectable FETB. |
| All-flash node | ReadyNode variant (C5200, C6200) using NVMe exclusively, optimized for low-latency / high-IOPS workloads such as instant mass restore and analytics on backup data. |
| Hybrid node | ReadyNode (C4000, C5066, C5200 hybrid) combining HDD bulk storage with NVMe flash tier for metadata, dedupe index, and write coalescing. The mainstream choice for backup. |
| Brick mode | Per-node configuration that subdivides a single dense node into multiple independent fault domains (“bricks”), giving SpanFS finer-grained EC stripe placement on dense C6000 hardware. |
| ReadyNode | Cohesity-validated, partner-supplied hardware SKU (Cisco, HPE, Dell, Cohesity-branded) certified for the SpanFS stack. Distinct from Cohesity-branded appliances and from generic third-party hardware. |
| Sizing tool | Cohesity Sizer — partner/SE-facing calculator that converts workload profile (FETB, change rate, retention, growth, EC) into a recommended ReadyNode model, node count, and multi-year capacity projection. |
Chapter 4: Networking, DNS, and Cluster Connectivity
Networking is the discipline that makes or breaks a Cohesity deployment. A perfectly sized cluster with a beautifully tuned protection policy will still miss its backup window if a switch is misconfigured, an MTU mismatch is silently dropping jumbo frames, or DNS is handing clients the IP address of a node that has been down for three hours. Architects sitting the CCAE exam are expected to design the physical, logical, and service-layer network of a Cohesity cluster end-to-end — bonded NICs, VLANs, VIPs, partitions, BGP, SmartDNS, NTP, certificates, and the firewall matrix that ties it all together.
This chapter walks through that stack from the wire up. We start with physical bonding and VLAN tagging on the node, climb into VIPs and the SmartDNS load-balancing service, work outward to external dependencies (DNS, NTP, AD, CAs, SMTP, Syslog), and close with the firewall port matrix every architect must memorize.
Learning Objectives
By the end of this chapter you will be able to:
- Design Cohesity network topologies including bonded interfaces, VLAN tagging, and 10/25/40/100 GbE selection.
- Configure VIPs, cluster partitions, and BGP/static routing for partition-aware client access.
- Plan DNS, NTP, certificate, and identity prerequisites that must be in place before the cluster is bootstrapped.
- Build a firewall port matrix that covers inter-node, source-to-cluster, replication, Helios, and IPMI traffic.
- Troubleshoot common failure modes: LACP half-bonded, MTU mismatch, SmartDNS delegation broken, partition split-brain.
Physical and Logical Network Layout
Figure 4.1: End-to-end network topology from source clients through the bonded cluster interface to a partitioned VIP pool.
flowchart LR
Source[Source Clients<br/>Backup Agents / NAS / VMware]
ToR[ToR Switch Pair<br/>MLAG / vPC]
Bond[bond0<br/>LACP mode 4]
Node1[Node 01]
Node2[Node 02]
Node3[Node 03]
Node4[Node 04]
VIP[VIP Pool<br/>1 VIP per node]
Part[Cluster Partition<br/>mgmt / smb / s3]
Source --> ToR
ToR --> Bond
Bond --> Node1
Bond --> Node2
Bond --> Node3
Bond --> Node4
Node1 --> VIP
Node2 --> VIP
Node3 --> VIP
Node4 --> VIP
VIP --> Part
Every Cohesity node ships with at least two 10GbE+ NICs that are intended to be bonded into a single logical interface — bond0 — that becomes the cluster’s primary network. The primary network is the surface that carries node-to-node traffic, VIPs (and therefore client backup/restore traffic), management UI/API, and replication. A separate IPMI interface handles out-of-band hardware management. Some designs add a secondary network — usually a tagged VLAN on the same bond, occasionally a physically separate bond — to isolate replication or NAS protocol traffic from backup ingest.
Bond Modes: Active-Backup vs. LACP
Cohesity supports exactly two Linux bonding modes on the primary interface — there is no balance-tlb, no balance-alb, and no Cisco PAgP. The choice is binary, and architects should be able to defend it on exam day.
Figure 4.2: Bond mode selection decision tree — Active-Backup (mode 1) versus LACP (mode 4).
flowchart TD
Start[Choose Bond Mode for bond0]
Q1{Switches support<br/>LACP / 802.3ad?}
Q2{Dual-switch<br/>resiliency required?}
Q3{MLAG / vPC<br/>configured?}
Mode1[Mode 1: Active-Backup<br/>One NIC active<br/>Sub-second failover<br/>Branch / ROBO / Lab]
Mode4Single[Mode 4: LACP<br/>Single switch port-channel<br/>Aggregate throughput]
Mode4MLAG[Mode 4: LACP<br/>MLAG / vPC pair<br/>Production default]
Fallback[Fall back to Mode 1<br/>or fix switch config]
Start --> Q1
Q1 -->|No| Mode1
Q1 -->|Yes| Q2
Q2 -->|No| Mode4Single
Q2 -->|Yes| Q3
Q3 -->|Yes| Mode4MLAG
Q3 -->|No| Fallback
| Bond mode | Linux name | Switch requirement | Throughput | Failover | Typical use |
|---|---|---|---|---|---|
| Mode 1 | Active-Backup | None — any switch | One NIC at a time | Sub-second on link loss | Branch / ROBO, lab clusters, switches without LACP/MLAG |
| Mode 4 | 802.3ad LACP | Port-channel/LAG with matching LACP timers; MLAG/vPC for dual-switch | Aggregate (L3+L4 hash) | Sub-second on LACPDU loss | Production default [Source: https://www.cohesity.com/blogs/optimizing-cohesity-and-vsphere-networking/] |
Mode 4 is the production-recommended default because it gives both link redundancy and active load distribution across NICs. The trade-off is that the upstream switches must support LACP and be configured as a port-channel (Cisco), LAG (Arista/Juniper), or — for cross-switch resiliency — an MLAG/vPC pair so the two NICs in the bond can land on two physically separate switches and still appear to LACP as a single peer [Source: https://iworknthecloud.wordpress.com/2018/04/22/how-to-configure-lacp-vlans-and-jumbo-frames-on-cohesity-and-cisco-nexus/].
Cohesity’s terminology can confuse people: when a Cohesity engineer says “active-active,” they specifically mean LACP mode 4. There is no separate “active-active without LACP” option for the primary bond [Source: https://cypresscollege.a2hosted.com/files/ISER-Evidence/IIC-Student-Support/IIC8/IIC8-15_ISDataProtectionCohesityPlatform.pdf].
The bond mode is set in two places, and forgetting either is a classic deployment bug:
- Per-node, in
/etc/sysconfig/network-scripts/ifcfg-bond0, setBONDING_OPTS="mode=4 miimon=100". - Cluster-wide, via the iris CLI:
iris_cli cluster -username=<user> -password=<pass> edit-bm bonding-mode=4[Source: https://mirror.vcu.edu/pub/cohesity/docs/Cohesity%20CLI%20Reference%20Guide%207.3.2.pdf].
After applying the change, restart networking and the Nexus services (sudo systemctl stop nexus nexus_proxy && sudo systemctl start nexus nexus_proxy). Verify with cat /sys/class/net/bond0/bonding/mode — you should see 802.3ad 4.
Primary, Secondary, and IPMI Interfaces
Three logical interfaces matter to the architect:
- Primary network — the bonded interface (
bond0) carrying node management IPs, all VIPs, and (via tagged VLANs) backup/NFS/SMB/replication traffic. 10GbE is the floor; 25GbE and 100GbE are increasingly common on dense or all-flash nodes [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/reference-architecture/Optimal-Network-Designs-with-Cohesity-RA.pdf]. - Secondary network — optional. Often a tagged VLAN on bond0 used to segregate replication traffic to a DR cluster, or a physically separate bond on dense nodes that need WAN replication kept off the ingest interface.
- IPMI — out-of-band, 1GbE, on a dedicated management VLAN. Uses TCP/UDP 623 (IPMI) plus 80/443 for the Redfish/web UI per node. Keep IPMI off the primary network — it has its own lifecycle, its own credentials, and should not share a broadcast domain with backup traffic.
IP allocation rule of thumb per node:
- Primary network: 1 node management IP + 1 VIP (recommended one VIP per node) = 2 IPs.
- IPMI: 1 IP per node on the OOB network.
A 4-node cluster therefore needs 8 primary-network IPs (4 node + 4 VIP) plus 4 IPMI IPs, all carved out of pools your network team should reserve before bring-up.
VLAN Tagging and Untagged Native VLANs
The recommended segmentation pattern is untagged management on the native VLAN of bond0, with tagged VLANs on top of bond0 for protocol traffic. So:
- bond0 native (untagged) → node management IPs, default gateway, primary VIPs.
- bond0.100 (tagged VLAN 100) → SMB/NFS client traffic for SmartFiles.
- bond0.200 (tagged VLAN 200) → replication to DR cluster.
- bond0.300 (tagged VLAN 300) → tenant-A NAS access.
Each tagged VLAN can carry its own VIP pool, its own SmartDNS subdomain, and its own gateway, allowing per-protocol or per-tenant isolation without buying additional NICs. The switch ports are configured as trunk ports with the appropriate native VLAN and allowed VLAN list.
Speed Selection: 10/25/40/100 GbE
NIC speed should follow the workload, not fashion:
| Speed | Use case |
|---|---|
| 1 GbE | IPMI only. Never the primary bond in production. |
| 10 GbE | Hybrid nodes, mid-size clusters, replication-heavy ROBO. Production minimum. |
| 25 GbE | Dense storage and modern hybrid nodes; common production choice in 2024+. |
| 40/100 GbE | All-flash nodes, large SmartFiles/primary workloads, 10+ node clusters. |
Whatever speed you pick, all NICs in a single bond must run at the same speed — mixing 10 and 25 GbE in bond0 is unsupported. And whatever speed you pick, enable jumbo frames (MTU 9000) end-to-end: Cohesity nodes, switches, uplinks, and any L3 hops. An MTU mismatch with an intermediate L3 hop set at 1500 will silently drop large backup writes and produce bizarre, intermittent slowness that’s miserable to debug [Source: https://iworknthecloud.wordpress.com/2018/04/22/how-to-configure-lacp-vlans-and-jumbo-frames-on-cohesity-and-cisco-nexus/].
Key Takeaway: Bond two same-speed 10/25 GbE NICs into
bond0using LACP mode 4 with MLAG/vPC to dual switches, set the cluster-wide bonding mode throughiris_cli, run jumbo frames end-to-end, and reserve 2 primary-network IPs plus 1 IPMI IP per node before you ever rack the gear.
Cluster Partitions and VIPs
Once bond0 is up and IPs are reserved, the next layer is the cluster’s VIP pool and how clients reach it. This is where Cohesity’s most distinctive networking feature — SmartDNS — and its concept of cluster partitions come into play.
The Cluster Partition Concept
A cluster partition in Cohesity is a logical grouping of nodes within a single physical cluster. Partitions are commonly used in three scenarios:
- Stretch / multi-rack designs — nodes in rack A in one partition, nodes in rack B in another, so SmartDNS can return only locally-reachable VIPs to clients that are network-close.
- Multi-tenant traffic separation — different tenants are pinned to different partitions, with each partition advertising its own VIPs and FQDN.
- Mixed-protocol routing — SmartFiles SMB clients steered to one partition, S3 clients to another, so a hot SMB workload can’t starve S3 throughput.
A partition is not a separate cluster — quorum, metadata, and SpanFS still span the whole cluster. The partition only governs which VIPs a given client sees and which nodes can serve a given DNS-delegated FQDN.
Virtual IPs (VIPs)
A VIP is an IP address that clients connect to but which is not pinned to a specific node’s hardware. In Cohesity, the recommended pattern is one VIP per node, all VIPs in the same subnet/VLAN as the node management IPs and the gateway, statically assigned (no DHCP) [Source: https://kb.expedient.com/docs/cohesity-client-premises-networking-requirements].
VIPs serve every client-facing protocol the cluster speaks:
- SMB (TCP 445) — Windows backup clients, SmartFiles SMB shares.
- NFS (TCP/UDP 2049 + 111) — Linux clients, ESXi NFS datastores for instant recovery.
- S3 (TCP 443) — object access to SmartFiles views.
- Management UI/API (TCP 80/443) — operators, automation, Helios proxying.
- Backup/agent traffic (TCP 50051, 11111, etc.) — physical agents and inter-cluster replication land on VIPs.
Because all VIPs live in the same L2 subnet, when a node fails its VIP can be re-homed to another node within the partition without changing routing — clients just see a brief reset and reconnect.
DNS Round-Robin vs. SmartDNS
There are two patterns for steering clients across the VIP pool. The exam will test both.
| Aspect | Classic DNS Round-Robin | SmartDNS (preferred) |
|---|---|---|
| Where DNS records live | Corporate DNS server | Cohesity cluster (delegated subdomain) |
| A-record management | Static, manually maintained | Dynamic, cluster-managed |
| Health awareness | None — failed nodes still get returned | Health-checked — failed VIPs auto-removed |
| TTL | Whatever corporate DNS sets (often 1 hour) | Short (seconds), driving fast client failover |
| Partition awareness | None | Returns only VIPs from healthy nodes in the relevant partition |
| Failover | Manual (admin removes A record) | Automatic (next query gets healthy set) |
| Use case | Lab, ROBO, environments where DNS team won’t delegate | Production, multi-VIP, partitioned, multi-VLAN |
In classic DNS round-robin, you pre-create A records for each VIP under a single cluster hostname (for example cohesity.acme.com) on the corporate authoritative DNS, and the DNS server rotates A-record order on each query. It works, but it’s static — when a node dies the corporate DNS keeps cheerfully handing out its VIP until a human intervenes [Source: https://kb.expedient.com/docs/cohesity-client-premises-networking-requirements].
Figure 4.3: SmartDNS resolution flow — client query traverses corporate DNS, the delegated Cohesity authoritative DNS, and returns a healthy VIP.
sequenceDiagram
participant Client as Backup Client
participant SiteDNS as Corporate DNS
participant ClusterDNS as Cohesity SmartDNS<br/>(Authoritative for cohesity.acme.com)
participant Health as Cluster Health Check
participant VIP as Healthy VIP
Client->>SiteDNS: Query cohesity.acme.com
SiteDNS->>SiteDNS: Lookup NS records
SiteDNS->>ClusterDNS: Forward via NS delegation
ClusterDNS->>Health: Get currently healthy nodes
Health-->>ClusterDNS: Healthy VIP set (excludes failed)
ClusterDNS-->>SiteDNS: A records (rotated, short TTL)
SiteDNS-->>Client: Healthy VIP address
Client->>VIP: Connect (SMB / NFS / S3 / API)
VIP-->>Client: Service response
In SmartDNS, a subdomain is delegated from corporate DNS to the Cohesity cluster, and the cluster runs an authoritative DNS service for that zone. On each query the cluster returns the currently healthy VIP set, rotated round-robin. Dead nodes are pulled from rotation by internal health checks; clients get a short TTL so retries quickly land on healthy nodes during failover [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/reference-architecture/Optimal-Network-Designs-with-Cohesity-RA.pdf].
Analogy — SmartDNS as Maitre d’. Picture a busy restaurant with eight tables. Classic DNS round-robin is a printed seating chart taped to the door: it cycles through tables 1 through 8 regardless of whether table 5 has a broken leg or table 7 is on fire. SmartDNS is a maitre d’ who knows the live state of every table — when you walk in they glance across the room, see which tables are clean, lit, and ready for service, and they seat you there. When table 5’s leg breaks, the maitre d’ simply stops sending guests to it until the busser fixes it; the printed chart-holder, by contrast, would happily seat you at the broken table until someone updates the laminated sheet.
Configuring SmartDNS — Architect Steps
- Pick a delegated subdomain —
cohesity.<company>.comis conventional. - On the corporate authoritative DNS, create NS records delegating that subdomain to two or more Cohesity VIPs.
- Create matching A/glue records for the NS hostnames so corporate DNS can reach those VIPs on UDP/TCP 53.
- In the Cohesity UI under Cluster → Network → VIPs / Hostname, set the Cluster Hostname to the delegated FQDN and assign one VIP per node on the primary bond.
- (Optional) Configure additional VLAN-tagged SmartDNS subdomains for SmartFiles, per-tenant traffic, or replication. Each VLAN gets its own VIP pool and FQDN.
- Validate with
dig cohesity.acme.com @<cluster-VIP>— you should see the A-record set rotating across queries. Take a node offline (maintenance mode) and re-run; the dead VIP should disappear within seconds.
A couple of network rules accompany SmartDNS:
- All VIPs must live in the same subnet/VLAN as the node management IPs and the gateway.
- VIPs must be statically assigned — DHCP is unsupported.
- Multicast must be enabled on the primary network for cluster auto-discovery during initial bring-up. From release 6.6+, unicast modes are available for environments that prohibit multicast [Source: https://www.ibm.com/docs/en/storage-defender/base?topic=suc-configuring-primary-secondary-network-in-cluster-multicast-disabled].
BGP for Multi-Subnet VIP Failover
Layer-2 round-robin VIPs are fine when every client can ARP for them in the same broadcast domain. But for multi-subnet primary networks (supported from 6.6+) and stretch-cluster designs that cross a router, BGP route advertisement lets the cluster announce VIP /32 routes from individual nodes upstream, so that when a VIP moves between partitions or subnets the routing table converges to the new owner.
BGP design points the architect needs:
- The Cohesity cluster speaks BGP to one or more upstream ToR switches or route reflectors, advertising each VIP as a
/32. - When a node fails, its
/32is withdrawn; when SmartDNS re-homes the VIP to a healthy node, that node advertises the/32. - BGP communities can be used to steer traffic across partitions — for example, a partition tagged “rack-A” advertises with a community that the upstream prefers when the source IP is in rack-A’s subnet.
- BGP and SmartDNS are complementary — SmartDNS picks the VIP, BGP makes the VIP routable across the L3 fabric.
Worked Example — VIP Plan for a 4-Node Cluster with 3 Partitions
Suppose you’re designing a Cohesity cluster for a customer with the following requirements:
- 4 physical nodes, each with bond0 = 2 × 25 GbE LACP to a Cisco Nexus vPC pair.
- Three logical partitions: mgmt (UI, API, replication), smartfiles-smb (SMB views), and smartfiles-s3 (S3 views).
- Production VLANs: VLAN 10 for mgmt (untagged on bond0), VLAN 100 for SMB, VLAN 200 for S3.
- Subnets: 10.10.0.0/24 for mgmt, 10.100.0.0/24 for SMB, 10.200.0.0/24 for S3.
- Corporate DNS administered by the network team, willing to delegate
cohesity.acme.com.
Step 1 — Node management IPs (mgmt VLAN, untagged on bond0).
| Node | Management IP | IPMI IP |
|---|---|---|
| node01 | 10.10.0.11 | 10.99.0.11 |
| node02 | 10.10.0.12 | 10.99.0.12 |
| node03 | 10.10.0.13 | 10.99.0.13 |
| node04 | 10.10.0.14 | 10.99.0.14 |
Step 2 — VIP pools, one per node per partition.
| Partition | VLAN | Subnet | node01 VIP | node02 VIP | node03 VIP | node04 VIP |
|---|---|---|---|---|---|---|
| mgmt | 10 (untagged) | 10.10.0.0/24 | 10.10.0.21 | 10.10.0.22 | 10.10.0.23 | 10.10.0.24 |
| smartfiles-smb | 100 | 10.100.0.0/24 | 10.100.0.21 | 10.100.0.22 | 10.100.0.23 | 10.100.0.24 |
| smartfiles-s3 | 200 | 10.200.0.0/24 | 10.200.0.21 | 10.200.0.22 | 10.200.0.23 | 10.200.0.24 |
That’s 4 VIPs per partition × 3 partitions = 12 VIPs total, plus 4 node IPs and 4 IPMI IPs = 20 IPs the network team must reserve before bring-up.
Step 3 — SmartDNS subdomains.
cohesity.acme.com→ delegated to VIPs 10.10.0.21 and 10.10.0.22 (mgmt partition). UI and API live here.smb.cohesity.acme.com→ delegated to VIPs 10.100.0.21 and 10.100.0.22 (smartfiles-smb partition). SMB clients use\\smb.cohesity.acme.com\share.s3.cohesity.acme.com→ delegated to VIPs 10.200.0.21 and 10.200.0.22 (smartfiles-s3 partition). S3 SDKs usehttps://s3.cohesity.acme.com.
Step 4 — NS delegation in corporate DNS.
cohesity.acme.com. IN NS ns1.cohesity.acme.com.
cohesity.acme.com. IN NS ns2.cohesity.acme.com.
ns1.cohesity.acme.com. IN A 10.10.0.21
ns2.cohesity.acme.com. IN A 10.10.0.22
smb.cohesity.acme.com. IN NS ns1-smb.cohesity.acme.com.
smb.cohesity.acme.com. IN NS ns2-smb.cohesity.acme.com.
ns1-smb.cohesity.acme.com. IN A 10.100.0.21
ns2-smb.cohesity.acme.com. IN A 10.100.0.22
... (same pattern for s3)
Step 5 — Validation.
dig cohesity.acme.com # expect 4 A records, rotating
dig smb.cohesity.acme.com # expect 4 A records on VLAN 100
dig s3.cohesity.acme.com # expect 4 A records on VLAN 200
Take node03 into maintenance mode and re-run — its VIP should disappear from all three responses within seconds.
Key Takeaway: Treat VIPs, partitions, and SmartDNS as one design problem. Reserve one VIP per node per partition in the same subnet/VLAN as the gateway, delegate a per-partition subdomain from corporate DNS, and let SmartDNS health-check VIPs in/out of rotation so client failover happens in seconds without operator intervention.
External Service Dependencies
A Cohesity cluster does not live in isolation. Before the bootstrap wizard ever runs, the architect must align with the network, identity, security, and operations teams on a shopping list of external services. Get any of these wrong and bring-up will stall — or worse, succeed in a way that leaves a security gap you’ll only discover during the first audit.
DNS, NTP, and Reverse DNS
- DNS (TCP 53 / UDP 53) — at least two upstream resolvers. The cluster needs forward and reverse DNS for itself, every node, every VIP, every external target (object store, replication peer, AD domain), and every Helios endpoint. Reverse DNS (PTR records) is mandatory for AD/Kerberos to work; many bootstrap failures trace back to missing PTRs.
- NTP (UDP 123) — at least two NTP sources, ideally three for quorum. Cohesity’s strict-consistency metadata layer is intolerant of clock skew; nodes more than ~30 seconds out can be ejected from quorum. Use the same NTP source as your AD domain controllers — Kerberos has a 5-minute clock skew tolerance and AD itself drifts if NTP is wrong.
AD/LDAP, Kerberos, and SSO Endpoints
- LDAP (TCP 389) / LDAPS (TCP 636) — Active Directory bind for user lookup. LDAPS is preferred; if you use LDAP+StartTLS or plain LDAP, expect security review pushback.
- Kerberos (TCP 88, UDP 88) — required for AD-joined SMB shares and for the cluster’s own machine account.
- SAML / OIDC endpoints (TCP 443 outbound) — for SSO with Okta, Azure AD/Entra ID, Ping, etc. The cluster needs outbound HTTPS to the IdP’s metadata URL and SAML assertion endpoint.
- DNS prerequisite — the AD domain’s
_ldap._tcp.<domain>SRV records must resolve. Architects often forget that SRV-record lookups go through DNS, not LDAP.
The cluster joins AD as a machine account; the service account used to perform the join needs Domain Join rights and the ability to create computer objects in the target OU.
Certificate Authority and TLS Chains
Out of the box the cluster ships with a self-signed cert. For production you must replace it with a CA-signed cert covering:
- The cluster hostname (
cohesity.acme.com). - Every SmartDNS-delegated subdomain in active use (
smb.cohesity.acme.com,s3.cohesity.acme.com). - Optionally, individual node FQDNs.
Use a SAN certificate (multiple subjectAltNames) so a single cert covers all FQDNs. Internal CAs are fine; the cluster trusts certs whose chain it can verify, so the issuing CA’s root and intermediates must be uploaded to the cluster.
For replication and Helios connectivity, the cluster needs to trust the peer cluster’s CA (for cluster-to-cluster replication) and trust public CAs (for Helios over TCP 443). Most enterprises route outbound HTTPS through a forward proxy that does TLS interception — the proxy’s root CA must be uploaded to the cluster or Helios calls will fail with cert validation errors.
SMTP, SNMP, and Syslog Targets
- SMTP (TCP 25 / 465 / 587 outbound) — for email alerts. Most enterprises use an authenticated relay on 587 with STARTTLS.
- SNMP (UDP 161 inbound for polling, UDP 162 outbound for traps) — for monitoring system integration.
- Syslog (UDP 514 / TCP 6514 outbound) — for SIEM integration. TCP 6514 with TLS is preferred for compliance environments.
These don’t block bring-up, but they are usually in scope for the security review and absolutely should be configured before the cluster takes production traffic.
Key Takeaway: Treat DNS forward+reverse records, redundant NTP sources, AD/SSO endpoints, a SAN TLS cert from a known CA, and SMTP/SNMP/Syslog targets as Day-0 prerequisites — not Day-2 polish. A cluster that boots without them is a cluster that will fail its first audit.
Firewalls and Port Requirements
The CCAE exam loves port-matrix questions. Memorize the table below; an architect who can rattle off “TCP 50051 is the physical-agent channel, TCP 11114 is replication-inbound, UDP 123 is NTP” will save themselves several scenario questions.
Figure 4.4: Cohesity port matrix as a hierarchy — source-to-cluster, inter-node, inter-cluster, external services, and management.
graph TD
Root[Cohesity Port Matrix]
Source[Source to Cluster]
Inter[Inter-Node / Inter-Cluster]
External[External Services]
Mgmt[Management / OOB]
Root --> Source
Root --> Inter
Root --> External
Root --> Mgmt
Source --> Agent[TCP 50051<br/>Physical Agent<br/>Win/Linux/SQL/Oracle]
Source --> VMware[TCP 443<br/>vCenter / ESXi]
Source --> SMB[TCP 445<br/>NAS SMB]
Source --> NFS[TCP/UDP 2049 + 111<br/>NAS NFS + RPC]
Source --> WinRM[TCP 5986<br/>Hyper-V WinRM-HTTPS]
Inter --> IO[TCP 11111<br/>I/O Operations]
Inter --> Repl[TCP 11114<br/>Replication]
Inter --> Mgmt24[TCP 24444<br/>Cluster Mgmt]
Inter --> API[TCP 443<br/>Mgmt API]
External --> DNS[TCP/UDP 53<br/>DNS / SmartDNS]
External --> NTP[UDP 123<br/>NTP]
External --> LDAP[TCP 389/636<br/>LDAP / LDAPS]
External --> Krb[TCP/UDP 88<br/>Kerberos]
External --> Helios[TCP 443<br/>Helios outbound]
External --> SMTP[TCP 25/465/587<br/>SMTP]
External --> Syslog[UDP 514 / TCP 6514<br/>Syslog]
Mgmt --> SSH[TCP 22<br/>SSH / iris_cli]
Mgmt --> UI[TCP 80/443<br/>UI / API]
Mgmt --> IPMI[TCP/UDP 623 + 80/443<br/>IPMI / Redfish]
The Cohesity Port Matrix
Inter-Node and Intra-Cluster Ports
Within a single cluster, all node-to-node traffic flows on bond0 and is generally not firewalled — the assumption is that the cluster’s primary network is a flat, trusted L2 segment. If your security policy puts an internal firewall between nodes (rare but seen in some service-provider topologies), the same inter-cluster ports — 11111, 11114, 24444, 443 — must be opened, plus 53 and 123 for DNS/NTP if those services are partition-local.
Crucially, multicast must be enabled on the primary network for the bring-up auto-discovery phase. If the network team can’t allow multicast on the primary VLAN, plan to use the unicast bring-up path on releases 6.6+ [Source: https://www.ibm.com/docs/en/storage-defender/base?topic=suc-configuring-primary-secondary-network-in-cluster-multicast-disabled].
Source-to-Cluster Ports
The single most-tested fact in this section: physical agent traffic (Linux, Windows, Hyper-V, SQL VSS, Oracle RMAN coordination) all share TCP 50051. If you remember nothing else, remember that one [Source: https://docs.cohesity.com/baas/data-protect/firewall-ports.htm].
After that:
- VMware flows over TCP 443 from the cluster to vCenter and ESXi hosts.
- NAS uses 445 (SMB), 2049 (NFS), 111 (RPC), occasionally 8080.
- Linux NFS-based restores also need TCP/UDP 111 (rpcbind) for mount.
- Hyper-V / SCVMM granular file recovery uses TCP 5986 (WinRM-HTTPS) into the guest VM.
- Oracle Hybrid Extender to cluster uses TCP 29991 for the NFS mount of Cohesity views during instant recover and clone workflows.
Helios Outbound Connectivity
Helios is a SaaS control plane. The cluster initiates an outbound HTTPS (TCP 443) tunnel to Helios — Helios never reaches in. This means:
- The cluster needs outbound TCP 443 to
helios.cohesity.comand the regional Helios endpoints. - If a forward proxy intercepts TLS, its CA must be trusted by the cluster.
- If proxy authentication is required, configure proxy credentials in the cluster.
In air-gapped environments (FedRAMP, classified networks), Helios is not available and operations rely on the cluster’s local UI/API only.
Common Firewall Misconfigurations
- TCP 50051 not opened from cluster VIPs to physical agents — backups fail with “agent unreachable.” Open from each VIP, not just node IPs, because the connection can come from any VIP in the partition.
- Asymmetric replication ports — opening 11111 outbound on the source side but not inbound on the DR side. Replication needs bidirectional flows; map the table direction carefully.
- MTU 1500 on a transit switch between Cohesity and a NAS source while both endpoints are MTU 9000 — silent packet drops, intermittent slowness.
- Missing reverse DNS — Kerberos breaks, AD-joined SMB protections fail with cryptic auth errors.
- Helios proxy CA not trusted — cluster shows “disconnected from Helios” while local UI looks fine. Always test outbound TLS to the Helios endpoint with the cluster’s trust store, not a generic curl from a node.
- IPMI on the primary VLAN — security finding waiting to happen; IPMI must be on a separate OOB management VLAN.
- SmartDNS NS delegation pointed at node IPs instead of VIPs — when the named node fails, the entire delegated zone goes dark. Always delegate to two or more VIPs, never node IPs.
Key Takeaway: Memorize the port matrix — TCP 50051 for agents, 443 for VMware and Helios, 11111/11114/24444 for replication, 445/2049/111 for NAS — and treat reverse DNS, MTU 9000 end-to-end, and IPMI VLAN isolation as non-negotiable prerequisites.
Chapter Summary
Cohesity networking is a layered design problem. At the bottom is the physical bond — two same-speed 10/25 GbE NICs in LACP mode 4 to an MLAG/vPC switch pair, jumbo frames everywhere, configured both in ifcfg-bond0 per node and cluster-wide via iris_cli edit-bm bonding-mode=4. On top of bond0 ride untagged management traffic on the native VLAN and tagged VLANs for protocols, replication, and tenant separation. Each node carries one node IP and one VIP per partition on the primary network, plus one IPMI IP on a separate OOB VLAN.
Above the IPs is the SmartDNS service. By delegating a subdomain from corporate DNS to the cluster, the cluster becomes an authoritative DNS server that returns currently healthy VIPs in round-robin order, with partition awareness and second-scale TTLs that give clients near-instant failover when a node dies — the maitre d’ replacing the laminated seating chart. For multi-subnet primaries and stretch designs, BGP /32 advertisement makes VIP failover work across L3 boundaries.
Cluster bring-up depends on a roster of external services: redundant DNS with forward and reverse records, redundant NTP within seconds of AD time, AD/LDAPS+Kerberos endpoints, a SAN TLS cert from a trusted CA, SMTP/SNMP/Syslog targets, and outbound HTTPS to Helios. And the whole thing rides on a port matrix the architect must memorize: TCP 50051 for physical agents, 443 for VMware and Helios outbound, 11111/11114/24444 for inter-cluster replication, 445/2049/111 for NAS, 88/389/636 for AD, 53/123 for DNS/NTP, and 623 + 80/443 for IPMI on the OOB network.
Get the bond mode wrong and you cap throughput at one NIC. Get the MTU wrong and large writes silently drop. Get reverse DNS wrong and Kerberos fails. Get NS delegation pointed at node IPs and SmartDNS dies with the first failure. None of these are subtle — they are the architect’s checklist on Day 0, every time.
Key Terms
- VIP (Virtual IP) — A statically assigned IP on the cluster’s primary network, used by clients to reach SMB, NFS, S3, or management endpoints. Recommended one VIP per node, all in the same subnet/VLAN as the gateway, no DHCP.
- Cluster partition — A logical grouping of nodes within a single cluster, used for stretch/multi-rack designs, multi-tenant isolation, or per-protocol traffic separation. SmartDNS returns VIPs only from healthy nodes within the relevant partition pool.
- LACP (Link Aggregation Control Protocol, IEEE 802.3ad / Linux bonding mode 4) — Active-active bonding mode where all NICs in the bond carry traffic; LACPDUs negotiate the aggregate with the upstream switch. Requires switch port-channel/LAG support and (for dual-switch resiliency) MLAG/vPC. The production-recommended bond mode for Cohesity.
- SmartDNS — Cohesity’s authoritative DNS service that runs on the cluster itself for a delegated subdomain. Returns currently healthy VIPs in round-robin order with partition awareness, automatically removing failed VIPs from rotation. The preferred load-distribution mechanism for production clusters.
- BGP (Border Gateway Protocol) — Used by Cohesity (6.6+) to advertise VIP
/32routes to upstream switches/route reflectors so VIP failover works across L3 boundaries. Complements SmartDNS for multi-subnet and stretch designs. - Bond (bond0) — The Linux logical interface formed by aggregating two or more physical NICs. Cohesity supports mode 1 (Active-Backup) and mode 4 (LACP). The bond is the cluster’s primary network and carries node IPs, VIPs, and tagged VLANs.
- Primary network — The bonded interface (
bond0) carrying node management IPs, VIPs, and (via tagged VLANs) backup, NFS, SMB, S3, and replication traffic. 10 GbE is the production minimum. - IPMI (Intelligent Platform Management Interface) — The out-of-band hardware management interface on each Cohesity node. 1 GbE, on a dedicated OOB management VLAN, using TCP/UDP 623 plus 80/443 for Redfish/web UI. Must be kept off the primary network.
Chapter 5: Cluster Deployment, Bootstrap, and Day-2 Operations
A Cohesity cluster spends roughly 30 minutes being born and the rest of its life in maintenance mode. The CCAE exam reflects that ratio: it expects you to bootstrap a cluster cleanly the first time, then operate it for years without unplanned downtime. This chapter walks through the full lifecycle: imaging and bootstrapping a physical cluster with iris_cli, deploying Virtual Edition (VE) on VMware, deploying Cloud Edition (CE) on AWS and Azure, and then performing the Day-2 work — rolling upgrades, node additions, disk replacements, and automation via REST, PowerShell, Ansible, and Terraform.
Learning Objectives
By the end of this chapter, you will be able to:
- Bootstrap a new physical, virtual, or cloud Cohesity cluster end-to-end using both the web wizard and
iris_cli. - Apply best-practice configuration (DNS, NTP, AD/SSO, licensing, VIPs) to a brownfield enterprise environment.
- Perform Day-2 operations including rolling upgrades, node additions/removals, and disk/node replacement procedures.
- Use the Cohesity CLI (
iris_cli), REST API v1/v2, Helios API, the PowerShell module, the Ansible collection, and the Terraform provider to automate cluster lifecycle. - Choose appropriately between physical, Virtual Edition, and Cloud Edition deployment models for a given architecture requirement.
Bootstrapping a New Cluster
A Cohesity cluster begins life as a stack of imaged but unconfigured nodes. Bootstrap is the act of giving the first node an IP, telling it about its peers, and forming a quorum. Until quorum forms, there is no SpanFS, no Bridge service accepting backups, and no Helios footprint — just a handful of Linux boxes waiting for instructions.
Out-of-the-Box Experience and First-Time Wizard
When a Cohesity-branded appliance, ReadyNode, or certified partner platform arrives from the factory, each node ships pre-imaged with the Cohesity OS but without IP addresses. The bootstrap entry point is IPMI (Intelligent Platform Management Interface), the out-of-band management controller every server provides. The architect or field engineer connects to each node’s IPMI IP — typically supplied on a sticker by the hardware vendor — powers the node on, and verifies the Cohesity imaging splash screen [Source: https://www.youtube.com/watch?v=vAsrKn14jgY].
Each node carries a unique node-ID (e.g., 181140266786854) printed on the IPMI console or visible at the boot screen. Architects must record these IDs because they are required arguments to iris_cli cluster create when nodes do not yet have IPs assigned [Source: https://www.ibm.com/docs/en/storage-defender/base?topic=suc-configuring-primary-secondary-network-in-cluster-multicast-disabled].
Analogy: Think of the node-ID like a baby’s hospital wristband. Before the cluster gives the node a name and an IP “address” to live at, the node-ID is the only handle that uniquely identifies which physical box you mean.
IP Allocation and Partition Creation
The first node is the bootstrap target. Console into it (via IPMI or directly) and run the network configuration script:
cd /home/cohesity/bin/network
./configure_network.sh
The script prompts for:
- Bond selection:
bond0is the default node-to-node bond (used bybridge0);bond1is reserved for redundancy or a separate management/storage plane [Source: https://www.youtube.com/watch?v=oXFVOG6AO0o]. - IP, prefix, gateway: a routable address on the management network (e.g.,
10.1.4.16/24, gateway10.1.0.1). - MTU: 1500 by default; raise to 9000 for jumbo frames if the switch fabric supports them end-to-end.
- DNS servers: required for cluster name resolution and AD integration.
Activating the changes resets interfaces and incurs roughly 30–60 seconds of downtime on the bootstrap node — fine, since the cluster is not yet serving traffic. As an alternative to the script, architects can use iris_cli directly after logging in (default credentials: admin/admin):
iris_cli interface list
iris_cli node status
These commands enumerate physical interfaces and bond memberships so the architect can confirm that bond0 is up before pushing the configuration [Source: https://mirror.vcu.edu/pub/cohesity/docs/Cohesity%20CLI%20Reference%20Guide%207.3.2.pdf].
A key exam-relevant fact: partitions are created automatically during cluster formation. There is no manual partitioning step before iris_cli cluster create. Cluster partitions logically group nodes for VIP segregation and tenancy, but the initial partition is allocated as part of the create call [Source: http://pdamien58.blogspot.com/2016/01/cohesity-initial-cluster-setup.html].
Joining Nodes and Forming Quorum
Once the first node has an IP, it can see its peers on the same L2 domain via Cohesity’s discovery protocol. Run:
iris_cli discover free-nodes
This lists all imaged-but-uncommitted nodes the bootstrap node can reach. Architects review the list and select which nodes to enroll — important when, for example, a 4-node chassis is being split into two 2-node clusters [Source: https://www.youtube.com/watch?v=vAsrKn14jgY].
With the free-node list in hand, the cluster is created in a single iris_cli invocation:
iris_cli cluster create \
domain-names=eng.cohesity.com \
ntp-servers=pool.ntp.org \
name="haswell2" \
hostname=haswell2.eng.cohesity.com \
subnet-gateway=10.1.0.1 \
subnet-mask=255.255.240.0 \
dns-server-ips=10.2.0.1 \
node-ips=10.1.4.16,10.1.4.17,10.1.4.18 \
node-ids=181140266786854,181140264583348,181140264822986 \
vips=10.1.4.20,10.1.4.21,10.1.4.22 \
enable-encryption=true
The command bundles every prerequisite into one atomic operation: domain, NTP, DNS, gateway, mask, node IPs, node IDs, VIPs, and encryption posture [Source: https://mirror.vcu.edu/pub/cohesity/docs/Cohesity%20CLI%20Reference%20Guide%207.3.2.pdf]. If nodes have already been assigned IPs (a common brownfield pattern when the network team pre-stages addresses), the node-ids= parameter can be omitted; Cohesity discovers the nodes by IP instead [Source: https://www.youtube.com/watch?v=vAsrKn14jgY].
Quorum forms once a majority of nodes acknowledge the create. For a 3-node cluster, all three must be present; the cluster cannot tolerate a missing node during initial formation, only after quorum is established and SpanFS metadata replicas are placed.
Exam tip: Multicast is frequently disabled in enterprise networks. Cohesity supports a unicast-discovery path; the architect must ensure the same node count, VIP count, and IPMI count are supplied and that ARP-based peer discovery works on the bootstrap subnet [Source: https://www.ibm.com/docs/en/storage-defender/base?topic=suc-configuring-primary-secondary-network-in-cluster-multicast-disabled].
Initial AD/SSO and Licensing Setup
With quorum formed, the cluster boots its services and presents a web UI on each node IP and on each VIP. The architect’s first post-create tasks are:
- Verify health:
iris_cli cluster statusshould show all nodes Online, services Up, and SpanFS mounted. - DNS registration: add A records for the cluster FQDN that round-robin across all VIPs (e.g.,
haswell2.eng.cohesity.com -> 10.1.4.20, 10.1.4.21, 10.1.4.22) [Source: https://cypresscollege.a2hosted.com/files/ISER-Evidence/IIC-Student-Support/IIC8/IIC8-15_ISDataProtectionCohesityPlatform.pdf]. - License application: upload the entitlement file (capacity-based or subscription) under Settings > Licensing.
- Active Directory join: bind the cluster to the AD forest for user authentication; map AD groups to built-in roles (Admin, Operator, Viewer).
- SSO/SAML: optionally federate with Okta, Azure AD, or Ping for browser-based login.
- Helios registration: link the cluster to Helios to enable global dashboards, multi-cluster reports, and SaaS-only features (DataHawk, FortKnox, SiteContinuity orchestration).
A classic brownfield pitfall is forgetting reverse DNS. Cohesity uses PTR records during AD join and Kerberos ticket validation; missing PTRs cause AD join failures that look mysterious until the architect runs dig -x <cluster IP>.
Figure 5.1: Bootstrap workflow from IPMI power-on to AD/SSO integration
flowchart TD
A[Power on imaged nodes via IPMI] --> B[Console to first node]
B --> C[Run configure_network.sh<br/>set bond0 IP, mask, gateway, DNS, MTU]
C --> D[iris_cli discover free-nodes]
D --> E{Select nodes<br/>to enroll}
E --> F[iris_cli cluster create<br/>domain, NTP, DNS, node-ips, node-ids, VIPs]
F --> G[Quorum forms<br/>initial partition auto-created]
G --> H[iris_cli cluster status<br/>verify health]
H --> I[DNS A/PTR records for VIPs]
I --> J[Apply license]
J --> K[Active Directory join<br/>map AD groups to roles]
K --> L[Optional SAML/SSO federation]
L --> M[Helios registration]
Key Takeaway: Bootstrap is a one-shot, atomic operation: imaged nodes get IPs, the first node discovers peers via
iris_cli discover free-nodes, and a singleiris_cli cluster createcall passes domain, NTP, DNS, gateway, mask, node IPs/IDs, and VIPs to form quorum and auto-create the initial partition.
Virtual and Cloud Edition Deployment
Not every Cohesity cluster lives on Cohesity-branded steel. Virtual Edition (VE) packages the same SpanFS stack as a virtual appliance for VMware ESXi, Hyper-V, Nutanix AHV, KVM, and even Raspberry-Pi-class edge devices. Cloud Edition (CE) packages it for AWS, Azure, and GCP. The deployment mechanics differ, but the post-deploy operational model — Helios registration, policies, protection groups — is identical.
VE Prerequisites on VMware and Hyper-V
Virtual Edition ships as an OVA (VMware) or VHDX (Hyper-V) and runs as a single-node or multi-node cluster. A VE node consumes considerably less than its physical counterpart but still has hard floors: typically 8+ vCPU, 64+ GB RAM, dedicated performance-tier disk on flash, and a separate capacity-tier disk on HDD or capacity SSD [Source: https://chriscolotti.us/vmware/how-to-deploy-the-cohesity-azure-cloud-edition/].
Critical sizing constraints unique to VE:
-
Single-node VE expands by disk resize, not node addition. To add capacity, the architect shuts down the VM, extends the capacity-tier disk in the hypervisor (never the boot disk), powers on, and runs:
iris_cli cluster stop iris_cli node list disk extend iris_cli disk list iris_cli cluster startThe sequence completes in minutes and adds the new capacity to SpanFS [Source: https://www.youtube.com/watch?v=BYM4u4NfvaI] [Source: http://demitasse.co.nz/2018/12/expanding-storage-on-a-cohesity-virtual-edition-appliance/].
-
Multi-node VE clusters (3+ nodes) behave like physical clusters:
iris_cli discover free-nodesandiris_cli cluster createare used identically. -
Anti-affinity rules must be configured in vSphere DRS so VE nodes never co-reside on the same ESXi host; otherwise a single host failure takes the cluster down.
-
VMDK provisioning must be Eager Zeroed Thick on the performance tier to guarantee NVRAM-equivalent commit latency. Thin-provisioned performance disks introduce write-stall behavior under load.
Cohesity Cloud Edition on AWS, Azure, GCP
Cloud Edition is the same software re-packaged as cloud VMs. It runs as a minimum 3-node production cluster (single-node CE is supported only for lab use) and registers to Helios for management.
Azure deployment uses the Cohesity Marketplace image with a documented VM size of Standard_DS5_v2 (16 vCPU per node). A 3-node minimum cluster therefore consumes 48 vCPU cores; new Azure subscriptions often default to a 10-core regional quota, so the architect’s first call is to request a quota increase before deploying [Source: https://chriscolotti.us/vmware/how-to-deploy-the-cohesity-azure-cloud-edition/]. Azure resource-group caps allow up to 64 VMs per resource group, which sets an effective ceiling on a single CE cluster’s node count [Source: https://docs.cohesity.com/baas/data-protect/azure-vm/azure-prereq.htm].
Azure Managed Disks must be sized in multiples of 1 MB. The capacity tier is resizable in place (e.g., 594 GB → 700 GB, yielding ~690 GiB usable after formatting), while the performance tier is provisioned on premium SSD and is not normally resized [Source: https://chriscolotti.us/vmware/how-to-deploy-the-cohesity-azure-cloud-edition/].
AWS deployment uses the Cohesity DataPlatform Cloud Edition AMI from the AWS Marketplace [Source: https://aws.amazon.com/marketplace/pp/prodview-m3tozzczpsmqe]. Sizing approximates the m5/m6i large-instance family; specific recommendations appear in the Cohesity & AWS Solution Brief [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/solution-brief/Cohesity-and-AWS-Solution-Brief.pdf]. AWS CE supports S3 tiering for overflow capacity, allowing the cluster to spill cold blocks to lower-cost object storage [Source: https://www.cohesity.com/blogs/deploying-cohesity-dataplatform-cloud-edition-aws/].
Networking prerequisites (both clouds): a dedicated subnet, security-group rules permitting inter-node traffic, a VPN or Direct Connect/ExpressRoute back to on-premises for replication, and IAM roles granting the cluster permission to read/write S3 (AWS) or Blob Storage (Azure) for CloudArchive and CloudTier targets.
Robo Edition and Edge Deployments
Robo Edition is a thin VE variant tuned for remote/branch offices (ROBO). A typical Robo cluster is a single-node VE on existing branch-office hypervisor capacity, configured to replicate inbound to a central CE or physical hub. Robo Edition trades clustered resiliency (no SpanFS quorum across nodes) for a tiny footprint and manages risk by replicating critical backups upstream within minutes.
Comparing Deployment Models
| Attribute | Physical (Branded/ReadyNode) | Virtual Edition (VE) | Cloud Edition (CE) |
|---|---|---|---|
| Form factor | 1U/2U appliance | OVA/VHDX virtual appliance | Cloud Marketplace image (AMI/Azure) |
| Min nodes (production) | 3 | 3 (single-node = lab/Robo only) | 3 |
| Bootstrap entry point | IPMI + console | Hypervisor console | Cloud console + SSH |
| Node-ID source | Hardware sticker / IPMI | OVF deployment | VM metadata |
| Capacity expansion | Add nodes / disks | Resize VMDK, then disk extend | Add VMs or resize managed disks |
| Networking | Bonded NICs, VLAN trunks | vSwitch port group, MTU | VPC/VNet subnet, security groups |
| Storage backing | NVMe (perf) + HDD/SSD (cap) | VMDK on flash + capacity datastore | Premium SSD + standard SSD/HDD |
| External overflow | Local only | Local only | S3 / Blob tier supported |
| Helios | Optional | Optional | Strongly recommended |
| Typical use case | Primary on-prem fabric | Lab, ROBO, dev/test | Cloud DR, cloud-native protection |
Figure 5.2: Deployment form factor matrix across Physical, VE, Cloud Edition, and Robo Edition
flowchart LR
subgraph Physical[Physical / ReadyNode]
P1[1U or 2U appliance]
P2[IPMI bootstrap]
P3[NVMe perf + HDD/SSD cap]
end
subgraph VE[Virtual Edition]
V1[OVA / VHDX]
V2[VMware, Hyper-V, AHV, KVM]
V3[Eager Zeroed Thick VMDKs<br/>DRS anti-affinity]
end
subgraph CE[Cloud Edition]
C1[AWS AMI / Azure Marketplace]
C2[3-node minimum production]
C3[S3 / Blob tiering]
end
subgraph Robo[Robo Edition]
R1[Single-node VE variant]
R2[Branch office hypervisor]
R3[Replicates to hub cluster]
end
Physical -->|Primary on-prem fabric| Hub[Helios Fleet Management]
VE -->|Lab, dev/test| Hub
CE -->|Cloud DR + cloud-native| Hub
Robo -->|Edge protection| Hub
Key Takeaway: Physical, VE, and CE share the same
iris_clibootstrap mechanics but differ in their resiliency model: physical and CE use clustered SpanFS; single-node VE expands via in-place disk resize and depends on hypervisor-level HA for availability.
Day-2 Operations
Once the cluster is alive and protecting data, the architect’s job pivots from creation to stewardship. Day-2 operations cover the recurring activities: software upgrades, capacity expansions, hardware replacement, and proactive health management.
Cluster Upgrades and Rolling Reboots
Cohesity ships a one-click rolling upgrade. The cluster pulls candidate releases automatically from Cohesity’s public release service — there is no manual download step — and the architect chooses a target version from a UI-filtered list of compatible packages [Source: https://www.cohesity.com/blogs/cohesity-cluster-upgrades/].
The mechanics are elegant. A distributed lock manager hands a single token from node to node:
- The token-holder pauses local services and migrates its VIPs to peers.
- Active client connections continue against the surviving VIPs (UI sessions, SMB mounts, NFS exports, ongoing backups).
- The node atomically swaps its active and passive root partitions and reboots into the new image.
- Once healthy, it releases the token to the next node.
Backups, replication, and indexing keep running throughout, and RPO/RTO/security posture is preserved [Source: https://www.cohesity.com/blogs/cohesity-cluster-upgrades/]. The atomic root-partition swap also enables fast rollback at the boot level: if a node fails to come up on the new image, it boots back into the previous root.
Figure 5.3: Rolling upgrade token-passing sequence across cluster nodes
sequenceDiagram
participant Helios
participant Cluster as Cluster Coordinator
participant N1 as Node 1
participant N2 as Node 2
participant N3 as Node 3
Helios->>Cluster: Initiate upgrade to target version
Cluster->>Cluster: Run pre-upgrade checks
Cluster->>N1: Hand upgrade token
N1->>N2: Migrate VIPs to peers
N1->>N1: Pause services, swap root partition, reboot
N1->>Cluster: Healthy on new image, release token
Cluster->>N2: Hand upgrade token
N2->>N3: Migrate VIPs to peers
N2->>N2: Pause, swap root partition, reboot
N2->>Cluster: Healthy, release token
Cluster->>N3: Hand upgrade token
N3->>N1: Migrate VIPs to peers
N3->>N3: Pause, swap root partition, reboot
N3->>Cluster: Healthy, release token
Cluster->>Helios: Upgrade complete, all nodes on new version
Analogy: A Cohesity rolling upgrade is like a Roomba in a house full of guests. Only one node ever steps out of the rotation at a time, and the rest keep cleaning (serving I/O) while it’s away. The Roomba doesn’t ask everyone to leave the house — it just navigates around them.
The 7.x UI introduces explicit pre-upgrade checks under the Upgrade tab and supports uploading a CRL (Certificate Revocation List) file when needed [Source: https://www.youtube.com/watch?v=-HjnGFgU_uA]. Architects should always:
- Open Platform > Cluster, confirm green health.
- Run pre-upgrade checks; fix any flagged issues (clock drift, certificate expiry, disk warnings).
- Initiate the upgrade from Platform > Admin > Upgrade Cluster.
- After completion, verify with
iris_cli cluster get-versionand spot-check a few backup jobs and a restore.
A noteworthy historical event: Cohesity 6.8.2_u1 migrated the underlying OS from CentOS 7.9 to RHEL 7.9 because of CentOS’s June 30, 2024 EOL. The upgrade was self-driven and low-risk because RHEL 7.9 is binary-compatible with CentOS 7.9 [Source: https://www.xiologix.com/20240701-cohesity-rhel-update/].
Adding and Removing Nodes
Cluster expansion is symmetric to bootstrap. New nodes are imaged, racked, cabled, and presented to the cluster via iris_cli discover free-nodes. The expansion command (UI-driven or CLI) appends the discovered nodes to the existing cluster, after which SpanFS rebalances chunk placement to spread load.
Removing a node is a multi-stage drain:
- Mark the node for removal in the UI or via
iris_cli. - SpanFS migrates chunk replicas off the node to maintain Replication Factor (RF) or Erasure Coding (EC) parity.
- Once drained, the node leaves the quorum and can be physically removed.
Architects must size for rebuild headroom: when a node fails or is removed, the cluster needs free capacity equal to the failed node’s data footprint to re-protect blocks. Without headroom, the cluster operates with degraded resiliency until capacity is added.
Disk and Node Replacement Procedures
Disk failures are common; node failures are rare. Both are handled via Cohesity’s hardware replacement workflow:
- Disk replacement: identified via Helios alert or
iris_cli cluster statusshowing a failed disk LED. Field engineer hot-swaps the disk; the cluster auto-formats it, integrates it into SpanFS, and rebuilds chunk replicas onto it. - Node replacement: more involved. The replacement node arrives pre-imaged at the same OS version. The architect drains the failed node, physically swaps it, and uses
iris_cli(or the UI) to re-add the new node. SpanFS rebuilds replicas; rebuild time scales with node capacity and cluster network throughput.
For appliance customers, the Cohesity Hardware Refresh Service handles tech-refresh cycles (5-7 year lifespan typical) by overlapping old and new clusters during data migration [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/datasheets/cohesity-hardware-refresh-service-data-sheet-en.pdf].
Health Checks and Proactive Support
Cohesity emits a daily Heartbeat log bundle that uploads to Cohesity Support over HTTPS. Heartbeat includes cluster configuration, service health, recent alerts, and capacity metrics — enough for support engineers to triage proactively without requesting fresh log bundles. Helios then surfaces alerts globally; the operator’s role is to triage alerts in priority order:
| Severity | Examples | Action |
|---|---|---|
| Critical | Quorum loss, full capacity, multiple disk failures | Page on-call; engage support |
| Major | Single disk failure, replication lag, license expiry | Same-day investigation |
| Warning | Job failures, certificate expiry < 30 days | Plan resolution |
| Info | Successful upgrade, scheduled maintenance | Acknowledge |
Key Takeaway: Day-2 is dominated by rolling upgrades (one-click, zero-downtime via VIP failover and atomic root-partition swaps), node and disk replacement (drain-rebuild cycles requiring rebuild-capacity headroom), and proactive monitoring through the Heartbeat log bundle and Helios alerts.
Automation and APIs
Click-ops works for one cluster. At fleet scale — dozens of clusters across regions and clouds — every operation must be expressible as code. Cohesity exposes four programmable interfaces, each with its own audience and idioms.
iris_cli Command Groups
iris_cli is the on-cluster shell binary. Architects invoke it from any node’s CLI (or remotely via SSH) for tasks where the UI is too slow or scripting is required. Major command groups include:
| Group | Purpose | Example |
|---|---|---|
cluster | Bootstrap, status, version | iris_cli cluster get-version |
node | Per-node operations | iris_cli node list disk extend |
disk | Disk inventory, extend | iris_cli disk list |
interface | Network config | iris_cli interface list |
vlan | VLAN/VIP management | iris_cli vlan list |
partition | Cluster partition admin | iris_cli partition list |
protection-job | Backup job control | iris_cli protection-job list |
recovery | Restore operations | iris_cli recovery list |
The CLI Reference Guide remains the authoritative source, with versioned PDFs published per release (e.g., 7.3.2) [Source: https://mirror.vcu.edu/pub/cohesity/docs/Cohesity%20CLI%20Reference%20Guide%207.3.2.pdf].
REST API v1 vs. v2 and Helios APIs
Cohesity exposes three REST surface areas:
- REST API v1: cluster-local, organized around legacy resource names (jobs, runs, sources). Still in heavy use; some endpoints have no v2 equivalent.
- REST API v2: cluster-local, redesigned around object-oriented resources (protection groups, recoveries, sources). Preferred for new automation.
- Helios API: SaaS-side, scoped to a customer tenant and aggregated across all registered clusters. Enables fleet-level operations (create a policy across 50 clusters) and is the only path to Helios-only features (DataHawk, FortKnox, SiteContinuity orchestration).
Authentication patterns differ: cluster APIs accept username/password or API key against the cluster directly; Helios uses a per-tenant API key issued from the Helios console and includes a clusterId filter for fan-out calls.
Architects choose v2 for new code, v1 only when an endpoint is missing in v2, and Helios for any cross-cluster orchestration.
PowerShell Module and Ansible Collection
The Cohesity PowerShell module wraps both v1 and v2 endpoints with idiomatic PowerShell verbs (Get-CohesityProtectionJob, New-CohesityRecoveryRequest). Windows-centric teams often standardize on it for Windows Server, SQL, and M365 backup automation.
The Cohesity Ansible collection delivers the same automation as Ansible modules consumed in playbooks. It fits naturally into Linux-heavy and infrastructure-as-code shops where Ansible already orchestrates VMware, AD, and network changes.
A representative Ansible task:
- name: Create VMware protection group
cohesity.dataprotect.cohesity_protection_group:
cluster: "{{ cohesity_vip }}"
username: "{{ cohesity_user }}"
password: "{{ cohesity_pass }}"
name: "tier1-vms-daily"
environment: "VMware"
policy: "Gold"
sources:
- vcenter01
vm_tags:
- "tier:1"
state: present
Terraform Provider for Cohesity
The Terraform provider treats Cohesity resources (policies, protection groups, sources, views, RBAC roles) as declarative state. It is the right tool when:
- Backup configuration must be reproducible across dev/test/prod environments.
- Infrastructure pipelines (Terraform Cloud, GitHub Actions, GitLab CI) gate Cohesity changes alongside VMware and cloud changes.
- An audit trail of “who changed which protection policy when” is required — Terraform commits live in Git.
A typical Terraform stanza:
resource "cohesity_protection_policy" "gold" {
name = "Gold"
retention {
unit = "Days"
duration = 30
}
incremental_schedule {
unit = "Hours"
frequency = 4
}
full_schedule {
unit = "Weeks"
day = "Sunday"
}
}
Figure 5.4: Cohesity automation stack layered above REST API surfaces
graph TD
subgraph Tools[Operator-Facing Automation]
TF[Terraform Provider<br/>declarative state]
ANS[Ansible Collection<br/>playbook tasks]
PS[PowerShell Module<br/>Windows-centric]
IRIS[iris_cli<br/>on-cluster shell]
end
subgraph APIs[REST Surface Areas]
V1[REST API v1<br/>legacy resources]
V2[REST API v2<br/>object-oriented]
HEL[Helios API<br/>fleet-wide tenant scope]
end
subgraph Targets[Cluster Targets]
CL1[Cluster A]
CL2[Cluster B]
CL3[Cluster C]
end
TF --> V2
TF --> HEL
ANS --> V2
ANS --> V1
PS --> V2
PS --> V1
IRIS --> CL1
V1 --> CL1
V1 --> CL2
V2 --> CL1
V2 --> CL2
V2 --> CL3
HEL --> CL1
HEL --> CL2
HEL --> CL3
Analogy: Pick the API like you pick a kitchen tool.
iris_cliis the chef’s knife — sharp, fast, on-cluster. REST is the food processor — bulk operations from outside. PowerShell is the rice cooker for Windows shops. Ansible and Terraform are the meal-prep system for the whole week.
Key Takeaway: Cohesity’s automation stack layers
iris_cli(on-cluster shell) under REST APIs (v1 legacy, v2 modern, Helios fleet-wide), with PowerShell, Ansible, and Terraform wrappers for the configuration management style of the operating team.
Chapter Summary
Bootstrapping a Cohesity cluster is a tightly choreographed sequence: image nodes, access them via IPMI, configure the first node’s network with configure_network.sh or iris_cli, discover peers with iris_cli discover free-nodes, and atomically form the cluster (and its first partition) with iris_cli cluster create. The same mechanics apply across physical, Virtual Edition, and Cloud Edition deployments, but each form factor brings unique constraints — VMDK provisioning and DRS anti-affinity for VE, Azure core quotas and managed-disk multiples for CE, AWS Marketplace AMIs and S3 tiering for AWS CE.
Day-2 operations rely on Cohesity’s hallmark rolling upgrade: a distributed lock manager serializes per-node reboots, VIPs migrate so backups and replication never pause, and atomic root-partition swaps allow fast rollback. Capacity grows by adding nodes (or by disk extend on single-node VE), hardware replacement follows drain-rebuild semantics, and the daily Heartbeat log bundle plus Helios alerts give operators proactive visibility.
For automation at scale, architects choose among iris_cli, REST API v1/v2, the Helios API, PowerShell, Ansible, and Terraform — matching the tool to the team’s existing operating model rather than imposing a new one. The CCAE exam expects fluency across all six.
Key Terms
- Bootstrap — the one-shot operation that gives nodes IPs, discovers peers, and forms the initial cluster quorum and partition via
iris_cli cluster create. - iris_cli — Cohesity’s on-cluster command-line shell, organized into command groups (
cluster,node,disk,interface,protection-job,recovery) and the authoritative scripting interface for cluster operations. - Virtual Edition (VE) — Cohesity SpanFS packaged as a virtual appliance (OVA/VHDX) for VMware, Hyper-V, AHV, KVM; runs as single-node (lab/Robo, expand by disk resize) or 3+ node clusters.
- Cloud Edition (CE) — Cohesity SpanFS packaged as cloud Marketplace VMs for AWS, Azure, GCP; minimum 3 nodes for production, with cloud-specific sizing constraints (e.g., Azure
Standard_DS5_v2, 64 VMs/resource group). - Helios API — SaaS-side REST surface scoped to a customer tenant, enabling fleet-wide operations across all registered clusters and exclusive access to DataHawk, FortKnox, and SiteContinuity orchestration.
- Rolling upgrade — Cohesity’s zero-downtime upgrade mechanism: a distributed lock token serializes per-node reboots while VIPs migrate to keep services online; atomic active/passive root-partition swaps enable fast rollback.
- Brick — a node-level fault domain used in chassis-aware placement; in dense multi-node-per-chassis hardware, brick mode tells SpanFS to treat each node as an independent fault domain so chunk replicas survive a chassis power failure.
Chapter 6: Identity, Access Management, and Multi-Tenancy
Securing a Cohesity DataPlatform is fundamentally a problem of identity. Who can log in? With what privileges? Against which resources? In a service-provider deployment, how do you guarantee that Tenant A cannot even see Tenant B’s backups? This chapter examines how Cohesity authenticates users (local, AD/LDAP, SAML SSO, MFA, API keys), authorizes them through role-based access control (RBAC) layered with access scopes, and how Organizations, View Boxes, and per-tenant VLANs combine to deliver multi-tenancy on a shared cluster.
The CCAE exam emphasizes architecture-level decisions: when to share a Storage Domain, when to dedicate one, which SAML attribute Cohesity uses for role mapping, and what AD FS cannot do that Okta and Azure AD can. The pitfalls are subtle and silent — we will mark each one explicitly.
Learning Objectives
By the end of this chapter you will be able to:
- Design RBAC models using Cohesity’s built-in roles and custom roles, layered with access scopes for least-privilege enforcement.
- Integrate Cohesity DataProtect and Helios with Active Directory, LDAP, and SAML 2.0 identity providers including Microsoft Entra ID (Azure AD), Okta, AD FS, and Ping.
- Architect multi-tenant deployments using Organizations, View Box (Storage Domain) isolation, per-tenant VLANs, and hierarchical Organizations.
- Apply the principle of least privilege across operators, tenant administrators, automation service accounts, and API consumers.
- Recognize and avoid common pitfalls — Login-vs-Email attribute precedence, nested-AD-group limitations, AD FS signed-auth-request restrictions, and silent SSO login rejections caused by missing default roles.
Authentication Sources
Authentication answers “are you who you claim to be?” Cohesity supports four paths: local users, AD/LDAP, SAML SSO, and API keys (with optional MFA layered on top). Most enterprise designs use AD or SAML for humans, local accounts for break-glass, and API keys for automation.
Local Users vs. AD/LDAP
A local user is an account whose password is stored directly in Cohesity. Local accounts provide a recovery path when the corporate IdP is unavailable — you do not want a domain controller outage to lock you out of your backup system. The original admin account created during bootstrap is local. Best practice is to keep one or two named break-glass admins, enforce strong passwords and MFA, and audit them closely [Source: https://www.cohesity.com/blogs/role-based-access-control-rbac-cohesity-dataprotect-4-0/].
Active Directory and LDAP are the workhorses for daily authentication. When joined to AD, Cohesity validates users against the domain controller. Group membership drives role assignment: assign roles to AD groups (e.g., cohesity-operators) rather than to individuals. Hybrid Azure AD environments running AAD Connect typically resolve users by sAMAccountName to keep on-premises and cloud names aligned [Source: https://docs.cohesity.com/baas/Helios/azure.htm].
SAML SSO with Modern Identity Providers
For organizations that have standardized on a cloud IdP — Microsoft Entra ID (formerly Azure AD), Okta, Ping Identity, JumpCloud, OneLogin, Duo SSO, RSA SecurID Access, ADSelfService Plus, IBM Security Verify, CyberArk Workforce Identity, or Thales SafeNet — Cohesity speaks SAML 2.0 [Source: https://docs.cohesity.com/baas/Helios/SingleSignOn.htm]. SAML establishes a trust triangle:
| Component | Role | Cohesity equivalent |
|---|---|---|
| Identity Provider (IdP) | Authenticates the user, issues a signed SAML assertion | Azure AD, Okta, Ping, AD FS |
| Service Provider (SP) | Consumes the assertion, makes authorization decisions | Cohesity cluster or Helios |
| User | Presents credentials to the IdP, redirected to SP | Backup admin, tenant operator |
Authentication can be IdP-initiated (user starts at the IdP portal and clicks the Cohesity tile) or SP-initiated (user clicks “Sign in with SSO” on the Cohesity login page) [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/integrate-azure-ad-with-cohesity-sso-white-paper-en.pdf].
When Cohesity receives a SAML assertion it performs four validations: signature verification, temporal validity (NotBefore / NotOnOrAfter), recipient validation against the cluster’s ACS URL, and identity-attribute extraction.
Figure 6.1: SAML SSO authentication flow (SP-initiated) between user, Cohesity, and the IdP.
sequenceDiagram
autonumber
participant U as User Browser
participant C as Cohesity (SP)
participant I as IdP (Okta / Azure AD)
U->>C: 1. GET /login (Sign in with SSO)
C-->>U: 2. 302 Redirect with SAMLRequest
U->>I: 3. SAMLRequest + user credentials
I->>I: 4. Authenticate user + MFA
I-->>U: 5. Signed SAML Response (assertion)
U->>C: 6. POST /idps/authenticate (SAMLResponse)
C->>C: 7. Validate signature, NotBefore/NotOnOrAfter, ACS URL
C->>C: 8. Extract Login/Email + Groups, map to Role + Scope
C-->>U: 9. Session cookie / Helios JWT
Configuration lives at Settings > Access Management > Single Sign-On > Configure SSO:
| Field | Source | Notes |
|---|---|---|
| Protocol | Cohesity | Choose SAML |
| SSO Domain | Architect | e.g., corp.example.com; routes users by email domain when multiple IdPs are configured |
| SSO Provider | Cohesity | Dropdown — Microsoft Entra ID, Okta, JumpCloud, OneLogin, Ping, Duo SSO, RSA SecurID, etc. |
| Single Sign-On URL | IdP | The IdP’s SSO endpoint |
| Provider Issuer ID | IdP | The IdP’s Entity ID |
| X.509 Certificate | IdP | Must be PEM format — Cohesity rejects DER/CER |
| Sign Auth Request | Optional | Requires uploading Cohesity’s public cert to IdP. AD FS does not support this with Cohesity |
| Default Role | Architect | Fallback for users not in any mapped SSO group; if neither default role nor SSO groups are configured, login is rejected |
| Access to Clusters | Architect | All clusters or limited subset |
| Assign to Organization | Optional | For multi-tenant scoping |
Cohesity needs to know two URLs about itself, which you give to the IdP:
- Self-managed cluster:
https://<cluster_fqdn>/idps/authenticate - Helios:
https://helios.cohesity.com/v2/mcm/idp/authenticate
The Identifier (Entity ID) and Reply URL (ACS URL) must both equal that endpoint exactly. A mismatch produces the dreaded “Subject confirmation validation failed” error, which is by far the most common SAML setup failure [Source: https://www.veritas.com/support/en_US/article.100053273].
SAML pitfall #1 — Login attribute beats Email. Cohesity expects either an Email or Login SAML attribute for user identity. If both are present, Cohesity uses Login for role mapping and ignores Email entirely [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/integrate-azure-ad-with-cohesity-sso-white-paper-en.pdf]. Architects who assume Email “wins” are surprised when role bindings appear to be applied against the wrong identity. SAML attribute names are not case-sensitive in Cohesity’s processing, but the value selection between Login and Email absolutely is.
SAML pitfall #2 — Nested AD groups are not supported. When you configure the Groups claim in Azure AD, you must select “Groups assigned to the application.” This restricts the SAML response to only those groups directly assigned to the Cohesity app. Nested or sub-groups (a user being a member of
cohesity-ops-nawhich is itself a member ofcohesity-ops-global) are not expanded by Azure AD into the assertion [Source: https://docs.cohesity.com/baas/Helios/azure.htm]. The fix is to assign the leaf group(s) the user actually belongs to directly to the application. Same advice for Okta — use a flat group filter (e.g., “Starts withcohesity_”) and avoid relying on group nesting.
SAML pitfall #3 — AD FS does not support signed auth requests. Cohesity offers a “Sign Auth Request” option that signs the SP’s authentication request before sending it to the IdP, providing additional integrity. Okta and Azure AD support this (in Okta, set Signature Algorithm = RSA-SHA256 and Digest Algorithm = SHA256). AD FS does not support signed auth requests with Cohesity [Source: https://docs.cohesity.com/baas/Helios/adfs.htm]. If your IdP is AD FS, leave this option off.
Azure AD integration starts at Azure AD > Enterprise applications > Create your own application > Integrate any other application [Source: https://docs.cohesity.com/baas/Helios/azure.htm]. Group-claim source attribute depends on posture:
- Hybrid Azure AD with AAD Connect (v1.2.70.0+): use
sAMAccountNameto match on-prem names. - Cloud-only Azure AD: use
Group ID(object identifiers).
Okta integration is at Okta admin > Applications > Create App Integration > SAML 2.0 [Source: https://docs.cohesity.com/baas/Helios/okta.htm]. Single Sign-On URL and Audience URI both equal the Cohesity ACS URL. Attribute Statements map Email -> user.email and Login -> user.login. Group Attribute Statements use name groups with a regex or “Starts with cohesity_” filter. Okta delivers the certificate in .cert format which must be converted to .pem before upload.
MFA and API Key Authentication
MFA is best enforced at the IdP — Azure Conditional Access, Okta MFA, or Duo SSO. Cohesity also supports MFA for local users via TOTP. Compliance-driven deployments (HIPAA, PCI, FedRAMP) combine MFA with quorum approval (Chapter 11) for destructive operations.
API keys authenticate automation: Ansible, Terraform, PowerShell, CI/CD. An API key inherits exactly the privileges of the issuing user — there is no separate per-key permission set [Source: https://docs.cohesity.com/baas/data-protect/access-managment/manage-users-and-groups.htm]. For least-privilege automation, create a dedicated service-account user with a narrowly-scoped custom role, then issue the key from that account. A nightly recovery-validation script runs as svc-recover-validate with Recover only, not as an admin.
Certificate-based authentication exists for cluster-to-cluster and cluster-to-IdP trust. Avoid wildcard certificates, use individual certs per cluster, and track expiry — IdP signing cert rotation without a Cohesity update is the #2 most common SSO failure mode [Source: https://www.cohesity.com/blogs/updating-ssl-certificates-on-cohesity-clusters/].
Key Takeaway: Authentication design is a layered choice. Local accounts exist for break-glass; AD/LDAP and SAML SSO handle daily human access; API keys handle automation and inherit the issuing user’s role. The dangerous defaults to remember are: SAML uses Login over Email when both are present, nested AD groups do not expand in SAML assertions, AD FS cannot sign auth requests, and an SSO login with no default role and no group mapping is rejected outright rather than allowed in.
RBAC and Roles
Authorization answers “what may you do?” Cohesity’s model has three primitives: principals (users/groups), roles (privilege sets), and access scopes (resource boundaries). They combine multiplicatively — an identity has a role, and that role applies only to resources within the assigned scope.
Built-in Roles
Cohesity DataProtect ships with a comprehensive set of built-in roles. The table below shows the canonical set [Source: https://docs.cohesity.com/disaster-recovery/site-continuity/vmware-vms/initial-setup/access-management.htm] [Source: https://docs.cohesity.com/baas/data-protect/access-managment/manage-users-and-groups.htm]:
| Role | Description | Typical Persona |
|---|---|---|
| Super Admin | Full access to all actions and workflows; manages other admins, roles, identity providers, and cluster configuration. | Cluster owner, platform team lead |
| Admin | Equivalent to Super Admin in many contexts; full management privileges. | Backup operations lead |
| Viewer | Read-only access across all workflows. Cannot run jobs, recover, or change configuration. | Auditor, security reviewer, observability tool |
| Operator | Viewer privileges plus the ability to run Protection Groups and create Recover Tasks. Cannot create/edit policies. | Daily backup operator |
| Data Security | Self-Service Data Protection plus the ability to create DataLock Views and set retention/expirations. | Compliance officer, ransomware response lead |
| Gaia Admin | Self-Service Gaia (search/AI) privileges: view and manage details and results. | Data discovery / e-discovery lead |
| Gaia Viewer | Query/read-only access in Gaia. | Investigator |
| High Classified | Privileged read access to fetch cluster details for specific API calls. | Auditing automation |
| SMB Backup Operator | SMB backup and restore privileges only. | Windows file-services admin |
| Self-Service | Viewer privileges plus the ability to manage Clones, Protection Groups, Policies, and Recover Tasks. | App-team self-service user |
| DR Admin | Viewer privileges plus the ability to create and manage DR workflows and tasks (failover, failback, runbooks). | DR architect, BCP team |
| Replication | Limited to setting up and replicating data to other clusters. | Cross-cluster service account |
| Cohesity Support Admin | Used by Cohesity Support to create a Super Admin if customer admin access is lost. | Vendor support break-glass |
The rough hierarchy: Viewer < Operator < Self-Service < DR Admin / Data Security < Admin / Super Admin. Operator can run existing groups; Self-Service can also create them. DR Admin specializes in failover/failback without permitting protection-policy edits — useful for issuing a DR runbook engineer a role that excludes backup-frequency changes.
Custom Roles and Granular Privileges
Architects designing for least privilege almost always create custom roles via Settings > Access Management > Roles > Add Custom Role and select privileges from a checklist [Source: https://docs.cohesity.com/disaster-recovery/site-continuity/vmware-vms/initial-setup/access-management.htm]. Access Management privileges are pre-selected by default; un-check them for non-admin roles to prevent privilege escalation.
Canonical custom-role examples:
- vSphere Backup Operator — Operator privileges scoped to VMware only.
- M365 Recovery Operator — Recovery-only on Microsoft 365 sources.
- NAS Self-Service — Self-Service on NAS protection groups only.
- Audit Read — Read-only plus audit-log download.
- Replication Service Account — Replication role only.
Custom roles are editable later and auditable like built-in roles [Source: https://www.cohesity.com/blogs/role-based-access-control-rbac-cohesity-dataprotect-4-0/].
Access Scopes — The Layer That Makes Least-Privilege Real
Roles alone tell Cohesity what an identity may do, but not where. Access Scopes layer on top of roles and constrain the resources a role applies to [Source: https://docs.cohesity.com/baas/data-protect/access-managment/access-scope.htm]:
| Scope Dimension | Purpose | Example |
|---|---|---|
| Source Level | A specific item — one vCenter, one SQL host, one NAS share. | ”Operator on vCenter prod-vc01 only” |
| Source Type | A class of source. | ”Operator on Microsoft 365 only” |
| Region | A geographic or logical region. | ”DR Admin on Region eu-west only” |
| Service Level | A service: DataProtect, DR, etc. | ”Self-Service on DataProtect only, no SmartFiles” |
Multiple scopes combine (“Operator on VMware AND NAS in Region us-east”). Auto Assign brings newly added matching resources into scope automatically — useful for tenants with growing inventories.
The combination produces real least privilege. Example: Role Operator + Access Scope (Source Type = VMware, Source Level = vCenter-A only) — the principal runs backups and recoveries only on vCenter-A and cannot browse anything else. This is the foundation of MSP tenant isolation.
Figure 6.2: RBAC hierarchy — principals bind to roles, roles carry privileges, access scopes constrain where the role applies.
graph TD
U1[User: alice@acme.com]
U2[User: svc-recover-validate]
G1[AD/SSO Group: cohesity_acme_ops]
G2[AD/SSO Group: cohesity_acme_dr]
R1[Role: Operator]
R2[Role: DR Admin]
R3[Custom Role: Recover-only]
P1[Privileges: Run PG, Recover, View]
P2[Privileges: Failover, Failback, Runbooks]
P3[Privileges: Recover only]
S1[Access Scope: Source Type = VMware]
S2[Access Scope: Region = us-east]
S3[Access Scope: Org = acme-corp]
U1 --> G1
U1 --> G2
G1 --> R1
G2 --> R2
U2 --> R3
R1 --> P1
R2 --> P2
R3 --> P3
R1 --> S1
R1 --> S3
R2 --> S2
R2 --> S3
R3 --> S1
R3 --> S3
Auditing Role Assignments
All role assignments and authentication events are logged and can be exported via Syslog or REST API to a SIEM [Source: https://docs.cohesity.com/baas/data-protect/audit-logs-dataprotect.htm]. Review assignments quarterly: identify Super Admin-equivalents, dormant service accounts, and drifted AD groups. A user with both individual and group role assignments inherits the union of both [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/integrate-azure-ad-with-cohesity-sso-white-paper-en.pdf] — useful for exceptions, but a privilege-creep risk if not reviewed.
Key Takeaway: Cohesity authorization is role × access scope. Built-in roles cover common personas; custom roles and granular privileges cover the rest. Access scopes (Source Level, Source Type, Region, Service Level) layer on roles to restrict where the role applies, and they are the mechanism that makes both least privilege and MSP tenant isolation possible. A user with both individual and group role assignments inherits the union — review periodically.
Multi-Tenancy with Organizations
Multi-tenancy runs independent workloads on shared infrastructure with strict isolation. Cohesity’s primitive is the Organization (Org), surrounded by Storage Domains (View Boxes), Views, VLANs, and quotas — together delivering isolation at application, storage, network, and resource layers.
The Apartment Building Analogy
A Cohesity cluster is an apartment building. Each Organization is a unit with its own door and keys. Helios is the building manager, holding the floor plan and seeing every unit but ordinarily not entering them. View Boxes are the plumbing and electrical risers — multiple units may share a riser (shared View Box: lower cost, residual risk) or have dedicated risers (dedicated View Box: higher cost, stronger isolation). VLANs are the building’s intercom wiring — each unit gets a private channel so traffic for 3B never crosses 7A’s wires. Quotas are per-unit utility metering. Tenants see only their own apartment; the manager sees the building.
Organizations as the Isolation Primitive
An Organization is a logical container for tenant resources [Source: https://www.penguinpunk.net/blog/wp-content/uploads/2018/11/cohesity_enable_multi-tenancy_v0.01.pdf]. The Cohesity blog “Multi-tenancy meets simplicity” emphasizes Organizations let MSPs serve multiple customers from a single cluster without sacrificing the security boundary [Source: https://www.cohesity.com/blogs/multi-tenancy-meets-simplicity/].
Critical principle: Organizations remain logically isolated regardless of Storage Domain sharing settings [Source: https://www.penguinpunk.net/blog/wp-content/uploads/2018/11/cohesity_enable_multi-tenancy_v0.01.pdf]. Even with shared View Boxes, a tenant admin sees only their own Org’s Storage Domains, Views, sources, Protection Groups, jobs, and reports. The UI is scoped.
Hierarchical Organizations let a parent Org contain child Orgs, each with their own Storage Domains, Views, and admins — common for large MSPs with reseller-managed customers under direct ones.
View Box Isolation — Shared vs. Dedicated
The Storage Domain (View Box) is the unit of storage policy: encryption keys, dedup scope, and compression are all configured per View Box, so View Box separation enforces cryptographic and deduplication boundaries [Source: https://www.penguinpunk.net/blog/wp-content/uploads/2018/11/cohesity_enable_multi-tenancy_v0.01.pdf].
Shared vs. dedicated View Boxes per tenant:
| Dimension | Shared View Box | Dedicated View Box |
|---|---|---|
| Logical isolation between Orgs | Yes (Cohesity guarantee) | Yes |
| Physical co-residency | Tenants share underlying chunks | Tenant data is on its own domain |
| Encryption-key separation | Shared key per View Box | Per-tenant key (KMIP/KMS isolation possible) |
| Deduplication scope | Cross-tenant dedup → highest storage savings | Within-tenant dedup only → lower savings, no cross-tenant blob fingerprints |
| Compression | Shared compression policy | Per-tenant compression policy |
| Compliance posture | Suitable for low-sensitivity tenants | Suitable for regulated, sensitive, or competitor-tenant scenarios |
| Cost per tenant | Lower | Higher |
| Cluster-wide capacity efficiency | Higher | Lower |
| Use case | Internal departments, low-risk MSP tenants | Compliance-bound tenants, competing tenants, separate-key requirements |
MSPs typically use dedicated Storage Domains when tenants require physical separation, separate encryption keys, or compliance-driven data adjacency rules [Source: https://www.penguinpunk.net/blog/wp-content/uploads/2018/11/cohesity_enable_multi-tenancy_v0.01.pdf]. A tiered model: Bronze on shared View Boxes, Silver on shared with per-tenant View-level encryption, Gold on dedicated View Boxes with KMIP-backed per-tenant keys.
Resource Hierarchy and Quota Inheritance
Organization -> Storage Domain -> View [Source: https://www.penguinpunk.net/blog/wp-content/uploads/2018/11/cohesity_enable_multi-tenancy_v0.01.pdf]:
- Organization: Assigned Storage Domains and Views; tenant admins see only what is explicitly assigned.
- Storage Domain: Quotas (physical limits) and default logical View limits cascade to Views.
- View: Inherits or overrides Storage Domain quotas; configures protocols (SMB, NFS, S3) and permissions.
Quotas are hard — writes exceeding a quota are rejected, not just alerted [Source: https://developers.cohesity.com/v1-helios-latest/reference/createstoragedomain-1]. Silver tenants exceeding their 50 TB allocation get write failures, not surprise bills.
Network Isolation — VLAN per Organization
Application-layer isolation alone does not satisfy regulated tenants. VLAN-per-Organization extends isolation to the network layer [Source: https://www.cohesity.com/blogs/multi-tenancy-meets-simplicity/]. Tenant A’s traffic rides VLAN 100; Tenant B’s rides VLAN 200. The cluster supports multiple VIPs across VLANs so each tenant gets a DNS-resolvable management endpoint reachable only from their network.
This matters when source environments are themselves segregated — a HIPAA-bound VMware environment should send backup traffic over its regulated VLAN end-to-end, never crossing into a shared management network.
Figure 6.3: Multi-tenant isolation layers — Organization scopes UI, View Box scopes storage policy, VLAN scopes network, quotas scope capacity.
flowchart LR
subgraph Cluster[Cohesity Cluster - Shared Infrastructure]
direction LR
subgraph OrgA[Organization: acme-corp]
VBA[View Box: sd-shared-silver<br/>encryption + dedup scope]
VLA[VLAN 412 + Tenant VIP]
QA[Quotas: 25 TB FETB hard limit]
VWA[Views: acme-vmware<br/>acme-nas]
end
subgraph OrgB[Organization: globex-inc]
VBB[View Box: sd-dedicated-globex<br/>per-tenant KMIP key]
VLB[VLAN 520 + Tenant VIP]
QB[Quotas: 100 TB FETB hard limit]
VWB[Views: globex-sql<br/>globex-m365]
end
end
OrgA --> VBA --> VWA
OrgA --> VLA
OrgA --> QA
OrgB --> VBB --> VWB
OrgB --> VLB
OrgB --> QB
The multi-tenancy isolation summary table:
| Isolation Dimension | Mechanism | Trade-off |
|---|---|---|
| Identity / UI | Organization scoping | Always on; tenant admins see only their Org’s resources |
| Authorization | RBAC + Access Scopes | Scope by Org assignment, source, region, service |
| Storage policy / encryption | Dedicated View Box | Higher cost; lower dedup; stronger crypto isolation |
| Storage capacity efficiency | Shared View Box with quotas | Cross-tenant dedup; logical isolation only |
| Network | VLAN-per-Org + dedicated VIPs | Configuration overhead; required for regulated tenants |
| DNS | Per-Org management FQDN | Aligns with VLAN isolation |
| Quotas | Hard quotas at Storage Domain & View | Predictable backpressure; required for chargeback |
| Reporting | Per-Org reports and Helios scoping | Built-in; supports tenant-facing dashboards |
Per-Tenant Policies and Reporting
Each Organization carries its own policies, retention, and reports. Helios scopes per-Org reporting so tenant admins see their own SLA dashboard while the MSP sees aggregate fleet metrics — essential for chargeback.
Key Takeaway: Multi-tenancy in Cohesity rests on four isolation layers: Organizations (logical/UI scope, always enforced), Storage Domains (storage policy, encryption keys, dedup boundary — shared for efficiency, dedicated for isolation), VLANs-per-Organization (network-layer separation), and quotas (hard backpressure for chargeback). Architects choose the depth of isolation per tenant tier — Bronze tenants share, Gold tenants get dedicated everything.
Service Provider Patterns
Multi-tenancy is most stress-tested in the MSP/CSP world. This section synthesizes the patterns service providers use.
MSP/CSP Deployment Patterns
Two dominant patterns:
- Shared cluster, multiple Organizations — One cluster hosts many tenants. MSP gets economies of scale, tenants get logical isolation, design relies on RBAC + Access Scopes + (optionally) dedicated View Boxes + VLANs. Most common for SMB/mid-market MSPs.
- Dedicated cluster per tenant — Often a Cloud Edition per tenant. Eliminates shared-infrastructure concerns but loses scale economies. Used for the largest or most regulated tenants.
Hybrid patterns combine both — shared clusters for Bronze/Silver, dedicated for Gold. Helios unifies the fleet view across both [Source: https://www.cohesity.com/blogs/multi-tenancy-meets-simplicity/].
Self-Service Portals via Helios
The Self-Service role (or a custom equivalent) lets tenant admins create Protection Groups, modify Policies within MSP-imposed bounds, run on-demand backups, and recover — without ticketing the MSP.
Chargeback and Usage Metering
Quotas plus Helios reporting drive chargeback. Typical metering:
- FETB (Front-end TB) protected per Organization
- Back-end consumed capacity (post dedup/compression) per Storage Domain
- Protection Groups, Views, and recoveries as activity metrics
- Egress for cross-cluster replication or cloud archive
Pricing is typically base $/TB FETB by tier (Bronze/Silver/Gold) plus add-ons for replication, DataLock, and DataHawk. Helios reports export to billing via API.
Tenant Onboarding Workflow
Figure 6.4: MSP tenant onboarding workflow — from contract to self-service portal in twelve steps.
flowchart TD
A[Tenant signs contract<br/>tier: Bronze/Silver/Gold] --> B[Day -2: Network prep<br/>VLAN + VIPs + firewall rules]
B --> C[Day -1: Storage Domain assignment<br/>shared sd-silver or dedicated sd-tenant]
C --> D[Day 0: Create Organization<br/>assign Storage Domain + VLAN + VIPs]
D --> E[Day 0: Create Views with quotas<br/>per-source-type Views]
E --> F[Day 0: Configure SAML SSO<br/>IdP metadata + Default Role]
F --> G[Day 0: Define custom roles<br/>+ Access Scopes + Auto Assign]
G --> H[Day 0: Map IdP groups to roles<br/>flat groups, no nesting]
H --> I[Day 1: Source registration<br/>vCenter, NAS, M365 over tenant VLAN]
I --> J[Day 1: Protection policies + groups<br/>from MSP templates]
J --> K[Day 1: Helios self-service portal<br/>scoped tenant view]
K --> L[Day 2: API key for automation<br/>scoped service account]
L --> M[Day 2: Document off-boarding plan<br/>revoke SAML, export audit, terminate Org]
Worked Example: MSP Onboards “Acme Corp”
Scenario. ProtectIT (MSP) runs a multi-tenant Cohesity cluster. New customer Acme Corp signs a Silver-tier contract: 25 TB FETB, dedicated VLAN, shared View Box, group-based RBAC via Acme’s Okta tenant. Acme has 150 VMs in vCenter and a 10 TB NAS share.
- Network prep (Day -2). Provision VLAN 412; allocate three IPs from
10.84.12.0/24on VLAN 412 for tenant VIPs (management UI, backup VIP, reserved); configure firewall rules for Acme’s source IPs and Helios outbound HTTPS. - Storage Domain assignment (Day -1). Place Acme on shared Storage Domain
sd-shared-silver(Silver shares; Gold dedicates). Set a 30 TB physical quota at the Storage Domain level (25 TB sold + 20% buffer). - Create Organization (Day 0). Via Settings > Multi-Tenancy > Organizations > Create Organization, create
acme-corp. Assignsd-shared-silver, associate VLAN 412, assign tenant VIPs. - Create Views with quotas (Day 0). Pre-create
acme-vmware-backups(25 TB) andacme-nas-backups(5 TB), both onsd-shared-silver. Explicit View-level quotas prevent one View from consuming the other’s allocation. - Configure SAML SSO (Day 0). Settings > Access Management > SSO > Configure SSO: Protocol = SAML; SSO Domain =
acme.com; Provider = Okta; paste Acme’s SSO URL, Provider Issuer ID, and PEM cert. Check Assign to Organization =acme-corp. Set Default Role to a restrictiveacme-default-denyso unmapped users get no access. - Define custom roles and access scopes (Day 0). Create
acme-backup-operator(Operator, scope = Source Types VMware + NAS in Acme Org) andacme-dr-admin(DR Admin, scoped to Acme Org). Enable Auto Assign so new Acme sources fall into scope. - Map Okta groups (Day 0). Users > Add SSO Users & Groups, SSO Domain
acme.com: mapcohesity_acme_ops->acme-backup-operatorandcohesity_acme_dr->acme-dr-admin. Acme’s Okta admin assigns these groups directly (not nested) to the SAML app. - Source registration (Day 1). Acme operator (SAML-authenticated, scoped) registers vCenter and NAS sources; traffic rides VLAN 412 to the Acme VIP.
- Protection policies and groups (Day 1). Operator creates Protection Groups using MSP policy templates; backups land in the Acme Views.
- Helios self-service (Day 1). Acme operator logs into Helios; view is scoped to Acme. MSP sees aggregate fleet metrics.
- API key for automation (Day 2). Create service account
acme-svc-recover-validatewithRecover-only role scoped to VMware in Acme Org; issue API key from that user. Key inherits exactly that ceiling. - Off-boarding plan (Day 2). Document the reverse-onboarding: revoke SAML assignment, export audit logs, terminate Org (cascades to Views per contracted retention), reclaim VLAN 412, remove Okta group mappings.
This threads every primitive — Organizations, Storage Domains, Views, VLANs, quotas, custom roles, access scopes, SAML SSO, group mapping, default roles, API keys, and Helios — into a repeatable workflow.
Key Takeaway: Onboarding is the integration test of every chapter concept: VLAN, Storage Domain, Organization, Views with quotas, SAML SSO, custom roles + access scopes, group-based mapping, source registration, Protection Groups, Helios self-service, scoped API keys, and documented off-boarding. Skip any step and you have a silent isolation gap.
Chapter Summary
This chapter covered the lifecycle of identity and access in Cohesity. Authentication spans local users (break-glass), AD/LDAP (daily access), SAML SSO (Azure AD, Okta, Ping, AD FS, Duo), MFA at the IdP, and API keys that inherit the issuing user’s role. Three SAML pitfalls to remember: Cohesity uses Login over Email when both are present; nested AD groups do not expand in SAML assertions; and AD FS does not support signed auth requests with Cohesity.
Authorization rests on principals, roles (Super Admin, Admin, Operator, Viewer, Data Security, DR Admin, Replication, Self-Service, Gaia, SMB Backup Operator, plus custom), and access scopes (Source Level, Source Type, Region, Service Level). Role × scope produces real least privilege and MSP tenant isolation.
Multi-tenancy adds Organizations (logical tenant primitive), View Boxes (encryption/dedup boundary — shared for efficiency, dedicated for isolation), per-Org VLANs (network separation), and hard quotas (predictable backpressure and chargeback). Hierarchical Organizations let MSPs nest customer Orgs under reseller Orgs.
Service-provider patterns include shared-cluster-multi-Organization (common), dedicated-cluster-per-tenant (regulated/large), Helios self-service, and FETB-based chargeback. The twelve-step Acme onboarding example threads every primitive into a repeatable workflow.
For the CCAE exam, expect scenarios that combine these primitives: choosing between shared and dedicated View Boxes for regulated tenants; diagnosing SSO failures (missing default role, Login-vs-Email, nested groups, AD FS signed-auth-request); designing role-plus-scope combinations; and ordering MSP onboarding steps with VLAN-per-Org isolation.
Key Terms
- RBAC — Role-Based Access Control. Cohesity’s authorization model where principals (users/groups) are assigned roles, and roles are constrained by access scopes.
- Organization — Cohesity’s multi-tenancy primitive. A logical container for a tenant’s resources (Storage Domains, Views, sources, Protection Groups, users). Tenants in different Organizations are logically isolated regardless of underlying View Box sharing. Hierarchical Organizations are supported.
- Tenant — A customer (in MSP/CSP context) or a department (in enterprise context) hosted as an Organization on a shared Cohesity cluster.
- View Box (Storage Domain) — The unit of storage policy in Cohesity. Encryption keys, deduplication scope, and compression behavior are configured per View Box. Dedicated View Boxes give cryptographic and dedup isolation between tenants; shared View Boxes give cross-tenant deduplication and lower cost.
- View — A logical share within a View Box, exposing SMB, NFS, or S3 protocols. Views inherit Storage-Domain-level quotas and configuration but can override at the View level.
- Access Scope — The resource-boundary layer that constrains where a role applies. Scopes can be Source Level (a specific item), Source Type (a class), Region (geographic/logical), or Service Level (DataProtect, DR, etc.). Combined with roles to produce least privilege.
- SAML — Security Assertion Markup Language 2.0. The protocol Cohesity speaks to integrate with cloud IdPs (Azure AD, Okta, Ping, AD FS, etc.). Establishes a trust triangle between IdP, SP, and user; relies on PEM-format X.509 certificates.
- SSO — Single Sign-On. The user-experience outcome of SAML integration: one IdP login grants access to many SPs including Cohesity.
- MFA — Multi-Factor Authentication. Best applied at the IdP layer (Azure Conditional Access, Okta MFA, Duo) or via TOTP for local users. Combined with quorum approval for highly destructive actions.
- API key — A programmatic credential issued from a Cohesity user account. Inherits exactly the privileges of the issuing user, including the user’s role and access scopes. Least-privilege automation requires dedicated service-account users with narrowly-scoped custom roles.
- VLAN per Organization — Network-layer isolation pattern where each tenant Organization is assigned a dedicated VLAN and its own VIPs, ensuring tenant traffic never crosses into shared management or peer-tenant networks.
- Default Role — The fallback role assigned to SSO-authenticated users who match no SSO group mapping. If neither default role nor SSO groups are configured, login is rejected outright.
- Hierarchical Organization — A parent Organization containing child Organizations, enabling MSPs to host customer Organizations under a top-level service-provider Organization with cascading quota and policy enforcement.
- Helios — Cohesity’s SaaS control plane and the “building manager” in the multi-tenancy analogy. Provides global fleet view for MSPs and per-Organization scoped self-service for tenants.
Chapter 7: Data Protection: Sources, Policies, and Protection Groups
If the previous chapters built the platform — clusters, networks, identity — this chapter is where Cohesity finally earns its keep. Data Protection is the day-job: pulling backups from a sprawling, heterogeneous estate of hypervisors, file servers, databases, and SaaS tenants; storing those copies efficiently; and proving, at 3 a.m. on the worst day of someone’s career, that the data can come back. As an architect, your job is not to click “Protect” — it is to design a system where the right thing gets backed up, at the right cadence, with the right retention, automatically, even as the production estate churns underneath you.
The CCAE exam tests three intersecting constructs: Sources (what you protect), Policies (how often, how long, where copies go), and Protection Groups (the binding object that stitches the two together). Master those three nouns, and most of the data-protection blueprint falls into place.
Figure 7.1: End-to-end data protection object model — Source through Snapshots
flowchart TD
A[Source<br/>vCenter / Hyper-V / Prism / Physical / NAS / DB] --> B[Protection Group<br/>Membership: Static / Container / Tag]
B --> C[Policy<br/>SLA Contract]
C --> D[Schedule<br/>Frequency / RPO]
C --> E[Retention<br/>GFS Hierarchy + DataLock]
D --> F[Snapshots<br/>Local Cluster Storage]
E --> F
F --> G[Replication<br/>DR Cluster]
F --> H[Archive<br/>CloudArchive / S3 Glacier]
style A fill:#1f6feb,color:#fff
style C fill:#238636,color:#fff
style F fill:#8957e5,color:#fff
Learning Objectives
By the end of this chapter, you will be able to:
- Register and protect heterogeneous sources, including VMware vSphere, Microsoft Hyper-V, Nutanix AHV, physical Linux/Windows hosts, NAS systems via SMB/NFS/NDMP, and database engines such as Oracle, SQL Server, SAP HANA, and Exchange.
- Design protection policies that align with explicit RPO, RTO, and retention SLAs, using GFS-style hierarchical retention and a tiered Gold/Silver/Bronze model.
- Build Protection Groups using static membership, container-based auto-protection, and vSphere tag-based auto-protection — and choose between them based on operational maturity and audit requirements.
- Optimize backup performance and reduce production impact using SmartCopy storage-snapshot integration, Changed Block Tracking (CBT), proxy distribution, and per-datastore stream throttling.
- Differentiate application-consistent from crash-consistent backups and select the appropriate quiescing path per workload.
Source Registration and Discovery
Before Cohesity can protect anything, it must know that the source exists, hold credentials to talk to it, and understand its API surface. Source registration is the moment a “production system” becomes a “discoverable, protectable inventory” inside Cohesity.
vCenter, SCVMM, and Nutanix Prism Integration
For VMware environments, the primary handshake is at the vCenter level. You register vCenter once, and Cohesity walks the entire managed inventory — datacenters, clusters, hosts, resource pools, folders, datastores, tags, and individual VMs [Source: https://docs.cohesity.com/baas/data-protect/register-vmware-sources.htm]. Registration requires a service account with sufficient privileges to read inventory, snapshot VMs, and (for some restore paths) attach virtual disks. Most architects create a dedicated svc-cohesity account in vCenter rather than reusing a domain admin.
Critically, registration is the moment to set per-datastore stream caps. After Cohesity discovers all datastores, you can override global stream limits by enabling a Cap and setting a maximum number of concurrent backup streams per datastore [Source: https://docs.cohesity.com/baas/data-protect/register-vmware-sources.htm]. This single setting is one of the most commonly overlooked exam topics: a small, hot all-flash datastore hosting tier-1 transactional workloads should not be saturated by a 32-stream backup job hammering its queues. Cap the streams, and Cohesity self-throttles. For Microsoft Hyper-V, registration goes through SCVMM (or directly to standalone Hyper-V hosts), and Cohesity uses Resilient Change Tracking (RCT) instead of CBT for incremental detection. Nutanix AHV registers via Prism Element or Prism Central; Cohesity then uses Nutanix’s native snapshot APIs.
Physical Agent for Linux and Windows
Not everything is virtualized, and Cohesity’s physical agent handles the rest. The Cohesity Agent is a lightweight binary that runs on Linux or Windows and provides three modes:
- File-based backup for individual filesets and directories.
- Volume-based (block) backup for full-system imaging, including bare-metal recovery.
- Application-aware backup for SQL, Oracle, Exchange, SharePoint, and Active Directory, where the agent coordinates with the application’s quiescing API (VSS on Windows, RMAN on Oracle, VDI on SQL Server, Backint on SAP HANA).
Architects should plan agent rollout via configuration management (Ansible, SCCM, Puppet) rather than manual installs; in a fleet of 5,000 servers, a manual approach simply does not scale.
NAS Sources via SMB/NFS and NDMP
NAS protection has two main flavors. For modern NAS (NetApp ONTAP, Dell PowerScale/Isilon, Pure FlashBlade, generic Linux NFS exporters, Windows file servers), Cohesity registers the share over SMB or NFS and walks the namespace. For legacy or large enterprise NAS where snapshot-and-stream is preferable, Cohesity drives backups via NDMP — talking directly to the array’s tape-out protocol but redirecting the stream into Cohesity instead of physical tape. NDMP backups are typically faster and lighter on the array than client-side share crawls.
A useful design pattern: register NetApp filers via the array’s snapshot-API integration (similar in spirit to the Pure SmartCopy pattern discussed below) so Cohesity ingests from a SnapMirror or array snapshot rather than competing with production NFS clients.
Database Sources: Oracle, SQL, SAP HANA, Exchange
Database protection is its own discipline (and Chapter 8 dives deeper), but at the source-registration layer the architect’s job is to ensure:
- Oracle: The Cohesity agent integrates with RMAN as a media-management library; you register the Oracle host and Cohesity provides the channel target.
- SQL Server: Registration uses VDI (Virtual Device Interface) for stream backups and is AAG-aware (Always-On Availability Groups) so Cohesity can target the preferred replica.
- SAP HANA: Registration plugs into the Backint API; HANA writes its backup stream directly to the Cohesity-provided endpoint.
- Exchange: Registration uses the VSS Exchange writer for application-consistent mailbox backups.
In each case, source registration captures the credentials and connection metadata; the protection logic — log backups, point-in-time recovery, granular item recovery — is configured later at the Protection Group and Policy layers.
Key Takeaway: Source registration is the inventory step. Register vCenter once and let auto-discovery handle the rest; for physical hosts, NAS, and databases, plan agent deployment and credential management as a fleet operation, and use per-datastore stream caps during registration to protect production I/O.
Policies and Schedules
A Cohesity Protection Policy is the SLA expressed in code. It encapsulates how often a backup runs (RPO), how long copies are kept (retention), where copies go (replication and archival targets), and any immutability rules. Critically, a single policy can express the entire lifecycle of a backup — from the first snapshot on local cluster storage all the way through replication to a DR site and archival to S3 Glacier seven years later [Source: https://docs.cohesity.com/baas/data-protect/policies.htm].
Frequency, Retention, and Lock Attributes
The minimum RPO Cohesity can express in a standard policy is 15 minutes for hypervisor-based backups using Redirect-on-Write (RoW) snapshots [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity_Data_Protection_White_Paper.pdf]. Tighter RPOs — sub-minute, even continuous — are achievable when integrated with primary array snapshots through SmartCopy, because Cohesity is no longer limited by hypervisor snapshot overhead; it is simply orchestrating the array’s native snapshot engine.
Retention is configured as “Keep for N days/weeks/months/years” and supports DataLock attributes for compliance and ransomware resilience. DataLock makes a backup immutable until its retention expires — even a cluster admin cannot delete it. Two flavors exist: Compliance Lock (truly immutable, legally enforceable) and Governance Lock (soft-immutable, can be overridden by a quorum of admins). For SOX, HIPAA, and PCI workloads, Compliance Lock is mandatory.
Hierarchical Retention (Daily/Weekly/Monthly/Yearly)
Cohesity policies natively support the Grandfather-Father-Son (GFS) retention model. Beyond the base “Keep for” period, you can promote specific snapshots to extended retention buckets:
- The first successful snapshot of each day, retained for N days.
- The first successful snapshot of each week, retained for N weeks.
- The first successful snapshot of each month, retained for N months.
- The first successful snapshot of each year, retained for N years.
This is exactly how seasoned backup admins have thought for decades; Cohesity simply makes it a checkbox rather than a custom script. Combined with global variable-length deduplication, the storage cost of a 7-year monthly retention is far lower than naive arithmetic suggests, because the unchanged blocks are stored once across the entire chain.
Policy Templates and Re-Use
A core architectural principle: one policy per SLA tier, not per workload. If you have 50 SQL servers, 200 file shares, and 1,200 VMs all in the “Gold” tier, they should all reference the same Gold policy. When the SLA changes — and it will — you edit one object instead of 1,450. Cohesity allows policies to be cloned and templated through the Helios global policy framework, so you can apply a single Gold definition across multiple clusters consistently [Source: https://docs.cohesity.com/baas/data-protect/policies.htm].
Time Zones and Blackout Windows
Schedules run in the cluster’s configured time zone; for global enterprises with clusters in Frankfurt, Singapore, and Virginia, ensure local schedules respect local maintenance windows. Cohesity supports blackout windows where backups are paused — for example, suspending VM backups during the nightly business close or a SAN firmware upgrade. Architects should document blackout policy in the runbook and confirm RPO is still achievable around the suspension.
Key Takeaway: Treat protection policies like SLA contracts: define one per service tier (Gold/Silver/Bronze), use GFS hierarchical retention, apply DataLock for immutability where compliance dictates, and reuse the same policy across many workloads instead of cloning per application.
The SLA Analogy
Think of a protection policy like an SLA contract that your backup service signs with the application teams. The contract specifies the deliverables (RPO, RTO), the warranty period (retention), the geographic redundancy clause (replication), and the long-term archive (compliance). A Protection Group, by contrast, is the customer roster — the list of accounts subscribed to that contract. The same Gold contract can be sold to a hundred customers (Protection Groups), and changing the contract terms automatically updates all subscribers. This separation is what makes Cohesity scale operationally: contracts and rosters are decoupled.
Reference SLA Tier Design (Gold/Silver/Bronze)
The following table captures a battle-tested tiered policy reference that maps directly to common enterprise SLAs [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity_Data_Protection_White_Paper.pdf].
| Tier | Frequency (RPO) | Local Retention | Replication | Archive | Target RTO | Typical Workloads |
|---|---|---|---|---|---|---|
| Gold (mission-critical) | Every 15 min via storage snapshot integration (SmartCopy) | 30 days, app-consistent | Async replication to DR cluster, every cycle | CloudArchive Direct, monthly, 7+ years, DataLock Compliance | Minutes (Instant Mass Restore) | Tier-1 OLTP databases, ERP, EHR, payment systems |
| Silver (business-important) | Every 4–6 hours, hypervisor CBT + app quiesce | 14–30 days | Async daily to DR cluster | CloudArchive monthly, 3–5 years | Less than 1 hour | Internal apps, file shares, VDI gold images, mid-tier DBs |
| Bronze (tier-3, dev/test) | Daily, crash-consistent acceptable | 7–14 days | None or weekly to DR | CloudArchive Direct quarterly, 1 year | Less than 4 hours | Dev/test VMs, sandbox, ephemeral workloads |
A common architect mistake is to over-engineer Bronze (giving every dev VM a 15-minute RPO “just in case”) or under-engineer Gold (relying on hypervisor backups for a database whose change rate makes a 6-hour RPO meaningless). Match the tier to the actual business cost of data loss.
Figure 7.2: Tiered policy decision tree — RPO/RTO requirements drive Gold/Silver/Bronze selection
graph TD
A[Workload SLA Requirement] --> B{RPO needed?}
B -->|<= 15 min| C{RTO needed?}
B -->|4-6 hours| D[Silver Tier]
B -->|>= 24 hours| E[Bronze Tier]
C -->|Minutes<br/>Instant Mass Restore| F[Gold Tier]
C -->|< 1 hour| D
F --> G[SmartCopy + Pure<br/>30d local + DR replication<br/>7y archive + Compliance Lock]
D --> H[CBT/RCT Hypervisor<br/>14-30d local + DR daily<br/>3-5y archive]
E --> I[Daily crash-consistent<br/>7-14d local<br/>1y archive]
style F fill:#d4af37,color:#000
style D fill:#c0c0c0,color:#000
style E fill:#cd7f32,color:#fff
Protection Groups
A Protection Group is the binding object that connects a set of source objects to a single policy. It also holds operational settings such as proxy assignment, indexing options, pre/post scripts, application-quiesce flags, and exclude lists. The architectural decision that dominates Protection Group design is how membership is determined: statically, by container, or by tag.
Static Membership vs. Tag-Based Auto-Protection
Static membership means the administrator hand-picks individual VMs (or volumes, or shares, or databases) at job-creation time. The Protection Group’s scope never changes unless someone edits it. This is highly deterministic — only the explicitly listed objects are ever backed up — and it is auditable: a regulator can ask “is VM pci-cardvault-01 protected?” and the answer is a binary yes/no based on the membership list.
The risk is silent under-protection. A new VM provisioned by a junior engineer in a regulated environment may not appear in any Protection Group for weeks until someone notices. For PCI cardholder data, HIPAA-regulated systems, or any workload with a documented compliance scope, static membership combined with an out-of-band reconciliation report (provisioned VMs minus protected VMs) is the safer pattern [Source: https://www.penguinpunk.net/blog/cohesity-basics-auto-protect/].
Auto-protect automatically protects new VMs added to selected parent objects — datacenters, folders, clusters, hosts, resource pools — and supports vSphere tags for inclusion and exclusion [Source: https://knowledge.broadcom.com/external/article/324666/cohesity-data-protection-solution-for-vm.html]. New VMs added to that container are automatically swept into the next backup run using the assigned policy. Removed or deleted VMs naturally fall out of scope. This is operationally elegant: provisioning automation can drop a VM into the right vCenter folder and protection “just happens.”
Auto-Protect with vSphere Tags
Tag-based auto-protect is the most powerful and the most surprising. Cohesity’s tag logic has a non-obvious quirk that frequently appears on the CCAE exam:
- Adding tags one-by-one with exclude selected produces an OR operation — Cohesity excludes any VM with any of the listed tags.
- Adding multiple tags simultaneously in a single operation produces an AND operation — Cohesity excludes only VMs with all the listed tags [Source: https://www.youtube.com/watch?v=_067ubHLEt4].
This matters in practice. If you want to exclude everything tagged dev or lab from your Gold Protection Group, add them one-by-one (OR). If you want to exclude only VMs that are both dev and decommissioned, add them together (AND). Misreading this distinction has caused architects to either over-protect (tagging an entire dev fleet into Gold) or under-protect (silently excluding production workloads).
VM Folders have been supported as parent objects since Cohesity DataProtect 4.0, propagating settings hierarchically [Source: https://www.cohesity.com/blogs/cohesity-dataprotect-4-0-extends-management-integration-vmware-vsphere/]. This is useful when vSphere admins encode tenancy or app boundaries into folder hierarchy.
Figure 7.3: Auto-protect via vSphere tags — dynamic membership update flow
flowchart LR
A[vCenter Tag<br/>tier=gold] --> B[Cohesity Inventory Sync]
B --> C{Tag Filter<br/>Include / Exclude}
C -->|Match| D[Dynamic Membership<br/>Update]
C -->|No Match| E[Excluded from PG]
D --> F[Protection Group<br/>pg-gold-auto]
F --> G[Next Backup Run<br/>New VMs Auto-Swept]
H[New VM<br/>Provisioned + Tagged] -.-> A
I[Untagged VM<br/>Removed] -.-> E
style A fill:#1f6feb,color:#fff
style D fill:#238636,color:#fff
style F fill:#8957e5,color:#fff
When to Use Each Membership Model
| Membership Model | Best For | Risks | Audit Posture |
|---|---|---|---|
| Static | PCI/HIPAA/SOX-scoped VMs; small, slow-changing high-value sets | New VMs silently unprotected | Strongest — explicit list |
| Container auto-protect (folder/cluster/RP) | Well-organized vSphere with folder-per-business-unit | Folder reorganization can shift scope | Good — provided folder hygiene |
| Tag auto-protect | Cross-cutting concerns (tier=gold, app=sql) where folder structure is contested | Tag drift, AND/OR confusion, double-coverage | Moderate — requires tag governance |
A common hybrid pattern: use static membership for regulated workloads and tag-based auto-protect for everything else, with a nightly Helios report flagging any provisioned VM not covered by any Protection Group.
Indexing Options and Search Implications
When a Protection Group runs, Cohesity can index the contents of the backup at the file level. Indexing enables global search (“find me every file named customer.csv across every backup, anywhere”) and self-service granular restore. The cost is metadata storage and ingest CPU. For VMs containing only block-level databases, indexing has minimal value and can be disabled. For file servers, NAS, and end-user VMs, indexing is essential. Architect the policy: index-everything-by-default tends to over-spend on metadata; selective indexing is a pure win.
App-Consistent vs. Crash-Consistent Backups
A crash-consistent backup captures the disk state as if the system had been suddenly powered off — the filesystem and any applications must rely on their own crash-recovery mechanisms (journal replay, transaction rollback) to come back cleanly. For modern journaled filesystems and databases this usually works, but it is not guaranteed.
An app-consistent backup pauses the application briefly so it can flush buffers, checkpoint state, and quiesce I/O before the snapshot is taken. On Windows, this is VSS (Volume Shadow Copy Service); on Linux, it is typically pre-/post-freeze scripts; for databases, it is the database’s own quiesce API (RMAN, VDI, Backint). The result is a backup the application can mount cleanly without recovery work — essential for databases where transaction-log replay from backup must be precise.
| Property | Crash-Consistent | App-Consistent |
|---|---|---|
| Application quiesce | None | Yes (VSS/agent/script) |
| Performance impact on source | Minimal | Brief pause (sub-second to several seconds) |
| Recovery cleanliness | Depends on app crash-recovery | Guaranteed clean |
| Required for DBs? | No (risky) | Yes |
| Required for general VMs? | Often acceptable | Recommended |
| Required for ephemeral/dev | Acceptable | Optional |
Architects should default to app-consistent for any VM running a database or transactional system, and crash-consistent only where quiesce overhead is unacceptable or the workload is genuinely stateless.
Key Takeaway: Static membership is auditable but brittle; auto-protect is operationally elegant but requires tag/folder governance. Pair static membership with regulated workloads and auto-protect with everything else, and always choose app-consistent backups for database and transactional workloads.
Performance and Concurrency
A perfectly designed policy is worthless if backups run hot enough to crash production. Performance tuning sits at the intersection of source impact, network bandwidth, proxy capacity, and cluster ingest throughput.
SmartCopy and Storage Snapshot Integration
SmartCopy is Cohesity’s snapshot-based copy and replication mechanism that integrates directly with primary storage arrays — most prominently Pure Storage FlashArray, but also NetApp, HPE Nimble/Primera, and Dell PowerStore via partner integrations [Source: https://www.cohesity.com/newsroom/press/cohesity-unveils-native-integration-pure/]. Rather than running a hypervisor-side or in-guest backup that competes with production I/O, Cohesity drives the array’s own snapshot APIs and ingests data from those snapshots.
The architecture flow is elegant [Source: https://www.cohesity.com/blogs/cohesity-pure-storage-data-protections-speed-flash/]:
- Discovery — Register the Pure FlashArray as a source in Cohesity; the cluster enumerates volumes and volume groups.
- Policy assignment — Assign a Protection Policy to the chosen Pure volumes; the policy defines snapshot frequency, retention on the array, retention on Cohesity, and optional replication and archive.
- Snapshot creation — At schedule time, Cohesity calls the Pure REST API to take a snapshot. Optional pre/post scripts quiesce SQL/Oracle/Exchange to make the snapshot app-consistent.
- Mount and read — Cohesity mounts the snapshot (typically via iSCSI) to the cluster, reads only changed blocks (using array-side change tracking), and ingests through inline dedupe and compression.
- Retention tiering — A few recent snapshots remain on Pure for instant restore at flash speed; older snapshots are aged out on the array but retained on Cohesity for long-term recovery.
- Recovery — Restore is volume-level back to any Pure FlashArray (same or DR site), file-level via SmartFiles mount, or cross-platform to native cloud VMs.
The exam-relevant point: SmartCopy enables sub-15-minute RPOs with zero hypervisor overhead and is the canonical Gold-tier mechanism for transactional databases sitting on Pure. It also means the Cohesity cluster does not need the bandwidth or proxy capacity to ingest a full hypervisor stream every 15 minutes — it ingests only the snapshot delta.
Figure 7.4: SmartCopy with Pure FlashArray — orchestration sequence
sequenceDiagram
participant App as SQL/Oracle App
participant Cohesity as Cohesity Cluster
participant Pure as Pure FlashArray
participant Archive as CloudArchive (S3)
Cohesity->>App: Pre-script: Quiesce (VSS/RMAN)
App-->>Cohesity: Quiesce ACK
Cohesity->>Pure: REST API: Take Snapshot
Pure-->>Pure: Native array snapshot created
Pure-->>Cohesity: Snapshot ID
Cohesity->>App: Post-script: Release quiesce
Cohesity->>Pure: Mount snapshot (iSCSI)
Pure-->>Cohesity: Stream changed blocks only
Cohesity-->>Cohesity: Inline dedupe + compression
Note over Pure: Recent snapshots retained<br/>on flash for instant restore
Note over Cohesity: Older snapshots tier<br/>to Cohesity DataPlatform
Cohesity->>Archive: Tier monthly to S3 Glacier
A subtle but important note: “SmartFiles SmartCopy” is not a distinct Pure feature; the integration is implemented through Cohesity’s Protection Policies [Source: https://www.cohesity.com/blogs/cohesity-pure-storage-data-protections-speed-flash/]. SmartFiles can act as the immutable backup target with DataLock/WORM, while SmartCopy is the orchestration layer.
CBT/RCT and Incremental Forever
For non-array-integrated VM backups, Cohesity uses Changed Block Tracking (CBT) on VMware and Resilient Change Tracking (RCT) on Hyper-V. The hypervisor maintains a bitmap of changed blocks since the last backup; Cohesity reads only those changed blocks. Combined with global variable-length deduplication on ingest, this delivers an Incremental Forever model: a single full backup at job inception, then deltas only — and even those deltas are deduped against existing cluster data.
CBT can occasionally desynchronize (after a storage vMotion, a snapshot consolidation failure, or certain VMware patches), and Cohesity will then need to fall back to a full read or a CBT reset. Architects should monitor cbt_reset events and budget for occasional full reads on a small percentage of jobs.
Job Concurrency and Proxy Distribution
Cohesity backups run as parallel streams. Concurrency is governed at multiple layers:
- Policy-level concurrency — how many objects in a Protection Group run simultaneously.
- Cluster-level concurrency — total simultaneous streams across the cluster.
- Source-level concurrency — typically governed by the per-datastore stream cap discussed in the registration section.
- Proxy-level concurrency — for environments using physical or virtual backup proxies (rare in modern Cohesity, more common in legacy hybrid deployments).
Tuning concurrency is iterative: start with defaults, watch for source saturation (vCenter task queues, datastore latency, NAS array CPU), and adjust caps where bottlenecks appear. The most common mistake is to crank concurrency to maximize throughput on day one and then spend three months chasing latency complaints from the storage team.
Throttling and QoS
Cohesity supports time-windowed bandwidth throttling for replication and archive traffic — for example, capping replication to 200 Mbps during business hours and lifting the cap overnight. Per-policy QoS lets you mark Gold backups higher-priority than Bronze on a shared cluster, so when contention arises, low-tier jobs slow first. This is essential when a single cluster serves multiple SLA tiers, which is the norm in real enterprise deployments.
Key Takeaway: Use SmartCopy with primary array snapshots for sub-15-minute RPOs without hypervisor overhead; use CBT/RCT incremental-forever for everything else; cap streams at the source (per-datastore) and tune concurrency iteratively rather than maximizing day-one.
Worked Example: Designing a Gold Policy
Let us design a Gold policy end-to-end against a concrete requirement.
Requirement. A financial-services customer runs SQL Server transactional databases on Pure FlashArray volumes. The application owner demands:
- 15-minute RPO for the database.
- 30 days of daily on-cluster recovery points.
- 1 year of monthly archives for regulatory review.
- Cross-region replication to a DR cluster in a paired region.
- App-consistent backups (no crash-consistent compromises).
- Immutable backups for ransomware resilience.
Design.
- Source registration: Register the Pure FlashArray as a source in Cohesity. Register the SQL Server hosts as physical sources with the Cohesity Agent so pre/post scripts can run VSS-coordinated quiesce.
- Membership: Create a Protection Group
pg-sql-gold-paymentswith static membership of the specific Pure volumes hosting the payment database files and logs. (Static, because PCI scope demands deterministic membership.) - Frequency: Configure the policy
pol-gold-15minwith a 15-minute snapshot frequency using SmartCopy against the Pure FlashArray. Pre-script triggers SQL VSS quiesce; post-script releases. - Local retention: 30 days, daily-extended retention pinned (the first successful snapshot of each day held for 30 days; intra-day snapshots aged out after 24 hours to control storage growth).
- Monthly archive: Add CloudArchive Direct to S3 with a Glacier lifecycle, monthly cadence, 1 year retention, Compliance DataLock enabled.
- Replication: Add an async replication target to the DR Cohesity cluster, replicating every cycle; DR cluster retains 7 days of recovery points for failover.
- Quiesce: App-consistent — pre-script triggers SQL VSS quiesce, post-script releases; if the script fails, the policy is configured to fail the run rather than fall back to crash-consistent (a deliberate choice for Gold).
- Indexing: Disabled at the volume level (block backups don’t benefit from file-level index); SQL granular recovery handled at the database layer through Cohesity’s SQL integration.
- Concurrency: Cap the Pure datastore at 8 streams during business hours, 16 overnight; replication throttled to 500 Mbps daytime, uncapped overnight.
The result: a single named policy (pol-gold-15min) bound to a single Protection Group (pg-sql-gold-payments) that delivers 15-minute RPO, instant RTO via mount-from-snapshot, a 1-year compliance archive, cross-region DR, and ransomware immunity. Adding the next Gold workload (say, the Oracle ERP database) is a matter of registering its Pure volumes and creating a second Protection Group bound to the same pol-gold-15min policy. The contract is reused; only the customer roster grows.
Chapter Summary
This chapter unpacked the trio of objects that drive Cohesity data protection: Sources (registered hypervisors, hosts, NAS, databases), Policies (RPO/retention/replication/archive contracts), and Protection Groups (the binding object that subscribes a set of sources to a policy).
You learned how to register vCenter, SCVMM, Prism, physical hosts, NAS, and databases — and the importance of per-datastore stream caps set at registration time. You walked through the policy structure: GFS hierarchical retention, DataLock immutability, blackout windows, and the tiered Gold/Silver/Bronze SLA model. You compared static membership against container- and tag-based auto-protect, including the AND/OR tag logic that catches careless architects on the exam. You learned why app-consistent backups are non-negotiable for databases. And you traced how SmartCopy with Pure FlashArray (and similar primary-array integrations) enables 15-minute RPOs without hypervisor overhead, while CBT/RCT incremental-forever handles the rest of the estate.
Hold the analogy in mind: a policy is the SLA contract; a protection group is the customer roster. Design contracts once per service tier, reuse them across many rosters, and let the platform’s dedupe, CBT, and SmartCopy mechanics make the math work. In Chapter 8 we move from generic protection into application-aware backup and recovery, where the database engines and SaaS endpoints have their own quirks and quiesce paths.
Key Terms
- Protection Group: The binding object that associates a set of source objects (VMs, volumes, shares, databases) with a Protection Policy and execution settings (proxy, indexing, scripts).
- Policy: A named, reusable SLA definition specifying frequency (RPO), retention, replication, archive, and immutability rules. One policy is typically applied to many Protection Groups.
- RPO (Recovery Point Objective): The maximum acceptable data loss measured in time. Cohesity supports as low as 15 minutes for hypervisor backups and tighter for storage-snapshot-integrated sources.
- RTO (Recovery Time Objective): The maximum acceptable downtime. Cohesity targets minutes for Instant Mass Restore, scaling longer for archive recoveries.
- CBT (Changed Block Tracking): VMware’s mechanism for identifying blocks that have changed since the last snapshot, enabling incremental-forever backups. Hyper-V’s equivalent is RCT (Resilient Change Tracking).
- Auto-protect: Cohesity feature that dynamically tracks vCenter inventory containers (folders, clusters, hosts, resource pools) or vSphere tags so newly added VMs are automatically protected without administrator intervention.
- App-consistent: A backup that captures the source after applications have been quiesced (via VSS, RMAN, VDI, Backint, or pre/post scripts), guaranteeing clean recovery without crash-recovery work.
- SmartCopy: Cohesity’s snapshot-based copy and replication mechanism that integrates with primary storage arrays (Pure FlashArray, NetApp, HPE, Dell) to drive native array snapshots into the Cohesity DataPlatform with minimal production impact.
Chapter 8: Application-Aware Backup and Recovery Patterns
The CCAE exam expects an architect to do far more than schedule a nightly snapshot. Real-world Cohesity designs must speak the native protocols of Oracle, SQL Server, SAP HANA, Microsoft Exchange, and Microsoft 365 — and they must turn those backups into instantly usable recovery products: bootable VMs, mounted NFS exports, writable clones, and individually restorable mailbox items. This chapter walks through how Cohesity wires itself into each application stack, why those wires matter for RPO/RTO, and how the Instant Mass Restore and clone primitives transform a passive backup repository into an active recovery and dev/test platform.
Learning Objectives
By the end of this chapter, you should be able to:
- Design backup strategies for Oracle (RMAN), Microsoft SQL Server (VDI/AAG), SAP HANA (Backint), Exchange, and Microsoft 365 (Mailbox/OneDrive/SharePoint/Teams).
- Choose between target-side and source-side deduplication for database workloads and configure the corresponding RMAN channels, ports, and SBT libraries.
- Implement Cohesity Instant Mass Restore and contrast it with VMware-native vSphere Instant Recovery on scale, performance, and orchestration.
- Recover individual files, mailboxes, and database objects using indexed search and point-in-time recovery (PITR) workflows.
- Validate recoveries with run-books, automated test failovers, and clone-based dev/test pipelines that do not consume additional storage.
8.1 Database Workloads: Oracle, SQL Server, and SAP HANA
Database protection on Cohesity is fundamentally about meeting the database engine on its own terms. Oracle wants to drive its own backup via RMAN. SQL Server expects an application to call its VDI (Virtual Device Interface). SAP HANA mandates a Backint-compliant target. The common thread is that Cohesity does not pretend to be a generic file system to these engines — it presents itself through each engine’s native API so backups and restores remain application-consistent and supportable by the database vendor.
8.1.1 Oracle RMAN Integration
Oracle’s Recovery Manager (RMAN) is the canonical backup driver for Oracle databases. Cohesity integrates with RMAN by registering itself as an SBT (System Backup to Tape) target — the same interface RMAN uses for tape libraries — while still providing disk-class performance and global deduplication on the back end [Source: https://www.cohesity.com/resources/solution-brief/Cohesity-Oracle-Databases-Solution/].
The Cohesity Remote Adapter built into the DataPlatform consolidates RMAN scripts, schedules, and alerts under a single console, eliminating the per-host crontab sprawl that plagues legacy Oracle backup operations [Source: https://www.cohesity.com/blogs/title-streamlining-oracle-database-protection-recovery-cohesity-oracle-rman/]. When you create a protection job, Cohesity auto-selects an active single-instance Oracle node and configures the number of RMAN channels for the database object. For RAC clusters, the architect can manually pin specific nodes and tune the channel count and SBT library path. Channel count is the primary throughput lever: more channels increase parallelism but also drive up CPU and network load on the Oracle host.
Cohesity supports two deduplication paths for RMAN, and choosing between them is a CCAE-level design decision:
| Path | How RMAN Sees It | Where Dedupe Happens | When to Use |
|---|---|---|---|
| Target-side dedupe | Cohesity exports an NFS view; the Oracle host mounts it. RMAN writes backup pieces to the mount. | Inline on the Cohesity cluster as data lands. | Default for low-latency LANs, when CPU budget on the Oracle host is constrained, or when DBAs want zero source-side software. |
| Source-side dedupe | The Cohesity Oracle Source-Side Dedupe plugin is an SBT library installed on the Oracle host. | On the Oracle host before bytes traverse the network. | WAN-attached database servers, bandwidth-constrained sites, or when the Oracle host has spare CPU and the network is the bottleneck. |
Both paths leverage Cohesity’s global variable-length deduplication and compression, so the on-cluster footprint is identical regardless of which side did the fingerprinting [Source: https://www.cohesity.com/blogs/explaining-cohesitys-space-efficient-target-source-side-dedupe-integration-oracle-rman/].
Network requirements are precise and exam-relevant: the Cohesity Linux Agent on the Oracle host requires inbound TCP 50051 for backup operations and 59999 for self-monitoring. Miss either port and discovery silently degrades or RMAN sessions hang [Source: https://docs.cohesity.com/baas/data-protect/oracle-requirements.htm].
Figure 8.1: Oracle RMAN backup data flow through SBT library to Cohesity cluster
sequenceDiagram
participant DBA as DBA / Scheduler
participant RMAN as Oracle RMAN
participant SBT as SBT Library<br/>(Cohesity plugin)
participant Agent as Cohesity Linux Agent<br/>(ports 50051/59999)
participant Cluster as Cohesity Cluster<br/>(SnapTree view)
DBA->>RMAN: BACKUP DATABASE PLUS ARCHIVELOG
RMAN->>RMAN: Allocate N channels<br/>(parallelism lever)
RMAN->>SBT: sbtopen / sbtwrite (backup pieces)
alt Source-side dedupe
SBT->>SBT: Variable-length fingerprint
SBT->>Agent: Send unique blocks only
else Target-side dedupe (NFS mount)
SBT->>Agent: Stream all blocks via NFS
Agent->>Agent: Inline dedupe at landing
end
Agent->>Cluster: Write to protection view
Cluster-->>Agent: Ack + catalog metadata
Agent-->>SBT: sbtwrite OK
SBT-->>RMAN: Piece complete
RMAN-->>DBA: Backup successful (catalog updated)
The integration supports full and incremental backups whether or not Oracle Change Block Tracking (CBT) is enabled, but CBT is strongly recommended for large databases because it limits incremental reads to changed blocks, dramatically shrinking the backup window. Archive log backups can be scheduled independently from datafile backups inside the same protection policy — this lets you hit aggressive RPO targets (e.g., 15-minute log shipping) without forcing full datafile passes that often.
Key Takeaway: Cohesity speaks RMAN natively via the SBT interface. Architects choose between target-side dedupe (NFS mount, simple, LAN-friendly) and source-side dedupe (host-side plugin, WAN-friendly), tune RMAN channel count for parallelism, and always open ports 50051 and 59999 to the Cohesity Linux Agent.
8.1.2 SQL Server: VDI and Always On Availability Groups
For Microsoft SQL Server, Cohesity uses the Virtual Device Interface (VDI) — the same native API that Microsoft Backup, Veeam, and other enterprise products use. The Cohesity Windows Agent registers as a VDI client; SQL Server streams its own backup to the agent, which then writes to a Cohesity view. This produces a SQL-consistent backup that includes the transaction log chain needed for point-in-time recovery.
Always On Availability Groups (AAG) add a wrinkle: the same database exists on multiple replicas. Cohesity’s AAG-aware protection can target the preferred backup replica configured in the AAG, the primary replica, or the secondary replica with the lowest backup priority. The protection policy understands that backing up from a secondary offloads I/O from the primary while still producing a usable backup chain. For point-in-time recovery, log backups are taken from the active log-shipping replica and Cohesity reconstructs the chain across replica failovers.
Granular SQL recovery options include:
- Full database restore — to original or alternate instance, with optional file relocation.
- Point-in-time restore — apply log backups up to a specific timestamp.
- Instant volume mount — present the backup as an iSCSI/SMB share so DBAs can attach it as a database without copying data, the SQL equivalent of VM Instant Recovery.
- Object-level recovery — extract individual tables or schemas via Cohesity’s database object-level recovery workflow.
A common architectural pattern: tier 1 OLTP databases run on AAG with a Cohesity-backed secondary replica taking log backups every 5 minutes; tier 2 reporting databases run as standalone instances with daily full + hourly log policies. Both use the same Cohesity protection policy template with retention overrides.
8.1.3 SAP HANA Backint Integration
SAP HANA’s official backup interface is Backint — a SAP-certified shared library that HANA dynamically loads when triggered to back up. Cohesity provides a Backint agent that HANA loads, and HANA streams its native backup format directly to the Cohesity cluster. Because Backint is the only SAP-supported third-party backup path for HANA, this is non-negotiable for production SAP support: writing HANA volumes via crash-consistent VM snapshots will technically work but is not supported by SAP for production restores.
Backint integration supports:
- Data backups — full and incremental backups of HANA tenants and the system database.
- Log backups — continuous redo log shipping for PITR.
- Catalog backups — HANA’s backup catalog is replicated to Cohesity so it can be reconstructed during DR.
- Multi-tenant container databases (MDC) — each tenant is protected independently.
The same target-side vs. source-side dedupe distinction applies as with Oracle, though most HANA deployments use the target-side path because HANA hosts already run hot and DBAs are reluctant to add CPU load with source-side fingerprinting.
8.1.4 Log Backups and Point-in-Time Recovery
PITR is the differentiator between “I have a backup” and “I can restore my business to the moment before the bad transaction.” For Oracle, SQL Server, and HANA, Cohesity protection policies expose log backup frequency as an independent dial from full/incremental cadence. A typical tier-1 database policy looks like:
| Backup Type | Frequency | Retention |
|---|---|---|
| Full | Weekly (Sunday 22:00) | 4 weeks |
| Incremental | Daily (22:00) | 14 days |
| Log | Every 15 minutes | 7 days |
| Archive copy | Monthly | 7 years |
This pattern hits a 15-minute RPO with 7 days of fine-grained PITR, weekly full reset for chain hygiene, and a monthly archive copy for long-term compliance retention.
Key Takeaway: Each database engine has a canonical native API — RMAN for Oracle, VDI for SQL Server, Backint for HANA. Cohesity speaks all three. Independent log backup cadence is the lever that converts a daily backup into a minute-grained point-in-time recovery capability.
8.2 Microsoft 365 and SaaS Workloads
Microsoft 365 occupies a unique architectural position: the data lives entirely in Microsoft’s cloud, accessed only through Graph API and EWS, with platform-imposed throttling and a shared-responsibility model that explicitly puts third-party backup on the customer. Cohesity protects M365 through Cohesity DataProtect on-premises, Cohesity DataProtect as a Service (the SaaS offering), or a hybrid that integrates with Microsoft 365 Backup Storage (MBS) for high-throughput recovery [Source: https://www.cohesity.com/solutions/microsoft-365/].
8.2.1 Mailbox, OneDrive, SharePoint, and Teams Coverage
Cohesity protects four primary M365 workloads, each with workload-specific recovery semantics:
| Workload | Granularity of Recovery | Notable Behavior |
|---|---|---|
| Exchange Online (mailboxes) | Mailbox, folder, single message, attachment | Independent retention from Microsoft’s native; global keyword search across the mailbox corpus [Source: https://exchangesavvy.com/cohesitys-microsoft-365-backup-your-solution-for-rapid-recovery/]. |
| OneDrive for Business | File or folder with full ACL/permission fidelity | MBS integration delivers up to 3 TB/hour restore throughput by bypassing Graph throttling [Source: https://www.cohesity.com/resources/solution-brief/microsoft-365/]. |
| SharePoint Online | Site, document library, list item, page | Supports protecting all child objects of a site and restoring to original or alternate location. |
| Microsoft Teams | Channel messages, files, tabs, underlying SharePoint site | Important caveat: restoring a fully deleted Team requires that an admin manually re-create the Teams/Groups container in M365 first. Graph API does not let third-party apps recreate the Group object [Source: https://www.cohesity.com/blogs/the-practitioners-guide-to-microsoft-365-teams-and-groups-data-protection/]. |
Figure 8.3: Microsoft 365 protection via Graph API across Exchange, OneDrive, SharePoint, and Teams
sequenceDiagram
participant Cohesity as Cohesity DataProtect
participant Entra as Entra ID<br/>(service principal)
participant Graph as Microsoft Graph API
participant MBS as M365 Backup Storage<br/>(MBS APIs)
participant Tenant as M365 Workloads
Cohesity->>Entra: Acquire app-permission token
Entra-->>Cohesity: OAuth2 access token
Cohesity->>Graph: Enumerate users / sites / teams
Graph-->>Cohesity: Object inventory
par Mailbox protection
Cohesity->>Graph: Read messages (per-user parallel)
Graph->>Tenant: Exchange Online mailbox
Tenant-->>Graph: Items + attachments
Graph-->>Cohesity: Indexed mailbox data
and OneDrive / SharePoint
Cohesity->>MBS: Snapshot file content
MBS->>Tenant: OneDrive / SharePoint stores
Tenant-->>MBS: Files + ACLs (3 TB/hr)
MBS-->>Cohesity: Bulk content (bypasses throttling)
and Teams
Cohesity->>Graph: Channel messages + tabs
Graph->>Tenant: Teams + underlying SharePoint
Tenant-->>Graph: Messages, files, metadata
Graph-->>Cohesity: Teams payload
end
Cohesity->>Cohesity: Index + dedupe + write to view
Note over Cohesity,Graph: 429 backoff with exponential delay<br/>parallelize across users not requests
8.2.2 MFA, Graph API Limits, and Authentication
Cohesity registers the M365 tenant as a source using a registered Entra ID application (service principal) with the appropriate Graph API permissions. Modern authentication and MFA are mandatory for the consent flow, but the running protection job itself uses application-permission tokens that are not subject to interactive MFA — which is exactly what you want for unattended nightly backups.
Graph API throttling is the most common operational pain point. Microsoft applies per-tenant and per-app throttles measured in requests per minute. Cohesity mitigates throttling by:
- Parallelizing across users rather than across requests for a single user — Graph throttles per-user mailbox access more aggressively than fan-out enumeration.
- Backing off and retrying when 429 responses arrive, with exponential delays.
- Using Microsoft 365 Backup Storage (MBS) for OneDrive, SharePoint, and Teams files — MBS uses Microsoft’s storage-side restore APIs that bypass Graph throttling entirely. This is where the 3 TB/hour figure originates.
- Indexing on the Cohesity side so that search and selective restore do not require additional Graph calls during recovery.
8.2.3 Auto-Protection and Granular Restore
A critical design pattern is policy-driven auto-protection. When a new user is provisioned in the tenant, their mailbox and OneDrive are automatically discovered and added to the protection group — no manual onboarding ticket. This eliminates the operational drift problem that plagues static include-list backup designs [Source: https://www.cohesity.com/dm/tip-sheets/4-ways-to-back-up-microsoft-365/].
Granular restore options match the workload’s natural granularity. For Exchange, an admin can search globally for “subject contains ‘Q4 forecast’”, select the matching message from a specific user’s mailbox at a specific point in time, and restore it back to the original folder, an alternate folder, or an alternate mailbox entirely.
8.2.4 Salesforce and Other SaaS Adapters
Beyond M365, Cohesity adapters cover Salesforce (object-level metadata and data backup), Microsoft Entra ID (users, groups, configuration), and additional SaaS targets. These adapters share architectural traits: API-driven enumeration, indexed backups for search, throttling-aware schedulers, and granular restore to original or alternate tenants.
Key Takeaway: M365 protection lives or dies on three things: handling Graph API throttling (use MBS where possible), auto-protecting new users via policy, and remembering that Teams/Groups containers cannot be re-created by third-party tools — admin must recreate the shell before granular Teams restore can land.
8.3 Instant Recovery Mechanics
Backups become recovery products through one of two primitives: Instant Mass Restore (IMR) for VMs and Clone for VMs and databases. Both rely on Cohesity’s SnapTree metadata structure, which provides O(1) snapshot access regardless of snapshot depth — every snapshot is a fully hydrated, instantly mountable view, not a delta chain that requires walking [Source: https://www.cohesity.com/blogs/instant-recovery-unlimited-vms-point-time-distributed-resilient-data-store/].
8.3.1 Instant Mass Restore for VMs
Instant Mass Restore is the marquee feature for VMware bulk recovery. It works by presenting an NFS datastore from the Cohesity cluster directly to the ESX/ESXi hosts, registering VMs from the backup metadata, powering them on, and then Storage vMotion-ing them back to primary storage in the background [Source: https://www.cohesity.com/blogs/cohesity-instant-mass-restore-better-solution-to-an-old-problem/].
The five-step automated workflow:
- Present an NFS datastore to ESX/ESXi hosts — the Cohesity cluster acts as a scale-out NFS server exporting a view of the backup snapshot.
- Create new VMs from backup metadata, registering them with vCenter.
- Power on VMs from the temporary NFS datastore — workloads are live and serving users.
- Storage vMotion the running VMs back to primary storage at the customer’s chosen pace.
- Clean up the temporary NFS datastore after migration completes.
From clicking “Recover” to having recoveries in progress takes approximately 30 seconds. Subsequent steps are fully automated [Source: https://www.cohesity.com/blogs/cohesity-instant-mass-restore-better-solution-to-an-old-problem/].
Figure 8.2: Instant Mass Restore flow from Cohesity NFS export to primary storage
flowchart LR
A[Cohesity Cluster<br/>SnapTree snapshot] -->|NFS export<br/>scale-out| B[ESXi Hosts<br/>mount datastore]
B -->|register from<br/>backup metadata| C[vCenter<br/>VM inventory]
C -->|power on VMs<br/>~30 seconds| D[Live Workloads<br/>serving users]
D -->|Storage vMotion<br/>staggered| E[Primary Storage<br/>production array]
E -->|migration complete| F[Auto-cleanup<br/>NFS export removed]
style A fill:#1f6feb,stroke:#58a6ff,color:#fff
style D fill:#238636,stroke:#3fb950,color:#fff
style F fill:#6e40c9,stroke:#a371f7,color:#fff
8.3.2 IMR vs. VMware vSphere Native Instant Recovery
CCAE candidates must be able to articulate the difference clearly. Both approaches boot VMs from a backup-side surface, but the architectural assumptions differ:
| Dimension | Cohesity Instant Mass Restore | VMware vSphere Instant Recovery |
|---|---|---|
| Scale | Unlimited concurrent VMs; demonstrated to 200 VMs simultaneously [Source: https://www.cohesity.com/blogs/instant-recovery-unlimited-vms-point-time-distributed-resilient-data-store/] | Designed for one or a handful of VMs |
| Storage backing | Distributed scale-out cluster — runs production load directly | Single replica appliance — performance cliff under load |
| Migration | Automated Storage vMotion orchestrated by Cohesity | Manual Storage vMotion by operator |
| Cleanup | Automatic NFS export removal post-migration | Manual cleanup |
| Point-in-time | Any snapshot, O(1) access via SnapTree | Latest replica only |
| Performance | Cohesity-published testing showed 3x transactions/minute vs. Veeam-from-target [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Accelerating-Instant-Recovery-with-Cohesity.pdf] | Backup target was not designed for production load |
The “mass” in Instant Mass Restore is the differentiator. If a ransomware event encrypts 200 VMs, IMR brings them all online in ~30 seconds and Storage vMotions them back as primary storage capacity allows. vSphere’s native Instant Recovery is fine for a single test restore but cannot orchestrate a 200-VM mass recovery.
Strict consistency in Cohesity’s distributed file system ensures that even under concurrent recovery, every ESXi host sees a consistent view of the NFS export — a non-trivial requirement when dozens of hosts are mounting the same export simultaneously [Source: https://www.cohesity.com/blogs/strict-consistency-must-capability-vmware-instant-restores/].
8.3.3 VM Cloning and Dev/Test Pipelines
Clone is the lighter-weight cousin of IMR. Where IMR is intended for production recovery (and ends with Storage vMotion back to primary), Clone spins up a writable copy of any backup snapshot for non-production use — dev, test, training, forensics — and never migrates back. The clone consumes only the metadata footprint plus changed-block writes, courtesy of SnapTree.
Analogy: A Cohesity clone is to a backup what a
git branchis to a commit: a cheap, isolated, fully writable workspace forked off a known-good point-in-time, where changes are tracked separately from the source. IMR is one step further — like agit checkoutof that branch into a running production environment, with the migration step being the eventualgit mergeback into primary storage.
Common clone use cases:
- Dev/test database refresh — DBAs clone last night’s production Oracle backup into a dev instance, redact PII, and ship it to developers without consuming additional storage.
- Forensic investigation — security team clones a VM at the moment of suspected compromise and isolates it on a quarantine network for analysis.
- Patch testing — clone a production VM, apply the patch, validate, then discard the clone.
- Training environments — clone a representative slice of production for new-hire onboarding labs.
8.3.4 Mounting Backups for Application Teams
For application teams that need read-only access to historical data without a full restore, Cohesity can mount a backup snapshot directly:
- NFS/SMB mount of a file or VM backup view — analysts walk the file tree and grab specific files.
- Volume mount of a SQL or Oracle backup — DBAs attach the backup as a database for read-only queries.
- Browse-and-search via the Cohesity UI — indexed search across protected sources, with download or push-to-target options.
These mount workflows are read-only by default; if write access is needed, the team uses Clone instead.
8.3.5 Storage vMotion-Out from Cohesity
The final step of IMR — Storage vMotion-out — is where the architect’s design choices interact with VMware operations. The vMotion is executed by VMware, not Cohesity, which means migration speed is bounded by ESXi host CPU, vMotion network bandwidth, and primary storage write throughput. Cohesity orchestrates and monitors; it does not accelerate the underlying vMotion.
Best practices:
- Stagger the migration rather than vMotion all VMs simultaneously — VMware default limits 8 concurrent storage vMotions per host.
- Reserve vMotion bandwidth on dedicated VMkernel ports.
- Plan primary storage capacity before initiating IMR — a 200-VM IMR will land 200 VMs’ worth of writes on primary storage during the migration window.
Key Takeaway: Instant Mass Restore differs from VMware’s native Instant Recovery in three architectural dimensions: scale (unlimited vs. one), storage performance (distributed cluster runs production load vs. appliance performance cliff), and orchestration (automated end-to-end vs. manual). Clones are the dev/test sibling —
git branchfor backups.
8.4 Granular File and Item Recovery
Figure 8.4: Granular recovery decision tree — choosing the right Cohesity primitive
flowchart TD
Start[Recovery Request] --> Q1{Scope of loss?}
Q1 -->|Whole VM<br/>or many VMs| VM[Instant Mass Restore<br/>NFS mount + power-on<br/>+ Storage vMotion]
Q1 -->|Single file or<br/>file version| File[Indexed Search<br/>Yoda service<br/>restore to original/alt]
Q1 -->|Mailbox / message /<br/>attachment| Item{Exchange type?}
Q1 -->|Database object<br/>table / schema| DB{Engine?}
Item -->|On-prem Exchange| ItemOn[VSS + Exchange API<br/>mailbox/folder/message]
Item -->|Exchange Online| ItemCloud[Graph API + EWS<br/>restore to original or PST]
DB -->|SQL Server| SQL[Mount backup as DB<br/>Cohesity object browser<br/>bcp / INSERT-SELECT]
DB -->|Oracle| Oracle[RMAN RECOVER TABLE<br/>or clone + expdp/impdp]
DB -->|SAP HANA| HANA[Clone tenant<br/>extract via SQL export]
VM --> Done[Recovery complete]
File --> Done
ItemOn --> Done
ItemCloud --> Done
SQL --> Done
Oracle --> Done
HANA --> Done
style Start fill:#1f6feb,stroke:#58a6ff,color:#fff
style Done fill:#238636,stroke:#3fb950,color:#fff
style Q1 fill:#6e40c9,stroke:#a371f7,color:#fff
style Item fill:#6e40c9,stroke:#a371f7,color:#fff
style DB fill:#6e40c9,stroke:#a371f7,color:#fff
Bulk recovery — restoring a VM, mounting a database, presenting an NFS export — is only half the recovery story. The other half is granular: a single email, a specific file version, an individual database table. Cohesity’s indexed search architecture makes granular recovery a first-class workflow rather than an afterthought.
8.4.1 Indexed Search Across VMs and NAS
When indexing is enabled on a protection group, Cohesity walks file-system metadata at backup time and pushes filenames, paths, sizes, timestamps, and (optionally) full-text content into the Yoda search service that runs across the cluster. Administrators can then query:
- “Find all files matching
*.pstmodified in the last 30 days across all protected VMs.” - “Locate
salary_2025.xlsxin any backup of any VM in the Finance protection group.” - “Show all versions of
appsettings.jsonfrom the WebApp01 VM, with timestamps.”
Selected files restore directly to the original VM, an alternate VM, or download to the admin’s workstation. No full VM restore is required.
The indexing trade-off: full-text indexing adds CPU and storage overhead at backup time and inflates metadata footprint. Architects typically enable full indexing on user file shares and selective metadata-only indexing on database-heavy or system VMs.
8.4.2 Item-Level Recovery for Exchange
For Exchange (both on-premises and Online), item-level recovery operates at a finer granularity than file-level. The Cohesity Exchange agent reads the backup at the message-store level and exposes:
- Mailbox — restore an entire mailbox.
- Folder — restore a single folder (e.g., “Inbox/Archive/2024”).
- Message — restore an individual email and its attachments.
- Attachment — extract a specific attachment without restoring the message.
Restore destinations include the original mailbox, an alternate mailbox in the same Exchange organization, or a PST export for download.
8.4.3 Database Object-Level Recovery
For SQL Server, Cohesity supports object-level recovery to extract individual tables, schemas, or stored procedures from a database backup without restoring the entire database. The workflow:
- Mount the database backup as an attachable database on a recovery instance.
- Use the Cohesity object browser to navigate tables and schemas.
- Select the objects to extract.
- Cohesity scripts the extraction (typically via
bcporINSERT ... SELECTfrom the mounted source) into the target database.
For Oracle, object-level recovery typically uses RMAN’s RECOVER TABLE command or a clone-and-extract pattern: clone the backup into a temporary instance, export the desired tables with expdp, and import into the target.
8.4.4 Self-Service Restore Portals
For end users, Cohesity exposes self-service restore via Helios. A user can:
- Browse their own VM’s backup history.
- Restore a single file to their workstation without filing a ticket.
- View restore audit logs scoped to their own data.
RBAC ensures users only see their own VMs/files. Self-service is heavily used by dev teams wanting to roll back a single config file without involving infrastructure ops.
Key Takeaway: Indexed search converts the backup repository into a search engine. Combined with item-level recovery for Exchange, object-level recovery for SQL/Oracle, and self-service portals via Helios, granular recovery becomes a routine helpdesk task — not a 4-hour full-restore project.
8.5 Worked Example: Three Recovery Scenarios
To cement the patterns, work through three CCAE-style recovery scenarios that map to the same protection groups but exercise very different recovery paths.
Scenario A: Single Exchange mailbox recovery. A finance manager accidentally deleted her mailbox folder containing audit-related emails three weeks ago, past Exchange’s native retention.
- Open Cohesity Helios; navigate to Search.
- Filter by Source =
tenant.onmicrosoft.com, User =manager@example.com, Type = Mailbox. - Select the snapshot from 24 days ago (one day before the deletion).
- Browse to the deleted folder; select all messages.
- Restore to original mailbox, “Recovered_Audit” subfolder.
- Total time: ~5 minutes; data movement: ~80 MB; impact on other users: zero.
Scenario B: Full VM recovery after ransomware. An overnight ransomware event encrypted 47 VMs in the production cluster.
- Helios alerting (DataHawk anomaly detection on the protection groups) flags abnormal change rate at 03:47.
- SOC declares the incident at 04:15. Last known good backup = 22:00 the prior night.
- Architect triggers Instant Mass Restore on all 47 VMs from the 22:00 snapshot.
- ~30 seconds later, the 47 VMs are powering on from the Cohesity NFS datastore on a quarantine VLAN.
- Validation team confirms VMs are clean; production traffic is cut over.
- Storage vMotion to primary array begins, throttled to 4 concurrent migrations per host.
- Migration completes over the next 8 hours; NFS export auto-removes.
- RTO: under 1 hour for service restoration; RPO: 6 hours (last backup at 22:00).
Scenario C: Oracle PITR after a bad transaction. A developer ran DELETE FROM orders against production at 14:32. The database is 8 TB.
- Architect identifies last full backup (Sunday 22:00), nightly incrementals, and 15-minute archive log backups.
- Decision: PITR restore to the original host, recovering up to 14:31.
- Cohesity job invokes RMAN with the SBT library, restores the most recent incremental, and applies archive logs up to 14:31:59.
- Channel count: 8 (matches host CPU and network capacity).
- Restore completes in 1 hour 12 minutes for 8 TB on a 10 GbE network.
- Database opens at 14:31 state; deleted rows are present; the bad DELETE is gone.
- RPO for this incident: 1 minute; RTO: 1 hour 12 minutes.
The same Cohesity protection environment delivered three radically different recovery products: a 5-minute item restore, a 1-hour mass VM recovery, and a 1-hour 8 TB database PITR.
8.6 Native App Integration Comparison
A consolidated table comparing how Cohesity integrates with each major application stack:
| Application | Native Interface | Agent/Plugin | Key Ports | Granular Recovery | PITR Support |
|---|---|---|---|---|---|
| Oracle | RMAN via SBT library | Cohesity Linux Agent + optional source-side dedupe plugin | 50051, 59999 | Object via RECOVER TABLE or clone-and-extract | Yes — archive log backups |
| SQL Server | VDI | Cohesity Windows Agent | 50051, 59999 | Table/schema via object-level recovery | Yes — log backups across AAG replicas |
| SAP HANA | Backint shared library | Cohesity Backint agent | 50051, 59999 | Table via clone-and-extract | Yes — log backups + catalog |
| Exchange (on-prem) | VSS + Exchange APIs | Cohesity Windows Agent | SMB/RPC | Mailbox/folder/message/attachment | Per-database log replay |
| Exchange Online | Graph API + EWS | Cloud-side service principal | HTTPS to Graph | Mailbox/folder/message/attachment | Snapshot granularity |
| OneDrive/SharePoint | Graph API + MBS | Service principal | HTTPS; MBS APIs | File/site/list-item with ACLs | Snapshot granularity |
| Microsoft Teams | Graph API + SharePoint | Service principal | HTTPS | Channel/message/file (Group must pre-exist for full-team restore) | Snapshot granularity |
| VMware VMs | VADP, CBT | None on guest (agentless) or VMware Tools quiescing | VMware ports | File-level via indexed search | Per-snapshot |
Chapter Summary
Application-aware backup is the place where architectural choices map directly onto user-visible outcomes. Cohesity integrates with Oracle through RMAN’s SBT interface (with target-side or source-side dedupe), with SQL Server through VDI (AAG-aware), and with SAP HANA through the SAP-mandated Backint API. Microsoft 365 protection covers Exchange Online, OneDrive, SharePoint, and Teams via Graph API and Microsoft 365 Backup Storage, with the critical caveat that fully deleted Teams require manual Group recreation before restore. Instant Mass Restore turns the backup cluster into a temporary primary datastore, recovering up to hundreds of VMs in approximately 30 seconds with full automation through Storage vMotion-back; this differs from VMware’s native Instant Recovery in scale, performance, and orchestration. Clones provide cheap, writable forks of any backup for dev/test, forensics, and patch validation. Granular recovery — file, mailbox item, database object — is enabled by Cohesity’s indexed search and self-service Helios portals. CCAE candidates should be able to map any recovery requirement (single email, mass VM event, database PITR) to the appropriate Cohesity primitive and explain the network, port, and policy prerequisites.
Key Terms
- RMAN — Oracle Recovery Manager. The native Oracle backup driver. Cohesity registers as an SBT (System Backup to Tape) target; RMAN channels stream backup pieces via either NFS mount (target-side dedupe) or the Cohesity source-side dedupe plugin.
- VDI — Virtual Device Interface. Microsoft SQL Server’s native API for third-party backup. Cohesity’s Windows Agent registers as a VDI client; SQL Server pushes consistent backup streams to the agent.
- AAG — Always On Availability Group. SQL Server’s HA/DR feature. Cohesity AAG-aware protection can target preferred, primary, or secondary replicas and reconstruct log chains across replica failovers.
- Backint — SAP-certified shared library interface for HANA backup. Cohesity’s Backint agent is the only SAP-supported third-party backup path for HANA production environments.
- Instant Mass Restore (IMR) — Cohesity’s bulk VM recovery primitive. Presents an NFS datastore to ESXi hosts, registers and powers on VMs from backup metadata, and Storage vMotions them back to primary storage. Demonstrated to recover 200 VMs concurrently.
- Clone — A writable, point-in-time fork of a backup snapshot. Consumes only metadata plus changed blocks. Used for dev/test, forensics, training, and patch validation. Conceptually analogous to a
git branch. - Granular recovery — Recovery at sub-object granularity: a single email, a specific file, an individual database table. Enabled by Cohesity’s indexed search and Yoda service.
- M365 — Microsoft 365 (Exchange Online, OneDrive, SharePoint Online, Teams, Entra ID). Protected through Graph API, EWS, and Microsoft 365 Backup Storage (MBS), which enables up to 3 TB/hour restore throughput.
- MBS — Microsoft 365 Backup Storage. Microsoft’s storage-side backup/restore APIs that bypass Graph API throttling for OneDrive, SharePoint, and Teams files.
- SBT — System Backup to Tape. RMAN’s interface specification for tape-class backup targets. Cohesity uses SBT to integrate without requiring custom RMAN scripting.
- SnapTree — Cohesity’s metadata structure providing O(1) snapshot access regardless of snapshot depth. The foundation of IMR and Clone — every snapshot is a fully hydrated, instantly mountable view rather than a delta chain.
- CBT (Oracle) — Change Block Tracking. Reduces incremental backup time by limiting reads to changed blocks. Strongly recommended for large Oracle databases.
- PITR — Point-in-Time Recovery. Restoring a database to an arbitrary moment using full/incremental backups plus replayed transaction or archive logs.
Chapter 9: Replication, Disaster Recovery, and SiteContinuity
Disaster recovery is where backup architecture stops being a storage problem and becomes a business continuity problem. A backup that cannot be replicated, orchestrated, and recovered within the agreed Recovery Time Objective (RTO) is, from the point of view of a regulator or a CFO, no backup at all. This chapter walks an aspiring CCAE through the four levers Cohesity gives you to meet RPO, RTO, and recovery-site requirements: cluster-to-cluster replication, the SiteContinuity orchestration engine, cloud-native recovery options (CloudReplicate, CloudSpin, CloudArchive, CloudTier), and the bandwidth math that ties them all together.
Learning Objectives
By the end of this chapter you will be able to:
- Design active-passive and active-active replication topologies for one-to-one, one-to-many, many-to-one, and cross-cloud patterns.
- Calculate replication bandwidth requirements from front-end TB (FETB), daily change rate, deduplication efficiency, and replication window.
- Architect orchestrated DR with Cohesity SiteContinuity runbooks, including the full Failover Ready -> Failover Complete -> Prepare Failback -> Failback Ready -> Failback Complete state machine.
- Differentiate CloudSpin, CloudReplicate, CloudArchive, and CloudTier and justify the right option for a given recovery scenario.
- Identify the network ports, bonding modes, and MTU settings that must be present for replication to succeed at scale.
Replication Topologies
Cohesity replication is a Protection-Group-aware operation: the source cluster sends only unique, deduplicated, compressed chunks to the target cluster, and the target cluster keeps a fully addressable copy that can be searched, recovered, mounted, or orchestrated independently of the source [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity_Data_Protection_White_Paper.pdf]. That property — the replica is a working cluster, not a tape — is what makes the four classic topologies viable.
One-to-One, One-to-Many, Many-to-One, and Cross-Cloud
The architect’s first decision is direction and fan-out. Cohesity supports four canonical topologies, summarized below.
| Topology | Pattern | Typical Use Case | Pros | Cons |
|---|---|---|---|---|
| One-to-one | Cluster A -> Cluster B | Two-site enterprise DR (primary + DR site) | Simple to operate; symmetric failback; predictable bandwidth | No tertiary copy; single point of DR failure |
| One-to-many | Cluster A -> Cluster B + Cluster C | Tier-1 workloads requiring an in-region DR copy plus a cross-region cyber-vault | Multiple recovery options; geographic diversity | Higher WAN cost; more policy maintenance |
| Many-to-one (fan-in) | Clusters A, B, C, D -> Hub Cluster | ROBO and branch consolidation to a central enterprise data center | Centralized retention, dedup, and reporting; lower licensing per spoke | Hub is a single failure domain; aggregate ingest must be sized carefully |
| Cross-cloud | On-prem -> AWS-hosted Cohesity Cloud Edition (or Azure) | Cloud-as-DR; eliminates secondary physical site | OPEX model; pay-as-you-grow; tight integration with CloudSpin | Egress costs on recall; cloud licensing premium |
A typical large enterprise blends all four: a hub-and-spoke many-to-one for ROBO consolidation, a one-to-one between primary data centers for orchestrated DR, and a cross-cloud one-to-many leg into AWS or Azure to provide a third “air-gapped” copy for ransomware resiliency.
Figure 9.1: Replication topology variants (1:1, 1:Many, Many:1, Cross-cloud)
flowchart LR
subgraph OneToOne["1:1 (Active-Passive DR)"]
A1[Cluster A<br/>Primary] -->|replicate| B1[Cluster B<br/>DR Site]
end
subgraph OneToMany["1:Many (Geographic Diversity)"]
A2[Cluster A<br/>Primary] -->|replicate| B2[Cluster B<br/>In-Region DR]
A2 -->|replicate| C2[Cluster C<br/>Cross-Region Vault]
end
subgraph ManyToOne["Many:1 (ROBO Fan-In)"]
S1[Spoke A<br/>ROBO] -->|replicate| H[Hub Cluster<br/>Central DC]
S2[Spoke B<br/>ROBO] -->|replicate| H
S3[Spoke C<br/>ROBO] -->|replicate| H
end
subgraph CrossCloud["Cross-Cloud (Cloud-as-DR)"]
OP[On-Prem Cluster] -->|replicate| CE[Cloud Edition<br/>AWS / Azure]
end
Replication Policies and Retention
Replication is configured on the Protection Policy, not on the Protection Group. A policy can specify multiple “external” targets — local snapshot, replication target cluster, CloudArchive target, CloudTier — each with its own retention. The result is a single source-of-truth schedule that drives all secondary copies in lockstep [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity_Data_Protection_White_Paper.pdf].
Retention on the replication target is independent of the source: it is common to keep 14 daily on the primary cluster, 30 daily plus 12 monthly on the DR cluster, and 7 yearly on the CloudArchive target, all driven by one policy.
Key Takeaway: Cohesity replication topology decisions are driven by failure domain rather than bandwidth. Pick the topology — one-to-one, one-to-many, many-to-one, cross-cloud — that aligns with your blast-radius assumptions, then size WAN and retention to match.
Replication Mechanics and Tuning
Encrypted, Compressed, Deduplicated Wire Format
Cohesity replication is always encrypted in flight (TLS) and is deduplicated and compressed at the chunk level before the wire. The source cluster queries the target’s chunk fingerprint database and only ships chunks the target does not already have [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity_Data_Protection_White_Paper.pdf]. This “global dedup on the wire” is the dominant reason Cohesity can hit aggressive RPOs over modest WAN links.
Bandwidth Throttling and Windowing
Three throttling surfaces exist [Source: https://docs.cohesity.com/baas/data-protect/manage-network-saas.htm] [Source: https://www.youtube.com/shorts/EPXqfK6qxF4]:
- SaaS Connector throttling for DataProtect-as-a-Service, configured per-connector with day/time windows. Direction can be split (upload vs. download).
- Source-agent throttling for individual physical hosts whose CPU or NIC cannot absorb the agent’s full streaming rate.
- Cluster-to-cluster Protection Group / policy windows that align replication runs to off-business hours.
A common gotcha: the SaaS Connector accepts throttle values in bytes per second, not bits per second, which is the opposite convention used by tools like Veritas AIR [Source: https://docs.cohesity.com/baas/data-protect/manage-network-saas.htm] [Source: https://www.veritas.com/support/en_US/article.100051869]. A “100 MB/s” throttle is therefore 800 Mbps on the wire — entering “100” thinking in bits will under-throttle by a factor of eight and saturate the WAN.
Network Architecture and Required Ports
For replication-target clusters and ROBO nodes, Cohesity recommends 2x 10 GbE LACP Bond Mode 4 with MLAG/VPC, providing 20 Gbps of combined management plus replication bandwidth per node [Source: https://kb.expedient.com/docs/cohesity-client-premises-networking-requirements] [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/reference-architecture/Optimal-Network-Designs-with-Cohesity-RA.pdf]. Higher-throughput sites use 25 GbE or 40/100 GbE, but 2x10 LACP is the baseline expectation.
Two distinct traffic classes must be sized:
| Traffic | Description | Sizing Rule |
|---|---|---|
| North-South | WAN replication egress/ingress, dedup-reduced | Throttle to fit available WAN; throughput scales linearly with node count |
| East-West | Intra-cluster RF/EC rebuild, metadata gossip | Non-blocking, non-oversubscribed switch fabric mandatory |
When multiple VLANs exist, a dedicated Interface Group / VLAN for replication isolates WAN traffic from front-end backup ingest [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/reference-architecture/Optimal-Network-Designs-with-Cohesity-RA.pdf].
The required firewall ports between source and target clusters are [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/reference-architecture/Optimal-Network-Designs-with-Cohesity-RA.pdf] [Source: https://www.youtube.com/watch?v=BkzPxpq7Swg]:
| Port | Protocol | Purpose |
|---|---|---|
| 443 | TCP | HTTPS / API control plane |
| 111 | TCP | Portmap / RPC |
| 20000 | TCP | Replication data channel |
| 24444 | TCP | Replication control / metadata |
Enable jumbo frames (MTU 9000) end-to-end on the replication path; a single device along the path that does not honor 9000-byte frames will silently fragment or drop and tank effective throughput [Source: https://kb.expedient.com/docs/cohesity-client-premises-networking-requirements].
Initial Seed Strategies
The first replication cycle (“seed”) is, by definition, full — there is nothing on the target to dedup against. For 50 TB+ datasets over a constrained WAN this can take days or weeks. Two seed strategies exist:
- Wire seeding with extended window. Run the initial replication during a multi-day quiet window with relaxed throttles. Acceptable for moderate datasets and adequate WAN.
- Physical seed transport. Replicate to a portable or temporary cluster physically located at the source site, ship it, then re-home the replication target. Used when the dataset-to-WAN ratio makes wire seeding infeasible.
After the initial seed, daily replication shrinks to change rate minus dedup — typically single-digit percentages of FETB.
Replication Failure Handling
Cohesity replication is checkpointed: a partial run resumes from the last committed chunk rather than restarting. Persistent failures generate Helios alerts and pause the policy after configurable retry attempts. The architect’s job is to ensure the monitoring path (Helios -> SNMP/Syslog/Email) actually reaches an on-call engineer before replication lag exceeds RPO.
Key Takeaway: Replication mechanics are built on dedup, compression, and TLS, but they only work if you’ve sized 2x10 GbE LACP, opened TCP 443/111/20000/24444, enabled MTU 9000 end-to-end, and remembered that SaaS Connector throttles are in bytes per second.
Replication Bandwidth Math
The single formula every CCAE must memorize:
Required throughput = (FETB x daily change rate x (1 - dedup/compression)) / replication window
All four inputs are negotiable, and the architect’s job is to balance them against the WAN budget.
The Bytes-vs-Bits Gotcha
Network engineers speak in bits per second (Mbps, Gbps); storage engineers and Cohesity throttles speak in bytes per second (MB/s, GB/s). The conversion factor is 8 (plus a small percentage of TCP/IP overhead, conventionally ignored at this level).
| Wire Speed (bits) | Theoretical Bytes | 50% Utilization Target |
|---|---|---|
| 1 Gbps | 125 MB/s | 62.5 MB/s |
| 10 Gbps | 1,250 MB/s (1.25 GB/s) | 625 MB/s |
| 40 Gbps | 5,000 MB/s | 2,500 MB/s |
| 100 Gbps | 12,500 MB/s | 6,250 MB/s |
Cohesity guidance is to plan for 50 percent of nominal wire speed as the sustainable replication ceiling, leaving headroom for retransmits, other tenants on the link, and the inevitable noisy neighbor [Source: https://kb.expedient.com/docs/cohesity-client-premises-networking-requirements] [Source: https://www.cohesity.com/blogs/demonstrating-linear-scalability-cohesity-data-platform/].
Worked Example: 50 TB FETB, 5% Daily Change, 4-Hour Window
A common CCAE-style scenario. Inputs:
- FETB: 50 TB of front-end protected data.
- Daily change rate: 5 percent (typical mixed VM/file workload).
- Replication dedup/compression efficiency: 60 percent reduction on the wire (i.e., we keep 40 percent of the change).
- Replication window: 4 hours = 14,400 seconds.
Step 1 — Daily change in bytes:
50 TB x 0.05 = 2.5 TB of change per day
2.5 TB = 2,500 GB = 2,500,000 MB = 2,500,000,000,000 bytes (2.5 x 10^12)
Step 2 — Apply dedup/compression on the wire:
2.5 TB x (1 - 0.60) = 2.5 TB x 0.40 = 1.0 TB on the wire
1.0 TB = 1,000 GB = 1,000,000 MB
Step 3 — Divide by replication window in seconds:
1,000,000 MB / 14,400 s = ~69.4 MB/s required
Step 4 — Convert to bits per second for the WAN team:
69.4 MB/s x 8 = ~556 Mbps
Step 5 — Apply 50 percent utilization headroom:
556 Mbps / 0.50 = ~1.11 Gbps minimum WAN provisioning
So 50 TB FETB at 5 percent change with 60 percent on-wire reduction and a 4-hour window requires roughly 1.1 Gbps of provisioned WAN, which one 10 GbE link comfortably absorbs. Halving the window to 2 hours doubles the requirement to ~2.2 Gbps. Doubling the change rate to 10 percent (e.g., a database-heavy workload) doubles it again to ~4.4 Gbps — at which point a single 10 GbE link is operating uncomfortably close to its 50 percent ceiling and the architect should either move to 25 GbE, lengthen the RPO, or add a second LACP bond.
The RPO / Bandwidth / Dedup Equation
Rearranging the formula gives three knobs the architect can turn when the math doesn’t close:
| Knob | What It Means | Typical Range |
|---|---|---|
| Increase WAN bandwidth | Buy more circuit | $$$, lead time of weeks |
| Lengthen RPO (replication window) | Replicate every 8 hr instead of every 4 | Free, but business must accept |
| Reduce change rate | Tighter Protection Group scoping; exclude logs/scratch | Cheap, but bounded |
| Improve dedup on wire | Better source filtering, larger target retention pool | Modest gains, slow to realize |
If, after exhausting these knobs, the inequality (FETB x change x (1 - dedup)) <= (WAN bandwidth x window) still fails, the architecture must shift to a closer replication target (regional rather than transcontinental) or to a cloud-native pipe.
Key Takeaway: Memorize the bandwidth equation, always do the math in bytes first, multiply by 8 last, and budget for 50 percent of wire speed. The bytes-vs-bits trap has cost more than one architect a six-figure WAN over-provisioning bill.
SiteContinuity Orchestration
Replication moves the data; SiteContinuity moves the application. SiteContinuity is Cohesity’s DR orchestration engine that turns “we have replicated VMs” into “we have a runnable runbook with a measurable RTO.”
The Runbook Analogy
Think of a SiteContinuity runbook as an emergency-evacuation plan for an office building. A good evacuation plan answers four questions in advance:
- Who goes first? (Dependency order — domain controllers and DNS before app servers; app servers before web tiers.)
- Where do they go? (Resource Profile — which target compute, which port group, which datastore.)
- What address do they get when they arrive? (Re-IP and VLAN mapping at the DR site.)
- How do you know everyone got out? (Validation steps — VM up, services running, smoke tests passing.)
Failback is going home after the storm: the same plan in reverse, but only after the building has been certified safe. You don’t sprint back into the office while the roof is still on fire — you “Prepare for Failback” first, which seeds the data back, then you actually move people.
Runbook Building Blocks
A SiteContinuity DR Plan is composed of [Source: https://docs.cohesity.com/disaster-recovery/site-continuity/vmware-vms/failback.htm] [Source: https://docs.cohesity.com/disaster-recovery/site-continuity/vmware-vms/prepare-for-failback.htm]:
- DR Applications — logical groups of VMs that recover together with a defined boot order.
- Resource Profiles — reusable mappings of target vCenter, datastore, port group, and IP-customization rules at the recovery site.
- Failback Resource Set — a separate resource definition added via Edit > Add Resource Set on a DR Plan, used when failing back to the primary or to a brand-new cluster.
- Snapshot Selection — at execution time, the operator can accept the latest snapshot or override with a specific recovery point for explicit RPO control.
The SiteContinuity State Machine
Every DR Plan moves through a discrete set of states that gate which actions are available [Source: https://docs.cohesity.com/disaster-recovery/site-continuity/vmware-vms/failover.htm] [Source: https://docs.cohesity.com/disaster-recovery/site-continuity/vmware-vms/actual-failback.htm] [Source: https://docs.cohesity.com/disaster-recovery/pdf/site-continuity-user-guide.pdf]:
[Failover Ready] --(Failover)--> [Failover In Progress] --> [Failover Complete]
|
(Prepare for Failback)
v
[Prepare for Failback In Progress]
|
v
[Failback Ready]
|
(Failback)
v
[Failback In Progress]
|
v
[Failback Complete]
|
(Prepare for Failover)
v
[Failover Ready] <- back to start
Test variants (Test Failover, Test Failback) operate non-disruptively and do not change the underlying state of the production plan [Source: https://docs.cohesity.com/disaster-recovery/pdf/site-continuity-user-guide.pdf].
Figure 9.2: SiteContinuity DR Plan state machine
stateDiagram-v2
[*] --> FailoverReady
FailoverReady --> FailoverInProgress: Failover
FailoverInProgress --> FailoverComplete
FailoverComplete --> PrepareFailbackInProgress: Prepare for Failback
PrepareFailbackInProgress --> FailbackReady: reverse seed complete
FailbackReady --> FailbackInProgress: Failback
FailbackInProgress --> FailbackComplete
FailbackComplete --> FailoverReady: Prepare for Failover
FailoverReady --> FailoverReady: Test Failover (no state change)
FailbackReady --> FailbackReady: Test Failback (no state change)
note right of FailoverReady
Steady state - DR site
seeded and ready
end note
note right of FailbackReady
Reverse seed complete -
ready to return home
end note
Failover Procedure (Production Cutover)
The SiteContinuity workflow for an actual failover [Source: https://docs.cohesity.com/disaster-recovery/site-continuity/vmware-vms/actual-failover.htm] [Source: https://docs.cohesity.com/disaster-recovery/site-continuity/vmware-vms/failover.htm]:
- Navigate to DR Plans > Disaster Recovery Plans.
- Select the desired plan, then Actions (kebab) > Failover.
- Choose a Resource Profile with the network mapping, IP customization, and target compute.
- Optionally enable Protect VMs at DR Site so recovered VMs are immediately added to a Protection Group on the DR Cohesity cluster — closing the backup gap that otherwise opens at the moment of failover.
- Confirm the snapshot (latest by default; can be overridden with an earlier RPO).
- Type YES to confirm and start the orchestrated workflow.
- Validate via DR vCenter: VM startup order, resource allocation, network connectivity, application functionality, and Protection Group state on the DR cluster.
Figure 9.3: Failover orchestration sequence (Helios -> Source -> Target -> VM Power-On)
sequenceDiagram
participant Op as Operator
participant Helios as Helios / SiteContinuity
participant Src as Source Cluster
participant Tgt as DR Target Cluster
participant vC as DR vCenter
participant VM as Recovered VMs
Op->>Helios: Trigger Failover (DR Plan)
Helios->>Helios: Validate Resource Profile + snapshot
Helios->>Src: Quiesce replication (if reachable)
Helios->>Tgt: Select latest replicated snapshot
Tgt->>Tgt: Mount SnapTree view
Helios->>vC: Register VMs from mounted view
vC->>VM: Apply IP customization + VLAN mapping
vC->>VM: Power on (boot order: DC, App, Web)
VM-->>vC: Services up
vC-->>Helios: Boot validation OK
Helios->>Tgt: Optionally protect VMs at DR site
Helios-->>Op: State = Failover Complete
Prepare for Failback
Before any production failback, the plan must transition cleanly back to a failback-ready state [Source: https://docs.cohesity.com/disaster-recovery/site-continuity/vmware-vms/prepare-for-failback.htm]:
- Confirm the plan is in Failover Complete.
- Choose Actions > Prepare for Failback, which drives reverse replication from the DR cluster back to the primary cluster.
- The plan moves through Prepare for Failback In Progress to Failback Ready when the reverse seed completes.
Failback Procedure
From the Failback Ready state [Source: https://docs.cohesity.com/disaster-recovery/site-continuity/vmware-vms/actual-failback.htm]:
- Select Actions > Failback.
- Pick the failback Resource Profile; confirm or override the snapshot (VADP or CDP).
- Type YES to confirm.
- Validate against the primary vCenter — VM order, resource attachments, networks, applications — and confirm the Protection Group on the primary cluster has resumed normal backups.
After Failback Complete, run Actions > Prepare for Failover to re-seed the DR site and return the plan to Failover Ready for the next event [Source: https://docs.cohesity.com/disaster-recovery/site-continuity/vmware-vms/actual-failback.htm].
For new-target failbacks (e.g., a rebuilt primary cluster after a true site loss), existing DR Plans must be deleted and recreated against the new target, and DR Applications must be redefined [Source: https://docs.cohesity.com/disaster-recovery/site-continuity/vmware-vms/failback.htm].
Network Re-IP and VLAN Mapping
At the DR site, VMs almost always need new IP addresses (different subnet) and possibly different VLANs. SiteContinuity’s IP customization runs guest-OS scripts during the failover boot to apply the new address. Architects must pre-stage:
- DR-site VLAN port groups that match production naming conventions (or define explicit mappings).
- Guest customization specs validated against current OS versions.
- DNS update strategy (dynamic DNS, scripted A-record updates, or GSLB-based geo-redirection).
RTO and RPO Measurement
- RPO is bounded by replication frequency — if you replicate every 4 hours, RPO is 4 hours.
- RTO is bounded by runbook execution time — boot order serialization, IP customization, application warmup.
SiteContinuity reports actual recovery times for each Test Failover, which is how architects prove RTO compliance to auditors. Regular Test Failovers are the single dominant predictor of successful real-world recoveries [Source: https://www.cohesity.com/blogs/the-disaster-recovery-reality-check/] [Source: https://www.cohesity.com/glossary/disaster-recovery/]. SiteContinuity is licensed and deployed alongside DataProtect, sharing the underlying Protection Groups so the same backup snapshot used for granular recovery is also the source of the orchestrated failover [Source: https://www.dataguardworks.com/SiteContinuity.asp].
Key Takeaway: A SiteContinuity DR Plan is a state machine — Failover Ready -> Failover Complete -> Prepare Failback -> Failback Ready -> Failback Complete -> back to Failover Ready. Test Failover does not change the state. Skipping “Prepare for Failback” is the most common operational error and will cause Failback to fail.
Cloud-Based DR Options
Three Cohesity features deliver DR-style outcomes into hyperscaler clouds, and a fourth (CloudTier) is frequently confused with them. The CCAE exam will test the differences directly.
CloudReplicate — DR Replica as a Working Cluster
CloudReplicate replicates Protection Group snapshots from an on-prem Cohesity cluster to a Cohesity cluster running in the cloud (Cloud Edition in AWS or Azure) [Source: https://tekhead.it/blog/2016/04/cohesity-announces-cloud-integration-services/]. The destination is a fully functional Cohesity cluster: granular file recovery, Instant Mass Restore, View mounting, and SiteContinuity orchestration are all available on the cloud-side replica.
Use CloudReplicate when:
- The cloud is the DR site and you want to skip building a second physical data center.
- You want SiteContinuity-driven failover into the cloud with the same runbook semantics as on-prem-to-on-prem DR.
- You may eventually CloudSpin specific VMs to native EC2/Azure VMs at recovery time, but want to keep the option open.
CloudSpin — On-Demand Conversion to Native Cloud VMs
CloudSpin converts an on-prem (or cloud-resident) backup snapshot into a native cloud VM — an AWS EC2 instance with EBS volumes, or an Azure VM with Managed Disks [Source: https://www.cohesity.com/resources/solution-brief/manage-data-thats-fragmented-across-cloud/]. The conversion is an active operation: Cohesity rewrites disk format from its native SnapTree representation into the hypervisor format the cloud provider requires.
Use CloudSpin when:
- You need a dev/test clone running natively in the cloud (no Cohesity required at recovery time).
- You want lightweight cloud DR without standing up a full Cloud Edition cluster.
- You want to test cloud-failover scenarios without committing to permanent cloud infrastructure.
CloudSpin is active (you trigger it on demand); CloudReplicate is continuous (the policy drives it).
CloudArchive — Long-Term Retention to Object Storage
CloudArchive creates a separate archival copy in cloud object storage (S3, Azure Blob, GCP) for compliance and long-term retention [Source: https://www.cohesity.com/blogs/cloud-clear-cohesity-cloud-archival/] [Source: https://www.cohesity.com/solutions/long-term-retention-and-archival/]. Driven by Protection Policy schedules (e.g., monthly archives for 7 years). The data stays in Cohesity’s deduplicated, compressed format; the source cluster keeps a full index/metadata copy locally so search and selective restore work without re-ingesting [Source: https://www.cohesity.com/resources/solution-brief/simplify-long-term-data-retention-and-archival/].
CloudArchive Direct is a variant that streams archives directly to cloud storage with only minimal local footprint — index stays on-prem, parallel uploads send full data blocks to the cloud target [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/archive-your-data-directly-with-cohesity-cloudarchive-direct-white-paper.pdf]. Intended for organizations that have outgrown local archive capacity.
CloudTier — Capacity Overflow (Not a DR Tool)
CloudTier automatically tiers cold blocks (default >60 days, configurable) from on-prem nodes into cloud object storage when local capacity exceeds 80 percent [Source: https://www.cohesity.com/glossary/cloud-tiering/] [Source: https://www.penguinpunk.net/blog/cohesity-basics-cloud-tier/]. It is an invisible extension of the View Box — the data is moved, not duplicated.
Two critical operational facts:
- Once enabled on a View Box, Cloud Tier cannot be disabled. It is irreversible.
- CloudTier is not a DR copy. The cluster is still the only authoritative copy of the data; if the cluster dies, the cloud-tiered blocks are unrecoverable on their own.
Decision Matrix: CloudSpin vs. CloudReplicate vs. CloudArchive vs. CloudTier
| Feature | Purpose | Result in Cloud | Reversible? | Trigger | Recovery Speed | Primary Use Case |
|---|---|---|---|---|---|---|
| CloudSpin | Dev/test or DR with native cloud VMs | EC2 / Azure VM (hyperscaler-native) | Yes — VM is ephemeral | On-demand by user | Minutes (single VM) | Quick cloud spin-up; dev/test |
| CloudReplicate | DR replica as a working Cohesity cluster | Full Cohesity cluster in cloud | Yes | Policy schedule | Standard restore + IMR | Cloud-as-DR-site |
| CloudArchive | Compliance / long-term retention | Cohesity-format archive in object storage | Yes — separate copy | Policy schedule | Slow (rehydrate from object) | 7-year compliance retention |
| CloudArchive Direct | LTR with minimal local footprint | Cohesity archive in object storage; index local only | Yes | Policy schedule | Slow | LTR when local archive exhausted |
| CloudTier | Capacity relief for cold local data | Object-storage extension of View Box | No | Auto when local capacity > threshold | Transparent (cluster fetches) | Capacity overflow only |
The CCAE-style trick question: “Customer wants the cloud to be both their archive target and their DR target with minimum cost.” Wrong answer: CloudTier (not a DR tool). Right answer: CloudReplicate for DR + CloudArchive for compliance retention, sharing one external cloud target where possible.
Figure 9.4: Cloud DR option decision tree
graph TD
Start[Cloud Use Case?] --> Q1{Primary Goal?}
Q1 -->|Disaster Recovery| Q2{Need Cohesity features<br/>at recovery time?}
Q1 -->|Long-Term Retention| Q3{Local archive<br/>capacity available?}
Q1 -->|Capacity Relief| CT[CloudTier<br/>WARNING: Irreversible<br/>NOT a DR tool]
Q2 -->|Yes - IMR, search,<br/>SiteContinuity| CR[CloudReplicate<br/>Full Cohesity cluster<br/>in AWS/Azure]
Q2 -->|No - just need<br/>native cloud VM| CS[CloudSpin<br/>EC2/Azure VM<br/>on-demand conversion]
Q3 -->|Yes| CA[CloudArchive<br/>Cohesity-format<br/>in object storage]
Q3 -->|No - exhausted| CAD[CloudArchive Direct<br/>Index local, data<br/>streamed direct to cloud]
CR --> Tip1[Use with SiteContinuity<br/>for full DR runbook]
CS --> Tip2[Active trigger; ideal<br/>for dev/test clones]
CA --> Tip3[Policy-driven; slow<br/>rehydrate on restore]
style CT fill:#5a1f1f,stroke:#ff6b6b,color:#fff
style CR fill:#1f3a5a,stroke:#58a6ff,color:#fff
style CS fill:#1f3a5a,stroke:#58a6ff,color:#fff
style CA fill:#1f3a5a,stroke:#58a6ff,color:#fff
style CAD fill:#1f3a5a,stroke:#58a6ff,color:#fff
Recovery Into VMware Cloud and Azure VMware Solution
VMware Cloud on AWS (VMC) and Azure VMware Solution (AVS) are first-class targets for Cohesity DR because they expose a native vCenter that SiteContinuity can drive directly — no CloudSpin conversion required, no Cloud Edition cluster required. The trade-off is the cost of running the VMC/AVS SDDC versus the cost of a Cloud Edition cluster.
Comparing Cost / Recovery Options
| Recovery Option | OPEX | Recovery Speed | Operational Familiarity |
|---|---|---|---|
| Second physical DC + Cohesity | Highest CAPEX, lowest OPEX | Fastest | Highest (same as production) |
| Cloud Edition + CloudReplicate + SiteContinuity | Medium OPEX | Fast (SiteContinuity-driven) | High (same SiteContinuity workflow) |
| VMC / AVS + CloudReplicate | High OPEX | Fast | Highest (native vCenter) |
| CloudSpin only (no Cloud Edition) | Lowest OPEX | Slow (per-VM conversion) | Lower (different from production) |
| CloudArchive + manual restore | Cheapest | Slowest | Lowest |
Key Takeaway: CloudReplicate gives you a working cluster, CloudSpin gives you a native VM, CloudArchive gives you compliance retention, and CloudTier gives you capacity relief — only the first three are DR options, and CloudTier is irreversible once enabled.
Chapter Summary
- Topology choice — one-to-one, one-to-many, many-to-one (fan-in), or cross-cloud — is driven by failure-domain assumptions, not by bandwidth.
- Replication is encrypted, deduplicated, and compressed on the wire. Throughput scales linearly with node count.
- The 2x10 GbE LACP baseline, jumbo frames end-to-end, and the four required ports (TCP 443, 111, 20000, 24444) are non-negotiable.
- The bandwidth formula —
(FETB x change rate x (1 - dedup)) / window— is the single most important calculation in this domain. Always work in bytes first, then multiply by 8. Plan for 50 percent of wire speed. - SaaS Connector throttles are bytes per second, not bits per second — a common bug.
- SiteContinuity orchestrates DR via a state machine: Failover Ready -> Failover Complete -> Prepare Failback -> Failback Ready -> Failback Complete. Test variants do not change the state.
- A runbook is an evacuation plan: define DR Applications (boot order), Resource Profiles (target compute/network), and a Failback Resource Set. Failback is going home after the storm — only after Prepare for Failback completes the reverse seed.
- CloudReplicate, CloudSpin, CloudArchive, CloudTier are four distinct tools for four distinct problems: DR replica cluster, on-demand native VM, compliance retention, and capacity overflow respectively. CloudTier is irreversible and is not a DR tool.
- Test Failovers are the dominant predictor of real-world recovery success. Run them on a schedule, not just before audits.
Key Terms
- Replication — Cluster-to-cluster movement of deduplicated, compressed, encrypted Protection Group snapshots, driven by Protection Policy schedules and consumed by DR, retention, and orchestration workflows.
- SiteContinuity — Cohesity’s DR orchestration product that drives runbook-based failover and failback for VMware VMs, consuming the same underlying snapshots used for granular recovery.
- Runbook — A SiteContinuity DR Plan composed of DR Applications (VM groups with boot order), Resource Profiles (target compute/network/IP mappings), and optional Failback Resource Sets; the operational analog to an emergency evacuation plan.
- CloudSpin — On-demand conversion of a Cohesity backup snapshot into a native cloud VM (AWS EC2 with EBS, or Azure VM with Managed Disks) for dev/test or lightweight cloud DR.
- CloudReplicate — Continuous policy-driven replication from an on-prem Cohesity cluster to a Cohesity Cloud Edition cluster running in AWS or Azure; the destination remains a fully functional Cohesity cluster.
- Failover — The orchestrated cutover of a DR Plan from the primary site to the DR site, transitioning the plan from Failover Ready through Failover In Progress to Failover Complete.
- Failback — The orchestrated return of a DR Plan from the DR site to the primary site, executed only after Prepare for Failback has successfully reverse-seeded data and moved the plan to Failback Ready.
- RTO (Recovery Time Objective) — The maximum acceptable elapsed time between disaster declaration and full application recovery; bounded by SiteContinuity runbook execution time including boot order and IP customization.
- RPO (Recovery Point Objective) — The maximum acceptable amount of data loss measured in time; bounded by replication frequency (e.g., 4-hour replication = 4-hour RPO worst case).
Chapter 10: Cloud Integration: Archive, Tier, Replicate, and Spin
Cohesity exposes four distinct cloud integration patterns: CloudArchive for long-term retention, CloudTier for capacity extension, CloudReplicate for cluster-to-cluster replication into a Cohesity Cloud Edition, and CloudSpin for converting on-prem backups into native cloud VMs. These four features rely on the same plumbing — the External Target abstraction — but solve different architectural problems with different cost profiles, recovery semantics, and IAM surfaces.
This chapter walks an architect through choosing among the four, configuring the AWS S3 Glacier and Azure Blob targets that most CCAE candidates will see on the exam, and modeling the egress and recall costs that quietly dominate cloud TCO.
Learning Objectives
By the end of this chapter, you will be able to:
- Differentiate CloudArchive, CloudArchive Direct, CloudTier, CloudReplicate, and CloudSpin by purpose, data movement model, and recovery semantics.
- Configure External Targets to AWS S3 (including Glacier and Deep Archive), Azure Blob (including Archive tier), GCP, and S3-compatible on-prem object stores.
- Apply storage-class lifecycle policies correctly — keeping bucket-side rules and cluster-side retention separated and aligned.
- Estimate egress, retrieval-request, and rehydration charges, and design retention and recall scenarios that avoid surprise invoices.
- Recognize the IAM and RBAC minimums required for each target — the CloudFormation Template for AWS, Storage Blob Data Contributor for Azure, and the role of
setBlobTier/actionin tier transitions.
Analogy: Garage cleanup vs. offsite storage rental. Think of your cluster as a two-car garage. CloudTier is the garage cleanup: you mark anything older than 90 days and haul those boxes to a self-storage unit. The boxes are gone from the garage — you got the floor space back — but you can drive over and fetch them when needed (slowly, for a small fee). CloudArchive is the offsite records vault: you photocopy critical documents and ship the copies to an archival facility. The originals stay in the garage; the archival facility is the durable, regulator-friendly second copy. The most common CCAE exam mistake is conflating these two.
Section 10.1: CloudArchive — Long-Term Retention to Object Storage
10.1.1 Long-Term Retention to Object Storage
CloudArchive is a copy-out mechanism. The Cohesity cluster keeps a complete local snapshot — chunk files, blob files, and metadata — and additionally writes a deduplicated, compressed copy to a registered external target. The local cluster remains authoritative for indices and metadata so that catalog operations (browse, search, restore) can be answered without rehydrating cloud objects unnecessarily [Source: https://www.cohesity.com/solutions/long-term-retention-and-archival/].
CloudArchive is driven by Protection Policies: each policy may attach one or more Archival actions, each referencing an External Target with its own retention horizon. A typical pattern: daily incrementals retained 30 days on cluster, weekly fulls 90 days on cluster, and monthly fulls retained 7 years in S3 Glacier Deep Archive via CloudArchive. Only the third tier leaves the cluster [Source: https://www.cohesity.com/resources/solution-brief/simplify-long-term-data-retention-and-archival/].
10.1.2 Encryption and Immutability Options
CloudArchive honors the cluster’s encryption posture end-to-end. Data leaves the cluster over TLS 1.2+ and is written at rest as AES-256. When the target is AWS S3, Cohesity can also use SSE-S3 or SSE-KMS server-side encryption, the latter requiring kms:Encrypt, kms:Decrypt, and kms:GenerateDataKey permissions on the customer-managed key [Source: https://docs.cohesity.com/baas/data-protect/aws-requirements-s3.htm].
Immutability for ransomware-resistant archives uses S3 Object Lock (AWS) or the equivalent Immutable Blob Storage (Azure). Cohesity sets per-object retention via the s3:PutObjectRetention API as objects are written. Object Lock must be enabled at bucket creation — it cannot be retrofitted without contacting AWS Support [Source: https://aws.amazon.com/blogs/apn/how-to-turn-archive-data-into-actionable-insights-with-cohesity-and-aws/].
10.1.3 Indexing for Cloud-Archived Snapshots
The Cohesity index (handled by Yoda) stays on the cluster. That has two consequences architects must internalize:
- You can browse and search archived snapshots without paying retrieval fees. The metadata is local; only the actual chunk data lives in Glacier.
- If the originating cluster is destroyed, you must rebuild the index from cloud-resident metadata before recovery is fast. This is one of the principal differences between CloudArchive (cluster-authoritative) and FortKnox cyber vaulting (Cohesity-managed, see Chapter 11).
10.1.4 Direct Archive vs. Archive on Policy
Two operational variants exist:
- CloudArchive (standard) — the default; archive copy is created via a Protection Policy attached to a Protection Group.
- CloudArchive Direct — a streaming variant for pure archival workloads. Data flows through the cluster but is not retained as a full local copy — only metadata and index live on cluster, while bulk data is streamed directly to the external target [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/archive-your-data-directly-with-cohesity-cloudarchive-direct-white-paper.pdf].
CloudArchive Direct is appropriate when the local cluster does not need to be the recovery hot tier — e.g., decommissioned apps retained only for compliance.
Key Takeaway: CloudArchive is your long-term retention copy to cheap object storage. The cluster keeps the local copy; the cloud holds the second copy. Use it when retention horizons exceed the cluster’s economic capacity sweet spot, or when compliance demands a separate, off-cluster copy.
Section 10.2: CloudTier — Capacity Extension for Cold Blocks
10.2.1 Capacity Tiering for Cold Blocks
CloudTier is a move operation, not a copy. The cluster’s tiering engine continuously profiles block heat: blocks not accessed for the configured threshold are migrated out to cloud object storage, freeing local capacity. The cluster retains a pointer (a stub) so the namespace appears unchanged — when the data is needed, it is rehydrated transparently [Source: https://www.cohesity.com/glossary/cloud-tiering/].
This is the garage-cleanup behavior from the chapter analogy. The block was in the garage. Now it is in the storage unit. There is no second copy.
Because tiering moves data, CloudTier is irreversible without a recall operation. If you tier 500 TB out to S3 Standard-IA, you cannot simply “untier” by toggling a setting — you must recall the data, which counts as a full read against the object store and incurs egress and request charges.
10.2.2 Tiering Thresholds and Recall Behavior
Tiering is policy-driven on a per-View Box basis. Common thresholds:
- Age-based: “tier any block not read in 90 days.”
- Capacity-based: “begin tiering when the View Box exceeds 80% utilization.”
- Hybrid: combine both — tier opportunistically by age, urgently by capacity pressure.
When a recall is required (e.g., a restore operation reads a tiered block), the cluster fetches the object, repopulates it locally, and returns the read. The recall completes transparently from the application’s perspective, but the latency profile changes — tiered blocks pay a one-time round trip to the cloud.
10.2.3 Performance Impact Considerations
Architects must reserve local capacity headroom to avoid tiering hot data. Common pitfalls:
- Aggressive thresholds (e.g., “tier anything older than 7 days”) tier blocks that will be read by month-end backups, causing recall storms.
- Disaster recall storms — rehydrating 100 TB of tiered data through WAN bandwidth and cloud egress quotas can dominate RTO. Plan this scenario explicitly.
- Cluster cache effect — recalled blocks repopulate the local tier, so recurring workloads stop hitting the cloud after the first recall.
10.2.4 Tier vs. Archive Trade-offs
| Dimension | CloudTier | CloudArchive |
|---|---|---|
| Data movement | Move (single copy) | Copy (two copies — local + cloud) |
| Local footprint | Reduced | Unchanged (full local + cloud) |
| Reversibility | Recall required | Local copy still authoritative |
| Typical destination class | S3 Standard-IA, Azure Cool | S3 Glacier, Azure Archive |
| Driver | Cluster running out of capacity | Compliance / LTR retention |
| Recall cost exposure | High — every restore pulls from cloud | Low — restores read local; cloud only on disaster |
The two patterns are complementary, not mutually exclusive. A common enterprise design tiers cold backup blocks to S3 Standard-IA (CloudTier) and copies monthly fulls to Glacier Deep Archive (CloudArchive). The tier reduces local footprint; the archive provides the compliance-grade second copy [Source: https://www.cohesity.com/blogs/leverage-cloud-long-term-archival-with-cohesity/].
Key Takeaway: CloudTier moves cold blocks to free local capacity; CloudArchive copies snapshots out for retention and durability. If your driver is “cluster is full,” tier. If your driver is “regulator requires 7-year retention,” archive. If both — do both.
Section 10.3: CloudReplicate and CloudSpin — Cloud as a DR Plane
10.3.1 CloudReplicate to a Cohesity Cloud Edition
CloudReplicate is conceptually identical to cluster-to-cluster replication (Chapter 9), except the destination cluster is a Cohesity Cloud Edition running inside AWS, Azure, or GCP. The replicated data lands on a fully-functional Cohesity cluster — just like the source — so all DataPlatform features (instant mass restore, indexing, granular search) are available on the cloud side.
CloudReplicate is the right answer when:
- The DR strategy requires a functioning Cohesity control plane in the cloud (e.g., to run Helios apps, recover into VMware Cloud on AWS, or serve restores into native services).
- RTO requirements exclude a slow rehydration from object storage.
- Compliance permits the cluster’s normal feature set in the cloud (DataLock, indexing, etc.).
The cost profile is meaningfully higher than CloudArchive — you are paying for cluster compute (EC2 instances or Azure VMs running Cohesity), local SSD/EBS, and replication network — but the recovery posture is dramatically better.
10.3.2 Converting Backups to Native Cloud VMs (CloudSpin)
CloudSpin converts an on-prem VM backup into a native cloud VM — an EC2 instance, an Azure VM, or a GCE instance — without requiring a Cohesity cluster on the destination side. The operator picks a VM backup (typically a VMware or Hyper-V VM), specifies the target cloud account, network, and instance type, and Cohesity:
- Reads the VM backup from local cluster (or recalls from CloudArchive if needed).
- Converts the disk format (VMDK or VHDX → AMI for AWS, managed disk for Azure).
- Boots the VM into the target VPC/VNet with the chosen instance shape.
CloudSpin is the right answer for:
- Test/dev cloud bursting — spin a copy of a production VM in the cloud for a stress test, then destroy it.
- Forensic investigation — boot a known-clean snapshot in an isolated VPC for malware analysis.
- Cloud migration trial runs — validate that a workload runs in EC2 before committing to lift-and-shift.
CloudSpin is not a continuous DR replication mechanism — each spin is a discrete conversion job. Compare and contrast with CloudReplicate, where the cloud cluster is continuously hydrated with the latest snapshots.
10.3.3 Network and IAM Prerequisites
For both CloudReplicate and CloudSpin, the Cohesity cluster needs:
- Outbound HTTPS (port 443) to the cloud control plane endpoints.
- IAM credentials with permission to create EC2/Azure VM resources, manage EBS/managed disks, and configure VPC/VNet network interfaces.
- VPC/VNet design with appropriate subnets, security groups, and route tables for the spun VMs.
- For CloudReplicate, a registered Cloud Edition cluster as the replication destination — it must be reachable from the source cluster over the network.
The IAM minimums for CloudSpin in AWS include ec2:RunInstances, ec2:CreateVolume, ec2:AttachVolume, ec2:CreateImage, ec2:RegisterImage, iam:PassRole, plus the S3 actions to read the backup objects if they were archived to S3 [Source: https://www.cohesity.com/partners/aws/].
10.3.4 Test Recovery and Clean-Up
A discipline the CCAE exam emphasizes: every cloud DR mechanism must be tested without affecting production. SiteContinuity (Chapter 9) wraps CloudSpin and CloudReplicate operations in runbooks that allow:
- Test failover — spin VMs in an isolated VPC for validation, then tear down without touching the production destination.
- Planned failover — orchestrated cutover with re-IP and DNS updates.
- Failback — reverse replication once the primary site is recovered.
Clean-up matters because spun cloud VMs accrue compute charges as long as they run. Always include a destroy step in your runbook.
Key Takeaway: CloudReplicate gives you a hot Cohesity cluster in the cloud — full feature parity, paid by the hour. CloudSpin gives you a one-shot native cloud VM, useful for bursting and testing but not for continuous DR. Pick the one whose recovery model matches your RTO and budget.
Section 10.4: The Decision Matrix — Which Cloud Integration to Use When
The single most exam-relevant artifact in this chapter is the decision matrix below. Memorize it.
| Capability | CloudArchive | CloudArchive Direct | CloudTier | CloudReplicate | CloudSpin |
|---|---|---|---|---|---|
| Primary purpose | Long-term retention copy | Streaming archive (low local footprint) | Capacity extension | Cluster-to-cluster cloud DR | Convert backup to native cloud VM |
| Data movement | Copy (local + cloud) | Stream (metadata local, data in cloud) | Move (single copy in cloud) | Copy to remote Cohesity cluster | Convert and boot |
| Local copy retained? | Yes | No (metadata only) | No (stub remains) | Yes | Yes |
| Reversible? | N/A — both copies exist | Limited — no local copy | No — must recall | N/A — both clusters live | N/A — VM is independent after spin |
| Typical destination class | S3 Glacier / Azure Archive | S3 Glacier / Azure Archive | S3 Std-IA / Azure Cool | EC2/Azure VM (Cloud Edition) | Native EC2/Azure VM |
| Driver | Compliance / LTR | Cold-only retention | Cluster running full | Cloud DR with Cohesity features | Cloud burst / test / forensics |
| Recovery speed | Hours (Glacier rehydration) | Hours | Seconds (warm cluster) on first read; recall after | Seconds (warm cluster) | Minutes (boot the VM) |
| Cost profile | Lowest $/GB-month, plus retrieval fees | Even lower (no local copy) | Mid; egress on recall | Highest (running cluster) | Mid (per-spin, then VM hourly) |
| Configured at | Inventory > External Targets (Archival) | Same, with Direct flag | Inventory > External Targets (Tiering) | Replication settings + remote cluster | Recover > Cloud Spin |
Exam tip: “Cluster at 85% capacity, customer wants cold backups online 90 more days, budget tight” → CloudTier (driver is capacity). “7-year compliance requirement, ransomware-resistant copies off-cluster” → CloudArchive with Object Lock. “Quarterly DR test in AWS without standing up another Cohesity cluster” → CloudSpin.
Figure 10.1: Cloud integration option decision tree
flowchart TD
Start([What is the primary driver?]) --> Q1{Cluster running<br/>out of capacity?}
Q1 -->|Yes| Tier[CloudTier<br/>Move cold blocks<br/>S3 Std-IA / Azure Cool]
Q1 -->|No| Q2{Long-term<br/>retention copy<br/>required?}
Q2 -->|Yes| Q3{Need local copy<br/>for fast restore?}
Q3 -->|Yes| Archive[CloudArchive<br/>Copy to Glacier/Archive<br/>Local + Cloud copies]
Q3 -->|No| Direct[CloudArchive Direct<br/>Stream to cloud<br/>Metadata only on cluster]
Q2 -->|No| Q4{Need warm<br/>cluster in cloud<br/>for DR?}
Q4 -->|Yes| Replicate[CloudReplicate<br/>To Cohesity Cloud Edition<br/>Full feature parity]
Q4 -->|No| Q5{Need native<br/>cloud VM from<br/>backup?}
Q5 -->|Yes| Spin[CloudSpin<br/>Convert to EC2/Azure VM<br/>Burst / forensics / migration]
Q5 -->|No| Reassess([Reassess requirements])
Section 10.5: Configuring CloudArchive to AWS S3 Glacier — The Five Steps
The AWS Glacier flow is the most exam-tested external target configuration. Cohesity documents a five-step pattern [Source: https://docs.cohesity.com/baas/data-protect/aws-requirements-s3.htm].
Figure 10.2: AWS S3 setup workflow — five-step configuration
flowchart LR
A[Step 1<br/>Register<br/>External Target<br/>Inventory > Targets<br/>Purpose: Archival] --> B[Step 2<br/>IAM via CFT<br/>CloudFormation<br/>least-privilege role<br/>+ KMS policy]
B --> C[Step 3<br/>Bucket Policy<br/>Allow Cohesity role<br/>PutObject/GetObject<br/>RestoreObject<br/>Object Lock enabled]
C --> D[Step 4<br/>Lifecycle Rule<br/>Std → IA → Glacier<br/>→ Deep Archive<br/>Bucket-side, not<br/>Cohesity]
D --> E[Step 5<br/>Bind to<br/>Protection Policy<br/>Add Archival action<br/>Set retention<br/>Validate + run]
style A fill:#1f6feb,color:#fff
style B fill:#1f6feb,color:#fff
style C fill:#1f6feb,color:#fff
style D fill:#1f6feb,color:#fff
style E fill:#238636,color:#fff
10.5.1 Step 1 — Register the External Target
Navigate to Inventory > External Targets > Register External Target in Cohesity Dashboard or Helios. Configure:
- Name: descriptive, e.g.,
S3-Glacier-Archive-Prod - Purpose: Archival (not Tiering — this is a one-character mistake that defines the target’s whole behavior)
- Provider: AWS > S3
- Bucket name, region, AWS Access Key, Secret Key
- Storage class: Glacier or Deep Archive (or Standard if relying on a bucket-side lifecycle rule)
Cohesity 7.1+ supports the full Glacier API family — Glacier Instant Retrieval, Glacier Flexible Retrieval, and Glacier Deep Archive [Source: https://aws.amazon.com/blogs/storage/storing-data-with-aws-partner-solutions-and-amazon-s3-glacier-instant-retrieval/].
10.5.2 Step 2 — IAM via the Cohesity CloudFormation Template
Cohesity publishes a CloudFormation Template (CFT) that creates a least-privilege IAM role for the cluster. Run it from the AWS Console; it provisions a role with these actions [Source: https://docs.cohesity.com/baas/data-protect/aws-requirements-s3.htm]:
s3:PutObject
s3:GetObject
s3:DeleteObjectVersion
s3:RestoreObject ← required to recall from Glacier
s3:PutLifecycleConfiguration
s3:GetLifecycleConfiguration
s3:GetBucketObjectLockConfiguration
s3:PutObjectRetention ← required for Object Lock / WORM
iam:SimulatePrincipalPolicy
kms:Encrypt
kms:Decrypt
kms:GenerateDataKey ← required for SSE-KMS
The CFT-generated role is the right answer on the exam — never grant s3:* or ec2:* to the Cohesity principal. If a customer-managed KMS key encrypts the bucket, the Cohesity role ARN must also be added to the KMS key policy, not just the bucket policy.
10.5.3 Step 3 — Bucket Policy
The CFT applies a bucket policy authorizing the Cohesity role explicitly:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowCohesityArchive",
"Effect": "Allow",
"Principal": {"AWS": "arn:aws:iam::<ACCOUNT>:role/<COHESITY-ROLE>"},
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:RestoreObject",
"s3:PutObjectRetention"
],
"Resource": "arn:aws:s3:::your-bucket/*"
}
]
}
If immutability is required, enable S3 Object Lock at bucket creation time. It cannot be retrofitted on an existing bucket without AWS Support intervention. With Object Lock on, Cohesity will set per-object retention via s3:PutObjectRetention, producing WORM copies that resist ransomware and rogue-admin deletion [Source: https://aws.amazon.com/blogs/apn/how-to-turn-archive-data-into-actionable-insights-with-cohesity-and-aws/].
10.5.4 Step 4 — S3 Lifecycle Rule for Glacier Transition
The S3 lifecycle rule lives on the bucket, not in Cohesity. This separation is crucial: Cohesity manages retention (how long the logical object lives, and whether it is locked); the bucket lifecycle manages storage class (which physical tier it sits in while it lives).
A typical rule:
Rule:
ID: ToDeepArchive
Status: Enabled
Filter:
Prefix: cohesity/
Transitions:
- Days: 30
StorageClass: GLACIER
- Days: 180
StorageClass: DEEP_ARCHIVE
# Optional cleanup matching Cohesity retention horizon
Expiration:
Days: 2555 # 7 years
Architects must respect the minimum storage durations below. Deleting objects before the minimum triggers an early-deletion charge equal to the storage cost of the remaining minimum days [Source: https://docs.cohesity.com/baas/data-protect/protect-amazon-s3.htm].
10.5.5 Step 5 — Bind to a Protection Policy and Validate
In Cohesity, edit a Protection Policy (or create one), add an Archival action that targets the new external target, and define the retention. Attach the policy to a Protection Group, validate connectivity from the External Targets page, and trigger an on-demand archive job.
Figure 10.3: CloudArchive to S3 Glacier — end-to-end sequence
sequenceDiagram
participant Cluster as Cohesity Cluster
participant Policy as Protection Policy
participant S3 as S3 Bucket (Standard)
participant Lifecycle as S3 Lifecycle Rule
participant Glacier as S3 Glacier / Deep Archive
Policy->>Cluster: Trigger archive job<br/>(monthly fulls)
Cluster->>Cluster: Dedupe + compress<br/>+ AES-256 encrypt
Cluster->>S3: PutObject (TLS 1.2+)<br/>SSE-KMS server-side
Cluster->>S3: PutObjectRetention<br/>(Object Lock WORM)
S3-->>Cluster: 200 OK + ETag
Note over S3,Lifecycle: Object lives in Standard<br/>per bucket-side rule
Lifecycle->>S3: Day 30: transition GLACIER
Lifecycle->>Glacier: Day 180: transition<br/>DEEP_ARCHIVE
Note over Cluster,Glacier: Recall path on disaster
Cluster->>Glacier: RestoreObject<br/>(Standard, 12h)
Glacier-->>S3: Rehydrate to<br/>temporary copy
Cluster->>S3: GetObject
S3-->>Cluster: Restored chunks<br/>(egress fees apply)
10.5.6 Glacier Pricing and Minimum Retention
| Storage Class | $/GB-month (us-east-1) | Min Retention | Retrieval Time (Standard) | Use Case |
|---|---|---|---|---|
| S3 Standard | ~$0.023 | none | n/a | Active backups, hot recovery |
| S3 Standard-IA | ~$0.0125 | 30 days | n/a | CloudTier destination |
| S3 Glacier Instant Retrieval | ~$0.004 | 90 days | milliseconds | Rare-but-fast archive recall |
| S3 Glacier Flexible Retrieval | ~$0.0036 | 90 days | 3–5 hours (Standard) | Default Glacier tier |
| S3 Glacier Deep Archive | ~$0.00099 | 180 days | 12 hours | Multi-year compliance |
Cost-optimization heuristics:
- Retention under 90 days: stay in Standard or Standard-IA; Glacier early-deletion fees will erase the savings.
- Retention 90 days to 1 year: Glacier Flexible Retrieval or Glacier Instant Retrieval.
- Retention beyond 1 year: Deep Archive — at $0.00099/GB-month, 100 TB costs ~$99/month versus ~$2,300/month at Standard.
- Always model retrieval-request fees in addition to storage. A 100 TB recall from Deep Archive is roughly $0.02/GB plus per-request fees, easily $2,000+ per recall event.
Key Takeaway: The five-step AWS pattern — register target, run CFT for IAM, apply bucket policy, set lifecycle rule, bind to Protection Policy — is the most testable workflow in Chapter 10. Memorize who manages what: Cohesity owns retention; the bucket lifecycle owns storage class.
Section 10.6: Configuring CloudArchive to Azure Blob Storage
Azure’s permission model differs structurally from AWS. There is no JSON IAM policy; you assign RBAC roles to a service principal, managed identity, or user [Source: https://docs.cohesity.com/baas/data-protect/aws-requirements-s3.htm].
10.6.1 The Minimum Role: Storage Blob Data Contributor
The single role a CCAE candidate must remember is Storage Blob Data Contributor. It grants:
- Read, write, and delete on blobs.
- The
setBlobTier/actiondata action — required for tier transitions to Archive. - Sufficient for backup, archive, list, and restore against the target container.
What it does not grant: control-plane operations like creating storage accounts. That is Storage Account Contributor, which is rarely needed for Cohesity (the customer typically pre-provisions the account).
A common exam trap: assigning Reader or Storage Account Contributor alone. These are control-plane roles — they let you see and configure the storage account but do not grant data-plane access to blobs. The cluster will fail to write objects with confusing 403 errors. Storage Blob Data Contributor is the data-plane role.
10.6.2 Authentication Model — Entra ID Service Principals Preferred
Cohesity supports three Azure auth methods, in this order of preference:
- Microsoft Entra ID (formerly Azure AD) service principal with RBAC — the recommended pattern. Create a service principal in Entra, assign it Storage Blob Data Contributor scoped to the container or storage account, and provide Cohesity the tenant ID, client ID, and client secret. If the cluster is running in Azure (Cloud Edition), use a managed identity to avoid storing secrets at all.
- Shared Access Signature (SAS) — time-limited, scoped tokens. Suitable for short-lived integrations or where RBAC is restricted, but rotation is operator-managed and silent expiry causes failed archives. Avoid for production.
- Storage Account Access Keys — full account-level access. Easiest to configure, hardest to revoke, and the highest blast radius if leaked.
The Entra service principal pattern is the production answer. Always prefer it on the exam.
10.6.3 The setBlobTier Data Action and the Azure Archive Tier
To move a blob to Azure’s Archive access tier — the cheapest tier, comparable to S3 Deep Archive — the principal must have the data action:
Microsoft.Storage/storageAccounts/blobServices/containers/blobs/setBlobTier/action
This action is included in Storage Blob Data Contributor, which is why that role is sufficient. If you build a custom role for least privilege, do not forget this action, or tier transitions will fail silently.
Two ways drive the tier change to Archive:
- Lifecycle Management Policy on the storage account — analogous to S3 lifecycle rules. Example: “move blobs older than 30 days to Cool, then 90 days to Archive.”
setBlobTierAPI call — direct per-blob tier setting, useful for Cohesity-driven scripted transitions.
Rehydration from the Archive tier takes up to 15 hours (Standard priority) or up to 1 hour (High priority, additional cost). Cohesity’s restore UI lets the operator pick the rehydration priority [Source: https://www.cohesity.com/glossary/cloud-tiering/].
10.6.4 Private Endpoints — Production Networking
For production deployments, lock down the Blob endpoint to private networking:
- Provision an Azure Private Endpoint on the storage account’s
blobsub-resource. - Approve the private endpoint connection in Networking > Private endpoint connections.
- Ensure the Cohesity cluster is on a VNet that can resolve
privatelink.blob.core.windows.netvia Azure Private DNS or a custom DNS forwarder. - If public access is permitted at all, lock the storage account firewall to the Cohesity public IPs or to the VNet/subnet via service tags (
Storage.Blob).
A common production pitfall: private endpoint configured but DNS pointed at the public Blob endpoint. The cluster’s traffic falls back to the public IP, which is then blocked by the storage account firewall, and archives fail. Verify DNS resolution explicitly during cutover.
10.6.5 Custom Role JSON for Reference
For environments where Storage Blob Data Contributor is too broad, a custom role can be defined:
{
"Name": "Cohesity Blob Archive",
"Actions": [
"Microsoft.Storage/storageAccounts/blobServices/containers/read",
"Microsoft.Storage/storageAccounts/blobServices/containers/write"
],
"DataActions": [
"Microsoft.Storage/storageAccounts/blobServices/containers/blobs/read",
"Microsoft.Storage/storageAccounts/blobServices/containers/blobs/write",
"Microsoft.Storage/storageAccounts/blobServices/containers/blobs/delete",
"Microsoft.Storage/storageAccounts/blobServices/containers/blobs/setBlobTier/action"
],
"AssignableScopes": [
"/subscriptions/{sub-id}/resourceGroups/{rg}/providers/Microsoft.Storage/storageAccounts/{account}"
]
}
This is the least-privilege equivalent of Storage Blob Data Contributor scoped to a single account.
Key Takeaway: For Azure, Storage Blob Data Contributor is the data-plane minimum role. Always prefer Entra ID service principals over SAS tokens or account keys. Always plan for private endpoints in production. The
setBlobTier/actionis what makes Archive transitions work.
Section 10.7: Storage Classes and Cost Modeling — The Egress Worked Example
10.7.1 S3 and Azure Class Mapping
| Use Case | AWS S3 Class | Azure Blob Tier | Min Retention | Latency |
|---|---|---|---|---|
| Active backups, fast recovery | S3 Standard | Hot | none | ms |
| Cold backups (CloudTier target) | S3 Standard-IA | Cool | 30 days | ms |
| Archive — fast occasional recall | S3 Glacier Instant Retrieval | (no exact equivalent) | 90 days | ms |
| Archive — multi-hour recall ok | S3 Glacier Flexible Retrieval | Cold (preview in regions) | 90 days | 3–5 h |
| Deepest archive — multi-year | S3 Glacier Deep Archive | Archive | 180 days | up to 15 h |
Figure 10.4: Storage class taxonomy — AWS S3 and Azure Blob tiers
graph TD
Root[Cloud Object Storage Classes]
Root --> AWS[AWS S3]
Root --> AZ[Azure Blob]
AWS --> AStd[S3 Standard<br/>Hot active backups<br/>~$0.023/GB-mo<br/>ms latency]
AStd --> AIA[S3 Standard-IA<br/>CloudTier target<br/>~$0.0125/GB-mo<br/>30-day min]
AIA --> AGIR[Glacier Instant Retrieval<br/>~$0.004/GB-mo<br/>90-day min, ms recall]
AGIR --> AGFR[Glacier Flexible Retrieval<br/>~$0.0036/GB-mo<br/>90-day min, 3-5h recall]
AGFR --> ADA[Glacier Deep Archive<br/>~$0.00099/GB-mo<br/>180-day min, 12h recall]
AZ --> ZHot[Hot tier<br/>Active workloads<br/>ms latency]
ZHot --> ZCool[Cool tier<br/>30-day min<br/>ms latency]
ZCool --> ZCold[Cold tier<br/>90-day min<br/>ms latency]
ZCold --> ZArch[Archive tier<br/>180-day min<br/>up to 15h rehydrate<br/>setBlobTier/action]
style ADA fill:#0d4429,color:#fff
style ZArch fill:#0d4429,color:#fff
style AStd fill:#1f6feb,color:#fff
style ZHot fill:#1f6feb,color:#fff
10.7.2 The Three Cost Components
Every cloud target has three cost layers that an architect must model separately:
- Storage — $/GB-month for the data at rest. Cheapest in Deep Archive or Azure Archive.
- Requests — per-API-call fees for PUT, GET, RESTORE, etc. Glacier and Archive request fees can dominate when many small objects are archived.
- Egress — outbound data transfer when restoring data back on-prem (or to a different region). Often the largest single charge.
Egress is the silent budget killer. AWS charges roughly $0.09/GB egress to Internet (us-east-1, list price; first 100 GB/month free). Azure egress is similarly priced. Within-region transfer to another AWS service is typically free; cross-region transfer is roughly $0.02/GB.
10.7.3 Worked Example: 100 TB Disaster Recall
Scenario: a customer archived 500 TB of monthly fulls to S3 Glacier Deep Archive over four years. A ransomware event destroys the production environment. The customer must recall 100 TB to a new on-prem cluster within 48 hours.
| Cost Component | Calculation | Cost |
|---|---|---|
| Storage at rest (4 years × 500 TB × $0.00099/GB-month × 12 × 4 / 1024) | $0.99/TB-month × 500 × 48 | ~$23,760 |
| Restore (Standard retrieval, $0.02/GB × 100 TB) | $0.02 × 100,000 GB | ~$2,000 |
| Restore requests (~$0.025 per 1,000 PUT/RESTORE; ~10M objects ≈ $250) | ~$250 | |
| Egress to Internet ($0.09/GB × 100 TB) | $0.09 × 100,000 GB | ~$9,000 |
| Total recall event | ~$11,250 one-time | |
| Plus 4 years of storage already paid | ~$23,760 |
Architectural lessons:
- Egress dominates the recall event. Recalling within AWS (e.g., into EC2 in the same region) makes egress effectively free.
- Storage is cheap; recall is not. Always design recall destinations to keep traffic in-region when possible.
- Bandwidth may be the real constraint. 100 TB over a 1 Gbps WAN link takes ~9 days; AWS Snowball Edge ships 80 TB devices for offline recall.
10.7.4 Lifecycle Policies and Rehydration Windows
Best practices for lifecycle and retention alignment:
- Match Cohesity’s retention to the bucket’s lifecycle expiration, with a small safety margin (e.g., expire bucket objects 30 days after Cohesity’s last expected retention day) so orphans are cleaned up but Cohesity-managed objects are never prematurely deleted.
- Stage transitions — go Standard → Standard-IA at 30 days → Glacier at 90 days → Deep Archive at 180 days, rather than jumping straight to Deep Archive. This avoids early-deletion charges if a Protection Policy retention is shortened mid-flight.
- Test rehydration windows quarterly. A 12-hour rehydration on a 50 TB recall is a real RTO cost that should appear in the DR runbook.
Key Takeaway: Storage cost is the headline; egress and request fees are the surprise. Always model the recall scenario, not just the resting state, and prefer recovery destinations that keep traffic in-region.
Chapter Summary
Cohesity’s four cloud integration patterns solve four distinct architectural problems:
- CloudArchive copies snapshots to cheap object storage for long-term retention; the cluster keeps the local copy and remains authoritative. CloudArchive Direct streams directly to the target without the local full copy.
- CloudTier moves cold blocks out to free local capacity — no second copy, irreversible without recall. Driver is on-cluster footprint, not retention.
- CloudReplicate replicates to a Cohesity Cloud Edition in AWS/Azure/GCP — best RTO, highest cost.
- CloudSpin converts a backup to a native cloud VM on demand — for bursting, forensics, and migration trials, not continuous DR.
The five-step AWS pattern: register External Target (Archival), run the Cohesity CFT for IAM, apply the bucket policy, set the S3 lifecycle rule (respecting 90-day Glacier and 180-day Deep Archive minimums), and bind to a Protection Policy. For Azure, the minimum role is Storage Blob Data Contributor (which includes setBlobTier/action); always prefer Entra service principals and Private Endpoints in production.
Egress is the cost layer most likely to surprise — a 100 TB recall to on-prem costs about $9,000 in egress alone, often more than a year of resting storage. Design recall destinations to stay in-region, and rehearse rehydration windows in runbooks. When in doubt on the exam, ask: is the driver local capacity (tier), long-term retention (archive), continuous cloud DR (replicate), or one-shot cloud VM (spin)?
Key Terms
- CloudArchive — Copies snapshots from the cluster to a registered external target (object storage, NFS, or tape) for LTR. Cluster keeps the local copy.
- CloudArchive Direct — Streaming variant that omits the full local copy, keeping only metadata on cluster while streaming bulk data to the target.
- CloudTier — Moves cold blocks from the cluster to cloud object storage to free local capacity. Single-copy move, not a copy; recall required to read.
- CloudReplicate — Replication from on-prem Cohesity cluster to a Cloud Edition cluster running in a public cloud, providing full DataPlatform features.
- CloudSpin — Converts a VM backup into a native cloud VM (EC2, Azure VM, GCE) on demand. Used for bursting, testing, and forensic isolation.
- External Target — Registered storage destination (object store, NFS, tape) in Cohesity Inventory. Typed at registration as Archival or Tiering.
- S3 Glacier — AWS archival class family: Glacier Instant Retrieval (90-day min, ms recall), Flexible Retrieval (90-day min, 3–5 h), Deep Archive (180-day min, 12 h).
- Azure Archive — Azure Blob’s coldest tier, comparable to Glacier Deep Archive. Up to 15 h rehydration. Requires
setBlobTier/actionto enter. - Lifecycle policy — Bucket-side (S3) or storage-account-side (Azure) rule transitioning objects between classes by age. Distinct from Cohesity retention, which governs lifetime.
- Storage Blob Data Contributor — Minimum Azure RBAC role for Cohesity to read/write/delete/tier blobs. The data-plane role; control-plane roles alone are insufficient.
- CloudFormation Template (CFT) — Cohesity-published AWS automation that creates a least-privilege IAM role and bucket policy for S3 archival.
- S3 Object Lock — AWS bucket feature for WORM immutability; must be enabled at bucket creation. Cohesity uses
s3:PutObjectRetentionto set per-object retention. - Egress — Outbound transfer from a cloud region to Internet or another region; typically $0.09/GB. Largest variable cost in recall scenarios.
- Rehydration — Restoring an archived object to a readable class. Deep Archive: up to 12 h Standard; Azure Archive: up to 15 h Standard.
Chapter 11: Security, Encryption, and Ransomware Resilience
Backups used to be the last thing an attacker thought about. Today they are the first. Modern ransomware operators have learned that destroying recovery points is the fastest way to force payment, and they routinely spend days or weeks inside an environment specifically hunting for backup admin credentials before triggering encryption. For a Cohesity architect, this changes the design conversation completely. Security is no longer a hardening checklist applied after the cluster is built — it is the central design axis around which encryption, immutability, isolation, detection, and recovery are organized. This chapter walks through that axis end-to-end, from FIPS-validated cryptography on individual disks to multi-cloud cyber vaults that cannot be touched even by a fully compromised root account.
Learning Objectives
By the end of this chapter you will be able to:
- Apply defense-in-depth across hardware, OS, software, and identity layers on a Cohesity cluster.
- Configure FIPS-validated encryption at rest and in transit using software encryption, Self-Encrypting Drives (SEDs), and external KMIP/KMS providers.
- Architect immutability with DataLock, WORM semantics, and legal hold workflows enforced by a Security Officer role.
- Differentiate Cohesity FortKnox cyber vaulting from CloudArchive and explain when each is appropriate.
- Design ransomware detection and clean-room recovery patterns using Cohesity DataHawk, including anomaly detection, threat intelligence, and BigID-powered classification.
- Layer DataLock + FortKnox + DataHawk into a coherent threat-defense architecture for a regulated workload.
Figure 11.1: Defense-in-Depth Layers across the Cohesity Security Stack
flowchart LR
HW[Hardware<br/>SED + FIPS modules] --> OS[OS<br/>Hardened Linux]
OS --> SW[Software<br/>SpanFS + TLS]
SW --> ID[Identity<br/>SSO + MFA + Quorum]
ID --> DATA[Data Immutability<br/>DataLock + WORM]
Encryption at Rest and In Transit
Encryption is the foundation. If the disks walk out of the data center, if a replication packet is captured on the wire, or if a cloud archive bucket is misconfigured, the data must remain unreadable. Cohesity supports two parallel encryption approaches at rest — software encryption performed by SpanFS and hardware encryption performed by Self-Encrypting Drives — and TLS for every byte that leaves the cluster.
Software Encryption vs. Self-Encrypting Drives
Software encryption (sometimes called “AES at the SpanFS layer”) is performed by the Cohesity software itself before data is written to disk. Every chunk that lands in a chunk file is encrypted with AES-256 using a Data Encryption Key (DEK) that is itself wrapped by a Key Encryption Key (KEK). The advantage is portability: software encryption works on any node — physical, virtual, or cloud — regardless of the underlying drive technology. The cost is a small CPU overhead, typically absorbed by AES-NI hardware acceleration on modern Intel and AMD processors.
Self-Encrypting Drives (SEDs) push encryption down into the drive’s firmware. The drive itself holds the Media Encryption Key (MEK) and refuses to release plaintext without an authentication key. SEDs are appealing for compliance because the cryptographic boundary is the physical drive — pulling a drive out of a chassis and walking away with it yields ciphertext. Cohesity’s SED-equipped nodes ship as a hardware option on supported appliances and ReadyNodes.
In practice, architects choose between them based on three factors:
| Factor | Software Encryption | Self-Encrypting Drives (SED) |
|---|---|---|
| Where it runs | SpanFS / Cohesity software | Drive firmware (FIPS 140-2/3 validated) |
| Form factor | All (physical, VE, Cloud Edition) | Physical appliances and ReadyNodes only |
| Performance impact | Small (AES-NI accelerated) | None on the host |
| Key management | KMIP, internal KMS, or AWS/Azure KMS | KMIP authentication key, drive holds MEK |
| Crypto-erase | Re-key wipes data logically | PSID revert wipes drive instantly |
| Typical use case | Mixed environments, VE, cloud | High-compliance physical sites, fast decommission |
A useful analogy: software encryption is like keeping your valuables in a locked safe inside your house — the safe goes with you wherever you live. SEDs are like buying a house where every room has its own combination lock built into the wall — only available in certain houses, but you do not need to bring the safe yourself.
Customer-Managed Keys with KMIP and KMS
Encryption is only as strong as the key custody model. Cohesity supports an internal key manager for small deployments, but at enterprise scale the assumption is that keys live in a customer-controlled Key Management System (KMS) and are fetched by the cluster over the Key Management Interoperability Protocol (KMIP). Common integrations include Thales CipherTrust, Entrust KeyControl, IBM Guardium / SKLM, HashiCorp Vault (via KMIP secrets engine), AWS KMS, and Azure Key Vault.
The flow looks like this:
- The cluster generates DEKs locally for each chunk file.
- DEKs are wrapped using a KEK that lives only in the external KMS.
- To read or write, the cluster calls the KMS over KMIP/TLS to wrap or unwrap the DEK.
- If the KMS is unreachable or revokes the KEK, encrypted data on the cluster becomes inaccessible — a powerful kill-switch in a breach scenario.
For CCAE design questions, watch for clues that point to customer-managed keys: regulated industries (healthcare, financial services, government), explicit mentions of “key escrow,” “BYOK,” “HYOK,” or “separation of duties between storage admin and security admin.” All of these are KMIP signals.
Figure 11.2: KMIP Key Management Flow between Cohesity and the External KMS
sequenceDiagram
participant Cluster as Cohesity Cluster
participant KMS as KMIP / KMS Server
participant Disk as SpanFS Chunk File
Cluster->>Cluster: Generate local DEK
Cluster->>KMS: Request KEK wrap (KMIP/TLS)
KMS-->>Cluster: Wrapped DEK released
Cluster->>Disk: Encrypt chunk with DEK
Note over Cluster,KMS: On read: cluster requests unwrap
Cluster->>KMS: Unwrap DEK request
KMS-->>Cluster: Plaintext DEK (in-memory)
Cluster->>Disk: Decrypt chunk
Note over KMS: Revoke KEK = global kill-switch
TLS for Management and Replication
In transit, every interface that leaves the node is TLS-protected. Management UI and REST API traffic uses TLS 1.2 or 1.3 with administrator-installed CA-signed certificates (avoid the self-signed defaults in production). Replication between source and target clusters is encrypted, compressed, and deduplicated on the wire. Cloud archive traffic to S3, Azure Blob, or GCS uses HTTPS with the cloud provider’s TLS endpoint.
For the exam, remember that protocols like SMB and NFS — used by SmartFiles — have their own encryption modes. SMB3 supports per-session encryption (SMB Encryption); NFSv4.1 with Kerberos krb5p provides privacy. These are configured per View, not globally, and are an architect’s lever when a tenant needs encrypted client traffic without VPN overhead.
FIPS 140-2 / 140-3 Mode
For US federal customers and many regulated industries, the cluster must operate in FIPS mode, which forces all cryptographic operations through FIPS 140-2 (or the newer 140-3) validated modules. Enabling FIPS mode on a Cohesity cluster:
- Restricts cipher suites for TLS to FIPS-approved algorithms.
- Disables non-compliant algorithms (e.g., MD5, SHA-1 for signatures).
- Forces SEDs to operate in their FIPS-validated configuration.
- Requires that the external KMS also be FIPS-validated for end-to-end compliance.
FIPS mode is a cluster-wide setting, not per-View, and is most easily enabled at deployment time. Toggling FIPS on a brownfield cluster is supported but requires a validated change window because some services restart.
Key Takeaway: Cohesity offers parallel encryption paths — software (universal, AES-NI accelerated) and SED (hardware-rooted, physical-only) — both manageable via KMIP-attached KMS. Combined with TLS for all in-transit paths and FIPS 140-2/140-3 mode for regulated workloads, encryption gives the architect complete cryptographic separation between data, keys, and operators. [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/threat-defense-architecture-white-paper-en.pdf]
Immutability and DataLock
Encryption protects confidentiality. Immutability protects existence. A ransomware actor who has stolen storage admin credentials can decrypt and re-encrypt as they like — what they cannot do, against an immutable backup, is delete it. Cohesity’s immutability story has two layers: the SpanFS baseline that applies to every snapshot ever taken, and the DataLock policy layer that turns selected snapshots into time-bound WORM objects no one can remove.
SpanFS Baseline Immutability
By default, every Cohesity backup snapshot is stored in SpanFS as a read-only, immutable object. The original “gold copy” cannot be mounted, modified, encrypted in place, or deleted by any external system, application, or process. When a workload needs read/write access to a backup — for instant recovery, dev/test refresh, or sandbox investigation — Cohesity creates a zero-cost, redirect-on-write clone. The clone is read/write; the gold copy is not. This single property defeats the most common ransomware pattern, which is to enumerate backup files and overwrite them in place. [Source: https://www.cohesity.com/blogs/how-backup-immutability-defends-against-ransomware-attacks/]
DataLock Policies and WORM Semantics
DataLock takes baseline immutability and adds a hardened, time-bound, role-bound enforcement layer. A DataLock policy is applied to a Protection Group (or to specific snapshots) by a designated Security Officer — a separate Cohesity role distinct from the regular cluster administrator. Once the policy is applied:
- The snapshot enters Write Once, Read Many (WORM) state for the configured retention period.
- It cannot be deleted or modified by any user — including the cluster administrator, the Security Officer who applied the lock, or any account with full privileges. The lock is enforced by the platform itself.
- The lock cannot be shortened, removed, or “talked out of” before the timer expires. It can only be extended.
- DataLock applies equally to copies that are tiered or archived to cloud targets (CloudArchive, FortKnox), so the WORM property follows the data. [Source: https://www.cohesity.com/resources/solution-brief/counter-ransomware-attacks-with-cohesity/]
Think of DataLock as a safe deposit box with a time lock. The bank manager set the timer this morning and locked the door behind you. Even if the manager comes back at noon and wants to open it — even if they are the most senior person in the bank — they physically cannot. The vault opens when the timer says it opens, not before. That is exactly what DataLock does to a snapshot.
Figure 11.3: DataLock Lifecycle States
stateDiagram-v2
[*] --> Created: Snapshot taken
Created --> Locked: Security Officer applies DataLock
Locked --> Locked: Extend retention (allowed)
Locked --> Expired: Retention timer ends
Locked --> LegalHold: Legal hold applied
LegalHold --> Locked: Hold released
Expired --> Released: Snapshot deletable
Released --> [*]
note right of Locked
WORM enforced
Cannot be shortened
Cannot be deleted
end note
Compliance Lock vs. Governance Lock
DataLock is offered in two flavors that map to two different regulatory postures:
| Mode | Who can shorten retention | Typical use case | Regulatory mapping |
|---|---|---|---|
| Governance | Security Officer can shorten or remove the lock | Internal policy, operational immutability | Best-practice data-protection hygiene |
| Compliance | Nobody — not even the Security Officer | SEC 17a-4, FINRA, HIPAA retention floors | Strict regulatory immutability |
Governance mode gives the security team an override path for legitimate exceptions; Compliance mode locks the door even on the Security Officer. For a CCAE scenario question, the giveaway phrase for Compliance mode is anything resembling “regulator requires that no individual, regardless of role, be able to delete records before retention expires.” For Governance mode, the phrase is typically “internal policy” or “data protection officer needs the ability to override in extraordinary circumstances.”
Legal Hold and Snapshot Deletion Approval
Legal hold is a related but distinct workflow. While DataLock is a time-bound lock applied at policy creation time, legal hold is an indefinite freeze applied when litigation or investigation is anticipated. Legal hold extends retention until the hold is explicitly released, regardless of the DataLock timer.
For sensitive operations that fall outside the locked window — deleting a Protection Group, changing a retention policy, removing an external target, or releasing a legal hold — Cohesity supports quorum approval, a multi-person workflow in which the action stays pending until N of M approvers (typically 2 of 3 or 3 of 5) have signed off. Quorum approval is the architectural answer to “what if a single privileged account is compromised?” The compromised account can request the deletion; it cannot finish it without independent approval from accounts the attacker does not control. For DR and DataLock removal, quorum approval is typically combined with MFA-enforced logins to make credential theft alone insufficient.
Key Takeaway: SpanFS makes every snapshot immutable; DataLock makes selected snapshots undeletable by anyone, including the Security Officer who applied the lock, for a specified duration. Governance mode allows controlled override; Compliance mode does not. Combined with legal hold and quorum approval, immutability transforms backups from a deletable target into a durable, time-locked recovery substrate. [Source: https://www.cohesity.com/blogs/guarding-against-ransomware-requires-more-than-just-detection/]
Cyber Vaulting with Cohesity FortKnox
DataLock prevents deletion. It does not, by itself, protect against scenarios in which an attacker has unfettered network access to the cluster and unlimited time. For that, the architect needs isolation — physical and operational separation between the production estate and a tertiary copy of the data. Cohesity FortKnox is the answer.
What FortKnox Is
FortKnox is a SaaS-delivered cyber vault — Cohesity calls it Data Isolation and Recovery as a Service, or DIRaaS — that stores an immutable, isolated tertiary copy of backup data in a Cohesity-managed cloud tenant on AWS, Azure, or Google Cloud. The customer does not deploy or maintain vault infrastructure; they subscribe, point source clusters at the service, and configure vaulting policies. [Source: https://www.cohesity.com/resources/datasheet/cohesity-fortknox/]
If FortKnox is the Swiss bank vault — a service operated for you by a separate institution, in a separate jurisdiction, with multi-person approval to open the door — then CloudArchive is the warehouse rented across town: cheap, capacious, and reachable any time you have the key.
The Virtual Air Gap
The defining architectural feature of FortKnox is the virtual air gap. The network connection between the source Cohesity cluster and the FortKnox vault is opened only during a configurable transfer window, just long enough to ingest a new vaulted copy, and is then severed. Outside the transfer window, the vault is operationally unreachable from the production cluster — there is no live network path an attacker can ride from a compromised cluster to the vaulted copies. This contrasts with CloudArchive’s persistent connection, which is appropriate for ongoing tiering and retention but offers no isolation guarantee. [Source: https://www.cohesity.com/blogs/going-beyond-the-air-gap-data-isolation-and-recovery-for-the-modern-era/]
Figure 11.4: FortKnox Cyber Vault Flow with Virtual Air Gap and Quorum Recovery
flowchart TD
SRC[Source Cohesity Cluster] --> WIN{Transfer<br/>Window Open?}
WIN -->|02:00-04:00| OPEN[Air Gap Opens]
OPEN --> REPL[Replicate Snapshot<br/>to FortKnox SaaS Vault]
REPL --> CLOSE[Air Gap Closes]
CLOSE --> VAULT[(Isolated<br/>WORM Vault<br/>AWS / Azure / GCP)]
VAULT --> RREQ[Recovery Request]
RREQ --> QUORUM{Quorum<br/>2-of-3 Approved?}
QUORUM -->|No| DENY[Operation Blocked]
QUORUM -->|Yes| RECOVER[Recover to Source /<br/>Alternate Cluster /<br/>Cloud Target]
Mandatory Multi-Person Quorum
FortKnox enforces multi-person quorum approval for sensitive operations — recoveries, retention changes, vault configuration changes — at the vault level. Typically two or more authorized users must approve before the operation proceeds. The control is purpose-built for insider-threat and stolen-credential scenarios: a single privileged account is never enough to exfiltrate, destroy, or release vaulted data. [Source: https://aws.amazon.com/blogs/apn/defending-against-ransomware-with-aws-and-cohesity-fortknox/]
Defense Layers Inside the Vault
Once data lands in FortKnox, it inherits and extends Cohesity’s defense model:
- Physical and tenant separation. The vault lives in a Cohesity-managed cloud tenant, in a separate trust domain from the customer’s production cluster.
- Network and operational isolation. Virtual air gap, separate identity plane, separate management.
- WORM immutability. Every vaulted snapshot is locked, with the same DataLock semantics that apply on-prem.
- ML-based anomaly detection. Anomaly scoring runs on vaulted data, not just on production backups.
- Flexible recovery targets. Recover to the original source cluster, an alternate cluster, or directly into a target cloud, supporting a wide range of disaster scenarios.
FortKnox vs. CloudArchive
This comparison is one of the most testable items in the chapter:
| Dimension | FortKnox (Cyber Vault SaaS) | CloudArchive |
|---|---|---|
| Primary purpose | Isolated, immutable cyber-recovery vault | Long-term cloud tiering / archive |
| Connectivity | Virtual air gap; network open only during transfer windows | Persistent connection to cloud target |
| Approval model | Mandatory multi-person quorum for recoveries / critical actions | Standard MFA + RBAC, no vault-level quorum |
| Operating model | Cohesity-managed SaaS (DIRaaS), no customer infrastructure | Customer-configured external target (S3/Blob/GCS) |
| Cloud | AWS, Azure, GCP — Cohesity tenant | Customer’s own buckets in any supported cloud |
| Use case | Ransomware/cyber-recovery “third copy” in 3-2-1-1-0 | Cost-optimized long-term retention and compliance archive |
| Cost profile | Subscription, premium for isolation + service | Storage + egress, customer-controlled |
A useful CCAE heuristic: when the scenario uses words like “cyber recovery,” “isolated copy,” “air gap,” “ransomware blast radius,” or “regulatory mandate to keep an offline copy,” the answer is FortKnox. When the scenario uses words like “long-term retention,” “7-year archive,” “tape replacement,” or “cold storage,” the answer is CloudArchive. They are complementary; many enterprises use both.
Key Takeaway: FortKnox is a SaaS cyber vault (DIRaaS) that adds three controls CloudArchive does not have — virtual air gap, mandatory multi-person quorum, and Cohesity-managed isolation across AWS/Azure/GCP — making it the right answer for ransomware-resilient tertiary copies. CloudArchive remains the right answer for long-term retention and cost optimization. [Source: https://www.cohesity.com/blogs/cohesity-fortknox-is-now-available-on-google-cloud/]
Ransomware Detection and Recovery with DataHawk
Encryption protects confidentiality, immutability prevents deletion, and FortKnox provides isolation — but none of those tell you that an attack is happening. Detection and clean-room recovery is the job of Cohesity DataHawk.
What DataHawk Does
DataHawk is the AI/ML-driven security service inside the Cohesity Data Cloud. It packages three capabilities — ransomware anomaly detection, threat intelligence-based malware hunting, and BigID-powered data classification — into a single SaaS offering whose job is to answer the three questions that arise during any cyber incident:
- Is there an attack in progress? Anomaly detection.
- Where is the malware, and which recovery point is clean? Threat intelligence and YARA scanning.
- What sensitive data was exposed? BigID classification.
[Source: https://www.cohesity.com/blogs/introducing-cohesity-datahawk/]
Figure 11.5: DataHawk Three-Pillar Architecture
graph TD
DH[Cohesity DataHawk<br/>AI/ML Security SaaS]
DH --> AD[Anomaly Detection]
DH --> TI[Threat Intelligence]
DH --> CL[BigID Classification]
AD --> AD1[Entropy analysis]
AD --> AD2[Change-rate baselines]
AD --> AD3[Clean snapshot recommendation]
TI --> TI1[100K+ IOCs daily]
TI --> TI2[YARA + CrowdStrike feeds]
TI --> TI3[Malware hash matching]
CL --> CL1[200+ patterns]
CL --> CL2[50+ compliance policies]
CL --> CL3[PII / PHI / PCI / GDPR]
Anomaly Detection via Entropy and Change Rate
DataHawk continuously analyzes backup snapshots and produces an anomaly strength score for each one based on machine-learning models trained on per-workload baseline behavior. The features the models inspect include:
- Data entropy. Encrypted-in-place data has a near-uniform byte distribution; normal application data does not. A sudden rise in average entropy across a snapshot is a strong signal of mass encryption.
- File and object change rates. A workload that normally changes 2% of its files per night and suddenly changes 80% is likely under attack, not getting busier.
- Write/modification patterns. Bulk file extension changes (e.g., everything ending
.lockedor.crypt) and sudden new file creation/deletion patterns are flagged. - Per-workload baselines. A SQL transaction log workload has very different normal behavior than a user fileshare; DataHawk learns each.
Anomalous snapshots are flagged in the anti-ransomware dashboard. The same scoring drives the clean snapshot recommendation that points administrators at the last-known-good recovery point, which is critical because a naive restore from “the most recent backup” will often restore the encryption itself. [Source: https://www.cohesity.com/blogs/cohesity-ransomware-detection-machine-learning-models/]
Threat Intelligence and YARA
Rather than asking customers to author and maintain their own YARA rules, DataHawk ships an automated, continuously updated threat-intelligence feed of more than 100,000 indicators of compromise (IOCs) refreshed daily from 160,000+ sources, including curated YARA rules, CrowdStrike Falcon Intelligence, and Cohesity-curated default libraries. When DataHawk scans backup data, it identifies:
- Malware hashes present in the snapshot.
- The specific files that contain them.
- The variant or family involved.
This converts “we were hit, restore everything from yesterday” into “we were hit, here are the 312 infected files and the time window of compromise — restore those specifically.” [Source: https://www.cohesity.com/blogs/cohesity-datahawk-continuing-the-ai-ml-transformation-of-data-security-and-management/]
BigID-Powered Data Classification
DataHawk integrates the BigID classification engine to discover and classify sensitive data inside backup snapshots. The engine combines regular expressions, named-entity recognition, AI/ML classifiers, 200+ predefined patterns, and 50+ out-of-the-box compliance policies (PII, PHI/HIPAA, PCI payment data, GDPR, and more).
After an anomaly hits or malware is found, classification reports tell the responder exactly which categories of sensitive data lived in the affected files. This is essential for:
- Breach notification timelines (HIPAA 60 days, GDPR 72 hours).
- Regulatory reporting accuracy.
- Incident scoping (“did the attacker reach PHI?”).
[Source: https://www.cohesity.com/platform/data-classification/]
Clean-Room Recovery Pattern
Even with isolated, immutable, classified backups, restoring straight back to production is risky — the malware may still be in transit. The clean-room recovery pattern is:
- Identify the clean recovery point using DataHawk’s ML recommendation.
- Provision an isolated environment — an alternate cluster, an alternate VLAN, or a recovery-only AWS/Azure VPC.
- Restore from FortKnox or DataLock-protected snapshot into the clean room.
- Run threat intelligence scans, AV scans, and integrity validation against the restored data.
- Cut over only after validation passes; otherwise iterate further back in time.
The clean-room is the “operating room” of recovery: sterile, instrumented, and isolated until you are sure the patient is no longer contagious.
Key Takeaway: DataHawk turns backups into a security telemetry source with three layered capabilities — entropy/change-rate anomaly detection, daily-refreshed threat intelligence with 100K+ IOCs, and BigID classification using 200+ patterns and 50+ policies. Combined with clean-room recovery, this implements the NIST Detect/Respond/Recover functions across the entire backup estate. [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/threat-defense-architecture-white-paper-en.pdf]
Hardening, Compliance, and Defense-in-Depth
The cluster itself must be hardened before any of the above features matter. A weakly secured admin plane is the easiest way for an attacker to dismantle the rest.
Identity, MFA, and Quorum
- Strong identity. Integrate with AD or LDAP; do not run on local accounts in production. SAML SSO with an enterprise IdP (Okta, Azure AD, Ping) is preferred for centralized lifecycle management.
- MFA enforcement. Mandatory for all administrative roles, including the Security Officer. MFA defeats the most common credential-theft attack pattern.
- Role separation. Cluster admin, Security Officer, audit reviewer, and tenant operators are distinct roles. The Security Officer is the only role that can apply or extend DataLock policies.
- Quorum approval. Required for any operation that could destroy data — DataLock removal, retention shortening, vault configuration changes, Protection Group deletion.
Audit Logging and SIEM Integration
Every administrative action and every security-relevant event is logged. Architects should:
- Forward audit logs to an enterprise SIEM (Splunk, Sentinel, Chronicle, QRadar) via Syslog or webhook.
- Retain logs in the SIEM beyond the cluster’s retention window.
- Build alerts on high-risk events: DataLock policy changes, quorum approval requests, failed MFA attempts, role assignment changes, KMIP fetch failures.
Compliance Frameworks
Cohesity is engineered to map cleanly to multiple frameworks. Key alignment points for CCAE:
| Framework | Cohesity controls that support it |
|---|---|
| HIPAA | DataLock (retention floors), encryption at rest/in transit, audit logging, BigID PHI classification |
| PCI-DSS | FIPS-validated encryption, KMIP/KMS separation, MFA, role separation |
| FedRAMP | FIPS 140-2/3 mode, audit logging, identity integration with FedRAMP-authorized IdPs |
| SEC 17a-4 / FINRA | DataLock Compliance mode (WORM), legal hold, immutable audit trail |
| GDPR | BigID classification (PII), data subject access via search, retention controls |
Worked Example: 500 TB Hospital Layered Defense
To make all this concrete, work through a realistic CCAE-style scenario. A regional hospital system has 500 TB of front-end data: 280 TB of imaging (PACS), 120 TB of EHR databases, 60 TB of clinician fileshares, and 40 TB of M365 (Exchange, OneDrive, SharePoint). Compliance requirements include HIPAA, a state-level seven-year retention floor for medical records, and a board-mandated ransomware recovery posture after a peer hospital was hit last year. RPO 4 hours, RTO 24 hours for clinical systems.
Step 1 — Cluster sizing and encryption. Provision a Cohesity all-flash cluster on FIPS-validated SED-equipped ReadyNodes at the primary site, sized for 500 TB FETB with a 3% daily change rate and erasure coding (4:2). Enable FIPS 140-2 mode at deployment. Configure a KMIP integration with the customer’s existing Thales CipherTrust Manager so that all DEKs are wrapped by KEKs that live outside the cluster — pulling a node, or even pulling all the disks, yields ciphertext.
Step 2 — Protection policies. Create three policies: Clinical-Tier1 (4-hour RPO, 14 daily / 4 weekly / 12 monthly / 7 yearly retention), Fileshares-Tier2 (24-hour RPO, 30 daily / 12 monthly / 7 yearly), and M365-Tier3 (daily, 7 yearly). All three carry DataLock in Compliance mode for the seven-year retention floor required by state medical-records law — neither the cluster admin nor the Security Officer can release records before the timer expires.
Step 3 — Replication and tertiary copy. Replicate to a secondary Cohesity cluster at a regional DR site for fast same-day recovery. In parallel, configure FortKnox vaulting on AWS for daily clinical and weekly fileshare snapshots. Configure the FortKnox transfer window for 02:00–04:00 local; outside that window, the virtual air gap is closed. Configure a quorum policy of 2-of-3 approvers (CISO, VP Infrastructure, Compliance Officer) for any FortKnox recovery, retention change, or vault configuration change.
Step 4 — Detection and classification. Subscribe to DataHawk. Anomaly detection runs against every backup; the anti-ransomware dashboard is reviewed daily by the SecOps team. Threat intelligence scans every snapshot for the 100,000+ IOCs in the daily-refreshed feed. BigID classification is configured with HIPAA, PII, and PCI policies and runs against PACS, EHR, and fileshare snapshots so that any incident can be scoped against actual PHI exposure within the 60-day breach notification clock.
Step 5 — Hardening. All admin accounts are sourced from Azure AD via SAML SSO with mandatory MFA. The Security Officer role is held by the CISO and the deputy CISO only. Audit logs stream to Microsoft Sentinel; alerts fire on DataLock changes, FortKnox quorum requests, and KMIP fetch failures. Quorum approval is enabled for Protection Group deletion, retention shortening, and external-target removal.
Step 6 — Recovery rehearsal. Quarterly, the team performs a clean-room recovery rehearsal: spin up an isolated VLAN, restore a representative EHR database from FortKnox using the DataHawk-recommended clean snapshot, run AV and integrity scans, validate database consistency, and confirm RTO. The runbook is owned by the SRE team and reviewed by the CISO.
The result is a layered defense — encryption at the disk and key-management layer, immutability at the snapshot layer, isolation at the FortKnox vault layer, detection at the DataHawk layer, and identity hardening across all admin paths — that survives a full compromise of the production cluster and can demonstrably recover within 24 hours.
Chapter Summary
Security is the central design axis of a modern Cohesity architecture, not an afterthought. The chapter built up a five-layer defense:
- Encryption — software AES via SpanFS or hardware via SEDs, both keyed through customer-controlled KMIP/KMS, all over TLS, optionally in FIPS 140-2/140-3 mode.
- Immutability — SpanFS baseline plus DataLock (Compliance or Governance), enforced by a Security Officer role and cannot be removed or shortened during the lock window.
- Isolation — FortKnox SaaS cyber vault with virtual air gap, mandatory multi-person quorum, and Cohesity-managed tenancy on AWS, Azure, or GCP.
- Detection and classification — DataHawk’s anomaly detection (entropy, change rate), threat intelligence (100K+ IOCs daily-refreshed), and BigID classification (200+ patterns, 50+ policies).
- Hardening — MFA, SSO, role separation, quorum approval, audit logging to SIEM, and clean-room recovery rehearsals.
For the CCAE exam, focus on the precise distinctions: software vs. SED encryption, Governance vs. Compliance DataLock, FortKnox vs. CloudArchive, and the order in which DataHawk’s three capabilities answer the three incident questions. Architecting these layers together — not just turning each on individually — is what defines an expert-level design.
Key Terms
- DataLock — Cohesity’s WORM immutability policy, applied by a Security Officer to Protection Groups or snapshots, time-bound, cannot be removed even by the Security Officer who applied it during the lock window.
- WORM (Write Once, Read Many) — Storage semantics in which data, once written, cannot be modified or deleted until a retention timer expires.
- KMIP (Key Management Interoperability Protocol) — OASIS-standard protocol Cohesity uses to fetch and manage keys from external KMS providers (Thales, Entrust, HashiCorp, etc.).
- FIPS — Federal Information Processing Standard; FIPS 140-2 and the newer 140-3 specify validated cryptographic modules required for US federal and many regulated workloads.
- DataHawk — Cohesity’s AI/ML SaaS security service combining anomaly detection, threat intelligence (100K+ IOCs daily-refreshed), and BigID classification (200+ patterns, 50+ policies).
- FortKnox — Cohesity’s SaaS cyber vault (Data Isolation and Recovery as a Service / DIRaaS) on AWS, Azure, or GCP, featuring a virtual air gap and mandatory multi-person quorum.
- Anomaly detection — DataHawk’s ML-driven analysis of entropy, change rate, and write patterns to flag suspicious snapshots and recommend the last-known-good recovery point.
- Cyber Vault — An isolated, immutable tertiary copy of backup data, designed to survive full compromise of the production environment and provide a clean recovery substrate.
- Quorum approval — Multi-person approval workflow (e.g., 2-of-3) required for sensitive operations such as DataLock removal, FortKnox recoveries, or Protection Group deletion, ensuring a single compromised account cannot destroy data.
Chapter 12: SmartFiles: Files, Objects, and Unstructured Data Services
For most enterprises, the largest pool of “data sprawl” is unstructured — engineering home directories, render farm scratch, surveillance video, M&E project folders, genomics datasets, build artifacts, and a rapidly growing tide of S3 buckets that started as developer experiments and ended up in production. Cohesity SmartFiles is the product that turns the same DataPlatform you already use as a backup target into a primary, multi-protocol unstructured-data service. For the CCAE candidate, SmartFiles is not a separate appliance to learn — it is a consumption mode of the cluster you have already designed. Pass the SpanFS-and-View-Box mental model from Chapter 2 forward, and SmartFiles becomes mostly a question of policy choices: which protocols, which QoS, which quotas, which tier-down rules.
This chapter walks the architecture from the bottom (SpanFS) up through the View, the View Box / Storage Domain, the protocol surface (SMB3, NFSv3/v4, S3), the governance layer (quotas, QoS, tiering), the data services (snapshots, replication, audit, ICAP AV), and finally the migration playbook for replacing or absorbing legacy NetApp and Dell/EMC Isilon estates.
Learning Objectives
By the end of this chapter you will be able to:
- Architect SmartFiles for primary file and object workloads on top of an existing Cohesity cluster, choosing the right View Box and Storage Domain shape per workload class.
- Compare SMB3, NFSv3/v4, and S3 access semantics on a Cohesity View, including cross-protocol identity mapping and locking.
- Apply quotas, QoS policies, and hot/cold tiering policies to Views and explain why QoS must be selected up front.
- Design data protection for SmartFiles, including snapshots, DR replication, file audit logging, and ICAP-based antivirus scanning.
- Plan migrations from legacy NetApp or Isilon using a combination of cold-data tiering, the Cohesity NAS File Migration Service, and backup-driven cross-filer restores.
SmartFiles Architecture
From SpanFS to View: One File System, Many Faces
SmartFiles is not a separate product riding on top of Cohesity — it is a way of consuming SpanFS, the same distributed file system that holds backup snapshots, archived databases, and replicated VMs. SpanFS exposes a single global namespace across every node in the cluster with strict consistency, and it stripes data across distributed volumes so there is no single point of failure [Source: https://www.cohesity.com/platform/spanfs/]. That single-namespace property is what lets the same logical object be a file to an NFS client, a share to an SMB client, and an object to an S3 client at the same time, without copying data into protocol-specific silos [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity-SpanFS-and-SnapTree-WP.pdf].
The central SmartFiles construct is the View. A View is a logical container that lives inside a View Box (Storage Domain) and that can be exposed simultaneously as:
- An NFS export, with mount-point / volume semantics (NFSv3 and NFSv4 are both supported).
- An SMB share, with Windows file share semantics (SMB3 with signing and encryption).
- An S3 bucket, where files become S3 objects keyed by their path in the View.
A single View, in other words, is a file share and a bucket and an NFS export — pointing at the same underlying SpanFS objects. A file written by an SMB client appears as an object in the S3 namespace of the same View, and as an NFSv4 file at the same logical path, with permissions translated across protocols using AD/LDAP identity mapping [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/solution-brief/Cohesity-SmartFiles-Solution-Brief.pdf].
Figure 12.1: SmartFiles architecture from SpanFS up through the protocol surface.
flowchart LR
SpanFS[SpanFS<br/>Distributed File System<br/>Single Global Namespace]
VB[View Box /<br/>Storage Domain<br/>Policy Boundary]
V[View<br/>Logical Container]
SMB[SMB3<br/>Windows Shares]
NFS[NFSv3 / NFSv4<br/>UNIX Mounts]
S3[S3<br/>Object Buckets]
SpanFS --> VB
VB --> V
V --> SMB
V --> NFS
V --> S3
SMB -.same data.-> NFS
NFS -.same data.-> S3
Analogy: The Multilingual Restaurant Menu. Think of a View as the menu in a multilingual restaurant. The food in the kitchen — the actual SpanFS data — is the same regardless of which language you order in. The English menu is SMB3, the French menu is NFSv4, and the Mandarin menu is S3. A vegetarian customer (an ACL) asking for “no meat” gets the same treatment whether they say it in English or French because the kitchen has one set of dietary rules, not three. Legacy NAS architectures, by contrast, run three separate kitchens and pretend the menus are translations of each other. SmartFiles runs one kitchen.
View Boxes / Storage Domains: The Policy Boundary
The View Box — which newer documentation calls a Storage Domain — is the policy container for the Views inside it. The View Box is where you define:
- Storage efficiency: deduplication on/off, inline vs. post-process, compression algorithm.
- Resiliency: Replication Factor 2/3 or erasure coding (e.g., 4:2, 6:2).
- Encryption: software vs. SED, KMS provider.
- Tiering policy: which cloud or remote tier cold blocks move to.
- Default quotas and quota alert limits for child Views.
Views inherit these settings from their parent View Box. They are scoped to the View Box’s available capacity unless overridden with explicit per-View quotas [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity-SpanFS-and-SnapTree-WP.pdf]. A common architectural pattern is to maintain at least two Storage Domains — one tuned for backup landing (high dedupe, erasure-coded, HDD-biased) and one tuned for primary file/object workloads (SSD-biased, lower dedupe ratio target) — so that backup ingest cannot starve a busy SmartFiles user share.
+--------------------------------------------------------------+
| Cohesity Cluster (SpanFS, single global namespace) |
| |
| +----------------------+ +-------------------------+ |
| | Storage Domain: | | Storage Domain: | |
| | "BackupTarget" | | "SmartFiles-Primary" | |
| | RF2 + EC 4:2 | | RF2, SSD-biased | |
| | Inline dedupe | | Post-process dedupe | |
| | | | | |
| | View: vmware-bk | | View: media-projects | |
| | View: oracle-bk | | View: build-artifacts | |
| | View: m365-bk | | View: home-dirs | |
| +----------------------+ +-------------------------+ |
+--------------------------------------------------------------+
Protocol Surface: SMB3, NFSv3/v4, S3
SmartFiles supports several deliberate combinations of protocol exposure on a single View [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/solution-brief/Cohesity-SmartFiles-Solution-Brief.pdf]:
| Mode | NFS | SMB | S3 | Typical Use |
|---|---|---|---|---|
| Multi-protocol R/W | Read/Write | Read/Write | Read-only | Existing NAS workload that wants modern apps to consume via S3 API without mutate rights |
| File-only | R/W | R/W | Off | Pure user/home-directory or build farm consolidation |
| S3-only | Off | Off | R/W | Cloud-native app, Splunk SmartStore target, container backups |
| Writable S3 clone | R/W (live View) | R/W (live View) | R/W (instant clone) | Analytics or ML pipelines that need a writable object copy without disturbing the live file workload |
The “writable S3 clone” pattern is worth pausing on. Allowing parallel writes from S3 against a live NFS/SMB View creates locking and consistency problems that no amount of identity mapping can paper over (S3 has no real notion of a byte-range lock). Cohesity sidesteps this by spawning an instant clone — a zero-copy SnapTree clone of the View that is exposed as a writable S3 bucket. The original file workload stays clean, and the analytics pipeline gets its own object-writable namespace [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity-SpanFS-and-SnapTree-WP.pdf].
Cross-Protocol Identity and Locking
Multi-protocol access only works if identity translates cleanly between worlds:
- NFS uses UID/GID numbers (and, in NFSv4, optionally name@domain principals).
- SMB3 uses Windows Security Identifiers (SIDs) and Kerberos principals.
- S3 uses bucket policies and IAM-style access keys.
Cohesity bridges these using AD/LDAP integration: an AD user maps to a SID for SMB3 access and to a UID/GID for NFS access on the same View, so a file created over SMB by CONTOSO\alice is owned by the corresponding POSIX UID when seen from NFS. S3 access uses access keys that can be tied to the same identity directory.
Cross-protocol locking is enforced inside SpanFS: if an SMB3 client holds an exclusive oplock on a file, an NFS client attempting a conflicting access is blocked or denied per the locking semantics SpanFS exposes. Architects designing mixed-Linux/Windows workloads should still test the specific lock contention patterns of their applications — multi-protocol locking removes the surprise but does not magically remove the contention.
Figure 12.2: Multi-protocol access — same data, three protocols, one View.
sequenceDiagram
participant SMB as SMB3 Client<br/>(Windows)
participant View as Cohesity View<br/>(SpanFS)
participant NFS as NFSv4 Client<br/>(Linux)
participant S3 as S3 Client<br/>(Cloud App)
SMB->>View: Write file \\share\report.csv (CONTOSO\alice)
View->>View: Map SID to UID/GID via AD/LDAP
View->>View: Persist to SpanFS, dedupe + compress
NFS->>View: read /mnt/share/report.csv
View-->>NFS: Same bytes, POSIX UID-owned
S3->>View: GET bucket/report.csv
View-->>S3: Same bytes via S3 API
Note over View: One file. One copy on SpanFS.<br/>Three protocol faces.
Performance Characteristics
Because all three protocols land in the same View, every byte benefits from the same SpanFS data services [Source: https://futurumgroup.com/wp-content/uploads/documents/EGL2_Cohesity_SmartFiles-2.pdf]:
- Global, variable-length deduplication across the entire cluster, not just within a share.
- Compression (including Zstandard).
- Unlimited zero-copy snapshots and clones via SnapTree.
- Multi-tier placement across NVMe/SSD, HDD, and S3-compatible cloud, transparently to clients.
The architectural payoff is one of the biggest selling points against scale-out NAS competitors: a file that is also an S3 object is dedup’d, compressed, snapshotted, and tiered exactly once.
Key Takeaway: SmartFiles is built on SpanFS, with the View as the multi-protocol logical container and the View Box / Storage Domain as the policy boundary. The same View can expose SMB3, NFSv3/v4, and S3 against the same data, with AD/LDAP-mediated identity mapping and SpanFS-enforced cross-protocol locking. Data services (dedupe, compression, snapshots, tiering) apply once at the SpanFS layer regardless of which protocol the client used.
Quotas, QoS, and Tiering
Once a View exists, three governance knobs determine whether it stays a good neighbor on a multi-tenant cluster: quotas (capacity), QoS (performance), and tiering (placement). A fourth, Storage Domain defaults, sets the floor for all of them.
Quotas: Capacity Governance at View and User Scope
SmartFiles supports both per-View quotas and per-user quotas inside a View, with audit logs of usage and Helios REST endpoints (getViewUserQuotas, top quotas by usage) to drive reporting and chargeback [Source: https://developers.cohesity.com/v1-helios-latest/reference/getviewuserquotas-1].
Storage-Domain defaults are typically configured via the CLI parameters default-view-quota (in GiB) and default-view-quota-alert-limit. Newly created Views inherit these defaults unless an architect overrides them at the View level [Source: https://mirror.vcu.edu/pub/cohesity/docs/Cohesity%20CLI%20Reference%20Guide%207.3.2.pdf].
A subtlety the CCAE exam can test: Cohesity’s public documentation does not sharply distinguish “soft” from “hard” quotas in the NetApp sense (with grace periods, etc.). Instead, you should think of:
- The quota itself as the enforced cap.
- The alert-limit as the soft trigger — the value at which an operator gets a warning.
Set the alert below the quota by a comfortable margin (e.g., alert at 80%, quota at 100%) so that operators have time to react before writes start failing [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/academy/cohesity-smartfiles-administration-en.pdf]. Storage Domain capacity itself is the hard ceiling: per-View quotas are governance, not protection against a runaway domain.
Figure 12.3: Quota and policy hierarchy — Storage Domain default to View override to user quota.
graph TD
SD[Storage Domain<br/>default-view-quota = 10 TiB<br/>alert-limit = 8 TiB]
V1[View: home-dirs<br/>inherits domain default]
V2[View: media-projects<br/>OVERRIDE: 50 TiB / alert 40 TiB]
V3[View: render-scratch<br/>OVERRIDE: 200 TiB / alert 160 TiB]
U1[User quota: alice<br/>500 GiB cap]
U2[User quota: bob<br/>500 GiB cap]
U3[User quota: build-svc<br/>2 TiB cap]
SD --> V1
SD --> V2
SD --> V3
V1 --> U1
V1 --> U2
V2 --> U3
QoS Policies: Workload-Aware Placement and Throttling
QoS in SmartFiles is selected at View creation time and steers two things: which storage tier (SSD vs. HDD) the View prefers, and how aggressively foreground IO competes with background tasks like dedupe and garbage collection. The shipping predefined policies include [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/solution-brief/cohesity-smartfiles-beyond-scale-out-nas-solution-brief-en.pdf]:
| QoS Policy | Tier Bias | IO Profile | Designed For |
|---|---|---|---|
| Backup Target Low | HDD-heavy | Large sequential / mixed small-block | Backup landing zones, secondary storage |
| TestAndDev High | SSD-optimized | Transactional, low-latency | Active dev/test, VDI-style workloads |
| (Archive / general purpose) | HDD/cold | Capacity-oriented | Cold archive Views, file shares with relaxed latency |
Two CCAE-flavored design rules around QoS:
- Pick the QoS at View creation. Changing it on a busy View is non-trivial and may require data movement; design up front based on the workload class.
- Match QoS to workload, not to who paid for it. Putting an active SMB user share on “Backup Target Low” tanks latency. Putting a backup target on “TestAndDev High” wastes SSD on workloads that are mostly write-then-archive.
Tiering: Hot/Cold Placement Across Local and Cloud
SmartFiles applies policy-driven tiering across SSD, HDD, and S3-compatible cloud targets (AWS S3, Azure Blob, GCS, or any compatible object store). Cold blocks move out without breaking the namespace — clients still see the file or object at the same path, and access triggers a transparent recall [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/solution-brief/cohesity-smartfiles-beyond-scale-out-nas-solution-brief-en.pdf]. Tiering is configured at the Storage Domain or View level and applies to all protocols simultaneously: a file tiered to S3 that is then read via SMB or NFS or the S3 API behaves identically.
Architects should explicitly model:
- Recall latency — first read after tier-down will hit cloud egress latency and may incur recall costs.
- Working set sizing — local SSD/HDD must still hold the active working set; tiering is for cold data, not hot.
- Egress cost — repeated recalls of the same dataset can dwarf the storage savings.
Worked Example: Media-and-Entertainment Workflow
Consider a post-production facility with two distinct workloads sharing one Cohesity cluster:
- Editorial team — about 30 video editors using Avid/Premiere over SMB3 against an active project share. Latency-sensitive, small numbers of large files, frequent timeline scrubs that read the same media segments repeatedly. Working set is ~20 TB at any time; long tail of finished projects is ~400 TB.
- Render farm — 200 Linux render nodes pulling assets over NFSv4 and writing rendered frames as objects via S3. Throughput-sensitive, highly parallel reads, append-heavy writes. Sustained ingest of ~5 GB/s during render windows.
A reasonable design:
Storage Domain: "Media-Primary" (SSD-biased, RF2, post-process dedupe)
+-- View: "edit-projects"
| Protocols: SMB3 R/W, NFSv4 R/W, S3 read-only (for archive readers)
| QoS: TestAndDev High (SSD priority, low latency)
| Quota: 50 TB, alert 40 TB
| Tiering: cold blocks > 90 days idle -> S3 (Standard-IA)
|
+-- View: "render-scratch"
Protocols: NFSv4 R/W, S3 R/W
QoS: General-purpose (HDD-biased, throughput-optimized)
Quota: 200 TB, alert 160 TB
Tiering: cold blocks > 30 days idle -> S3 (Glacier Instant Retrieval)
Editors get SSD-class latency on the active project share via “TestAndDev High”. The render farm gets capacity-class HDD throughput on a separate View whose IO profile cannot starve the editors. Both Views share the same SpanFS dedupe domain — so when the same source plate is referenced from both Views, it is stored once. Cold finished projects tier off to S3 transparently; an editor opening a 6-month-old project pays a one-time recall latency, but the file system path does not change.
Cohesity Insight Reports
SmartFiles surfaces capacity, top quota consumers, file age distribution, and access-pattern analytics through Helios reporting and the Insight family of reports. Architects use these for chargeback, for justifying tier-down policy choices to data owners, and for sizing migrations off legacy NAS (see next section).
Key Takeaway: SmartFiles governance has three knobs — quotas, QoS, and tiering — anchored to Storage-Domain defaults. Quotas are enforced caps with a separate alert-limit acting as the soft warning. QoS is chosen at View creation and is hard to change later, so map it to the workload class (Backup Target Low for landing zones, TestAndDev High for active SSD workloads). Tiering moves cold blocks to cloud transparently across all protocols.
Data Protection for SmartFiles
A primary file/object service that loses data is not a service. SmartFiles inherits the full DataPlatform protection stack, with a few SmartFiles-specific data services on top.
Snapshots and Policies
Every View can be snapshotted on a schedule using the same Protection Policies covered in Chapter 7 — frequency, hierarchical retention (daily / weekly / monthly / yearly), and lock attributes. SnapTree gives near-zero overhead for snapshots, so retention windows can be aggressive without paying capacity penalties [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity-SpanFS-and-SnapTree-WP.pdf]. Snapshots are mountable as read-only Views, which makes “previous versions” workflows straightforward for SMB users.
For ransomware resilience, combine snapshot policies with DataLock (Chapter 11) so snapshot deletes require either time-elapse or quorum approval. SmartFiles is a particularly attractive ransomware target precisely because it holds primary files; immutable snapshots are not optional for production deployments.
DR Replication for SmartFiles
Views replicate to a remote cluster using the same replication engine that DataProtect uses, with encryption, compression, and dedupe-aware transfer over the wire. Replication policies are attached to the View’s Protection Policy. For active-active patterns, the same View name on the remote side is presented as a read-only mirror that can be promoted on failover; for active-passive, snapshots and live data are pushed continuously and the remote View is brought online during a SiteContinuity-orchestrated failover.
Architectural notes for SmartFiles DR:
- AD/LDAP must be reachable from the DR site for SMB and NFSv4 access to resolve identities post-failover.
- DNS / VIP planning matters more than for backup workloads — clients are connecting on protocol VIPs, so the failover plan must include either VIP movement or DNS record updates.
- S3 endpoint URLs must be planned end-to-end; cloud-native applications often hard-code the bucket endpoint and need a redirection strategy.
File Audit Logging
SmartFiles ships native file audit logging that records per-event activity (open, read, write, rename, delete, ACL change) on Views. This is pushed to Syslog or to SIEM platforms and replaces the bolt-on third-party audit appliances that traditional NetApp/Isilon shops ran in front of their NAS. Audit logging is also a control for ransomware detection: an unusual rate of rename-then-delete on a user share is a classic encryption signature.
Anti-Virus and ICAP Integration
SmartFiles integrates antivirus scanning natively via the ICAP (Internet Content Adaptation Protocol). When ICAP AV is enabled on a View, write paths fan out to one or more configured ICAP servers (Trellix, Symantec, Sophos, etc.) for scanning before the data is committed [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/solution-brief/cohesity-smartfiles-beyond-scale-out-nas-solution-brief-en.pdf]. This replaces the architecture pattern where customers ran a separate ICAP-broker appliance fronting Isilon or NetApp — one fewer hop, one fewer thing to license.
For the CCAE exam, remember:
- ICAP is a synchronous scan; very high-throughput workloads must size ICAP scanner pools accordingly or scope ICAP to specific Views (e.g., user shares but not render scratch).
- ICAP scan results integrate with audit logs and DataHawk where deployed.
- Cohesity’s ICAP support is a feature of SmartFiles itself; it is not an additional appliance.
Key Takeaway: SmartFiles inherits SpanFS-level snapshots and replication and adds three native data services that have historically been bolt-ons in legacy NAS estates: file audit logging, ICAP-based AV, and DataLock-immutable snapshots. The replacement of bolted-on audit and AV products is one of the most common business cases for migrating from NetApp/Isilon to SmartFiles.
Migration and Modernization
Few customers buy SmartFiles for greenfield workloads. The CCAE-relevant scenarios are almost always migrations or modernizations: replacing a NetApp filer at end-of-life, absorbing an Isilon estate as part of a vendor consolidation, or onboarding cloud-native S3 workloads that started life on AWS.
NAS File Migration Service: The Packaged Path
Cohesity sells a packaged Professional Services engagement called the NAS File Migration Service for full NetApp/Isilon cutovers. The service covers cluster preparation, migration planning, the cutover itself, and end-state documentation, and is sized at approximately 30 TB per migration event [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/datasheets/cohesity-nas-file-migration-service-data-sheet-en.pdf]. For estates significantly larger than 30 TB, plan multiple cutover events — by share, by department, or by data classification.
Transparent Cold-Data Tiering: The Coexistence Path
When the legacy NAS is not yet at end-of-life, SmartFiles can absorb just the cold data without disturbing hot workloads. SmartFiles scans the source NAS using its built-in file analytics, classifies data by access pattern, and policy-tiers cold blocks to the Cohesity cluster (or directly to cloud) — without rehydration on access [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/solution-brief/Cohesity-SmartFiles-Solution-Brief.pdf]. The legacy NAS keeps serving hot data; SmartFiles silently absorbs the cold tail and exposes it at the same logical path. This is often used to defer a NAS refresh by years and to make the eventual cutover smaller.
Backup-Driven Migration: The Unconventional Path
Because Cohesity is also a NAS backup target, an unconventional but supported path is to back up the source NAS to Cohesity and then restore cross-filer — for example, restoring an Isilon backup directly into a SmartFiles View, or even into a different NAS array entirely [Source: https://www.cohesity.com/blogs/how-to-conquer-nas-backup-and-recovery-challenges/]. This bypasses traditional robocopy / rsync / NetApp-native tooling for many use cases and is particularly useful when permission preservation across SMB/NFS is critical.
Cohesity SmartFiles vs. NetApp ONTAP and Dell PowerScale (Isilon)
| Capability | Cohesity SmartFiles | NetApp ONTAP | Dell PowerScale (Isilon) |
|---|---|---|---|
| Primary architecture | Distributed SpanFS, hyperconverged | Dual-controller HA pairs (cluster of pairs) | Distributed OneFS scale-out NAS |
| Single namespace | Yes, cluster-wide | Per SVM (vserver) | Yes, cluster-wide |
| Multi-protocol on same data | NFSv3/v4 + SMB3 + S3 simultaneously | NFS + SMB + S3 (S3 is bolt-on, separate bucket per SVM) | NFS + SMB + (S3 via separate ECS or via OneFS S3) |
| Global dedupe | Yes, variable-length, cluster-wide | Volume/aggregate scoped | Limited; per-volume |
| Cold-tier to public cloud | Yes, transparent recall, all protocols | FabricPool (block-level, ONTAP-only mechanism) | CloudPools (Isilon-specific) |
| Native ICAP AV | Yes, built in | Yes (Vscan) | Yes (CAVA / ICAP) |
| Native file audit | Yes | FPolicy | Audit subsystem |
| Snapshot model | SnapTree, unlimited, zero-copy | WAFL snapshots, capacity-efficient | OneFS snapshots |
| Immutability | DataLock (WORM) | SnapLock | SmartLock |
| Backup target capability | Native (same product, same cluster) | Possible but not the design center | Possible but not the design center |
| Single platform for backup + primary | Yes — that is the design center | No | No |
The critical column is the last one. NetApp and Isilon are excellent primary NAS platforms, but they are not designed to also be your backup target. SmartFiles’ architectural argument is consolidation: one platform for backup, files, objects, archive, and DR.
Application Refactoring with S3
A meaningful subset of SmartFiles deployments are not file-server replacements at all — they are S3 endpoints for cloud-native applications that need an on-prem object store. Splunk SmartStore, container image registries, ML training datasets, IoT telemetry sinks, and Veeam-style backup targets all routinely consume S3-only Views. For these, treat the View as a bucket and forget the file protocol surface entirely.
Lift-and-Shift Patterns
For lift-and-shift modernization, a common sequence looks like:
- Discovery. Run SmartFiles analytics against the source filers; produce a hot/warm/cold map per share.
- Decision tree. For each share, choose tier-off (keep source, move cold) vs. full cutover (NAS Migration Service) vs. backup-driven restore.
- Identity and AV. Stand up AD/LDAP integration and ICAP scanner pools before any cutover, not after.
- QoS placement. Map each share to a QoS class — primary user share to TestAndDev High or general-purpose, archive to Backup Target Low, render scratch to general-purpose throughput.
- Cutover windows. Plan cutover events at the 30-TB-per-event sizing of the packaged service; use replication seeding to minimize cutover time.
- Decommission. Retire the source filer once the SmartFiles View is authoritative and snapshots have aged into the new retention windows.
Figure 12.4: NAS migration workflow — three paths from legacy filer to SmartFiles.
flowchart TD
Legacy[Legacy NAS<br/>NetApp ONTAP / Dell Isilon]
Disc[Discovery & Analytics<br/>SmartFiles file scan<br/>hot/warm/cold map]
Decision{Decision tree<br/>per share}
Cutover[Full Cutover Path<br/>NAS File Migration Service<br/>~30 TB per event]
Tier[Cold-Data Tier-Off Path<br/>policy-driven tiering<br/>legacy keeps hot data]
Backup[Backup-Driven Path<br/>NAS backup + cross-filer<br/>restore preserves ACLs]
Prep[Pre-cutover prep<br/>AD/LDAP integration<br/>ICAP scanner pools<br/>QoS classification]
SF[SmartFiles View<br/>SMB3 + NFSv4 + S3<br/>on Cohesity DataPlatform]
Decom[Decommission legacy<br/>after retention aged]
Legacy --> Disc
Disc --> Decision
Decision --> Cutover
Decision --> Tier
Decision --> Backup
Cutover --> Prep
Backup --> Prep
Tier --> SF
Prep --> SF
SF --> Decom
Key Takeaway: SmartFiles offers three migration paths — full cutover via the 30-TB-per-event NAS File Migration Service, transparent cold-data tier-off that lets the legacy NAS keep running, and backup-driven cross-filer restore. The strategic argument against NetApp/Isilon is consolidation: SmartFiles is the same platform that holds your backups, replicates to DR, and tiers to cloud, with native file audit and ICAP AV replacing bolt-on appliances.
Chapter Summary
SmartFiles turns the Cohesity DataPlatform into a multi-protocol primary unstructured-data service without adding any new hardware or any new file system. It is built on SpanFS, the same distributed file system that holds your backups, and its central construct is the View: a logical container exposed simultaneously as SMB3, NFSv3/v4, and S3 against the same data. Views live inside View Boxes / Storage Domains that set policy — efficiency, encryption, resiliency, default quotas, tiering — and that act as fault-isolation boundaries between workload classes.
Governance is three knobs: quotas (with separate alert-limits as the soft trigger), QoS (predefined policies like Backup Target Low and TestAndDev High, chosen at View creation and hard to change later), and tiering (transparent hot/cold placement across SSD, HDD, and S3-compatible cloud, applied uniformly across all protocols). Identity is mediated by AD/LDAP so a single AD user maps to SMB SIDs, NFS UID/GID, and S3 access keys against the same View. Cross-protocol locking is enforced inside SpanFS.
Data services that legacy NAS estates traditionally bolted on — file audit logging, ICAP antivirus, DataLock immutability — are native to SmartFiles. So is the protection stack: snapshots via SnapTree, replication via the same engine DataProtect uses, and DR orchestration via SiteContinuity.
For migrations, three patterns dominate: the packaged NAS File Migration Service (~30 TB per cutover event), transparent cold-data tier-off that defers a refresh without disturbing hot workloads, and backup-driven cross-filer restore when permission preservation matters. The architectural value proposition against NetApp ONTAP and Dell PowerScale (Isilon) is consolidation: one platform for backup, files, objects, archive, and DR, with global dedupe across what used to be siloed estates.
For the CCAE exam, anchor your answers in the View / View Box / SpanFS hierarchy, remember that QoS is sticky and must be chosen up front, and recognize the three-path migration playbook. Most scenario questions about SmartFiles reduce to: which protocol(s) on which View, which QoS, which Storage Domain, which tiering policy, and which protection policy — in that order.
Key Terms
- SmartFiles — Cohesity’s primary file and object data service, built on SpanFS, that exposes Views as multi-protocol containers (SMB3 + NFSv3/v4 + S3) on the same DataPlatform cluster used for backup, archive, and DR.
- View — The unified logical container in SpanFS that data lives in. A View can be exposed simultaneously as an NFS export, an SMB share, and an S3 bucket against the same underlying data.
- View Box (also Storage Domain) — The policy container that holds Views and defines defaults for storage efficiency, encryption, resiliency (RF/EC), tiering, and default quotas. Acts as the workload-class fault-isolation boundary.
- SMB3 — The SMB protocol level supported by SmartFiles, including signing and encryption. Used for Windows file share semantics.
- NFSv4 — The newer NFS version supported alongside NFSv3 on SmartFiles Views; supports name@domain principals and stateful locks.
- S3 — S3-compatible object API exposed on Views, where files become objects keyed by their path. Supports read-only, read-write, and writable-clone modes per View.
- Quota — Capacity governance enforced per-View or per-user-within-a-View. The quota itself is the enforced cap; a separate alert-limit acts as the soft warning before hard enforcement.
- QoS — Predefined policies (e.g., Backup Target Low, TestAndDev High) selected at View creation that steer tier preference (SSD vs. HDD) and IO prioritization. Non-trivial to change after the View is in production.
- ICAP — Internet Content Adaptation Protocol. The mechanism SmartFiles uses for native AV scanning by integrating with external ICAP scanner pools (Trellix, Symantec, Sophos, etc.), replacing the bolt-on ICAP-broker appliances common in legacy NAS estates.
Chapter 13: Helios SaaS, Marketplace Apps, and Automation
If Chapters 1 through 12 taught you how to design, deploy, secure, and protect data on individual Cohesity clusters, this chapter zooms out to the operating model an architect actually inherits in production: a fleet. Real CCAE-scale customers run anywhere from three or four clusters to several hundred, scattered across data centers, public clouds, and edge sites. They cannot afford to log in to each cluster individually, and they certainly cannot afford for the same Gold protection policy to mean three different things in three different regions. The architectural answer is the Helios SaaS control plane plus the Marketplace and the automation stack — REST API v2, Terraform, Ansible, and PowerShell — that turn the fleet into a single managed estate.
Think of it this way: a single Cohesity cluster is a server. Helios is the cloud console for the whole fleet, and the automation tools are the deployment pipelines that keep that fleet in compliance with whatever the architecture document says it should be.
Learning Objectives
By the end of this chapter, you will be able to:
- Architect global fleet management with Helios across multi-cluster, multi-region deployments, including dark-site (self-managed Helios) variants.
- Use Helios reporting, dashboards, federated RBAC, and global search effectively for both operations and compliance use cases.
- Position Cohesity DataProtect delivered as a Service (DMaaS) — including data residency, region selection, and subscription implications — relative to self-managed clusters.
- Deploy and govern Marketplace apps using the Apollo container runtime, AppSpec YAML, and view-mediated data access; identify vetted apps such as SentinelOne, ClamAV, Splunk, and Imanis.
- Automate operations with REST API v2, the
cohesity/cohesityTerraform provider, thecohesity.dataprotectAnsible collection, and the cross-platform PowerShell module — and choose the right tool for each job.
Helios SaaS Control Plane
Cohesity Helios is delivered as a SaaS service that aggregates and centralizes management of every Cohesity cluster a customer owns, regardless of whether those clusters live on-premises, in a public cloud, or in a hybrid topology [Source: https://www.cohesity.com/products/helios/]. From an architect’s standpoint, Helios is the single pane of glass that turns a fleet of independent clusters into one logical, policy-governed estate.
Helios Architecture and Tenancy Model
Architecturally, Helios is a multi-tenant cloud service hosted by Cohesity. Each customer gets a Helios account (a tenant), and clusters are bound to that account during onboarding. Connectivity from cluster to Helios is outbound HTTPS over TCP/443 — Cohesity does not require any inbound openings on customer firewalls, which is exactly what enterprise security architects want to hear [Source: https://docs.cohesity.com/]. The cluster establishes a long-lived TLS session to Helios, sends telemetry and accepts pushed configuration, and exposes a relay channel that lets a Helios admin click “Launch Cluster UI” and proxy into the per-cluster console without a VPN.
| Helios Layer | Responsibility | Lives Where |
|---|---|---|
| Helios UI / API | Single pane of glass; entry point for admins, APIs, dashboards | Cohesity SaaS |
| Helios services (telemetry, search index, reporting) | Aggregate cluster metadata, run anomaly detection | Cohesity SaaS |
| Cluster agent / Helios connector | Outbound HTTPS tunnel from each cluster | On each managed cluster |
| Data plane (SpanFS, Bridge, Magneto) | Stores and protects data; Helios does not see data, only metadata | On each managed cluster |
Figure 13.1: Helios SaaS control plane topology — outbound HTTPS aggregation of multi-cluster, multi-cloud, and DMaaS estates
flowchart LR
subgraph OnPrem["On-Premises Data Centers"]
C1[Cluster A<br/>NYC]
C2[Cluster B<br/>Frankfurt]
C3[Cluster C<br/>Singapore]
end
subgraph Cloud["Public Cloud"]
CE1[Cloud Edition<br/>AWS us-east-1]
CE2[Cloud Edition<br/>Azure westeurope]
end
subgraph DMaaS["Cohesity-Operated"]
DM[DMaaS Tenant<br/>region-pinned]
FK[FortKnox<br/>Cyber Vault]
end
C1 -- "outbound HTTPS<br/>TCP/443" --> Helios
C2 -- "outbound HTTPS<br/>TCP/443" --> Helios
C3 -- "outbound HTTPS<br/>TCP/443" --> Helios
CE1 -- "outbound HTTPS" --> Helios
CE2 -- "outbound HTTPS" --> Helios
DM --> Helios
FK --> Helios
Helios[("Helios SaaS<br/>Control Plane<br/>UI / API / Reporting<br/>DataHawk / Anomaly")]
Helios -- "HTTPS" --> Browser[Admin Browser /<br/>API Client]
The crucial separation is that Helios is a control plane only. Backup data, deduplication chunks, and SpanFS metadata never leave the cluster. What flows to Helios is operational metadata — job status, capacity, alerts, source inventory, policy IDs — which is the basis for both the global dashboard and the AI/ML anomaly detection [Source: https://www.cohesity.com/products/helios/].
Onboarding Clusters to Helios
Onboarding is intentionally trivial: in the cluster UI under Settings → Helios Registration, the admin enters the Helios account credentials, Helios issues a registration token, and the cluster establishes the outbound tunnel. From that moment on, the cluster appears in the Helios dashboard. Re-registering or detaching a cluster is equally a one-click operation from the Helios side. For dark sites or air-gapped environments where outbound HTTPS is forbidden, Cohesity offers Helios Self-Managed, a customer-hosted variant of the Helios services that runs on customer infrastructure and provides equivalent fleet management without depending on Cohesity’s SaaS [Source: https://www.cohesity.com/products/helios/]. Self-Managed Helios is the standard answer for federal classified environments, certain regulated banks, and intelligence-sector deployments.
Global Dashboards and SLA Reporting
Helios presents a unified, real-time view that aggregates health, capacity, protection status, SLA compliance, and performance metrics from every managed cluster. Operators drill from a fleet-level summary down to a single job on a single node without switching tools. The reporting engine produces customizable reports covering backup success rates, RPO/RTO compliance, storage consumption, growth trends, chargeback by tenant or organization, and audit/compliance evidence; reports can be scheduled and exported, which is essential for ITGC and SOC 2 audit cycles [Source: https://www.cohesity.com/products/helios/].
Federated global search lets an admin search for a VM, file, mailbox object, or database backup by name across the entire fleet; Helios resolves which cluster(s) hold the data. This is foundational for incident response (you don’t have to know in advance which cluster has the backup) and for legal/eDiscovery workflows that span sites.
Federated RBAC layers on top: granular roles can be scoped to specific clusters, regions, organizations, or object types, and can be sourced from Okta, Azure AD/Entra ID, Active Directory, or any SAML 2.0 IdP. A regional admin who can only see and recover within EMEA clusters is a one-line RBAC scope, not a per-cluster configuration project.
Helios-Only Features (DataHawk, FortKnox, Anomaly Detection)
Several Cohesity capabilities are Helios-resident by design — they have no on-cluster equivalent, because they need fleet-wide telemetry or cloud-side compute:
- DataHawk — ML-based ransomware detection, threat intelligence (YARA), and data classification, scoring backup ingest patterns across the fleet to surface anomalies.
- FortKnox — managed cyber vault that air-gaps an immutable copy of backups in a Cohesity-operated cloud tenant; it is provisioned, monitored, and recovered through Helios.
- Cross-cluster anomaly detection — sudden change-rate spikes on one cluster correlated with similar patterns elsewhere, a signal you can only compute fleet-wide.
Key Takeaway: Helios is the SaaS control plane that converts N independent Cohesity clusters into one managed estate. Connectivity is outbound HTTPS only; backup data never leaves the cluster; and global dashboards, federated RBAC, global search, and Helios-only features (DataHawk, FortKnox) are the architectural payoff. For dark sites, Helios Self-Managed delivers the same model on customer infrastructure.
Helios as a Service (HaaS) and DMaaS
Helios is not just a console; it is also the entry point and management surface for Cohesity’s Data Management as a Service offerings. DMaaS shifts the operating model from “I run the cluster” to “I subscribe to backup outcomes” — Cohesity operates the underlying clusters in the cloud, and the customer consumes them through Helios.
DataProtect Delivered as a Service
DMaaS bundles DataProtect, replication, archive, and recovery as a fully managed SaaS service [Source: https://www.cohesity.com/products/data-management-as-a-service/]. The customer points sources (VMs, M365 tenants, databases, NAS) at the DMaaS endpoint; Cohesity provisions the underlying SpanFS capacity, runs the backup jobs, manages upgrades, and meters consumption. There is no cluster to bootstrap, no node to replace, and no version to upgrade — the SLA covers the platform.
| Operating Model | What the Customer Owns | What Cohesity Owns |
|---|---|---|
| Self-managed DataProtect (on-prem cluster) | Hardware, OS, network, cluster software, policies, sources | Software releases, support, Helios SaaS |
| Self-managed DataProtect (Cloud Edition) | Cloud VM/IaaS bill, cluster software ops, policies | Software, Helios SaaS |
| DMaaS (DataProtect as a Service) | Sources, policies, RBAC, data residency choice | Cluster, capacity, upgrades, SLA, infrastructure |
| FortKnox cyber vault | Vault policies, recovery decisions | Vault infrastructure (immutable, air-gapped) |
Region Selection and Data Residency
DMaaS is provisioned into specific cloud regions (AWS or Azure, depending on offering and customer choice). The architect’s job is to map regulatory boundaries — GDPR for EU data, data sovereignty laws in countries like Germany, Switzerland, India, Australia, and Canada — to a region selection. Because backup data carries the same residency obligations as primary data, picking a US region for an EU tenant’s M365 backups is not an option you can quietly hand-wave past an auditor. Helios surfaces region choice during DMaaS subscription and pins data residency for the lifetime of the tenant.
Subscription and Licensing Implications
DMaaS is sold on a subscription basis (typically per FETB-month or per workload tier), versus the perpetual or term license model common to self-managed DataProtect. This shifts the economics from CapEx + ongoing maintenance to pure OpEx. Architects sizing DMaaS apply the same FETB and change-rate inputs from Chapter 3 but must additionally model:
- Egress — recall traffic leaving the cloud, billed by the cloud provider through Cohesity’s pass-through.
- Long-term retention — DMaaS includes integration with cloud archive tiers (S3 Glacier, Azure Archive); cost optimization here is identical in spirit to the CloudArchive content from Chapter 10.
- Replication scope — replication between DMaaS regions or between DMaaS and self-managed clusters is supported and factored into subscription tiers.
On-Prem vs. SaaS Operating Models — Comparison
| Concern | On-Prem DataProtect | DMaaS |
|---|---|---|
| Cluster ops (upgrades, hardware) | Customer | Cohesity |
| Capacity planning | Customer (sizing tool, refresh cycles) | Cohesity (elastic) |
| Network ingress | LAN-speed to cluster | WAN egress to cloud (mind change rate) |
| Data residency | Wherever you put the cluster | Region selection at subscription time |
| Cost model | CapEx + maintenance | OpEx subscription |
| Best fit | Large, dense, predictable workloads; latency-sensitive recoveries | M365, branch offices, cloud-native apps, fast time-to-value |
Key Takeaway: DMaaS is DataProtect delivered as a managed SaaS service through Helios. Architects choose a region for residency, subscribe by FETB/workload, and consume backup as an outcome rather than an appliance. The same Helios console manages DMaaS tenants and self-managed clusters side by side, which is the architectural enabler for hybrid fleets.
Marketplace Apps
The Cohesity Marketplace is the storefront and delivery mechanism for first- and third-party applications that run directly on the Cohesity DataPlatform, leveraging the Cohesity App framework [Source: https://www.cohesity.com/marketplace/]. The architectural value proposition is compute at data: instead of egressing backup data to a separate analytics, AV, or compliance system, applications run in containers next to SpanFS where the data already lives, eliminating data movement, copy proliferation, and the egress cost/latency tax.
App Framework, Apollo, and Isolation
The Cohesity App Framework is built on containers. Apps are packaged as Docker images plus a Cohesity AppSpec — a Kubernetes-style YAML descriptor extended with Cohesity-specific fields [Source: https://developer.cohesity.com/docs/get-started-apps]. The cluster’s container runtime, Apollo (introduced in the Pegasus 6.3 release line), executes the image. The AppSpec declares resources, view mounts, network requirements, and lifecycle hooks. Each app runs in its own Docker container — a clean isolation boundary that prevents one Marketplace app from interfering with another or with cluster services.
A minimal AppSpec looks like this (excerpted from the Cohesity App SDK documentation):
apiVersion: cohesity.com/v1
kind: App
metadata:
name: clamav-scanner
version: 1.4.0
spec:
image: cohesity-marketplace/clamav:1.4.0
resources:
cpu: "2"
memory: "4Gi"
views:
- name: vm-backups
mountPath: /mnt/backups
mode: ReadOnly
network:
egress: false
lifecycle:
onStart: /opt/clamav/scan.sh
Figure 13.2: Marketplace app deployment sequence — Admin to Helios to Apollo runtime to container, with SDK callbacks
sequenceDiagram
participant Admin as Admin
participant Helios as Helios UI
participant Cluster as Target Cluster
participant Apollo as Apollo Docker Runtime
participant Container as App Container
participant Views as SpanFS Views
Admin->>Helios: Browse Marketplace, select app
Admin->>Helios: Accept EULA, choose target cluster(s)
Helios->>Cluster: Push AppSpec YAML + image ref
Cluster->>Apollo: Register app, enforce resources
Apollo->>Container: docker run (image, CPU/mem quotas)
Container->>Views: Mount authorized views (NFS/SMB, ReadOnly)
Container->>Container: Run lifecycle.onStart hook
Container->>Cluster: Management SDK call (REST API v2)
Cluster-->>Container: Snapshot / metadata response
Container->>Helios: Telemetry / status (egress=false respected)
Helios-->>Admin: App status: Running
Three security properties are worth pulling out:
views— the only sanctioned data path. The app sees backup data through view mounts (NFS/SMB-backed), restricted to the views the admin authorizes. There is no direct SpanFS access.network.egress: false— apps can be locked into air-gapped operation, supporting dark-site and high-security deployments. This is exactly how SentinelOne scanning works without phoning home.resources— Apollo enforces CPU/memory quotas declared in the AppSpec, so a misbehaving app cannot starve the cluster.
Marketplace Access and Vetted Apps
Apps are browsed and installed through the Helios UI or the public storefront at ccs-integration-marketplace.cohesity.com [Source: https://ccs-integration-marketplace.cohesity.com]. The deployment workflow is straightforward: select the target cluster(s), review and accept the EULA, then deploy. Installation is fleet-aware — an architect can push the same app to a curated subset of clusters via Helios.
| App | Purpose | Typical Use Case |
|---|---|---|
| SentinelOne | Endpoint/AV scanning of backup data, no internet egress required | Validate that backups are clean before relying on them for recovery |
| ClamAV | Open-source antivirus scanning of NAS and VM snapshots | Cost-effective AV for compliance check-the-box |
| Sophos | Commercial antivirus alternative to ClamAV | Enterprise environments standardized on Sophos |
| Splunk Enterprise | Log analytics and SIEM ingest running on the cluster | Search audit logs and backup metadata in place |
| Imanis Data | NoSQL/Hadoop backup integration (since acquired by Cohesity) | MongoDB, Cassandra, Hadoop, Couchbase backup |
Custom App Development Overview
Cohesity ships two SDKs for developers building custom apps [Source: https://developer.cohesity.com/docs/get-started-apps]:
- App SDK — provides primitives like
cohesity_mountfor mounting Cohesity Views into the container so the app can read/scan/analyze backup data directly. - Management SDK — exposes the Cohesity REST API surface from inside the container so apps can drive cluster operations (start scans, create snapshots, fetch metadata).
The publish flow is:
- Build a Docker image bundling the app and the Cohesity SDKs.
- Author an AppSpec YAML and validate with
appspecvalidator. - Submit for vetting (
developer@cohesity.com) — Cohesity reviews for security, stability, and resource behavior before listing. - On approval, the app appears in Marketplace for customer deployment.
The vetting gate is the first defense; container isolation is the second; view-scoped mounts are the third. Together they form a defense-in-depth posture that makes “third-party code on my backup cluster” an acceptable architectural decision, not a compliance red flag.
Key Takeaway: Marketplace apps run as Docker containers under Apollo, declared via AppSpec YAML, isolated from each other and from cluster services, and limited to admin-authorized view mounts as the only data path. This enables compute-at-data analytics, AV scanning, and SIEM ingest without egressing backups, with vetting + isolation + view scope providing defense in depth.
Automation Stack
Helios and Marketplace solve the interactive operating model. The programmatic operating model is REST API v2 plus the language-specific wrappers — Terraform, Ansible, and PowerShell. At CCAE scale, automating policy and protection-group management is the only scalable path; manually clicking through hundreds of policies across dozens of clusters is both error-prone and untraceable.
REST API v2
All higher-level tools sit on top of the cluster’s REST API v2, supported on cluster versions 6.3.1+ [Source: https://developer.cohesity.com/]. The API covers protection groups, policies, sources, storage domains, views, alerts, recoveries, and tenants. Direct REST is the right choice when no higher-level wrapper exists yet, or for purpose-built integrations with ITSM (ServiceNow), SIEM (Splunk, Sentinel), or custom self-service portals.
Figure 13.3: REST API v2 surface map — auth and the principal resource hierarchy under the cluster control plane
graph TD
Auth["/v2/mcm/access-tokens<br/>(Bearer token / Helios API key)"] --> Root["REST API v2 Root<br/>cluster 6.3.1+"]
Root --> Policies["/data-protect/policies<br/>(RPO, retention, archive, replication)"]
Root --> Groups["/data-protect/protection-groups<br/>(jobs bound to policies)"]
Root --> Sources["/data-protect/sources<br/>(vCenter, M365, NAS, DBs)"]
Root --> Views["/file-services/views<br/>(NFS/SMB/S3 namespaces)"]
Root --> Storage["/storage-domains<br/>(dedup/encryption domains)"]
Root --> Recoveries["/data-protect/recoveries<br/>(restore tasks)"]
Root --> Alerts["/monitoring/alerts<br/>(events, severities)"]
Root --> Reports["/reports<br/>(SLA, capacity, audit)"]
Root --> Tenants["/multi-tenancy/tenants<br/>(orgs, RBAC)"]
Groups -. "policy_id" .-> Policies
Groups -. "source_id" .-> Sources
Authentication uses a session token obtained from /public/accessTokens (cluster) or a Helios API key for fleet-wide calls. A simple example fetching all protection groups:
TOKEN=$(curl -sk -X POST https://cluster.example.com/v2/mcm/access-tokens \
-H 'Content-Type: application/json' \
-d '{"username":"admin","password":"'"$PWD"'","domain":"LOCAL"}' \
| jq -r .accessToken)
curl -sk -H "Authorization: Bearer $TOKEN" \
https://cluster.example.com/v2/data-protect/protection-groups
Terraform Provider (cohesity/cohesity)
Published on the HashiCorp Terraform Registry as cohesity/cohesity and listed on the Cohesity Marketplace [Source: https://registry.terraform.io/providers/cohesity/cohesity/latest/docs], the Terraform provider is the right tool for declarative cluster-side configuration: storage domains, views, policies, protection groups, RBAC, replication targets. It is the source of truth checked into Git.
provider "cohesity" {
cluster_vip = "10.0.0.10"
cluster_username = "admin"
cluster_password = var.cohesity_password
}
Architects declare the full protection topology as code, run terraform plan to preview drift, and apply changes uniformly across environments. This is foundational for CI/CD-driven backup management and for keeping non-prod and prod policies aligned.
Ansible Collection (cohesity.dataprotect)
Installed with ansible-galaxy collection install cohesity.dataprotect, the Ansible collection is the right tool for source-side mutation: rolling agent installs across fleets, registering sources, and triggering on-demand jobs [Source: https://github.com/cohesity/ansible-collection]. Notable modules:
cohesity_uda_protection_group— register, remove, start, and stop universal data adapter protection groups [Source: https://galaxy.ansible.com/ui/repo/published/cohesity/dataprotect/].- Source/agent modules — install and register Cohesity backup agents on Windows, Linux, and macOS targets.
Ansible is push-based and idempotent; it integrates cleanly with Tower / Ansible Automation Platform (AAP) for RBAC and audit.
PowerShell Module
Cohesity publishes a cross-platform PowerShell module (Windows, Linux, macOS via PowerShell 7+) that wraps REST API v2 in cmdlets [Source: https://cohesity.github.io/cohesity-powershell-module/]. It is the natural choice for Windows-centric shops automating protection groups, policies, recoveries, and reporting from existing PowerShell-based tooling and runbooks.
Connect-CohesityCluster -Server cluster.example.com -Credential $cred
Get-CohesityProtectionGroup | Where-Object { $_.policyId -eq $goldPolicy.id }
Choosing the Right Tool — Comparison Table
| Dimension | Terraform | Ansible | PowerShell | Raw REST API v2 |
|---|---|---|---|---|
| Paradigm | Declarative IaC | Imperative + idempotent push | Imperative scripting | Direct HTTP |
| State management | Yes (.tfstate) | None (each run re-evaluates) | None | None |
| Best for | Cluster config: policies, groups, views, RBAC, replication | Source-side: agent installs, source registration, ad-hoc jobs | Windows ops, ad-hoc reporting, integration with existing PS tooling | ITSM/SIEM webhooks, custom portals, gap-fill where wrappers lag |
| CI/CD fit | Excellent (plan/apply, drift detection) | Good (AAP / Tower) | Moderate (Jenkins/PS pipelines) | Excellent (any HTTP-aware tool) |
| Drift detection | First-class | Re-runs converge but no diff | Manual | Manual |
| Skill curve | Medium (HCL, state) | Low–medium (YAML) | Low for Windows admins | High for non-trivial flows |
| Typical owner | Platform / SRE team | Server / source team | Windows ops team | Integration / dev team |
Figure 13.4: Automation tool selection decision tree — choosing among Terraform, Ansible, PowerShell, and raw REST
flowchart TD
Start{What are you<br/>automating?} --> Q1{Cluster-side config?<br/>policies, views,<br/>protection groups}
Q1 -- Yes --> Q2{Need declarative<br/>state + drift detection?}
Q2 -- Yes --> TF["Terraform<br/>cohesity/cohesity provider<br/>(.tfstate, plan/apply, Git)"]
Q2 -- No --> Q5
Q1 -- No --> Q3{Source-side push?<br/>agent installs,<br/>source registration}
Q3 -- Yes --> AN["Ansible<br/>cohesity.dataprotect collection<br/>(idempotent, AAP/Tower)"]
Q3 -- No --> Q4{Windows-native shop<br/>or ad-hoc reporting?}
Q4 -- Yes --> PS["PowerShell Module<br/>(PS 7+, cross-platform,<br/>Connect-CohesityCluster)"]
Q4 -- No --> Q5{Webhook from<br/>ITSM/SIEM, or gap<br/>in wrappers?}
Q5 -- Yes --> REST["Raw REST API v2<br/>(curl, any HTTP client,<br/>ServiceNow/Splunk)"]
Q5 -- No --> PS
The architect-level rule of thumb: Terraform for cluster-side IaC, Ansible for source-side push, PowerShell for Windows-native ops, REST when nothing else fits. All four are thin layers over REST API v2, so no tool boxes you in — each is a different ergonomic surface on the same control plane.
Worked Example: Terraform Module — Gold Policy + Protection Group
Here is a concrete CCAE-style worked example: a Terraform module that creates a “Gold” protection policy (1-hour RPO, 30-day local retention, 1-year archive) and a protection group bound to it for a set of VMware VMs. This is the kind of artifact that lives in Git, is reviewed in PRs, and is applied via CI to every cluster in the fleet to guarantee identical Gold semantics everywhere.
# modules/gold-vmware/main.tf
terraform {
required_providers {
cohesity = {
source = "cohesity/cohesity"
version = "~> 1.2"
}
}
}
variable "cluster_vip" { type = string }
variable "cluster_username" { type = string }
variable "cluster_password" { type = string, sensitive = true }
variable "vcenter_source_id" { type = number }
variable "vm_object_ids" { type = list(number) }
variable "archive_target_id" { type = number }
provider "cohesity" {
cluster_vip = var.cluster_vip
cluster_username = var.cluster_username
cluster_password = var.cluster_password
}
# 1. Gold Policy: 1h RPO, 30d local, 1y archive
resource "cohesity_protection_policy" "gold" {
name = "Gold-1h-30d-1y"
description = "Tier 1: 1h RPO, 30d local retention, 1y archive"
backup_policy {
regular {
incremental {
schedule {
unit = "Hours"
hour_schedule {
frequency = 1
}
}
}
retention {
unit = "Days"
duration = 30
}
}
}
remote_target_policy {
archival_targets {
target_id = var.archive_target_id
schedule {
unit = "Weeks"
frequency = 1
}
retention {
unit = "Years"
duration = 1
}
}
}
}
# 2. Protection Group bound to the Gold Policy
resource "cohesity_protection_group" "tier1_vms" {
name = "Tier1-VMware-Gold"
policy_id = cohesity_protection_policy.gold.id
environment = "kVMware"
vmware_params {
source_id = var.vcenter_source_id
object_ids = var.vm_object_ids
app_consistent_snapshot = true
indexing_policy {
enable_indexing = true
}
}
start_time {
hour = 22
minute = 0
}
}
output "policy_id" { value = cohesity_protection_policy.gold.id }
output "group_id" { value = cohesity_protection_group.tier1_vms.id }
A consuming root module then wires it up per cluster:
module "frankfurt_gold" {
source = "./modules/gold-vmware"
cluster_vip = "10.20.0.10"
cluster_username = "admin"
cluster_password = var.frankfurt_password
vcenter_source_id = data.cohesity_source.frankfurt_vcenter.id
vm_object_ids = data.cohesity_vmware_vms.frankfurt_tier1.ids
archive_target_id = data.cohesity_external_target.frankfurt_archive.id
}
module "singapore_gold" {
source = "./modules/gold-vmware"
# ...same variables, different cluster...
}
The architectural payoff: every Cohesity cluster in the fleet has a Gold-1h-30d-1y policy that means exactly the same thing — 1-hour incremental, 30-day local retention, 1-year archive, app-consistent VMware snapshots, indexed for search, weekly archive cadence. A terraform plan after any drift instantly shows the deviation. A change request to extend retention to 45 days is a one-line PR that cascades to every cluster on merge.
Key Takeaway: Terraform owns declarative cluster-side IaC (policies, protection groups, views, RBAC); Ansible owns source-side push (agents, source registration); PowerShell handles Windows-native ops; raw REST fills gaps and powers ITSM/SIEM webhooks. All four are wrappers over REST API v2 — pick the right ergonomics for the job, and treat backup configuration as code in Git with CI/CD gates.
Chapter Summary
Helios is the architectural answer to fleet sprawl. As a SaaS control plane with outbound-HTTPS-only connectivity, it converts an arbitrary number of independent Cohesity clusters into one managed estate with shared dashboards, federated RBAC, global search, anomaly detection, and centralized policy authoring. For environments where SaaS is not acceptable, Helios Self-Managed delivers the same model on customer infrastructure for dark sites and air-gapped deployments. DMaaS extends the same Helios surface into a fully managed subscription model where Cohesity operates the cluster — region selection pins data residency, and the customer consumes backup as an outcome rather than an appliance.
The Marketplace brings third-party compute to where the data already lives: Apollo (the cluster’s Docker container runtime) executes apps declared by Kubernetes-style AppSpec YAML, isolated by container boundaries, scoped to admin-authorized view mounts, and optionally air-gapped from the internet. Vetted apps like SentinelOne, ClamAV, Splunk, and Imanis are installed fleet-wide through Helios; custom apps go through a SDK + AppSpec + Cohesity vetting pipeline.
The automation stack is the programmatic counterpart: REST API v2 is the bedrock, and Terraform (cohesity/cohesity), Ansible (cohesity.dataprotect), and the PowerShell module are the language-specific wrappers. Architects pair Terraform for declarative cluster-side IaC with Ansible for source-side push, use PowerShell for Windows-native operational work, and call REST directly when nothing else fits — all backed by Git, CI/CD, and policy review. The worked Terraform module for a Gold policy and protection group illustrates how a single, version-controlled artifact can guarantee identical SLA semantics across every cluster in a global fleet.
Carrying forward: in Chapter 14 we use Helios as the lens for monitoring, alerting, and triage — the operational consequence of having all this telemetry centralized.
Key Terms
- Helios — Cohesity’s SaaS control plane that aggregates management of all clusters into a single pane of glass; communicates with clusters over outbound HTTPS only.
- Helios Self-Managed — customer-hosted variant of Helios for dark sites and air-gapped environments, delivering equivalent fleet management without SaaS dependence.
- DMaaS (Data Management as a Service) — Cohesity-operated, subscription-delivered DataProtect (and adjacencies like FortKnox) consumed through Helios with region-pinned data residency.
- Marketplace — the storefront and delivery mechanism for first- and third-party apps (SentinelOne, ClamAV, Splunk, Sophos, Imanis) that run directly on Cohesity clusters via the App Framework.
- App framework / Apollo / AppSpec — the container runtime (Apollo, Docker-based) and Kubernetes-style YAML descriptor (AppSpec) that together package, deploy, and isolate Marketplace apps with declared resources, view mounts, and network policy.
- REST API v2 — the canonical Cohesity programmatic interface (cluster 6.3.1+) covering protection groups, policies, sources, views, alerts, and recoveries; the foundation under all higher-level wrappers.
- Ansible (
cohesity.dataprotect) — Cohesity’s official Ansible collection for push-based, idempotent source-side automation (agent installs, source registration, on-demand jobs). - Terraform (
cohesity/cohesity) — Cohesity’s HashiCorp-registry Terraform provider for declarative, state-managed cluster-side IaC (policies, protection groups, views, RBAC, replication targets). - PowerShell module — cross-platform (PS 7+) Cohesity cmdlet library wrapping REST API v2; the natural fit for Windows-centric shops and ad-hoc operational scripting.
- DataHawk / FortKnox — Helios-resident security capabilities: ML-based ransomware detection / classification (DataHawk) and managed cyber-vault immutable cloud copy (FortKnox), neither of which has an on-cluster equivalent.
Chapter 14: Performance, Monitoring, and Troubleshooting
If the previous chapters explained how to design and operate a Cohesity estate when everything is healthy, this chapter is about what to do when it is not. A backup that ran in 45 minutes last week is now taking six hours. A replication target is two days behind. A node has gone dark and the cluster is alerting about quorum risk. The CCAE-level architect is expected to do more than restart services and hope — they are expected to localize bottlenecks, generate the right diagnostic artifacts, integrate alerts into the customer’s operational tooling, and engage Cohesity Support effectively when escalation is warranted.
The discipline that ties all of this together is structured triage: walk the data path end to end, measure each segment, and find the lowest sustained throughput point. Whether the symptom is a slow backup, a lagging replica, or a degraded recovery, the investigation always reduces to the same question — where in the chain is the bottleneck, and what evidence proves it?
Learning Objectives
By the end of this chapter you will be able to:
- Diagnose performance bottlenecks across the source -> network -> ingest -> NVRAM -> writer data path and identify which segment is the limiter.
- Use Cohesity statistics,
iris_cli, the Siren UI, and Helios alerts to triage incidents methodically. - Generate scoped log bundles and engage Cohesity Support with the artifacts they need on the first round-trip.
- Configure Helios alerting across email/SMTP, SNMP, syslog, and webhook channels, including SMTP validation.
- Recognize and respond to common failure modes — backup re-runs, replication lag, disk and node failures, and network partitions — and design resiliency for each.
Performance Bottleneck Analysis
A Cohesity backup is a multi-stage assembly line. Bytes are read from a source (a VM disk, a NAS share, a database), pushed across the network through a proxy or agent, ingested by a Cohesity node, journaled into NVRAM, and finally destaged onto SSD or HDD by the writer service. Each stage has a maximum sustainable throughput, and the slowest stage sets the throughput of the whole line. That is the definition of a bottleneck, and finding it is the architect’s primary skill in this chapter.
The Assembly-Line Analogy
Picture a five-station automotive assembly line. Station 1 attaches the chassis (source read), station 2 ships sub-assemblies between buildings (network), station 3 receives parts at the factory dock (cluster ingest), station 4 stages parts into a buffer area (NVRAM), and station 5 mounts them on the vehicle (writer/disk). If station 3 can only accept one chassis per minute while station 1 can produce three, you will see chassis piling up at the dock — but if you stand at station 5 and stare at the empty workstation, you will conclude (incorrectly) that the line is “slow.” The fix is not to add more workers at the end; it is to identify which station is the actual choke point and rebalance there.
Cohesity bottlenecks behave identically. A symptom at the writer (high write latency, growing queues) does not mean the writer is the problem — it may mean the source is feeding faster than the cluster can persist, or that NVRAM destage is back-pressuring upstream. Without measurements at each station, you are guessing.
The Five-Stage Data Path
[Source App/VM] -> [Network/Proxy] -> [Cluster Ingest] -> [NVRAM Journal] -> [Writer -> SSD/HDD]
Stage 1 Stage 2 Stage 3 Stage 4 Stage 5
Each stage has characteristic symptoms when it is the limiter:
| Stage | Symptom Pattern | Telltale Metric |
|---|---|---|
| 1. Source | Low source read MB/s, idle network, idle writers | Source read latency high; CBT/RCT slow; storage array saturated |
| 2. Network | Source ready, cluster idle, low end-to-end MB/s | iperf below link speed; retransmits; wrong VLAN/uplink |
| 3. Ingest | Network saturated but cluster CPU/proxy queues high | Proxy concurrency exhausted; hypervisor NBD/HotAdd contention |
| 4. NVRAM | Bursty throughput, periodic stalls, destage backpressure | NVRAM journal utilization high, destage queue growing |
| 5. Writer | Network healthy, NVRAM filling, write latency rising | Writer latency, disk queue depth, SSD/HDD saturation |
The most common Cohesity-side culprit is the target disk writer spending excessive time persisting data to the underlying tier — high writer latency or growing write queues even while network ingest looks healthy. NVRAM behavior is the leading indicator: incoming write batches land in NVRAM-backed journals before destaging, so saturating NVRAM or seeing destage backpressure is a strong signal that the cluster (not the source) is the limiter [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity_Data_Protection_White_Paper.pdf].
Figure 14.1: Bottleneck triage decision tree across the five-stage data path
flowchart TD
Start([Slow backup symptom]) --> NetCheck{iperf >= link speed?}
NetCheck -->|No| Stage2[Stage 2: Network<br/>Fix VLAN / uplink / MTU]
NetCheck -->|Yes| SrcCheck{Source MB/s and<br/>writer MB/s both low<br/>and balanced?}
SrcCheck -->|Yes| Stage1[Stage 1: Source<br/>Investigate array,<br/>hypervisor, agent]
SrcCheck -->|No| NvramCheck{Writer latency high<br/>or NVRAM saturated?}
NvramCheck -->|Yes| Stage45[Stages 4-5: NVRAM / Writer<br/>Cluster-side limiter:<br/>capacity, sizing]
NvramCheck -->|No| ConcCheck{Per-stream healthy<br/>but total throughput low?}
ConcCheck -->|Yes| Stage3[Stage 3: Ingest / Concurrency<br/>Add proxies, raise concurrency,<br/>split groups]
ConcCheck -->|No| Bundle[Generate Siren log bundle<br/>Engage Cohesity Support]
Establishing a Network Baseline with iperf
Before blaming Cohesity, prove the substrate works. The standard baseline test is iperf3 between a source host (or backup proxy) and one or more Cohesity nodes. Run it bidirectionally over the same NIC and VLAN that backup traffic uses. If iperf reports 9.4 Gb/s on a nominal 10 GbE link, the network is healthy. If it reports 940 Mb/s, you have just found a 1 GbE path masquerading as 10 GbE, and no amount of backup tuning will fix it.
A real-world Cohesity case showed dramatically improved backup times after forcing all Cohesity backup traffic onto a dedicated 10 GbE VLAN and a dedicated vSphere VMkernel port, instead of letting it traverse multiple shared paths [Source: https://www.cohesity.com/blogs/optimizing-cohesity-and-vsphere-networking/]. The vSwitch teaming policy and VMkernel binding (NBD/HotAdd traffic) frequently route backups onto the wrong uplink without any UI indication. iperf surfaces this in 30 seconds.
A 1 GbE versus 10 GbE path makes a 10x difference in achievable ingest, which alone explains a large fraction of “slow job” tickets.
The SOBR Performance Trap (140 vs 20-30 MB/s)
When third-party tools like Veeam present Cohesity as a Scale-Out Backup Repository (SOBR), throughput can collapse — from roughly 140 MB/s on a single-node Cohesity repository down to 20-30 MB/s once SOBR is enabled with the default placement policy. The remediation is to set the SOBR placement policy to Performance rather than Data Locality, restoring parallel ingest across nodes [Source: https://forums.veeam.com/veeam-backup-replication-f2/veeam-9-5-and-cohesity-t49454.html].
This is a textbook case where the cluster looks underused and the source looks healthy, yet end-to-end throughput is dismal. The bottleneck is in the upstream construct’s placement logic, not in Cohesity itself.
| SOBR Placement Policy | Behavior | Throughput |
|---|---|---|
| Data Locality (default) | Pin a backup chain to one extent | ~20-30 MB/s — single node serializes |
| Performance | Stripe across SOBR extents in parallel | ~140 MB/s — nodes work in parallel |
Other Common Source-Side and Path Limiters
- Antivirus on gateway / proxy servers. Physical-agent proxies and NAS scan hosts inspecting in-flight backup streams will throttle throughput severely. Exclude Cohesity processes and the staging directories from on-access scanning.
- NAS incremental performance. Historically slow on certain DataProtect versions; reportedly improved in 6.4.0. Always cross-check release notes during triage [Source: https://www.peerspot.com/questions/what-needs-improvement-with-cohesity-dataprotect].
- S3/object target tuning. For cloud-tier targets, increasing the multipart chunk size (e.g., AWS CLI
--multipart-chunk-size-mb) yields significant gains when the cloud-write pipeline is the choke point. - Hypervisor CBT/RCT regressions. When CBT is reset or corrupted, the next backup falls back to a full read of the disk, dropping effective throughput by an order of magnitude.
Worked Example: A Backup Running at 20 MB/s
A protection group that normally completes 4 TB in 6 hours (~190 MB/s aggregate) is now running at 20 MB/s and projecting a 60-hour runtime. Walk the path:
- Establish a baseline. Run
iperf3from the vSphere proxy host to three Cohesity node primary IPs. Result: 9.3 Gb/s sustained. Network substrate is healthy. Eliminate Stage 2 as the prime suspect. - Inspect job stats in Helios and
iris_cli. Source read MB/s is 22, writer MB/s is 21, writer latency is normal, NVRAM is not saturated. Reads and writes are balanced and low — the cluster is not being fed. The bottleneck is upstream of ingest. - Check the source. Open vCenter; the protected VM disks live on a datastore whose backing array shows queue depth 95% and read latency 40 ms. Stage 1 is the limiter — the source array cannot deliver bytes faster.
- Check for proxy/path issues. Confirm the backup proxy is using NBD over the dedicated 10 GbE VMkernel and not falling back to NBDSSL on the management network. It is correct.
- Look for known-issue overlays. Review release notes for the running DataProtect build. No CBT regression listed.
- Conclusion and remediation. The bottleneck is the source storage array under contention from a non-backup workload (a SQL re-index running concurrently). Reschedule the protection group outside the re-index window, or move the affected VMs to a less contended datastore. Do not collect a Cohesity log bundle — Cohesity is not the limiter and Support cannot accelerate someone else’s storage array.
The lesson: bottleneck = lowest sustained throughput point in the chain. The metric divergence (low read, low write, idle queues) localized the problem in two minutes, before any logs were collected.
Job Concurrency and Queue Analysis
Even when individual stages are healthy, concurrency can starve a job. A protection group with 200 VMs and only four backup proxies will queue 196 VMs while four run. Adding proxies, increasing per-proxy concurrency settings, or splitting the protection group into parallel groups can lift end-to-end completion time dramatically without changing per-stream MB/s.
For SmartFiles workloads, I/O profiling shifts from MB/s to IOPS and latency. A SmartFiles SMB share serving small-file metadata-heavy workloads will saturate at low MB/s but high IOPS. The fix is rarely “more bandwidth”; it is more nodes, more flash, or a redesign of the access pattern.
Key Takeaway: Bottlenecks are localized by walking the source -> network -> ingest -> NVRAM -> writer path and finding the lowest sustained throughput point. Always establish a network baseline with iperf, watch for SOBR placement-policy traps (140 vs 20-30 MB/s), and let metric divergence — not assumptions — point to the limiter.
Monitoring and Alerting
Once you understand bottlenecks in principle, you need a continuous monitoring fabric so the next incident finds you before the customer does. Cohesity’s primary alerting plane is Helios, which aggregates events from every registered cluster and fans them out to email, SNMP, syslog, and webhook channels.
Alert Categories and Severities
Helios groups alerts by category (cluster health, protection, replication, capacity, security, hardware, etc.) and severity (Critical, Warning, Info). The severity dimension is the architect’s main lever against alert fatigue: route Critical events to the on-call channel, Warning to the operations queue, Info to a SIEM archive [Source: https://docs.cohesity.com/baas/data-protect/alerts/alerts.htm].
| Severity | Typical Examples | Recommended Routing |
|---|---|---|
| Critical | Node down, quorum risk, replication failure, data unavailable | Email + SNMP + PagerDuty webhook |
| Warning | Disk predictive failure, capacity > 80%, job missed SLA | Email + Syslog (SIEM) |
| Info | Job completed, snapshot expired, configuration change | Syslog only (SIEM archive) |
Configuring Alert Notification Rules
In Helios (or DataProtect as a Service), the workflow is:
- Navigate to Health > Notification.
- Click Create > New Alert Notification Rule.
- Set the rule Notification Name and Filters (category, severity, alert name, cluster scope).
- Choose the delivery method: email, SNMP, syslog, or webhook.
- For email, specify To, Cc, Subject.
- Save. Matching alerts trigger automatically [Source: https://docs.cohesity.com/baas/data-protect/alerts/configure-alert-notification-rule.htm].
The same alert can fan to multiple channels — create one rule per channel with the same filter set, or different filters per channel for tiered routing. Programmatically, the endpoint is createAlertNotificationRule on the Helios API [Source: https://developers.cohesity.com/v1-helios-latest/reference/createalertnotificationrule-1].
Figure 14.2: Helios alerting fan-out across notification channels to recipients
sequenceDiagram
participant Cluster as Cohesity Cluster
participant Helios as Helios Aggregator
participant Rule as Notification Rule<br/>(category + severity filters)
participant SMTP as SMTP / Email
participant SNMP as SNMP Trap (NMS)
participant Syslog as Syslog / SIEM
participant Webhook as Webhook (HTTPS JSON)
participant Recipient as On-call / Ops / Security
Cluster->>Helios: Cluster event<br/>(node down, replication fail, etc.)
Helios->>Rule: Match against filters
Rule-->>Helios: Severity = Critical
par Fan-out to all matching channels
Helios->>SMTP: Email (To/Cc/Subject)
SMTP->>Recipient: Inbox notification
and
Helios->>SNMP: Trap with Cohesity MIB OIDs
SNMP->>Recipient: NMS event
and
Helios->>Syslog: Forward event
Syslog->>Recipient: SIEM correlation
and
Helios->>Webhook: POST JSON payload
Webhook->>Recipient: PagerDuty / ServiceNow / SOAR
end
Email/SMTP Configuration
Email delivery requires the cluster (or Helios) to know how to reach an SMTP relay. The relevant API surface is:
| API | Purpose | Notes |
|---|---|---|
PUT /v2/clusters/smtp | Update SMTP config | Server, port (465 SMTPS, 587 STARTTLS), auth credentials. Requires CLUSTER_MODIFY privilege. |
GET /v2/clusters/smtp | Retrieve SMTP config | Audit current settings. |
POST /validate | Test SMTP | Send a test message to a recipient address [Source: https://developers.cohesity.com/v1-helios-latest/reference/validatesmtpconfiguration]. |
Always run validate after changes. Silent SMTP relay breakage — expired credentials, blocked outbound 587, certificate chain failure — is one of the most common causes of “we never got the alert” tickets. The validate endpoint catches it before an incident does [Source: https://developers.cohesity.com/v1-cluster-7.3/reference/updatesmtpconfiguration].
SNMP Trap Integration
For customers with mature NMS environments (SolarWinds, OpenNMS, IBM Netcool), SNMP is the preferred channel. Configure trap targets via getHeliosSnmpAlertsConfig and tie SNMP delivery to alert filters through createAlertNotificationRule [Source: https://developers.cohesity.com/v1-helios-latest/reference/getheliossnmpalertsconfig-1]. Supply management station IPs, community/auth depending on SNMP version, and the trap filters. Validate by triggering a known alert and confirming the NMS decodes it correctly using the Cohesity MIB.
Syslog and SIEM
Syslog is the standard integration path for SIEM platforms — Splunk, QRadar, Sentinel, Chronicle. Configure through createAlertNotificationRule with the syslog target (server, port, optional facility). For full coverage, pair alert syslog forwarding with audit log forwarding so security teams can correlate operational events with administrative actions.
Webhook Integration
Webhook output ships a structured JSON payload — alertname, severity, cluster identifying info, and other context — to an HTTPS endpoint. This is the integration point for modern tooling: PagerDuty, ServiceNow, Slack/Teams via custom routers, or SOAR platforms like Cortex XSOAR.
Validation Pattern (Memorize This)
- Configure SMTP/SNMP/syslog/webhook targets at the cluster or Helios level.
- Create an alert notification rule with tight filters (start with one severity).
- Trigger or simulate a matching event.
- Confirm receipt at the email inbox / NMS / syslog server / webhook endpoint.
- Run the explicit
POST /validateendpoint for SMTP to catch silent relay breakage.
Capacity and SLA Reports
Beyond alerts, Helios provides SLA reports showing protection compliance per protection group, source, and cluster — what percentage of protected objects met their RPO over a reporting window. SLA reports are the artifact a CISO or IT director will demand quarterly; they are also the leading indicator of systemic problems that no single alert reveals. A protection group that drifts from 99.5% to 96% over six weeks is missing windows even though every individual job “succeeded” eventually after retries.
Key Takeaway: Helios fans alerts to email, SNMP, syslog, and webhook channels via filterable notification rules. Configure SMTP with
PUT /v2/clusters/smtp, always run the validate endpoint to catch silent relay failures, and tier severity routing to prevent alert fatigue. SLA reports surface chronic drift that individual alerts miss.
Logs and Diagnostics
Alerts tell you something is wrong; logs tell you what. Cohesity’s diagnostic story revolves around three artifacts: the Siren UI for log bundle generation, the iris_cli command-line for cluster-side queries and management, and the time capsule directory where bundles are staged.
iris_cli: The Supported CLI
iris_cli is the supported CLI surface for cluster management operations. Authenticate with:
iris_cli -server <cluster-IP> -username=admin -password=<pwd>
It is documented in the Cohesity CLI Reference Guide (e.g., the 7.3.2 release) and is the same admin context used for SSL certificate updates, protection management, and many support workflows [Source: https://mirror.vcu.edu/pub/cohesity/docs/Cohesity%20CLI%20Reference%20Guide%207.3.2.pdf]. For the exam, iris_cli is the CLI to name when a question asks about cluster-side actions [Source: https://nshielddocs.entrust.com/interops-docs/cohesity-kc/cli.html].
Common command groups include cluster operations (cluster status, cluster nodes list), protection (protection-runs list, protection-jobs list), and stats queries useful during triage.
Service Logs
Cohesity is a microservices platform. Each major service writes its own logs:
| Service | Responsibility | When to Inspect |
|---|---|---|
| iris | UI/control plane | UI errors, login failures, REST API issues |
| Bridge | I/O data path / SpanFS front-end | Read/write latency, NFS/SMB issues |
| Magneto | Backup orchestration | Job failures, scheduling, source registration |
| Apollo | Garbage collection, replication, indexing | GC stalls, replication lag, index issues |
| Stats | Metrics aggregation | Missing dashboards, metric gaps |
| Yoda | Search/index service | Search failures, indexing slowness |
| Gandalf | Configuration management | Cluster config issues |
| Nexus | Cluster networking control | Network path/route issues |
Targeting the right service when you generate a bundle keeps the artifact small and Support’s analysis fast.
Generating a Log Bundle via Siren
The primary on-cluster tool for log collection is the Siren log analysis page, reached at https://<cluster-VIP-or-node-IP>/siren. From the Siren landing page, click Cluster Support Bundle to start collection [Source: https://www.youtube.com/watch?v=b3sn69irplo]. The dialog lets you scope:
- Nodes — uncheck “Select all” to scope the bundle to specific node IPs (useful when one node is misbehaving and you want a smaller bundle).
- Services — pick only the services relevant to the symptom (iris for UI issues, Bridge for I/O path, Magneto for backup orchestration, etc.).
- Log level — verbosity threshold for the captured logs.
- Time range — defaults to the last 24 hours; widen for older incidents, narrow to the precise incident window when you can.
- Include hardware logs — pulls syslog and hardware diagnostics (firmware, IPMI/BMC, SMART, chassis events).
The Time Capsule Path
Generated bundles land on the cluster as “time capsules” under:
/home/cohesity/data/timecapsules
Bundle size typically ranges from a few MB up to 2-3 GB, depending on services and time range. Large bundles usually mean too many services or too wide a window — re-scope and regenerate. After Siren produces the bundle, copy it from the timecapsules directory and upload it to the location Support specifies (typically a per-case secure upload URL). Automation can use the uploadFilePackage API on the Cohesity Developer Portal [Source: https://developers.cohesity.com/v1-cluster-7.3/reference/uploadfilepackage].
Figure 14.3: Heartbeat telemetry and Siren log bundle lifecycle to Cohesity Support
flowchart LR
subgraph Cluster[Cohesity Cluster]
HB[Heartbeat<br/>continuous telemetry]
Trigger([Siren UI Trigger<br/>Cluster Support Bundle])
Scope[Scope inputs:<br/>nodes / services /<br/>log level / time range /<br/>hardware logs]
TC[(Time capsule<br/>/home/cohesity/data/<br/>timecapsules)]
end
HBEP[Cohesity Heartbeat<br/>endpoint HTTPS/443]
Upload[Secure upload URL<br/>uploadFilePackage API]
Support[Cohesity Support<br/>Proactive + Reactive]
HB -- continuous --> HBEP --> Support
Trigger --> Scope --> TC
TC -- copy / upload --> Upload --> Support
Heartbeat: The Continuous Diagnostic
Beyond on-demand bundles, Cohesity clusters emit a Heartbeat stream — a continuous, lightweight diagnostic feed reporting cluster health, version, configuration, and key metrics back to Cohesity. Heartbeat is what lets Cohesity Proactive Support spot brewing issues (failing disks, capacity creep, configuration drift) before they cause outages. Architects should ensure Heartbeat egress (HTTPS/443 to the Cohesity Heartbeat endpoint) is open, otherwise proactive support is blind.
Practical Bundle Hygiene
- Scope tightly. Pick the minimum services and the narrowest time window covering the incident. Smaller bundles upload faster over customer egress.
- Capture both healthy and unhealthy nodes. For cluster-wide issues, include at least one good node alongside the bad one for comparison.
- Record context separately. Note exact UTC timestamps, protection job names/IDs, and recent change events (upgrades, hardware swaps, firmware updates). Support correlates faster when this is in the case notes alongside the bundle [Source: https://www.cohesity.com/content/dam/cohesity/agreements-docs/cohesity-global-support-and-services-handbook-en.pdf].
- Use iris_cli for scripted hygiene. For repeated troubleshooting, scripting
iris_clilogins lets you cleanly capture cluster state alongside the Siren bundle.
For the exam, memorize the trio: Siren (UI generator) -> timecapsules (/home/cohesity/data/timecapsules) -> upload to Support. The bundle’s primary inputs are node set, services, log level, time range, and hardware logs toggle.
Audit Logs and Security Events
In addition to operational logs, Cohesity emits audit logs capturing administrative actions — who logged in, what they changed, who approved a snapshot deletion. Audit logs forward to syslog/SIEM for compliance and incident investigation [Source: https://docs.cohesity.com/baas/data-protect/audit-logs-dataprotect.htm]. They are the artifact a security auditor will demand during a HIPAA, PCI, or FedRAMP review.
Key Takeaway: Use Siren on the cluster to generate scoped log bundles (nodes, services, time range, hardware logs); bundles land in
/home/cohesity/data/timecapsules.iris_cliis the named CLI for cluster-side operations. Heartbeat provides continuous proactive telemetry. Always scope tightly and pair the bundle with precise UTC timestamps when engaging Support.
Common Failure Modes
The final section catalogs the failure modes a CCAE-level architect must recognize and respond to. For each, there is a characteristic signature, an immediate action, and a longer-term design lever.
Figure 14.4: Common failure modes taxonomy with signatures and architectural levers
graph TD
Root[Common Cohesity Failure Modes]
Root --> Backup[Backup Job Failures]
Root --> Repl[Replication Lag]
Root --> Disk[Disk Failure]
Root --> Node[Node Failure]
Root --> Part[Network Partition]
Backup --> B1[Stale credentials /<br/>CBT reset / locked files]
Backup --> B2[Lever: policy design,<br/>retry rules, SLA reports]
Repl --> R1[WAN underprovisioned /<br/>throttle misalignment /<br/>target ingest saturated]
Repl --> R2[Lever: bandwidth windows,<br/>change-rate sizing]
Disk --> D1[SpanFS RF/EC rebuild<br/>Heartbeat opens case]
Disk --> D2[Lever: schedule replacement,<br/>capacity headroom]
Node --> N1[Reduced capacity / perf<br/>Quorum risk if multiple]
Node --> N2[Lever: fault domain awareness<br/>chassis / rack / site]
Part --> P1[Paxos quorum split<br/>NTP drift / latency spikes]
Part --> P2[Lever: dual-homed nodes,<br/>redundant ToR, cluster VLAN]
Backup Job Failures and Re-runs
Backup jobs fail for many reasons — source unreachable, credentials expired, snapshot quota exceeded, hypervisor CBT reset, network blip during stream. Cohesity’s default behavior is to retry within the window: a transient failure that resolves before the next scheduled run is invisible to most stakeholders. Persistent failures escalate.
| Failure Pattern | Likely Cause | Action |
|---|---|---|
| First-run failures only | Stale source credentials, recently-changed VM | Refresh source credentials; re-discover |
| Random failures across many sources | Network/proxy intermittency | Check proxy health, network path |
| Same source fails repeatedly | Source-specific issue (CBT, agent, locked file) | Reset CBT; reinstall agent; investigate locked file |
| Cluster-wide failures spike | Cluster service issue, upgrade in progress | Check cluster health; review change log |
The architectural lever is policy design: tight RPOs combined with aggressive retry rules will mask intermittent failures, while loose policies surface them as SLA misses.
Replication Lag
Replication lag — the source cluster’s most recent successful replicated snapshot lagging behind the source cluster’s latest local snapshot — is the canonical “silent” DR failure. The protection job is succeeding locally; replication is just not keeping up.
Causes:
- WAN bandwidth insufficient for the daily change rate. The classic miscalculation: sized for steady-state daily change but not for full-resync scenarios, monthly fulls, or seasonal peaks.
- Replication policy throttling windows misaligned with actual change rates.
- Target cluster ingest saturated by other replication or local backup load.
- Encrypted/compressed replication competing with backup CPU on undersized clusters.
Action: in Helios, the SLA report and replication dashboards expose lag per protection group. Recovery is rarely instant — if you are 48 hours behind, you need a window where ingest exceeds change rate to catch up. Architects design bandwidth throttle windows that yield to backup ingest during the active backup window and run replication at full bandwidth overnight.
Disk and Node Failures
A single disk failure on a Cohesity node is a non-event — SpanFS rebalances using RF or erasure coding parity, and the architect’s only action is to schedule disk replacement. Heartbeat usually opens the support case automatically.
A node failure is more consequential. The cluster continues to operate (assuming RF and EC tolerances are not exceeded), but capacity, performance, and resiliency are all reduced until the node is recovered or replaced. Quorum loss — too many nodes down at once — halts the cluster. Architects design fault domain awareness (chassis, rack, site) into the cluster from day one to avoid correlated failures taking the cluster below quorum.
| Failure | Cluster Impact | Time to Recover |
|---|---|---|
| Single disk | Background rebuild; no outage | Hours (background) |
| Single node (RF2) | Reduced redundancy; rebuild begins | Hours to days for rebuild |
| Multiple nodes (within tolerance) | Performance degraded; rebuild contention | Days |
| Quorum loss | Cluster halts; data unavailable | Recovery operation; potential restore |
Cluster Network Partition Events
A network partition — a portion of the cluster losing connectivity to another portion — is the most dangerous failure mode. SpanFS uses Paxos-based metadata with strict consistency and quorum; the side without quorum cannot serve writes. If the partition persists, jobs targeting that side fail; replication across the partition lags; and management UI may show inconsistent state from different nodes.
Detection: Heartbeat alerts, node-up/node-down alerts, intra-cluster latency spikes, and NTP drift warnings (often the first symptom of an underlying network problem).
Action:
- Identify the partition boundary using
iris_cli cluster statusfrom multiple nodes. - Check the physical network — switch, uplink, VLAN.
- If the partition is healed quickly, the cluster auto-recovers; if not, generate a Siren bundle scoped to Bridge, Apollo, and Gandalf and engage Support before attempting any manual remediation.
- Document the event and review fault domain design — was the partition along an unexpected boundary?
The architectural lever is network design: dual-homed nodes, redundant top-of-rack switches, dedicated cluster interconnect VLANs separate from client traffic, and BGP/LACP configurations that fail predictably.
Decision Tree: Bottleneck Classification
When the symptom is “slow,” use this decision tree to classify quickly:
Is the network healthy? (iperf >= link speed)
No -> Stage 2 (Network) — fix VLAN/uplink/MTU first
Yes -> next
|
Are source read MB/s and writer MB/s both low and balanced?
Yes -> Stage 1 (Source) — investigate source array, hypervisor, agent
No -> next
|
Is writer latency high or NVRAM saturated?
Yes -> Stages 4-5 (NVRAM/Writer) — cluster-side limiter; check capacity, sizing
No -> next
|
Is per-stream throughput healthy but total throughput low?
Yes -> Concurrency — add proxies, raise concurrency, split groups
No -> Generate log bundle, engage Support
This tree does not replace measurement, but it sequences the questions in the order most likely to find the bottleneck fast.
Key Takeaway: Common failure modes — backup re-runs, replication lag, disk/node failure, network partition — each have a characteristic signature, an immediate action, and a longer-term design lever. Architects design fault domain awareness, throttle windows, and network redundancy before the failure, not after.
Chapter Summary
Performance and troubleshooting are the operational disciplines that separate a well-architected Cohesity deployment from a fragile one. The CCAE-level architect must think like a process engineer: measure each stage of the data path, identify the lowest-throughput point, and apply the lever that lifts it. The five-stage pipeline (source -> network -> ingest -> NVRAM -> writer) is the mental model for every “slow backup” ticket, and metric divergence — not assumptions — localizes the bottleneck.
Tooling is straightforward but must be rehearsed. iris_cli is the supported CLI; the Siren UI generates support log bundles into /home/cohesity/data/timecapsules; Heartbeat streams continuous diagnostics back to Cohesity for proactive support. Helios fans alerts to email, SNMP, syslog, and webhook channels via filterable notification rules — and SMTP changes must always be followed by a POST /validate to catch silent relay breakage.
Common failure modes — backup re-runs, replication lag, disk and node failures, network partitions — each have characteristic signatures and architectural levers. The architect’s job is to build the levers (fault domain awareness, throttle windows, redundant networking, tiered alerting) into the design before the incident, then triage with discipline when the incident arrives.
For the exam, internalize three drills:
- The bottleneck triage drill — given a slow-backup symptom, walk source -> network -> ingest -> NVRAM -> writer and name the metric that proves your verdict.
- The log bundle drill — name Siren, name the timecapsules path, name the four scoping inputs (nodes, services, time range, hardware logs).
- The Helios alerting drill — name the four channels (email, SNMP, syslog, webhook), the SMTP API (
PUT /v2/clusters/smtp), the validation step (POST /validate), and the rule creator (createAlertNotificationRule).
Key Terms
- Bottleneck — The lowest sustained throughput point in the source -> network -> ingest -> NVRAM -> writer chain. The whole pipeline’s throughput is set by its bottleneck; finding it is the architect’s primary triage skill.
- Heartbeat — Continuous lightweight diagnostic stream emitted by Cohesity clusters back to Cohesity for proactive support. Requires HTTPS/443 egress to function.
- iris_cli — The supported Cohesity command-line interface for cluster management operations. Authenticated with
iris_cli -server <ip> -username=admin -password=<pwd>and documented in the Cohesity CLI Reference Guide. - Log bundle — Packaged collection of service logs, hardware data, and cluster state for a defined time window across selected nodes. Generated via the Siren UI and staged in
/home/cohesity/data/timecapsulesbefore upload to Cohesity Support. - Alert — A categorized, severity-tagged event emitted by a cluster (or aggregated through Helios) and routed to email, SNMP, syslog, or webhook subscribers via notification rules.
- SLA report — Helios report that measures protection compliance per protection group, source, and cluster against the policy’s RPO over a reporting window. Surfaces chronic drift that individual alerts miss.
- NVRAM — Non-volatile RAM journal in front of the SpanFS disk tier. Incoming writes land in NVRAM before destaging to SSD/HDD; saturation or destage backpressure is a leading signal of a cluster-side write bottleneck.
Chapter 15: End-to-End Architecture Scenarios and Exam Synthesis
The previous fourteen chapters built the vocabulary, mechanics, and design patterns of the Cohesity Data Cloud one layer at a time — SpanFS internals, sizing math, networking, identity, protection policies, replication, cloud integration, security, SmartFiles, Helios, and troubleshooting. The Cohesity Certified Architect Expert (CCAE) exam, however, almost never asks about a single layer in isolation. It asks you to assemble layers into a coherent design that satisfies a business problem under hard constraints. This final chapter does three things: it walks through three full reference architectures end-to-end, it decodes the exam blueprint and scenario question pattern in detail, and it gives you a 30-day plan plus a test-day playbook so you walk into the proctoring session knowing exactly how to spend your 90 minutes [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/datasheets/cohesity-certified-architect-expert-exam-preparation-guide-en.pdf].
Learning Objectives
By the end of this chapter, you should be able to:
- Synthesize multi-domain Cohesity architectures that combine DataProtect, SmartFiles, SiteContinuity, FortKnox, DataHawk, and Helios for enterprise-scale scenarios.
- Trace explicit business requirements (RPO, RTO, retention, compliance, residency, budget) through to specific Cohesity design choices and defend each choice against plausible alternatives.
- Apply scenario-based reasoning to CCAE-style architecture questions, recognizing the four-option pattern and identifying the constraint each distractor violates.
- Build a final 30-day study plan keyed to the published domain weights (22 / 35 / 18 / 13 / 12 percent) and execute a disciplined test-day strategy.
Scenario 1: Global Enterprise with Multi-Region DR
The Business Problem
A multinational manufacturer operates three primary data centers (Dallas, Frankfurt, Singapore) plus 42 branch and plant sites distributed across the Americas, EMEA, and APAC. The CIO has set the following targets: Tier-0 ERP and MES workloads need an RPO of 15 minutes and an RTO of 60 minutes; Tier-1 file shares and VMs need RPO of 4 hours and RTO of 4 hours; Tier-2 archive and dev/test workloads need RPO of 24 hours and an RTO of 24 hours; all data must be retained for seven years for tax and audit purposes; cross-region replication must survive the loss of any one regional data center; the security team requires a SaaS control plane for fleet visibility but will not allow production data to be archived to a third-party SaaS vault.
Topology Choice: Hub-and-Spoke Per Region with One-to-Many Across Regions
The reference pattern that maps cleanly to this requirement set is hub-and-spoke within each region and one-to-many across regions [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity_Data_Protection_White_Paper.pdf]. Each branch and plant site (the spokes) deploys a Cohesity Robo Edition or small Virtual Edition cluster sized for local backups; spokes replicate inbound to the regional hub (Dallas, Frankfurt, or Singapore). Each regional hub then replicates Tier-0 and Tier-1 protected data to one of the other two regions — Dallas replicates to Frankfurt, Frankfurt to Singapore, Singapore to Dallas — a triangular one-to-many mesh that survives any single regional loss [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/modern-data-security-management-topologies-guide-for-it-leaders-white-paper-en.pdf].
| Topology Element | Decision | Rationale |
|---|---|---|
| Branch protection | Robo Edition / small VE cluster per site | Local backup for fast restore, reduces WAN backup traffic [Source: https://blogs.cisco.com/datacenter/disaster-recovery-solutions-for-the-edge-with-hyperflex-and-cohesity] |
| Branch-to-hub | Many-to-one inbound replication | Centralizes recovery, audit, retention; matches ROBO best practice [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity_Data_Protection_White_Paper.pdf] |
| Hub-to-hub | One-to-many triangular replication | Survives loss of any single region; geographic separation [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity_Data_Protection_White_Paper.pdf] |
| Long-term retention | CloudArchive to S3 with Glacier lifecycle | 7-year retention without consuming hot capacity [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/actualtech-multicloud-data-protection-and-recovery-white-paper-en.pdf] |
| Control plane | Helios SaaS (multi-tenant) | Global SLA reporting, capacity prediction, fleet upgrades [Source: https://www.cohesity.com/blogs/using-cohesitys-saas-based-helios-manage-clusters-anywhere/] |
Figure 15.1: Global enterprise hub-and-spoke topology with triangular cross-region replication and CloudArchive offsite
flowchart LR
subgraph Americas
S1[Branch Spoke A1]
S2[Branch Spoke A2]
DC1[(Dallas Hub)]
S1 --> DC1
S2 --> DC1
end
subgraph EMEA
S3[Branch Spoke E1]
S4[Branch Spoke E2]
DC2[(Frankfurt Hub)]
S3 --> DC2
S4 --> DC2
end
subgraph APAC
S5[Branch Spoke P1]
S6[Branch Spoke P2]
DC3[(Singapore Hub)]
S5 --> DC3
S6 --> DC3
end
DC1 <--> DC2
DC2 <--> DC3
DC3 <--> DC1
DC1 -.7-yr archive.-> CA[(CloudArchive S3 Glacier)]
DC2 -.7-yr archive.-> CA
DC3 -.7-yr archive.-> CA
HS{{Helios SaaS Control Plane}} -.fleet mgmt.-> DC1
HS -.fleet mgmt.-> DC2
HS -.fleet mgmt.-> DC3
Translating SLAs to Policies
The architect builds three policy templates in Helios — Tier-0, Tier-1, Tier-2 — and applies them through Protection Groups rather than per-job customization, which is the design discipline the exam expects [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/datasheets/cohesity-certified-architect-expert-exam-preparation-guide-en.pdf].
| SLA Tier | Backup Frequency | Replication Frequency | Cloud Archive | Retention |
|---|---|---|---|---|
| Tier-0 (ERP/MES) | Every 15 min (CBT) | Continuous to nearest peer region | Monthly to S3 Glacier | 7 years |
| Tier-1 (File / VM) | Every 4 hours | Every 4 hours, off-peak window | Quarterly to S3 Glacier | 7 years |
| Tier-2 (Dev/Archive) | Daily | Daily | Monthly to S3 Deep Archive | 7 years |
For Tier-0 RTO of 60 minutes the architect leverages Instant Mass Restore at the regional hub, which mounts protected VMs directly from SpanFS so applications come up in minutes rather than hours [Source: https://www.cohesity.com/solutions/instant-mass-restore/]. SiteContinuity runbooks orchestrate the failover sequence — power-on order, IP re-mapping, dependency groups (database before app, app before web), and post-failover validation — and a non-disruptive test failover runs quarterly into an isolated network bubble to keep audit evidence current [Source: https://www.cohesity.com/blogs/recovering-from-a-data-disaster-with-cohesity/].
Capacity, Bandwidth, and Cost Sanity Check
A useful analogy here: think of the regional hubs as regional water reservoirs. Each branch site is a small upstream tank; water flows downhill into the regional reservoir at low pressure (background backup with WAN-optimized, deduplicated streams). Reservoirs then exchange water across long pipes (cross-region replication) only for the most critical workloads, because long pipes are expensive. The cloud archive is the underground aquifer — slow to recall but cheap and effectively infinite. An architect who tries to send every drop directly to the aquifer overbuilds bandwidth; one who lets every reservoir drain only locally fails the multi-region survival requirement.
WAN bandwidth is dimensioned around the change rate of Tier-0 and Tier-1 data, not total front-end TB. If Tier-0 produces 200 GB of unique change daily after Cohesity’s deduplication and compression, a 25 Mbps committed cross-region link comfortably absorbs the load with headroom for catch-up; oversubscribing the link with Tier-2 traffic would be a typical exam distractor.
RBAC and Helios Fleet Management
A four-tier RBAC model is implemented globally: a Global Architect role with cross-cluster admin, a Regional Operator role limited to one regional hub plus its spokes, a Tenant Operator role for business units that self-serve restores, and a Read-Only Auditor role for compliance. Helios provides the global pane for capacity prediction, predictive disk-failure analytics, and fleet-wide policy compliance reporting [Source: https://www.cohesity.com/blogs/using-cohesitys-saas-based-helios-manage-clusters-anywhere/].
Key Takeaway: A global enterprise design layers hub-and-spoke (intra-region) with one-to-many (inter-region) replication, drives every workload through tiered Helios policies, uses CloudArchive for long-term retention without hot-capacity bloat, and orchestrates recovery through SiteContinuity runbooks rather than scripts.
Scenario 2: Healthcare with HIPAA and Ransomware Posture
The Business Problem
A 600-bed regional health system runs Epic EHR, PACS imaging, lab systems, and clinical research workloads across two on-premises data centers. The CISO has been briefed on three healthcare ransomware incidents in the past 18 months and now requires: PHI must never leave the covered entity’s controlled environment (no SaaS vaulting); WORM-immutable backups that cannot be deleted by a compromised admin; encryption with customer-managed keys via KMIP; quorum approval for any retention change; ransomware anomaly detection on backup ingestion; a clean-room recovery capability validated quarterly; and a Helios control plane that does not require outbound SaaS connectivity.
Stack Selection: DataLock + FortKnox Self-Managed + DataHawk + Helios Self-Managed
Each requirement maps to a specific Cohesity component, and the integrated stack is the canonical healthcare reference design [Source: https://aws.amazon.com/blogs/apn/supercharge-your-cyber-resiliency-with-cohesity-datahawk/][Source: https://www.cohesity.com/blogs/new-self-managed-deployment-option-for-cohesity-fortknox/].
| HIPAA / Security Requirement | Cohesity Component | How It Satisfies the Requirement |
|---|---|---|
| Immutability of backups | DataLock (WORM) | Object-level write-once retention, no admin override [Source: https://www.cohesity.com/glossary/cyber-vault/] |
| Quorum approval for retention changes | DataLock + RBAC + MFA | Four-eyes approval; defends insider threat [Source: https://www.cohesity.com/trust/] |
| Customer-managed encryption | KMIP / external KMS integration | AES-256 CBC at rest, TLS in transit, customer key control [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/cohesity-platform-security-white-paper-en.pdf] |
| No PHI in third-party SaaS | FortKnox Self-Managed | Air-gapped vault stays inside the covered entity’s perimeter [Source: https://www.cohesity.com/blogs/new-self-managed-deployment-option-for-cohesity-fortknox/] |
| Ransomware anomaly detection | DataHawk anomaly detection | Flags atypical encryption / change patterns on ingest [Source: https://aws.amazon.com/blogs/apn/supercharge-your-cyber-resiliency-with-cohesity-datahawk/] |
| Threat scanning for IOCs | DataHawk threat scan | Curated IOCs from Tenable / Qualys before recovery [Source: https://aws.amazon.com/blogs/apn/supercharge-your-cyber-resiliency-with-cohesity-datahawk/] |
| PHI/PII discovery for compliance | DataHawk classification | Identifies sensitive data, drives retention/access policy [Source: https://www.cohesity.com/platform/data-classification/] |
| No outbound SaaS for control plane | Helios Self-Managed | On-prem Helios for dark-site / regulated environments [Source: https://www.cohesity.com/blogs/using-cohesitys-saas-based-helios-manage-clusters-anywhere/] |
Figure 15.2: Healthcare ransomware-resilient stack — DataLock + FortKnox Self-Managed + DataHawk + KMIP + Helios SM
flowchart TD
PROD[Production Workloads<br/>Epic EHR / PACS / Lab] --> DP[Cohesity DataProtect Cluster<br/>on-premises]
DP -->|WORM retention| DL[DataLock Immutability Layer]
DP -->|encrypt at rest| KMIP[KMIP / External KMS<br/>Customer-Managed Keys]
DP -->|ingest scan| DH[DataHawk<br/>Anomaly + Threat + Classification]
DL -->|inbound-only<br/>transfer window| FK[(FortKnox Self-Managed<br/>Air-Gapped Vault)]
DH -->|alerts| SOC[SOC / SIEM Splunk]
HSM{{Helios Self-Managed<br/>Dark-Site Control Plane}} -.manages.-> DP
HSM -.manages.-> FK
HSM -.manages.-> DH
QUORUM[/Quorum + MFA + RBAC/] -.governs.-> DL
QUORUM -.governs.-> FK
The 3-2-1 Defense in Depth
The design layers three copies of every protected dataset on two media with one offsite/immutable copy — the classic 3-2-1 pattern hardened with Cohesity’s modern controls [Source: https://www.cohesity.com/resources/solution-brief/cohesity-fortknox-modern-cyber-vaulting-for-confident-recovery-en/].
- Primary copy — production storage (Epic ODB, PACS arrays, file shares).
- Secondary copy — Cohesity DataProtect cluster on-premises with DataLock WORM retention. Daily app-consistent backups; Tier-0 EHR backups every 15 minutes via log-based RPO. Encryption at rest with KMIP-managed keys.
- Tertiary vault copy — Cohesity FortKnox Self-Managed in a logically and physically isolated network segment, behind a separate management VLAN, with an inbound-only transfer window that opens for replication and closes immediately after [Source: https://www.cohesity.com/resources/datasheet/cohesity-fortknox/]. Production admins do not have credentials for the vault; vault admins do not have credentials for production. This segregation of duties is what defeats the credential-compromise ransomware kill chain.
A useful analogy: think of FortKnox Self-Managed as the safe-deposit vault inside a bank inside a city. The DataProtect cluster is the bank — well-guarded, but you can walk in during business hours. The vault sits behind a second locked door whose key is held by a different person, and the door opens only on a published schedule. A burglar who compromises a teller has not compromised the vault.
Detection and Clean Recovery
DataHawk performs anomaly detection on every backup ingestion, comparing change rates and entropy against historical baselines; an unusual spike in encrypted blocks fires a Helios alert and pages the SOC [Source: https://aws.amazon.com/blogs/apn/supercharge-your-cyber-resiliency-with-cohesity-datahawk/]. Threat scanning then uses curated IOCs to identify which restore points are clean. Quarterly clean-room recoveries orchestrated through DataHawk and SiteContinuity restore Epic and PACS into an isolated network bubble; the recovery is validated by application owners and the run is recorded as audit evidence for HIPAA’s contingency-plan requirement.
Audit, Logging, and HIPAA Alignment
All administrative actions stream to a SIEM (Splunk in this design) via the Cohesity Data Security Alliance integration [Source: https://www.cohesity.com/company/data-security-alliance/]. RBAC is built around least privilege with custom roles: a Backup Operator can run restores but cannot change retention; a Compliance Officer can place legal holds but cannot delete data; a Cluster Admin can change configuration but cannot bypass DataLock. MFA is mandatory for all admin roles. Quorum approval is enabled for retention reduction and DataLock policy changes — two distinct admins must approve before the action commits.
Key Takeaway: Healthcare ransomware-resilient design is not a single product but a layered stack: DataLock provides immutability, FortKnox Self-Managed provides the air-gapped vault inside the covered entity, DataHawk provides detection and classification, KMIP provides customer-controlled keys, Helios Self-Managed provides the dark-site control plane, and quorum + MFA + RBAC provide segregation of duties. Removing any layer fails one of the requirements.
Scenario 3: Service Provider Multi-Tenant DMaaS
The Business Problem
A managed service provider (MSP) wants to launch a Backup-as-a-Service offering for mid-market customers. Requirements: 25 initial tenants growing to 200 within 18 months; per-tenant data isolation enforced cryptographically and operationally; tenants must self-serve protection policies, restores, and reports through a branded portal; consumption must be metered for monthly chargeback; tenant onboarding must complete in under one business day; offboarding must guarantee data destruction within 30 days; the MSP wants to deliver the service as Cohesity DMaaS (Data Management as a Service) rather than building its own infrastructure for the first wave.
Why DMaaS for the Initial Wave
Cohesity DMaaS delivers DataProtect as a SaaS offering with the MSP as a managed-service overlay [Source: https://www.cohesity.com/blogs/architecture-matters-blueprints-for-backup-as-a-service-offerings/]. The MSP avoids the capital expense and operational burden of standing up DataProtect clusters per region and instead consumes Cohesity’s regional SaaS instances. The architect picks the region that satisfies tenant data-residency requirements (e.g., us-east-1 for US tenants, eu-west-1 for EU tenants).
Tenant Isolation via Organizations
The cornerstone of multi-tenant Cohesity is the Organization construct. Each tenant becomes an Organization with its own scoped View Boxes, RBAC, network segmentation, and policies; cross-tenant visibility is impossible by design [Source: https://www.cohesity.com/blogs/architecture-matters-blueprints-for-backup-as-a-service-offerings/].
| Isolation Dimension | Mechanism | Tenant Impact |
|---|---|---|
| Storage | Per-tenant View Boxes with quotas | Tenant A cannot see Tenant B’s data |
| Network | VLAN / VRF per tenant; tenant-scoped VIPs | Tenant traffic is L2/L3-isolated |
| Identity | Per-tenant SAML IdP federation | Tenant uses its own Azure AD / Okta |
| Encryption | Per-tenant KMIP keys (optional) | Cryptographic separation for high-trust tenants |
| Roles | Tenant-scoped RBAC | Tenant Admin role limited to its own Organization |
| Reporting | Per-tenant SLA and capacity reports | Tenant sees only its consumption |
The MSP retains an MSP-Admin role across all Organizations for operational management but cannot access tenant data without explicit, audited break-glass procedures. This is the architect’s answer to the inevitable exam question about insider risk in multi-tenant designs.
Figure 15.3: MSP DMaaS service flow — Tenant to Helios self-service to Organization to View Box to Cluster
flowchart LR
T1[Tenant A<br/>SAML IdP] --> HS{{Helios Self-Service Portal}}
T2[Tenant B<br/>SAML IdP] --> HS
T3[Tenant N<br/>SAML IdP] --> HS
HS -->|scoped session| ORG1[Organization A]
HS -->|scoped session| ORG2[Organization B]
HS -->|scoped session| ORG3[Organization N]
ORG1 --> VB1[View Box A<br/>quota + VLAN]
ORG2 --> VB2[View Box B<br/>quota + VLAN]
ORG3 --> VB3[View Box N<br/>quota + VLAN]
VB1 --> CL[(DMaaS Regional Cluster)]
VB2 --> CL
VB3 --> CL
CL -.metering API.-> BILL[MSP Billing System]
MSP[/MSP-Admin role<br/>cross-org ops/] -.audited break-glass.-> ORG1
Self-Service via Helios
Tenants log into Helios with their own SAML IdP and see only their Organization. They can create Protection Groups, attach pre-approved policies (the MSP publishes Bronze / Silver / Gold templates), trigger restores, view SLA dashboards, and download compliance reports. The MSP does not field tickets for routine operations — the platform’s self-service surface absorbs them, which is precisely how MSPs achieve the unit economics needed for mid-market BaaS [Source: https://www.cohesity.com/blogs/architecture-matters-blueprints-for-backup-as-a-service-offerings/].
| Service Tier | RPO | Retention | Cloud Archive | Monthly Price (illustrative) |
|---|---|---|---|---|
| Bronze | 24 h | 30 days | None | $0.05 / GB |
| Silver | 4 h | 90 days | Quarterly to S3 IA | $0.10 / GB |
| Gold | 15 min | 7 years | Monthly to S3 Glacier | $0.18 / GB |
Chargeback and Metering
Helios exposes consumption metrics — protected front-end TB, change rate, archived TB, restore activity — that the MSP pulls via REST API into its billing system [Source: https://www.cohesity.com/blogs/automating-workflows-using-cohesity-rest-api-part-1/]. A monthly cron job calls the Helios API, joins the per-Organization metrics with the tier price, and emits invoices. The Cohesity Terraform provider lets the MSP version-control tenant configurations alongside the rest of its infrastructure-as-code [Source: https://www.cohesity.com/blogs/architecture-matters-blueprints-for-backup-as-a-service-offerings/].
Onboarding and Offboarding Automation
A new-tenant onboarding workflow runs as a Terraform plan plus an Ansible playbook: Terraform creates the Organization, View Boxes, network bindings, default policies, and SAML federation; Ansible registers the tenant’s sources (vCenter, M365, NAS) and applies the chosen tier policy. Total elapsed time: approximately two hours of automation plus tenant-side IdP configuration. Offboarding inverts the workflow: the tenant’s View Boxes are placed into a 30-day grace period with restore-only access, then cryptographically erased by destroying the per-tenant key. Audit logs from the entire lifecycle stream to the MSP’s SIEM for compliance evidence.
Key Takeaway: A Cohesity DMaaS MSP design rests on three pillars: Organizations for tenant isolation, Helios for self-service, and APIs (REST / Terraform / Ansible) for metering and lifecycle automation. Service tiers, not custom configurations, are the unit of sale; the platform absorbs Day-2 operations so the MSP can scale tenant count without scaling headcount.
CCAE Exam Blueprint and Scenario Question Pattern
The Numbers You Must Internalize
| Exam Element | Value | Source |
|---|---|---|
| Exam code | COH500 | [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/datasheets/cohesity-certified-architect-expert-exam-preparation-guide-en.pdf] |
| Duration | 90 minutes | Same |
| Cost | $200 USD | Same |
| Passing score | 60% (approx. 36 of 60 correct) | Same |
| Question count | ~60 scenario MCQs | Same |
| Question format | 4-option scenario MCQ | Same |
| Validity | 2 years | Same |
| Retake window | 14 days | Same |
| Prerequisites | None formal; ~1 year hands-on recommended | Same |
Ninety minutes for sixty scenario questions yields a 90-second per-question budget. That is enough time to read a paragraph carefully, reject two distractors, and pick between the remaining two — but it is not enough time to puzzle over feature trivia. Memorize the platform vocabulary so reading time is short and decision time is long.
Domain Weights and Per-Domain Question Counts
| # | Domain | Weight | Questions (of 60) |
|---|---|---|---|
| 1 | Cohesity Data Cloud Data Management Platform Architecture | 22% | ~13 |
| 2 | Cohesity Architecture Solution Discovery and Design | 35% | ~21 |
| 3 | Design Security-focused Solutions | 18% | ~11 |
| 4 | Integrate Third-party Solutions with Cohesity | 13% | ~8 |
| 5 | Gap Analysis and Troubleshooting | 12% | ~7 |
Domain 2 alone is more than a third of the exam — it is the one place you cannot afford to be weak. Domains 1 and 3 together are 40 percent. Spending equal time on every domain is a strategic mistake; the 30-day plan below allocates study hours in proportion to the weights.
Figure 15.4: CCAE domain weight breakdown across the five exam domains
graph TD
EXAM[CCAE COH500<br/>60 questions / 90 minutes]
EXAM --> D1[Domain 1: Platform Architecture<br/>22% / ~13 questions]
EXAM --> D2[Domain 2: Solution Discovery & Design<br/>35% / ~21 questions]
EXAM --> D3[Domain 3: Security-Focused Solutions<br/>18% / ~11 questions]
EXAM --> D4[Domain 4: 3rd-Party Integration<br/>13% / ~8 questions]
EXAM --> D5[Domain 5: Gap Analysis & Troubleshooting<br/>12% / ~7 questions]
style D2 fill:#1f6feb,stroke:#58a6ff,color:#fff
style D1 fill:#2d5a8c,stroke:#58a6ff,color:#fff
style D3 fill:#2d5a8c,stroke:#58a6ff,color:#fff
style D4 fill:#1c3a5c,stroke:#58a6ff,color:#fff
style D5 fill:#1c3a5c,stroke:#58a6ff,color:#fff
The Scenario Question Pattern
Every CCAE item follows a recognizable shape [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/datasheets/cohesity-certified-architect-expert-exam-preparation-guide-en.pdf]:
- A scenario paragraph describing workload characteristics (volume, change rate, RTO/RPO), constraints (budget, bandwidth, existing infrastructure, compliance, residency), and a business challenge.
- Four candidate designs, each plausible at first glance.
- The correct answer satisfies all stated constraints.
- The distractors each fail exactly one constraint — usually the cheapest option misses RTO, the most secure option exceeds budget or operational complexity, or the most performant option ignores compliance.
The decoding strategy is to underline the constraints in the scenario before reading the options, then test each option against the constraint list and eliminate any option that fails one. The remaining option, by construction, is the answer.
| Distractor Archetype | What It Optimizes For | What It Sacrifices |
|---|---|---|
| The Cheap Option | Lowest CapEx / OpEx | RTO, RPO, or resilience |
| The Fortress | Maximum security | Operational simplicity, cost |
| The Performance Demon | Lowest RTO/RPO | Cost, retention, compliance |
| The Status Quo | Minimal change to existing estate | Future scale, modern features |
The correct answer is almost always the option that balances competing objectives while applying the right Cohesity feature for the constraint set. Cohesity explicitly states the exam rewards judgment, not feature recall [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/datasheets/cohesity-certified-architect-expert-exam-preparation-guide-en.pdf].
Decision Criteria Checklist
For every scenario, run through this six-point checklist before picking an option:
- Have I mapped every requirement to a measurable SLA (RPO, RTO, retention, residency, compliance)?
- Does the protection technique match the workload class — VM snapshot, application-aware, agent, cloud-native, SaaS?
- Is the hardware/edition appropriate for the performance, footprint, and refresh cycle?
- Is the design policy-driven via Helios rather than hand-configured per job?
- Are encryption, RBAC/MFA, immutability, and vaulting answered explicitly, not bolted on?
- Is the design validated by a PoC artifact or as-built vs as-used review?
A 30-Day Study Plan Keyed to Domain Weights
The plan below allocates study days roughly in proportion to the published domain weights. Adjust ratios upward for any domain where your diagnostic-test score is below 70 percent.
Figure 15.5: 30-day CCAE study plan timeline mapped to domain weights
timeline
title 30-Day CCAE Study Plan
section Week 1 Foundation
Days 1-7 : SpanFS internals
: Hardware editions
: Networking & Helios
: Domain 1 (22%)
section Week 2 Design
Days 8-18 : Sizing & workload patterns
: Hybrid / multi-cloud
: PoC architectures
: Domain 2 (35%)
section Week 3 Security
Days 19-23 : DataLock & FortKnox
: DataHawk & DSA partners
: Domain 3 (18%)
section Week 4 Integration & Practice
Days 24-26 : REST API & Terraform
: Organizations & multi-tenancy
: Domain 4 (13%)
Days 27-28 : Gap analysis & Siren
: Capacity prediction
: Domain 5 (12%)
Days 29-30 : Two timed practice exams
: Error analysis & remediation
| Phase | Days | Focus | Domain | Activities |
|---|---|---|---|---|
| Foundations | 1–7 | SpanFS, hardware editions, networking, Helios | 1 (22%) | Re-read Chapters 1–5; Cohesity SpanFS / SnapTree white paper [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity-SpanFS-and-SnapTree-WP.pdf]; Optimal Network Designs reference architecture [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/reference-architecture/Optimal-Network-Designs-with-Cohesity-RA.pdf]; build a sketch of a 6-node cluster with Bond Mode 4 and MLAG. |
| Solution Design | 8–18 | Sizing, workload patterns, hybrid/multi-cloud, PoC | 2 (35%) | Re-read Chapters 3, 7, 8, 9, 10; sizing exercises with realistic dedupe ratios; review Suffolk County Council and Sky Lakes Medical case studies [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/case-study/suffolk-county-council-case-study-en.pdf][Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/case-study/sky-lakes-medical-case-study-en.pdf]; design three PoC architectures (one VM-heavy, one DB-heavy, one M365-heavy). |
| Security | 19–23 | DataLock, FortKnox, DataHawk, DSA integrations | 3 (18%) | Re-read Chapter 11; threat-defense architecture white paper [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/threat-defense-architecture-white-paper-en.pdf]; map each Data Security Alliance partner to the Cohesity capability it extends [Source: https://www.cohesity.com/company/data-security-alliance/]. |
| Integration | 24–26 | REST API, Terraform, Organizations, Hybrid Extender | 4 (13%) | Re-read Chapters 6 and 13; hands-on with the Helios “Try this Operation” interactive API console [Source: https://www.cohesity.com/blogs/automating-workflows-using-cohesity-rest-api-part-1/]; write one Terraform module for a Protection Group; design a multi-tenant Organization layout. |
| Gap Analysis | 27–28 | Helios capacity, Siren, pre-checks, as-built vs as-used | 5 (12%) | Re-read Chapter 14; practice reading Helios capacity prediction graphs; walk through a Siren diagnostic flow [Source: https://www.cohesity.com/blogs/using-cohesitys-saas-based-helios-manage-clusters-anywhere/]. |
| Synthesis | 29–30 | Full-length practice exams under timed conditions | All | Two 60-question timed practice exams; post-exam error analysis; targeted remediation on the lowest-scoring domain. |
A useful analogy for the plan: it is a marathon training schedule, not a sprint. Domain 2 is the long-run portion of the week — you cannot skip the long run and expect to finish. Domain 5 is the cooldown — necessary, but short. The synthesis weekend is the taper; the exam is race day.
Test-Day Strategy
The Day Before
- Confirm proctoring system check is green, ID is ready, environment is private.
- Re-read your one-page distilled summary of domain weights, RPO/RTO patterns, and the FortKnox + DataLock + DataHawk stack.
- Sleep eight hours. Do not cram.
During the Exam
- Spend the first 60 seconds on the question scanning for constraints: RPO, RTO, retention, compliance, residency, budget, existing infrastructure. Underline them mentally.
- Spend the next 30 seconds eliminating distractors: the option that violates a stated constraint is wrong, regardless of how attractive it otherwise looks.
- If two options remain, pick the one that balances rather than the one that maximizes a single axis. CCAE rewards balance.
- Flag any question that takes more than two minutes and move on. Return on the second pass with a clearer head.
- Reserve the final 10 minutes for flagged questions and a pass through any blanks. Never leave an answer blank — there is no penalty for guessing.
After the Exam
- Pass: claim your two-year credential and start logging continuing-education hours toward renewal.
- Fail: use the 14-day retake window deliberately. Do not retake immediately. Run a domain-by-domain post-mortem against your score report and target the weakest domain for one focused week, then retake [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/datasheets/cohesity-certified-architect-expert-exam-preparation-guide-en.pdf].
Key Takeaway: The exam is 90 minutes for 60 scenario MCQs at 60 percent passing — a budget of 90 seconds per question. Win by reading constraints first, eliminating distractors that violate any one constraint, and picking the balanced option. Allocate study time in proportion to domain weights (22 / 35 / 18 / 13 / 12), and prioritize Domain 2 because it is the largest single block of the exam.
Chapter Summary
This chapter pulled the entire CCAE curriculum into three end-to-end architectures and one exam strategy. The global enterprise scenario showed how hub-and-spoke replication inside each region combines with one-to-many replication across regions to deliver multi-region survival, with CloudArchive supplying long-term retention and Helios SaaS providing fleet-level visibility. The healthcare scenario showed how DataLock immutability, FortKnox Self-Managed cyber vaulting, DataHawk detection and classification, KMIP-managed encryption, and Helios Self-Managed combine into a HIPAA-aligned defense-in-depth stack where every layer answers a specific compliance or ransomware requirement. The MSP scenario showed how Cohesity Organizations, Helios self-service, and REST/Terraform/Ansible automation deliver a multi-tenant DMaaS offering with metered chargeback and lifecycle automation that scales tenants without scaling support headcount.
The exam blueprint section decoded COH500 — 90 minutes, 60 questions, $200, 60 percent passing — and made the domain weights actionable. The scenario question pattern is consistent: a paragraph of constraints, four plausible options, three distractors that each fail one constraint, and one balanced answer. The 30-day plan and test-day strategy translate the blueprint into daily activities and minute-by-minute exam discipline. If you can defend each design choice in the three scenarios above against alternatives, recognize the four distractor archetypes on sight, and finish a 60-question timed practice with at least 80 percent under exam conditions, you are ready to sit for the CCAE.
Key Terms
- Hub-and-spoke — Replication topology in which spoke clusters back up locally and replicate inbound to a central hub for centralized DR, retention, and reporting; the canonical pattern for global enterprises with branch sites [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/white-paper/Cohesity_Data_Protection_White_Paper.pdf].
- Cyber Vault — A logically and physically isolated tertiary copy of protected data with a virtual air gap and inbound-only transfer windows. Cohesity FortKnox is the platform’s cyber vault offering, available as Cohesity-managed SaaS or self-managed [Source: https://www.cohesity.com/glossary/cyber-vault/].
- DMaaS (Data Management as a Service) — Cohesity’s SaaS delivery of DataProtect, where the customer or MSP consumes regional SaaS instances rather than operating clusters; the foundation of modern BaaS offerings [Source: https://www.cohesity.com/blogs/architecture-matters-blueprints-for-backup-as-a-service-offerings/].
- Compliance — The set of regulatory and contractual requirements (HIPAA, PCI, FedRAMP, GDPR) that constrain architecture choices around encryption, retention, residency, and access control. The exam treats compliance as a hard, not soft, constraint.
- Chargeback — Per-tenant or per-business-unit usage metering that converts platform consumption into invoiceable units. Implemented via Helios consumption metrics retrieved through the REST API [Source: https://www.cohesity.com/blogs/automating-workflows-using-cohesity-rest-api-part-1/].
- Scenario design — The architectural discipline of mapping a paragraph of business constraints to a composed Cohesity solution that satisfies all constraints while balancing competing objectives. The unit of evaluation on the CCAE exam.
- Domain weighting — The published percentage allocation of exam questions to each of the five CCAE domains (22 / 35 / 18 / 13 / 12). The basis for proportional study-time allocation [Source: https://www.cohesity.com/content/dam/cohesity/resource-assets/datasheets/cohesity-certified-architect-expert-exam-preparation-guide-en.pdf].