Chapter 5: Cluster Deployment, Bootstrap, and Day-2 Operations
Learning Objectives
Bootstrap a new physical, virtual, or cloud Cohesity cluster end-to-end using the web wizard and iris_cli.
Apply best-practice configuration (DNS, NTP, AD/SSO, licensing, VIPs) to a brownfield enterprise environment.
Perform Day-2 operations including rolling upgrades, node additions/removals, and disk/node replacement.
Use the Cohesity CLI, REST API v1/v2, Helios API, PowerShell, Ansible, and Terraform to automate cluster lifecycle.
Choose appropriately between Physical, Virtual Edition, and Cloud Edition deployment models.
A Cohesity cluster spends roughly 30 minutes being born and the rest of its life in maintenance mode. The CCAE exam reflects that ratio: it expects you to bootstrap a cluster cleanly the first time, then operate it for years without unplanned downtime.
Section 1: Bootstrapping a New Cluster
Pre-Quiz: Bootstrapping
1. Before any IPs are assigned, how does the architect uniquely identify a specific imaged Cohesity node when running iris_cli cluster create?
By the MAC address of bond0By the node-ID printed on the IPMI console / boot screenBy a temporary DHCP-assigned hostnameBy the chassis serial number alone
2. When does the initial cluster partition get created during bootstrap?
Manually, via iris_cli partition create before cluster createAutomatically, as part of iris_cli cluster createOnly after Helios registration completesWhen the first VIP is bound to bond0
3. An architect joins the cluster to AD but Kerberos ticket validation fails mysteriously. What is the most likely missing prerequisite?
A Helios subscriptionAn Eager Zeroed Thick VMDKReverse DNS (PTR) records for the cluster IPsA Terraform state file
1.1 Out-of-the-Box Experience and IPMI
Each Cohesity-branded appliance, ReadyNode, or certified partner platform ships pre-imaged but unconfigured. The bootstrap entry point is IPMI (the out-of-band management controller). The architect connects to each node's IPMI IP, powers it on, and verifies the imaging splash screen. Each node carries a unique node-ID (e.g., 181140266786854) printed on the IPMI console.
Analogy: The node-ID is like a hospital wristband on a newborn. Before the cluster gives the node an IP "address" to live at, the node-ID is the only handle that uniquely identifies which physical box you mean.
1.2 IP Allocation
Console into the first node and run the network configuration script:
cd /home/cohesity/bin/network
./configure_network.sh
The script prompts for bond selection (bond0 default for node-to-node), IP/prefix/gateway, MTU (1500 default; 9000 for jumbo), and DNS servers. As an alternative, iris_cli interface list and iris_cli node status enumerate physical interfaces and bond memberships before pushing the configuration.
1.3 Discovery and cluster create
Once the first node has an IP, it discovers its peers on the same L2 domain:
iris_cli discover free-nodes
With the free-node list in hand, the cluster is created in a single atomic invocation:
The command bundles every prerequisite into one atomic operation: domain, NTP, DNS, gateway, mask, node IPs, node IDs, VIPs, and encryption posture. Quorum forms once a majority of nodes acknowledge the create.
Exam tip: Multicast is frequently disabled in enterprise networks. Cohesity supports a unicast-discovery path; the architect must ensure the same node count, VIP count, and IPMI count are supplied and that ARP-based peer discovery works on the bootstrap subnet.
1.4 Post-Create Tasks
Verify health with iris_cli cluster status.
DNS registration: A and PTR records that round-robin across all VIPs.
License application: capacity-based or subscription entitlement.
Active Directory join: bind to the AD forest and map AD groups to roles (Admin, Operator, Viewer).
SSO/SAML: optional federation with Okta, Azure AD, or Ping.
Helios registration: required for global dashboards and SaaS-only features (DataHawk, FortKnox, SiteContinuity orchestration).
Animation: Bootstrap Workflow Timeline
Bootstrap Workflow (Mermaid)
flowchart TD
A[Power on imaged nodes via IPMI] --> B[Console to first node]
B --> C[Run configure_network.sh set bond0 IP, mask, gateway, DNS, MTU]
C --> D[iris_cli discover free-nodes]
D --> E{Select nodes to enroll}
E --> F[iris_cli cluster create domain, NTP, DNS, node-ips, node-ids, VIPs]
F --> G[Quorum forms initial partition auto-created]
G --> H[iris_cli cluster status verify health]
H --> I[DNS A/PTR records for VIPs]
I --> J[Apply license]
J --> K[Active Directory join map AD groups to roles]
K --> L[Optional SAML/SSO federation]
L --> M[Helios registration]
Key Points - Bootstrapping
node-ID uniquely identifies an imaged-but-unconfigured node before any IP is assigned; it is required for iris_cli cluster create.
The initial partition is auto-created as part of cluster create — there is no separate partition step.
iris_cli discover free-nodes enumerates imaged-but-uncommitted peers on the same L2 domain (or unicast in multicast-disabled networks).
Reverse DNS (PTR) is mandatory for AD join and Kerberos — missing PTRs cause mysterious AD failures.
Helios registration is the final step and unlocks SaaS-only features (DataHawk, FortKnox, SiteContinuity).
Post-Quiz: Bootstrapping
1. Before any IPs are assigned, how does the architect uniquely identify a specific imaged Cohesity node when running iris_cli cluster create?
By the MAC address of bond0By the node-ID printed on the IPMI console / boot screenBy a temporary DHCP-assigned hostnameBy the chassis serial number alone
2. When does the initial cluster partition get created during bootstrap?
Manually, via iris_cli partition create before cluster createAutomatically, as part of iris_cli cluster createOnly after Helios registration completesWhen the first VIP is bound to bond0
3. An architect joins the cluster to AD but Kerberos ticket validation fails mysteriously. What is the most likely missing prerequisite?
A Helios subscriptionAn Eager Zeroed Thick VMDKReverse DNS (PTR) records for the cluster IPsA Terraform state file
Section 2: Virtual and Cloud Edition Deployment
Pre-Quiz: VE and CE Deployment
1. How does an architect add capacity to a single-node Cohesity Virtual Edition appliance?
Add a second VE node to form a 2-node clusterStop the cluster, extend the capacity-tier disk in the hypervisor, run iris_cli node list disk extend, restartResize the boot disk in vSphereMigrate to Cloud Edition automatically via Helios
2. Why must Cohesity VE nodes have a vSphere DRS anti-affinity rule configured?
To improve dedupe ratios across hostsTo reduce VMware licensing costSo a single ESXi host failure does not take down multiple cluster nodes simultaneouslyTo enable BGP route advertisement
3. A 3-node Azure Cloud Edition cluster on Standard_DS5_v2 requires how many vCPU cores total — and what is the architect's first sizing prerequisite?
12 cores; ensure VPN to on-prem exists24 cores; provision premium SSD only48 cores; request a regional vCPU quota increase from Azure96 cores; deploy via Terraform only
2.1 VE on VMware/Hyper-V
Virtual Edition ships as an OVA (VMware) or VHDX (Hyper-V) and runs as single-node or multi-node clusters. Hard floors are typically 8+ vCPU and 64+ GB RAM, with dedicated performance-tier disk on flash and a separate capacity-tier disk on HDD or capacity SSD. Critical constraints:
Single-node VE expands by disk resize, not node addition. Stop the cluster, extend the capacity-tier disk in the hypervisor, then:
iris_cli cluster stop
iris_cli node list disk extend
iris_cli disk list
iris_cli cluster start
Multi-node VE (3+ nodes) uses the same iris_cli discover free-nodes + cluster create flow as physical.
DRS anti-affinity must be configured so VE nodes never co-reside on the same ESXi host.
VMDK provisioning must be Eager Zeroed Thick on the performance tier — thin-provisioned performance disks introduce write-stall behavior under load.
2.2 Cloud Edition on AWS, Azure, GCP
Cloud Edition runs as a minimum 3-node production cluster. Single-node CE is supported only for lab use.
Azure: Marketplace image with Standard_DS5_v2 (16 vCPU per node). 3 nodes = 48 vCPU; new subscriptions often default to a 10-core regional quota, so request quota increases first. Resource-group caps allow up to 64 VMs per RG (effective ceiling on cluster size). Azure Managed Disks must be sized in multiples of 1 MB.
AWS: DataPlatform Cloud Edition AMI from AWS Marketplace, m5/m6i large-instance family. CE on AWS supports S3 tiering for overflow capacity.
Networking: dedicated subnet, security-group inter-node rules, VPN/Direct Connect/ExpressRoute back on-prem, IAM roles for S3/Blob (CloudArchive, CloudTier).
2.3 Robo Edition
A thin VE variant tuned for remote/branch offices. Typically a single-node VE on existing branch hypervisor capacity, replicating inbound to a central CE or physical hub. It trades clustered resiliency for a tiny footprint and manages risk by replicating critical backups upstream within minutes.
Animation: Form Factor Selection -> Helios
Form Factor Comparison
Attribute
Physical
VE
CE
Form factor
1U/2U appliance
OVA/VHDX
Cloud Marketplace AMI/Image
Min nodes (prod)
3
3 (single = lab/Robo)
3
Bootstrap entry
IPMI + console
Hypervisor console
Cloud console + SSH
Capacity expand
Add nodes / disks
Resize VMDK + disk extend
Add VMs / resize managed disks
External overflow
Local only
Local only
S3 / Blob tier
Helios
Optional
Optional
Strongly recommended
Form Factor Mermaid Diagram
flowchart LR
subgraph Physical[Physical / ReadyNode]
P1[1U or 2U appliance]
P2[IPMI bootstrap]
P3[NVMe perf + HDD/SSD cap]
end
subgraph VE[Virtual Edition]
V1[OVA / VHDX]
V2[VMware, Hyper-V, AHV, KVM]
V3[Eager Zeroed Thick VMDKs DRS anti-affinity]
end
subgraph CE[Cloud Edition]
C1[AWS AMI / Azure Marketplace]
C2[3-node minimum production]
C3[S3 / Blob tiering]
end
subgraph Robo[Robo Edition]
R1[Single-node VE variant]
R2[Branch office hypervisor]
R3[Replicates to hub cluster]
end
Physical -->|Primary on-prem fabric| Hub[Helios Fleet Management]
VE -->|Lab, dev/test| Hub
CE -->|Cloud DR + cloud-native| Hub
Robo -->|Edge protection| Hub
Key Points - VE / CE Deployment
Single-node VE expands by disk resize + iris_cli node list disk extend — never by adding a second node.
VE on VMware requires DRS anti-affinity and Eager Zeroed Thick performance VMDKs.
Azure CE uses Standard_DS5_v2 — a 3-node cluster needs 48 vCPU, often blocked by default subscription quotas.
AWS CE supports S3 tiering as overflow capacity; Azure CE uses Blob.
Robo Edition is single-node VE for branch offices; resiliency comes from upstream replication, not local quorum.
Post-Quiz: VE and CE Deployment
1. How does an architect add capacity to a single-node Cohesity Virtual Edition appliance?
Add a second VE node to form a 2-node clusterStop the cluster, extend the capacity-tier disk in the hypervisor, run iris_cli node list disk extend, restartResize the boot disk in vSphereMigrate to Cloud Edition automatically via Helios
2. Why must Cohesity VE nodes have a vSphere DRS anti-affinity rule configured?
To improve dedupe ratios across hostsTo reduce VMware licensing costSo a single ESXi host failure does not take down multiple cluster nodes simultaneouslyTo enable BGP route advertisement
3. A 3-node Azure Cloud Edition cluster on Standard_DS5_v2 requires how many vCPU cores total — and what is the architect's first sizing prerequisite?
12 cores; ensure VPN to on-prem exists24 cores; provision premium SSD only48 cores; request a regional vCPU quota increase from Azure96 cores; deploy via Terraform only
Section 3: Day-2 Operations
Pre-Quiz: Day-2 Operations
1. During a Cohesity rolling upgrade, how does the cluster keep client traffic flowing while a node is rebooting?
All clients enter a 30-minute pauseA distributed lock token serializes one node at a time and VIPs migrate to surviving peersThe cluster fails over to a passive standby clusterA separate Helios cluster takes over the VIPs
2. Why does Cohesity use an active/passive root partition swap during upgrades?
To save storage capacityTo enable atomic, fast rollback at the boot level if the new image fails to come upTo support encryption at restBecause SpanFS requires it
3. What is "rebuild headroom" and why does it matter for node removal?
Free RAM reserved for the SpanFS cacheSpare capacity equal to a node's data footprint, kept free so SpanFS can re-protect blocks after node loss without degraded resiliencyNetwork bandwidth reserved for replicationA Helios SaaS feature
4. The daily Cohesity Heartbeat log bundle is used for what?
Replication keepalive between nodesProactive triage by Cohesity Support — configuration, health, alerts, capacity uploaded daily over HTTPSQuorum heartbeat between cluster partitionsAn audit log for compliance only
3.1 Cluster Upgrades and Rolling Reboots
Cohesity ships a one-click rolling upgrade. The cluster pulls candidate releases automatically — no manual download. The mechanics are elegant: a distributed lock manager hands a token from node to node:
Token-holder pauses local services and migrates its VIPs to peers.
Active client connections continue against surviving VIPs (UI, SMB, NFS, ongoing backups).
The node atomically swaps active and passive root partitions and reboots into the new image.
Once healthy, it releases the token to the next node.
The atomic root-partition swap also enables fast rollback at the boot level: if a node fails to come up on the new image, it boots back into the previous root.
Analogy: A Cohesity rolling upgrade is like a Roomba in a house full of guests. Only one node steps out of the rotation at a time, and the rest keep cleaning (serving I/O) while it's away.
The 7.x UI introduces explicit pre-upgrade checks under the Upgrade tab and supports uploading a CRL when needed. Architects should always:
Open Platform > Cluster, confirm green health.
Run pre-upgrade checks; fix flagged issues (clock drift, certificate expiry, disk warnings).
Initiate upgrade from Platform > Admin > Upgrade Cluster.
Verify with iris_cli cluster get-version and spot-check backups + a restore.
Historical note: Cohesity 6.8.2_u1 migrated the underlying OS from CentOS 7.9 to RHEL 7.9 because of CentOS's June 30, 2024 EOL.
3.2 Adding and Removing Nodes
Expansion is symmetric to bootstrap: image, rack, cable, present via iris_cli discover free-nodes, then append. Removal is a multi-stage drain: mark for removal, SpanFS migrates chunk replicas off the node to maintain RF/EC parity, node leaves quorum. Architects must size for rebuild headroom: free capacity equal to the failed node's footprint.
3.3 Disk and Node Replacement
Disk replacement: Helios alert or iris_cli cluster status shows failed disk LED; field engineer hot-swaps; cluster auto-formats and rebuilds chunk replicas.
Node replacement: drain the failed node, physically swap, re-add via UI or iris_cli; SpanFS rebuilds replicas at network throughput.
Cohesity Hardware Refresh Service handles 5-7 year tech-refresh cycles via overlapping clusters during data migration.
3.4 Health Checks and Heartbeat
The daily Heartbeat log bundle uploads to Cohesity Support over HTTPS, including cluster configuration, service health, recent alerts, and capacity metrics — enough for proactive triage without requesting fresh logs. Helios surfaces alerts globally:
Severity
Examples
Action
Critical
Quorum loss, full capacity, multiple disk failures
Page on-call, engage support
Major
Single disk failure, replication lag, license expiry
Same-day investigation
Warning
Job failures, certificate expiry < 30 days
Plan resolution
Info
Successful upgrade, scheduled maintenance
Acknowledge
Animation: Rolling Upgrade Token + VIP Migration
Rolling Upgrade Sequence (Mermaid)
sequenceDiagram
participant Helios
participant Cluster as Cluster Coordinator
participant N1 as Node 1
participant N2 as Node 2
participant N3 as Node 3
Helios->>Cluster: Initiate upgrade to target version
Cluster->>Cluster: Run pre-upgrade checks
Cluster->>N1: Hand upgrade token
N1->>N2: Migrate VIPs to peers
N1->>N1: Pause services, swap root partition, reboot
N1->>Cluster: Healthy on new image, release token
Cluster->>N2: Hand upgrade token
N2->>N3: Migrate VIPs to peers
N2->>N2: Pause, swap root partition, reboot
N2->>Cluster: Healthy, release token
Cluster->>N3: Hand upgrade token
N3->>N1: Migrate VIPs to peers
N3->>N3: Pause, swap root partition, reboot
N3->>Cluster: Healthy, release token
Cluster->>Helios: Upgrade complete, all nodes on new version
Key Points - Day-2 Operations
Rolling upgrade serializes via a distributed lock token; one node reboots at a time while VIPs migrate — backups, replication, and indexing keep running.
The active/passive root partition swap enables fast boot-level rollback if the new image fails.
Node removal is a multi-stage drain; rebuild headroom equal to one node's footprint is required to avoid degraded resiliency.
The daily Heartbeat bundle sends config/health/alerts/capacity to Cohesity Support over HTTPS for proactive triage.
Pre-upgrade checks must be run from the 7.x Upgrade tab before initiating; verify with iris_cli cluster get-version after.
Post-Quiz: Day-2 Operations
1. During a Cohesity rolling upgrade, how does the cluster keep client traffic flowing while a node is rebooting?
All clients enter a 30-minute pauseA distributed lock token serializes one node at a time and VIPs migrate to surviving peersThe cluster fails over to a passive standby clusterA separate Helios cluster takes over the VIPs
2. Why does Cohesity use an active/passive root partition swap during upgrades?
To save storage capacityTo enable atomic, fast rollback at the boot level if the new image fails to come upTo support encryption at restBecause SpanFS requires it
3. What is "rebuild headroom" and why does it matter for node removal?
Free RAM reserved for the SpanFS cacheSpare capacity equal to a node's data footprint, kept free so SpanFS can re-protect blocks after node loss without degraded resiliencyNetwork bandwidth reserved for replicationA Helios SaaS feature
4. The daily Cohesity Heartbeat log bundle is used for what?
Replication keepalive between nodesProactive triage by Cohesity Support — configuration, health, alerts, capacity uploaded daily over HTTPSQuorum heartbeat between cluster partitionsAn audit log for compliance only
Section 4: Automation and APIs
Pre-Quiz: Automation
1. Which API surface is required to push a single protection policy across 50 registered clusters in one operation?
REST API v1, per-clusterREST API v2, per-clusterHelios API (SaaS, tenant-scoped)iris_cli SSH'd to each node
2. A team wants reproducible Cohesity protection-policy state across dev/test/prod with a Git audit trail. Which tool fits best?
PowerShell module — ad-hoc cmdletsTerraform provider — declarative state in Gitiris_cli shell scripts on each nodeHelios UI clicks
3. New Cohesity automation should target which REST surface area first?
REST API v1 onlyREST API v2, falling back to v1 only when an endpoint is missing in v2Helios API only, even for cluster-local operationsA custom JSON-RPC endpoint
4.1 iris_cli Command Groups
iris_cli is the on-cluster shell. Major groups:
Group
Purpose
Example
cluster
Bootstrap, status, version
iris_cli cluster get-version
node
Per-node ops
iris_cli node list disk extend
disk
Inventory, extend
iris_cli disk list
interface
Network config
iris_cli interface list
vlan
VLAN/VIP mgmt
iris_cli vlan list
partition
Cluster partitions
iris_cli partition list
protection-job
Backup jobs
iris_cli protection-job list
recovery
Restores
iris_cli recovery list
4.2 REST API v1, v2, Helios
REST API v1: cluster-local, legacy resources (jobs, runs, sources). Still in heavy use; some endpoints have no v2 equivalent.
REST API v2: cluster-local, redesigned around object-oriented resources (protection groups, recoveries, sources). Preferred for new automation.
Helios API: SaaS-side, tenant-scoped, aggregated across all registered clusters. Required for fleet operations and Helios-only features (DataHawk, FortKnox, SiteContinuity).
Authentication: cluster APIs accept username/password or API key against the cluster; Helios uses a per-tenant API key with a clusterId filter for fan-out.
4.3 PowerShell, Ansible, Terraform
The PowerShell module (e.g., Get-CohesityProtectionJob) suits Windows-centric teams. The Ansible collection fits Linux-heavy IaC shops:
The Terraform provider treats Cohesity resources as declarative state — the right tool when backups must be reproducible across environments and when an audit trail of "who changed what when" is required:
resource "cohesity_protection_policy" "gold" {
name = "Gold"
retention {
unit = "Days"
duration = 30
}
incremental_schedule {
unit = "Hours"
frequency = 4
}
full_schedule {
unit = "Weeks"
day = "Sunday"
}
}
Analogy: Pick the API like you pick a kitchen tool. iris_cli is the chef's knife — sharp, fast, on-cluster. REST is the food processor — bulk operations from outside. PowerShell is the rice cooker for Windows shops. Ansible and Terraform are the meal-prep system for the whole week.
Helios API is the only path to cross-cluster fleet operations and to Helios-only features (DataHawk, FortKnox, SiteContinuity orchestration).
Choose REST v2 for new code, v1 only when v2 lacks the endpoint.
Terraform = declarative state with Git audit trail; Ansible = playbook orchestration; PowerShell = Windows-team idiom; iris_cli = on-cluster shell.
The CLI Reference Guide is published per release; iris_cli command groups include cluster, node, disk, interface, vlan, partition, protection-job, recovery.
Match the tool to the team's existing operating model rather than imposing a new one.
Post-Quiz: Automation
1. Which API surface is required to push a single protection policy across 50 registered clusters in one operation?
REST API v1, per-clusterREST API v2, per-clusterHelios API (SaaS, tenant-scoped)iris_cli SSH'd to each node
2. A team wants reproducible Cohesity protection-policy state across dev/test/prod with a Git audit trail. Which tool fits best?
PowerShell module — ad-hoc cmdletsTerraform provider — declarative state in Gitiris_cli shell scripts on each nodeHelios UI clicks
3. New Cohesity automation should target which REST surface area first?
REST API v1 onlyREST API v2, falling back to v1 only when an endpoint is missing in v2Helios API only, even for cluster-local operationsA custom JSON-RPC endpoint