Chapter 5: Cluster Deployment, Bootstrap, and Day-2 Operations

Learning Objectives

A Cohesity cluster spends roughly 30 minutes being born and the rest of its life in maintenance mode. The CCAE exam reflects that ratio: it expects you to bootstrap a cluster cleanly the first time, then operate it for years without unplanned downtime.

Section 1: Bootstrapping a New Cluster

Pre-Quiz: Bootstrapping

1. Before any IPs are assigned, how does the architect uniquely identify a specific imaged Cohesity node when running iris_cli cluster create?

By the MAC address of bond0 By the node-ID printed on the IPMI console / boot screen By a temporary DHCP-assigned hostname By the chassis serial number alone

2. When does the initial cluster partition get created during bootstrap?

Manually, via iris_cli partition create before cluster create Automatically, as part of iris_cli cluster create Only after Helios registration completes When the first VIP is bound to bond0

3. An architect joins the cluster to AD but Kerberos ticket validation fails mysteriously. What is the most likely missing prerequisite?

A Helios subscription An Eager Zeroed Thick VMDK Reverse DNS (PTR) records for the cluster IPs A Terraform state file

1.1 Out-of-the-Box Experience and IPMI

Each Cohesity-branded appliance, ReadyNode, or certified partner platform ships pre-imaged but unconfigured. The bootstrap entry point is IPMI (the out-of-band management controller). The architect connects to each node's IPMI IP, powers it on, and verifies the imaging splash screen. Each node carries a unique node-ID (e.g., 181140266786854) printed on the IPMI console.

Analogy: The node-ID is like a hospital wristband on a newborn. Before the cluster gives the node an IP "address" to live at, the node-ID is the only handle that uniquely identifies which physical box you mean.

1.2 IP Allocation

Console into the first node and run the network configuration script:

cd /home/cohesity/bin/network
./configure_network.sh

The script prompts for bond selection (bond0 default for node-to-node), IP/prefix/gateway, MTU (1500 default; 9000 for jumbo), and DNS servers. As an alternative, iris_cli interface list and iris_cli node status enumerate physical interfaces and bond memberships before pushing the configuration.

1.3 Discovery and cluster create

Once the first node has an IP, it discovers its peers on the same L2 domain:

iris_cli discover free-nodes

With the free-node list in hand, the cluster is created in a single atomic invocation:

iris_cli cluster create \
  domain-names=eng.cohesity.com \
  ntp-servers=pool.ntp.org \
  name="haswell2" \
  hostname=haswell2.eng.cohesity.com \
  subnet-gateway=10.1.0.1 \
  subnet-mask=255.255.240.0 \
  dns-server-ips=10.2.0.1 \
  node-ips=10.1.4.16,10.1.4.17,10.1.4.18 \
  node-ids=181140266786854,181140264583348,181140264822986 \
  vips=10.1.4.20,10.1.4.21,10.1.4.22 \
  enable-encryption=true

The command bundles every prerequisite into one atomic operation: domain, NTP, DNS, gateway, mask, node IPs, node IDs, VIPs, and encryption posture. Quorum forms once a majority of nodes acknowledge the create.

Exam tip: Multicast is frequently disabled in enterprise networks. Cohesity supports a unicast-discovery path; the architect must ensure the same node count, VIP count, and IPMI count are supplied and that ARP-based peer discovery works on the bootstrap subnet.

1.4 Post-Create Tasks

  1. Verify health with iris_cli cluster status.
  2. DNS registration: A and PTR records that round-robin across all VIPs.
  3. License application: capacity-based or subscription entitlement.
  4. Active Directory join: bind to the AD forest and map AD groups to roles (Admin, Operator, Viewer).
  5. SSO/SAML: optional federation with Okta, Azure AD, or Ping.
  6. Helios registration: required for global dashboards and SaaS-only features (DataHawk, FortKnox, SiteContinuity orchestration).
Animation: Bootstrap Workflow Timeline
IPMI Power on node-IDs Network configure_ network.sh Discover iris_cli discover free-nodes Cluster Create iris_cli cluster create + quorum Partitions auto-created at create time AD/SSO join, roles, SAML Helios SaaS register t=0 ~10 min: quorum + partition ~30 min: cluster ready Bootstrap is one-shot and atomic Imaged nodes -> IPs -> discover peers -> cluster create -> quorum + initial partition

Bootstrap Workflow (Mermaid)

flowchart TD A[Power on imaged nodes via IPMI] --> B[Console to first node] B --> C[Run configure_network.sh
set bond0 IP, mask, gateway, DNS, MTU] C --> D[iris_cli discover free-nodes] D --> E{Select nodes
to enroll} E --> F[iris_cli cluster create
domain, NTP, DNS, node-ips, node-ids, VIPs] F --> G[Quorum forms
initial partition auto-created] G --> H[iris_cli cluster status
verify health] H --> I[DNS A/PTR records for VIPs] I --> J[Apply license] J --> K[Active Directory join
map AD groups to roles] K --> L[Optional SAML/SSO federation] L --> M[Helios registration]

Key Points - Bootstrapping

Post-Quiz: Bootstrapping

1. Before any IPs are assigned, how does the architect uniquely identify a specific imaged Cohesity node when running iris_cli cluster create?

By the MAC address of bond0 By the node-ID printed on the IPMI console / boot screen By a temporary DHCP-assigned hostname By the chassis serial number alone

2. When does the initial cluster partition get created during bootstrap?

Manually, via iris_cli partition create before cluster create Automatically, as part of iris_cli cluster create Only after Helios registration completes When the first VIP is bound to bond0

3. An architect joins the cluster to AD but Kerberos ticket validation fails mysteriously. What is the most likely missing prerequisite?

A Helios subscription An Eager Zeroed Thick VMDK Reverse DNS (PTR) records for the cluster IPs A Terraform state file

Section 2: Virtual and Cloud Edition Deployment

Pre-Quiz: VE and CE Deployment

1. How does an architect add capacity to a single-node Cohesity Virtual Edition appliance?

Add a second VE node to form a 2-node cluster Stop the cluster, extend the capacity-tier disk in the hypervisor, run iris_cli node list disk extend, restart Resize the boot disk in vSphere Migrate to Cloud Edition automatically via Helios

2. Why must Cohesity VE nodes have a vSphere DRS anti-affinity rule configured?

To improve dedupe ratios across hosts To reduce VMware licensing cost So a single ESXi host failure does not take down multiple cluster nodes simultaneously To enable BGP route advertisement

3. A 3-node Azure Cloud Edition cluster on Standard_DS5_v2 requires how many vCPU cores total — and what is the architect's first sizing prerequisite?

12 cores; ensure VPN to on-prem exists 24 cores; provision premium SSD only 48 cores; request a regional vCPU quota increase from Azure 96 cores; deploy via Terraform only

2.1 VE on VMware/Hyper-V

Virtual Edition ships as an OVA (VMware) or VHDX (Hyper-V) and runs as single-node or multi-node clusters. Hard floors are typically 8+ vCPU and 64+ GB RAM, with dedicated performance-tier disk on flash and a separate capacity-tier disk on HDD or capacity SSD. Critical constraints:

2.2 Cloud Edition on AWS, Azure, GCP

Cloud Edition runs as a minimum 3-node production cluster. Single-node CE is supported only for lab use.

2.3 Robo Edition

A thin VE variant tuned for remote/branch offices. Typically a single-node VE on existing branch hypervisor capacity, replicating inbound to a central CE or physical hub. It trades clustered resiliency for a tiny footprint and manages risk by replicating critical backups upstream within minutes.

Animation: Form Factor Selection -> Helios
Physical / ReadyNode 1U/2U appliance IPMI bootstrap Virtual Edition (VE) OVA / VHDX VMware / Hyper-V / AHV Cloud Edition (CE) AWS AMI / Azure Marketplace 3-node minimum Robo Edition Single-node VE variant Branch / edge Helios Fleet Management All four form factors share the same iris_cli bootstrap mechanics and converge at Helios for fleet management.

Form Factor Comparison

AttributePhysicalVECE
Form factor1U/2U applianceOVA/VHDXCloud Marketplace AMI/Image
Min nodes (prod)33 (single = lab/Robo)3
Bootstrap entryIPMI + consoleHypervisor consoleCloud console + SSH
Capacity expandAdd nodes / disksResize VMDK + disk extendAdd VMs / resize managed disks
External overflowLocal onlyLocal onlyS3 / Blob tier
HeliosOptionalOptionalStrongly recommended

Form Factor Mermaid Diagram

flowchart LR subgraph Physical[Physical / ReadyNode] P1[1U or 2U appliance] P2[IPMI bootstrap] P3[NVMe perf + HDD/SSD cap] end subgraph VE[Virtual Edition] V1[OVA / VHDX] V2[VMware, Hyper-V, AHV, KVM] V3[Eager Zeroed Thick VMDKs
DRS anti-affinity] end subgraph CE[Cloud Edition] C1[AWS AMI / Azure Marketplace] C2[3-node minimum production] C3[S3 / Blob tiering] end subgraph Robo[Robo Edition] R1[Single-node VE variant] R2[Branch office hypervisor] R3[Replicates to hub cluster] end Physical -->|Primary on-prem fabric| Hub[Helios Fleet Management] VE -->|Lab, dev/test| Hub CE -->|Cloud DR + cloud-native| Hub Robo -->|Edge protection| Hub

Key Points - VE / CE Deployment

Post-Quiz: VE and CE Deployment

1. How does an architect add capacity to a single-node Cohesity Virtual Edition appliance?

Add a second VE node to form a 2-node cluster Stop the cluster, extend the capacity-tier disk in the hypervisor, run iris_cli node list disk extend, restart Resize the boot disk in vSphere Migrate to Cloud Edition automatically via Helios

2. Why must Cohesity VE nodes have a vSphere DRS anti-affinity rule configured?

To improve dedupe ratios across hosts To reduce VMware licensing cost So a single ESXi host failure does not take down multiple cluster nodes simultaneously To enable BGP route advertisement

3. A 3-node Azure Cloud Edition cluster on Standard_DS5_v2 requires how many vCPU cores total — and what is the architect's first sizing prerequisite?

12 cores; ensure VPN to on-prem exists 24 cores; provision premium SSD only 48 cores; request a regional vCPU quota increase from Azure 96 cores; deploy via Terraform only

Section 3: Day-2 Operations

Pre-Quiz: Day-2 Operations

1. During a Cohesity rolling upgrade, how does the cluster keep client traffic flowing while a node is rebooting?

All clients enter a 30-minute pause A distributed lock token serializes one node at a time and VIPs migrate to surviving peers The cluster fails over to a passive standby cluster A separate Helios cluster takes over the VIPs

2. Why does Cohesity use an active/passive root partition swap during upgrades?

To save storage capacity To enable atomic, fast rollback at the boot level if the new image fails to come up To support encryption at rest Because SpanFS requires it

3. What is "rebuild headroom" and why does it matter for node removal?

Free RAM reserved for the SpanFS cache Spare capacity equal to a node's data footprint, kept free so SpanFS can re-protect blocks after node loss without degraded resiliency Network bandwidth reserved for replication A Helios SaaS feature

4. The daily Cohesity Heartbeat log bundle is used for what?

Replication keepalive between nodes Proactive triage by Cohesity Support — configuration, health, alerts, capacity uploaded daily over HTTPS Quorum heartbeat between cluster partitions An audit log for compliance only

3.1 Cluster Upgrades and Rolling Reboots

Cohesity ships a one-click rolling upgrade. The cluster pulls candidate releases automatically — no manual download. The mechanics are elegant: a distributed lock manager hands a token from node to node:

  1. Token-holder pauses local services and migrates its VIPs to peers.
  2. Active client connections continue against surviving VIPs (UI, SMB, NFS, ongoing backups).
  3. The node atomically swaps active and passive root partitions and reboots into the new image.
  4. Once healthy, it releases the token to the next node.

The atomic root-partition swap also enables fast rollback at the boot level: if a node fails to come up on the new image, it boots back into the previous root.

Analogy: A Cohesity rolling upgrade is like a Roomba in a house full of guests. Only one node steps out of the rotation at a time, and the rest keep cleaning (serving I/O) while it's away.

The 7.x UI introduces explicit pre-upgrade checks under the Upgrade tab and supports uploading a CRL when needed. Architects should always:

  1. Open Platform > Cluster, confirm green health.
  2. Run pre-upgrade checks; fix flagged issues (clock drift, certificate expiry, disk warnings).
  3. Initiate upgrade from Platform > Admin > Upgrade Cluster.
  4. Verify with iris_cli cluster get-version and spot-check backups + a restore.

Historical note: Cohesity 6.8.2_u1 migrated the underlying OS from CentOS 7.9 to RHEL 7.9 because of CentOS's June 30, 2024 EOL.

3.2 Adding and Removing Nodes

Expansion is symmetric to bootstrap: image, rack, cable, present via iris_cli discover free-nodes, then append. Removal is a multi-stage drain: mark for removal, SpanFS migrates chunk replicas off the node to maintain RF/EC parity, node leaves quorum. Architects must size for rebuild headroom: free capacity equal to the failed node's footprint.

3.3 Disk and Node Replacement

3.4 Health Checks and Heartbeat

The daily Heartbeat log bundle uploads to Cohesity Support over HTTPS, including cluster configuration, service health, recent alerts, and capacity metrics — enough for proactive triage without requesting fresh logs. Helios surfaces alerts globally:

SeverityExamplesAction
CriticalQuorum loss, full capacity, multiple disk failuresPage on-call, engage support
MajorSingle disk failure, replication lag, license expirySame-day investigation
WarningJob failures, certificate expiry < 30 daysPlan resolution
InfoSuccessful upgrade, scheduled maintenanceAcknowledge
Animation: Rolling Upgrade Token + VIP Migration
Node 1 Healthy v7.0 Node 2 Healthy v7.0 Node 3 Healthy v7.0 Node 4 Healthy v7.0 V1 V2 V3 V4 Upgrade Token Ready: 4-node cluster on v7.0

Rolling Upgrade Sequence (Mermaid)

sequenceDiagram participant Helios participant Cluster as Cluster Coordinator participant N1 as Node 1 participant N2 as Node 2 participant N3 as Node 3 Helios->>Cluster: Initiate upgrade to target version Cluster->>Cluster: Run pre-upgrade checks Cluster->>N1: Hand upgrade token N1->>N2: Migrate VIPs to peers N1->>N1: Pause services, swap root partition, reboot N1->>Cluster: Healthy on new image, release token Cluster->>N2: Hand upgrade token N2->>N3: Migrate VIPs to peers N2->>N2: Pause, swap root partition, reboot N2->>Cluster: Healthy, release token Cluster->>N3: Hand upgrade token N3->>N1: Migrate VIPs to peers N3->>N3: Pause, swap root partition, reboot N3->>Cluster: Healthy, release token Cluster->>Helios: Upgrade complete, all nodes on new version

Key Points - Day-2 Operations

Post-Quiz: Day-2 Operations

1. During a Cohesity rolling upgrade, how does the cluster keep client traffic flowing while a node is rebooting?

All clients enter a 30-minute pause A distributed lock token serializes one node at a time and VIPs migrate to surviving peers The cluster fails over to a passive standby cluster A separate Helios cluster takes over the VIPs

2. Why does Cohesity use an active/passive root partition swap during upgrades?

To save storage capacity To enable atomic, fast rollback at the boot level if the new image fails to come up To support encryption at rest Because SpanFS requires it

3. What is "rebuild headroom" and why does it matter for node removal?

Free RAM reserved for the SpanFS cache Spare capacity equal to a node's data footprint, kept free so SpanFS can re-protect blocks after node loss without degraded resiliency Network bandwidth reserved for replication A Helios SaaS feature

4. The daily Cohesity Heartbeat log bundle is used for what?

Replication keepalive between nodes Proactive triage by Cohesity Support — configuration, health, alerts, capacity uploaded daily over HTTPS Quorum heartbeat between cluster partitions An audit log for compliance only

Section 4: Automation and APIs

Pre-Quiz: Automation

1. Which API surface is required to push a single protection policy across 50 registered clusters in one operation?

REST API v1, per-cluster REST API v2, per-cluster Helios API (SaaS, tenant-scoped) iris_cli SSH'd to each node

2. A team wants reproducible Cohesity protection-policy state across dev/test/prod with a Git audit trail. Which tool fits best?

PowerShell module — ad-hoc cmdlets Terraform provider — declarative state in Git iris_cli shell scripts on each node Helios UI clicks

3. New Cohesity automation should target which REST surface area first?

REST API v1 only REST API v2, falling back to v1 only when an endpoint is missing in v2 Helios API only, even for cluster-local operations A custom JSON-RPC endpoint

4.1 iris_cli Command Groups

iris_cli is the on-cluster shell. Major groups:

GroupPurposeExample
clusterBootstrap, status, versioniris_cli cluster get-version
nodePer-node opsiris_cli node list disk extend
diskInventory, extendiris_cli disk list
interfaceNetwork configiris_cli interface list
vlanVLAN/VIP mgmtiris_cli vlan list
partitionCluster partitionsiris_cli partition list
protection-jobBackup jobsiris_cli protection-job list
recoveryRestoresiris_cli recovery list

4.2 REST API v1, v2, Helios

Authentication: cluster APIs accept username/password or API key against the cluster; Helios uses a per-tenant API key with a clusterId filter for fan-out.

4.3 PowerShell, Ansible, Terraform

The PowerShell module (e.g., Get-CohesityProtectionJob) suits Windows-centric teams. The Ansible collection fits Linux-heavy IaC shops:

- name: Create VMware protection group
  cohesity.dataprotect.cohesity_protection_group:
    cluster: "{{ cohesity_vip }}"
    username: "{{ cohesity_user }}"
    password: "{{ cohesity_pass }}"
    name: "tier1-vms-daily"
    environment: "VMware"
    policy: "Gold"
    sources:
      - vcenter01
    vm_tags:
      - "tier:1"
    state: present

The Terraform provider treats Cohesity resources as declarative state — the right tool when backups must be reproducible across environments and when an audit trail of "who changed what when" is required:

resource "cohesity_protection_policy" "gold" {
  name = "Gold"
  retention {
    unit     = "Days"
    duration = 30
  }
  incremental_schedule {
    unit      = "Hours"
    frequency = 4
  }
  full_schedule {
    unit = "Weeks"
    day  = "Sunday"
  }
}
Analogy: Pick the API like you pick a kitchen tool. iris_cli is the chef's knife — sharp, fast, on-cluster. REST is the food processor — bulk operations from outside. PowerShell is the rice cooker for Windows shops. Ansible and Terraform are the meal-prep system for the whole week.

Automation Stack Layered Above APIs (Mermaid)

graph TD subgraph Tools[Operator-Facing Automation] TF[Terraform Provider
declarative state] ANS[Ansible Collection
playbook tasks] PS[PowerShell Module
Windows-centric] IRIS[iris_cli
on-cluster shell] end subgraph APIs[REST Surface Areas] V1[REST API v1
legacy resources] V2[REST API v2
object-oriented] HEL[Helios API
fleet-wide tenant scope] end subgraph Targets[Cluster Targets] CL1[Cluster A] CL2[Cluster B] CL3[Cluster C] end TF --> V2 TF --> HEL ANS --> V2 ANS --> V1 PS --> V2 PS --> V1 IRIS --> CL1 V1 --> CL1 V1 --> CL2 V2 --> CL1 V2 --> CL2 V2 --> CL3 HEL --> CL1 HEL --> CL2 HEL --> CL3

Key Points - Automation

Post-Quiz: Automation

1. Which API surface is required to push a single protection policy across 50 registered clusters in one operation?

REST API v1, per-cluster REST API v2, per-cluster Helios API (SaaS, tenant-scoped) iris_cli SSH'd to each node

2. A team wants reproducible Cohesity protection-policy state across dev/test/prod with a Git audit trail. Which tool fits best?

PowerShell module — ad-hoc cmdlets Terraform provider — declarative state in Git iris_cli shell scripts on each node Helios UI clicks

3. New Cohesity automation should target which REST surface area first?

REST API v1 only REST API v2, falling back to v1 only when an endpoint is missing in v2 Helios API only, even for cluster-local operations A custom JSON-RPC endpoint

Your Progress

Answer Explanations