Study Guide: Chapter 5 - Cluster Deployment, Bootstrap, and Day-2 Operations

A Cohesity cluster spends roughly 30 minutes being born and the rest of its life in maintenance mode. The CCAE exam reflects that ratio: it expects you to bootstrap a cluster cleanly the first time, then operate it for years without unplanned downtime.

Section 1: Bootstrapping a New Cluster

Pre-Quiz: Bootstrapping

1. Before any IPs are assigned, how does the architect uniquely identify a specific imaged Cohesity node when running iris_cli cluster create?

By the MAC address of bond0 By the node-ID printed on the IPMI console / boot screen By a temporary DHCP-assigned hostname By the chassis serial number alone

2. When does the initial cluster partition get created during bootstrap?

Manually, via iris_cli partition create before cluster create Automatically, as part of iris_cli cluster create Only after Helios registration completes When the first VIP is bound to bond0

3. An architect joins the cluster to AD but Kerberos ticket validation fails mysteriously. What is the most likely missing prerequisite?

A Helios subscription An Eager Zeroed Thick VMDK Reverse DNS (PTR) records for the cluster IPs A Terraform state file

1.1 Out-of-the-Box Experience and IPMI

Each Cohesity-branded appliance, ReadyNode, or certified partner platform ships pre-imaged but unconfigured. The bootstrap entry point is IPMI (the out-of-band management controller). The architect connects to each node's IPMI IP, powers it on, and verifies the imaging splash screen. Each node carries a unique node-ID (e.g., 181140266786854) printed on the IPMI console.

Analogy: The node-ID is like a hospital wristband on a newborn. Before the cluster gives the node an IP "address" to live at, the node-ID is the only handle that uniquely identifies which physical box you mean.

1.2 IP Allocation

Console into the first node and run the network configuration script:

cd /home/cohesity/bin/network
./configure_network.sh

The script prompts for bond selection (bond0 default for node-to-node), IP/prefix/gateway, MTU (1500 default; 9000 for jumbo), and DNS servers. As an alternative, iris_cli interface list and iris_cli node status enumerate physical interfaces and bond memberships before pushing the configuration.

1.3 Discovery and `cluster create`

Once the first node has an IP, it discovers its peers on the same L2 domain:

iris_cli discover free-nodes

With the free-node list in hand, the cluster is created in a single atomic invocation:

iris_cli cluster create \
  domain-names=eng.cohesity.com \
  ntp-servers=pool.ntp.org \
  name="haswell2" \
  hostname=haswell2.eng.cohesity.com \
  subnet-gateway=10.1.0.1 \
  subnet-mask=255.255.240.0 \
  dns-server-ips=10.2.0.1 \
  node-ips=10.1.4.16,10.1.4.17,10.1.4.18 \
  node-ids=181140266786854,181140264583348,181140264822986 \
  vips=10.1.4.20,10.1.4.21,10.1.4.22 \
  enable-encryption=true

The command bundles every prerequisite into one atomic operation: domain, NTP, DNS, gateway, mask, node IPs, node IDs, VIPs, and encryption posture. Quorum forms once a majority of nodes acknowledge the create.

Exam tip: Multicast is frequently disabled in enterprise networks. Cohesity supports a unicast-discovery path; the architect must ensure the same node count, VIP count, and IPMI count are supplied and that ARP-based peer discovery works on the bootstrap subnet.

1.4 Post-Create Tasks

Verify health with iris_cli cluster status.
DNS registration: A and PTR records that round-robin across all VIPs.
License application: capacity-based or subscription entitlement.
Active Directory join: bind to the AD forest and map AD groups to roles (Admin, Operator, Viewer).
SSO/SAML: optional federation with Okta, Azure AD, or Ping.
Helios registration: required for global dashboards and SaaS-only features (DataHawk, FortKnox, SiteContinuity orchestration).

Bootstrap Workflow (Mermaid)

Post-Quiz: Bootstrapping

1. Before any IPs are assigned, how does the architect uniquely identify a specific imaged Cohesity node when running iris_cli cluster create?

By the MAC address of bond0 By the node-ID printed on the IPMI console / boot screen By a temporary DHCP-assigned hostname By the chassis serial number alone

2. When does the initial cluster partition get created during bootstrap?

Manually, via iris_cli partition create before cluster create Automatically, as part of iris_cli cluster create Only after Helios registration completes When the first VIP is bound to bond0

3. An architect joins the cluster to AD but Kerberos ticket validation fails mysteriously. What is the most likely missing prerequisite?

A Helios subscription An Eager Zeroed Thick VMDK Reverse DNS (PTR) records for the cluster IPs A Terraform state file

Section 2: Virtual and Cloud Edition Deployment

Pre-Quiz: VE and CE Deployment

1. How does an architect add capacity to a single-node Cohesity Virtual Edition appliance?

Add a second VE node to form a 2-node cluster Stop the cluster, extend the capacity-tier disk in the hypervisor, run iris_cli node list disk extend, restart Resize the boot disk in vSphere Migrate to Cloud Edition automatically via Helios

2. Why must Cohesity VE nodes have a vSphere DRS anti-affinity rule configured?

To improve dedupe ratios across hosts To reduce VMware licensing cost So a single ESXi host failure does not take down multiple cluster nodes simultaneously To enable BGP route advertisement

3. A 3-node Azure Cloud Edition cluster on Standard_DS5_v2 requires how many vCPU cores total — and what is the architect's first sizing prerequisite?

12 cores; ensure VPN to on-prem exists 24 cores; provision premium SSD only 48 cores; request a regional vCPU quota increase from Azure 96 cores; deploy via Terraform only

2.1 VE on VMware/Hyper-V

Virtual Edition ships as an OVA (VMware) or VHDX (Hyper-V) and runs as single-node or multi-node clusters. Hard floors are typically 8+ vCPU and 64+ GB RAM, with dedicated performance-tier disk on flash and a separate capacity-tier disk on HDD or capacity SSD. Critical constraints:

Single-node VE expands by disk resize, not node addition. Stop the cluster, extend the capacity-tier disk in the hypervisor, then:
```
iris_cli cluster stop
iris_cli node list disk extend
iris_cli disk list
iris_cli cluster start
```
Multi-node VE (3+ nodes) uses the same iris_cli discover free-nodes + cluster create flow as physical.
DRS anti-affinity must be configured so VE nodes never co-reside on the same ESXi host.
VMDK provisioning must be Eager Zeroed Thick on the performance tier — thin-provisioned performance disks introduce write-stall behavior under load.

2.2 Cloud Edition on AWS, Azure, GCP

Cloud Edition runs as a minimum 3-node production cluster. Single-node CE is supported only for lab use.

Azure: Marketplace image with Standard_DS5_v2 (16 vCPU per node). 3 nodes = 48 vCPU; new subscriptions often default to a 10-core regional quota, so request quota increases first. Resource-group caps allow up to 64 VMs per RG (effective ceiling on cluster size). Azure Managed Disks must be sized in multiples of 1 MB.
AWS: DataPlatform Cloud Edition AMI from AWS Marketplace, m5/m6i large-instance family. CE on AWS supports S3 tiering for overflow capacity.
Networking: dedicated subnet, security-group inter-node rules, VPN/Direct Connect/ExpressRoute back on-prem, IAM roles for S3/Blob (CloudArchive, CloudTier).

2.3 Robo Edition

A thin VE variant tuned for remote/branch offices. Typically a single-node VE on existing branch hypervisor capacity, replicating inbound to a central CE or physical hub. It trades clustered resiliency for a tiny footprint and manages risk by replicating critical backups upstream within minutes.

Form Factor Comparison

Form Factor Mermaid Diagram

Attribute	Physical	VE	CE
Form factor	1U/2U appliance	OVA/VHDX	Cloud Marketplace AMI/Image
Min nodes (prod)	3	3 (single = lab/Robo)	3
Bootstrap entry	IPMI + console	Hypervisor console	Cloud console + SSH
Capacity expand	Add nodes / disks	Resize VMDK + `disk extend`	Add VMs / resize managed disks
External overflow	Local only	Local only	S3 / Blob tier
Helios	Optional	Optional	Strongly recommended

Post-Quiz: VE and CE Deployment

1. How does an architect add capacity to a single-node Cohesity Virtual Edition appliance?

2. Why must Cohesity VE nodes have a vSphere DRS anti-affinity rule configured?

To improve dedupe ratios across hosts To reduce VMware licensing cost So a single ESXi host failure does not take down multiple cluster nodes simultaneously To enable BGP route advertisement

3. A 3-node Azure Cloud Edition cluster on Standard_DS5_v2 requires how many vCPU cores total — and what is the architect's first sizing prerequisite?

12 cores; ensure VPN to on-prem exists 24 cores; provision premium SSD only 48 cores; request a regional vCPU quota increase from Azure 96 cores; deploy via Terraform only

Section 3: Day-2 Operations

Pre-Quiz: Day-2 Operations

1. During a Cohesity rolling upgrade, how does the cluster keep client traffic flowing while a node is rebooting?

All clients enter a 30-minute pause A distributed lock token serializes one node at a time and VIPs migrate to surviving peers The cluster fails over to a passive standby cluster A separate Helios cluster takes over the VIPs

2. Why does Cohesity use an active/passive root partition swap during upgrades?

To save storage capacity To enable atomic, fast rollback at the boot level if the new image fails to come up To support encryption at rest Because SpanFS requires it

3. What is "rebuild headroom" and why does it matter for node removal?

Free RAM reserved for the SpanFS cache Spare capacity equal to a node's data footprint, kept free so SpanFS can re-protect blocks after node loss without degraded resiliency Network bandwidth reserved for replication A Helios SaaS feature

4. The daily Cohesity Heartbeat log bundle is used for what?

Replication keepalive between nodes Proactive triage by Cohesity Support — configuration, health, alerts, capacity uploaded daily over HTTPS Quorum heartbeat between cluster partitions An audit log for compliance only

3.1 Cluster Upgrades and Rolling Reboots

Cohesity ships a one-click rolling upgrade. The cluster pulls candidate releases automatically — no manual download. The mechanics are elegant: a distributed lock manager hands a token from node to node:

Token-holder pauses local services and migrates its VIPs to peers.
Active client connections continue against surviving VIPs (UI, SMB, NFS, ongoing backups).
The node atomically swaps active and passive root partitions and reboots into the new image.
Once healthy, it releases the token to the next node.

The atomic root-partition swap also enables fast rollback at the boot level: if a node fails to come up on the new image, it boots back into the previous root.

Analogy: A Cohesity rolling upgrade is like a Roomba in a house full of guests. Only one node steps out of the rotation at a time, and the rest keep cleaning (serving I/O) while it's away.

The 7.x UI introduces explicit pre-upgrade checks under the Upgrade tab and supports uploading a CRL when needed. Architects should always:

Open Platform > Cluster, confirm green health.
Run pre-upgrade checks; fix flagged issues (clock drift, certificate expiry, disk warnings).
Initiate upgrade from Platform > Admin > Upgrade Cluster.
Verify with iris_cli cluster get-version and spot-check backups + a restore.

Historical note: Cohesity 6.8.2_u1 migrated the underlying OS from CentOS 7.9 to RHEL 7.9 because of CentOS's June 30, 2024 EOL.

3.2 Adding and Removing Nodes

Expansion is symmetric to bootstrap: image, rack, cable, present via iris_cli discover free-nodes, then append. Removal is a multi-stage drain: mark for removal, SpanFS migrates chunk replicas off the node to maintain RF/EC parity, node leaves quorum. Architects must size for rebuild headroom: free capacity equal to the failed node's footprint.

3.3 Disk and Node Replacement

Disk replacement: Helios alert or iris_cli cluster status shows failed disk LED; field engineer hot-swaps; cluster auto-formats and rebuilds chunk replicas.
Node replacement: drain the failed node, physically swap, re-add via UI or iris_cli; SpanFS rebuilds replicas at network throughput.
Cohesity Hardware Refresh Service handles 5-7 year tech-refresh cycles via overlapping clusters during data migration.

3.4 Health Checks and Heartbeat

The daily Heartbeat log bundle uploads to Cohesity Support over HTTPS, including cluster configuration, service health, recent alerts, and capacity metrics — enough for proactive triage without requesting fresh logs. Helios surfaces alerts globally:

Severity	Examples	Action
Critical	Quorum loss, full capacity, multiple disk failures	Page on-call, engage support
Major	Single disk failure, replication lag, license expiry	Same-day investigation
Warning	Job failures, certificate expiry < 30 days	Plan resolution
Info	Successful upgrade, scheduled maintenance	Acknowledge

Rolling Upgrade Sequence (Mermaid)

Post-Quiz: Day-2 Operations

1. During a Cohesity rolling upgrade, how does the cluster keep client traffic flowing while a node is rebooting?

2. Why does Cohesity use an active/passive root partition swap during upgrades?

To save storage capacity To enable atomic, fast rollback at the boot level if the new image fails to come up To support encryption at rest Because SpanFS requires it

3. What is "rebuild headroom" and why does it matter for node removal?

4. The daily Cohesity Heartbeat log bundle is used for what?

Section 4: Automation and APIs

Pre-Quiz: Automation

1. Which API surface is required to push a single protection policy across 50 registered clusters in one operation?

REST API v1, per-cluster REST API v2, per-cluster Helios API (SaaS, tenant-scoped) iris_cli SSH'd to each node

2. A team wants reproducible Cohesity protection-policy state across dev/test/prod with a Git audit trail. Which tool fits best?

PowerShell module — ad-hoc cmdlets Terraform provider — declarative state in Git iris_cli shell scripts on each node Helios UI clicks

3. New Cohesity automation should target which REST surface area first?

REST API v1 only REST API v2, falling back to v1 only when an endpoint is missing in v2 Helios API only, even for cluster-local operations A custom JSON-RPC endpoint

4.1 `iris_cli` Command Groups

iris_cli is the on-cluster shell. Major groups:

Group	Purpose	Example
`cluster`	Bootstrap, status, version	`iris_cli cluster get-version`
`node`	Per-node ops	`iris_cli node list disk extend`
`disk`	Inventory, extend	`iris_cli disk list`
`interface`	Network config	`iris_cli interface list`
`vlan`	VLAN/VIP mgmt	`iris_cli vlan list`
`partition`	Cluster partitions	`iris_cli partition list`
`protection-job`	Backup jobs	`iris_cli protection-job list`
`recovery`	Restores	`iris_cli recovery list`

4.2 REST API v1, v2, Helios

REST API v1: cluster-local, legacy resources (jobs, runs, sources). Still in heavy use; some endpoints have no v2 equivalent.
REST API v2: cluster-local, redesigned around object-oriented resources (protection groups, recoveries, sources). Preferred for new automation.
Helios API: SaaS-side, tenant-scoped, aggregated across all registered clusters. Required for fleet operations and Helios-only features (DataHawk, FortKnox, SiteContinuity).

Authentication: cluster APIs accept username/password or API key against the cluster; Helios uses a per-tenant API key with a clusterId filter for fan-out.

4.3 PowerShell, Ansible, Terraform

The PowerShell module (e.g., Get-CohesityProtectionJob) suits Windows-centric teams. The Ansible collection fits Linux-heavy IaC shops:

- name: Create VMware protection group
  cohesity.dataprotect.cohesity_protection_group:
    cluster: "{{ cohesity_vip }}"
    username: "{{ cohesity_user }}"
    password: "{{ cohesity_pass }}"
    name: "tier1-vms-daily"
    environment: "VMware"
    policy: "Gold"
    sources:
      - vcenter01
    vm_tags:
      - "tier:1"
    state: present

The Terraform provider treats Cohesity resources as declarative state — the right tool when backups must be reproducible across environments and when an audit trail of "who changed what when" is required:

resource "cohesity_protection_policy" "gold" {
  name = "Gold"
  retention {
    unit     = "Days"
    duration = 30
  }
  incremental_schedule {
    unit      = "Hours"
    frequency = 4
  }
  full_schedule {
    unit = "Weeks"
    day  = "Sunday"
  }
}

Analogy: Pick the API like you pick a kitchen tool. iris_cli is the chef's knife — sharp, fast, on-cluster. REST is the food processor — bulk operations from outside. PowerShell is the rice cooker for Windows shops. Ansible and Terraform are the meal-prep system for the whole week.

Automation Stack Layered Above APIs (Mermaid)

Key Points - Automation

Helios API is the only path to cross-cluster fleet operations and to Helios-only features (DataHawk, FortKnox, SiteContinuity orchestration).
Choose REST v2 for new code, v1 only when v2 lacks the endpoint.
Terraform = declarative state with Git audit trail; Ansible = playbook orchestration; PowerShell = Windows-team idiom; iris_cli = on-cluster shell.
The CLI Reference Guide is published per release; iris_cli command groups include cluster, node, disk, interface, vlan, partition, protection-job, recovery.
Match the tool to the team's existing operating model rather than imposing a new one.

Post-Quiz: Automation

1. Which API surface is required to push a single protection policy across 50 registered clusters in one operation?

REST API v1, per-cluster REST API v2, per-cluster Helios API (SaaS, tenant-scoped) iris_cli SSH'd to each node

2. A team wants reproducible Cohesity protection-policy state across dev/test/prod with a Git audit trail. Which tool fits best?

PowerShell module — ad-hoc cmdlets Terraform provider — declarative state in Git iris_cli shell scripts on each node Helios UI clicks

3. New Cohesity automation should target which REST surface area first?

REST API v1 only REST API v2, falling back to v1 only when an endpoint is missing in v2 Helios API only, even for cluster-local operations A custom JSON-RPC endpoint

Chapter 5: Cluster Deployment, Bootstrap, and Day-2 Operations

Learning Objectives

Section 1: Bootstrapping a New Cluster

1.1 Out-of-the-Box Experience and IPMI

1.2 IP Allocation

1.3 Discovery and `cluster create`

1.4 Post-Create Tasks

Bootstrap Workflow (Mermaid)

Key Points - Bootstrapping

Section 2: Virtual and Cloud Edition Deployment

2.1 VE on VMware/Hyper-V

2.2 Cloud Edition on AWS, Azure, GCP

2.3 Robo Edition

Form Factor Comparison

Form Factor Mermaid Diagram

Key Points - VE / CE Deployment

Section 3: Day-2 Operations

3.1 Cluster Upgrades and Rolling Reboots

3.2 Adding and Removing Nodes

3.3 Disk and Node Replacement

3.4 Health Checks and Heartbeat

Rolling Upgrade Sequence (Mermaid)

Key Points - Day-2 Operations

Section 4: Automation and APIs

4.1 `iris_cli` Command Groups

4.2 REST API v1, v2, Helios

4.3 PowerShell, Ansible, Terraform

Automation Stack Layered Above APIs (Mermaid)

Key Points - Automation

Your Progress

Answer Explanations

Chapter 5: Cluster Deployment, Bootstrap, and Day-2 Operations

Learning Objectives

Section 1: Bootstrapping a New Cluster

1.1 Out-of-the-Box Experience and IPMI

1.2 IP Allocation

1.3 Discovery and cluster create

1.4 Post-Create Tasks

Bootstrap Workflow (Mermaid)

Key Points - Bootstrapping

Section 2: Virtual and Cloud Edition Deployment

2.1 VE on VMware/Hyper-V

2.2 Cloud Edition on AWS, Azure, GCP

2.3 Robo Edition

Form Factor Comparison

Form Factor Mermaid Diagram

Key Points - VE / CE Deployment

Section 3: Day-2 Operations

3.1 Cluster Upgrades and Rolling Reboots

3.2 Adding and Removing Nodes

3.3 Disk and Node Replacement

3.4 Health Checks and Heartbeat

Rolling Upgrade Sequence (Mermaid)

Key Points - Day-2 Operations

Section 4: Automation and APIs

4.1 iris_cli Command Groups

4.2 REST API v1, v2, Helios

4.3 PowerShell, Ansible, Terraform

Automation Stack Layered Above APIs (Mermaid)

Key Points - Automation

Your Progress

Answer Explanations

1.3 Discovery and `cluster create`

4.1 `iris_cli` Command Groups