Chapter 14: Performance, Monitoring, and Troubleshooting

Cohesity Certified Architect Expert (CCAE) — Interactive Study Guide

Learning Objectives

If the previous chapters explained how to design and operate a Cohesity estate when everything is healthy, this chapter is about what to do when it is not. The CCAE-level architect is expected to localize bottlenecks, generate the right diagnostic artifacts, integrate alerts into operational tooling, and engage Cohesity Support effectively when escalation is warranted. The discipline that ties all of this together is structured triage: walk the data path end to end, measure each segment, and find the lowest sustained throughput point.

Section 1: Performance Bottleneck Analysis

Pre-Quiz — Section 1

1. A Cohesity backup that previously ran at 190 MB/s is now running at 20 MB/s. Helios shows source read MB/s = 22, writer MB/s = 21, writer latency normal, NVRAM not saturated. Where is the bottleneck most likely to be?

A) Stage 4 NVRAM — destage backpressure
B) Stage 1 Source — the cluster is not being fed
C) Stage 5 Writer — high disk latency
D) Stage 3 Ingest — proxy concurrency exhausted

2. A Veeam SOBR backed by Cohesity drops from ~140 MB/s on a single repository to ~20-30 MB/s once SOBR is enabled. What is the architect's first remediation?

A) Add more Cohesity nodes
B) Switch SOBR placement policy from Data Locality to Performance
C) Increase NVRAM allocation
D) Disable inline deduplication

3. Why is establishing a network baseline with iperf the recommended first step before deeper triage?

A) iperf calibrates Cohesity's internal counters
B) iperf detects substrate issues (1 GbE masquerading as 10 GbE, wrong VLAN/uplink) in seconds
C) iperf is required by the SOBR Performance policy
D) iperf flushes the writer queues

The Five-Stage Data Path

A Cohesity backup is a multi-stage assembly line. Bytes are read from a source, pushed through the network, ingested by the cluster, journaled into NVRAM, and finally destaged onto SSD or HDD by the writer service. Each stage has a maximum sustainable throughput, and the slowest stage sets the throughput of the whole line.

[Source App/VM] -> [Network/Proxy] -> [Cluster Ingest] -> [NVRAM Journal] -> [Writer -> SSD/HDD]
     Stage 1            Stage 2            Stage 3              Stage 4            Stage 5

Animation: Bottleneck Triage — Sequential Check Across the Five-Stage Data Path

Each stage is checked in turn (green pulse = healthy). The bottleneck stage glows red.

Source → Network → Ingest → NVRAM → Writer Source VM / NAS / DB read MB/s Network VLAN / proxy iperf baseline Ingest cluster CPU concurrency NVRAM journal destage queue Writer SSD / HDD latency ↑ BOTTLENECK writer latency ↑ queue ↑ Lowest sustained throughput point sets the pipeline rate.

Symptom Patterns by Stage

StageSymptom PatternTelltale Metric
1. SourceLow source read MB/s; idle network and writersSource read latency high; CBT/RCT slow; storage array saturated
2. NetworkSource ready; cluster idle; low end-to-end MB/siperf below link speed; retransmits; wrong VLAN/uplink
3. IngestNetwork saturated but cluster CPU/proxy queues highProxy concurrency exhausted; NBD/HotAdd contention
4. NVRAMBursty throughput; periodic stalls; destage backpressureNVRAM journal utilization high; destage queue growing
5. WriterNetwork healthy; NVRAM filling; write latency risingWriter latency, disk queue depth, SSD/HDD saturation

Bottleneck Triage Decision Tree

flowchart TD Start([Slow backup symptom]) --> NetCheck{iperf >= link speed?} NetCheck -->|No| Stage2[Stage 2: Network
Fix VLAN / uplink / MTU] NetCheck -->|Yes| SrcCheck{Source MB/s and
writer MB/s both low
and balanced?} SrcCheck -->|Yes| Stage1[Stage 1: Source
Investigate array,
hypervisor, agent] SrcCheck -->|No| NvramCheck{Writer latency high
or NVRAM saturated?} NvramCheck -->|Yes| Stage45[Stages 4-5: NVRAM / Writer
Cluster-side limiter] NvramCheck -->|No| ConcCheck{Per-stream healthy
but total throughput low?} ConcCheck -->|Yes| Stage3[Stage 3: Ingest / Concurrency
Add proxies, raise concurrency] ConcCheck -->|No| Bundle[Generate Siren log bundle
Engage Cohesity Support]

The SOBR Performance Trap (140 vs 20-30 MB/s)

When third-party tools like Veeam present Cohesity as a Scale-Out Backup Repository, throughput can collapse from ~140 MB/s on a single-node Cohesity repository to 20-30 MB/s once SOBR is enabled with the default placement policy. The remediation is to set SOBR placement to Performance rather than Data Locality, restoring parallel ingest across nodes.

SOBR Placement PolicyBehaviorThroughput
Data Locality (default)Pin a backup chain to one extent~20-30 MB/s — single node serializes
PerformanceStripe across SOBR extents in parallel~140 MB/s — nodes work in parallel

Worked Example: Backup Running at 20 MB/s

  1. Network baseline. iperf3 from vSphere proxy to three node primary IPs: 9.3 Gb/s. Network healthy. Stage 2 ruled out.
  2. Job stats. Source read 22 MB/s, writer 21 MB/s, writer latency normal, NVRAM not saturated. Reads and writes balanced and low — cluster is not being fed.
  3. Source check. vCenter shows backing array queue depth 95%, read latency 40 ms.
  4. Conclusion. A SQL re-index is contending with the source array. Reschedule the protection group outside the re-index window. Do not open a Cohesity Support case — Cohesity is not the limiter.

Key Points — Performance Bottleneck Analysis

Key Takeaway: Bottlenecks are localized by walking the path and finding the lowest sustained throughput point. Establish iperf first, watch for SOBR placement traps, and let metric divergence — not assumptions — point to the limiter.
Post-Quiz — Section 1

1. A Cohesity backup that previously ran at 190 MB/s is now running at 20 MB/s. Helios shows source read MB/s = 22, writer MB/s = 21, writer latency normal, NVRAM not saturated. Where is the bottleneck most likely to be?

A) Stage 4 NVRAM — destage backpressure
B) Stage 1 Source — the cluster is not being fed
C) Stage 5 Writer — high disk latency
D) Stage 3 Ingest — proxy concurrency exhausted

2. A Veeam SOBR backed by Cohesity drops from ~140 MB/s on a single repository to ~20-30 MB/s once SOBR is enabled. What is the architect's first remediation?

A) Add more Cohesity nodes
B) Switch SOBR placement policy from Data Locality to Performance
C) Increase NVRAM allocation
D) Disable inline deduplication

3. Why is establishing a network baseline with iperf the recommended first step before deeper triage?

A) iperf calibrates Cohesity's internal counters
B) iperf detects substrate issues (1 GbE masquerading as 10 GbE, wrong VLAN/uplink) in seconds
C) iperf is required by the SOBR Performance policy
D) iperf flushes the writer queues

Section 2: Monitoring and Alerting

Pre-Quiz — Section 2

4. Which four channels does Helios fan alerts out to via notification rules?

A) Email, SNMP, syslog, webhook
B) SMS, MQTT, Kafka, syslog
C) Email, NetFlow, RADIUS, webhook
D) Slack, ICMP, SNMP, NTP

5. After updating SMTP configuration on a cluster, why is running POST /validate a critical step?

A) It applies the change to all clusters via Helios sync
B) It catches silent relay breakage (expired credentials, blocked port, cert chain issues) before an incident does
C) It's required to enable SMTPS on port 465
D) It triggers a notification rule reload

6. Which Helios artifact is the leading indicator that individual alerts can miss — chronic protection drift over weeks?

A) The audit log
B) The SLA report
C) The Heartbeat stream
D) The Bridge service log

Helios is Cohesity's primary alerting plane. It aggregates events from every registered cluster and fans them out via filterable notification rules to email, SNMP, syslog, and webhook channels. Alerts are categorized (cluster health, protection, replication, capacity, security, hardware) and tagged by severity (Critical, Warning, Info).

SeverityExamplesRecommended Routing
CriticalNode down, quorum risk, replication failureEmail + SNMP + PagerDuty webhook
WarningDisk predictive failure, capacity > 80%, missed SLAEmail + Syslog (SIEM)
InfoJob completed, snapshot expired, config changeSyslog only (SIEM archive)

Configuring Alert Notification Rules

  1. Navigate to Health > Notification in Helios/DataProtect.
  2. Click Create > New Alert Notification Rule.
  3. Set rule name and filters (category, severity, alert name, cluster scope).
  4. Choose delivery method: email, SNMP, syslog, or webhook.
  5. For email: specify To, Cc, Subject. Save.

Programmatically the endpoint is createAlertNotificationRule. The same alert can fan to multiple channels — create one rule per channel, or different filters per channel for tiered routing.

Animation: Helios Alert Fan-Out — Cluster Event to Four Channels

A cluster event reaches Helios, then fans out to all four notification channels in parallel.

Cohesity Cluster node down / replication fail Helios Aggregator Notification Rule Match (category + severity filter) createAlertNotificationRule Email / SMTP PUT /v2/clusters/smtp SNMP Trap NMS / Cohesity MIB Syslog Splunk / QRadar / Sentinel Webhook (JSON) PagerDuty / ServiceNow / SOAR Same alert → multiple channels via filterable rules

Email/SMTP Configuration

APIPurposeNotes
PUT /v2/clusters/smtpUpdate SMTP configServer, port (465 SMTPS, 587 STARTTLS), credentials. Requires CLUSTER_MODIFY.
GET /v2/clusters/smtpRetrieve SMTP configAudit current settings
POST /validateTest SMTP deliveryAlways run after changes — catches silent relay breakage

Validation Pattern (Memorize)

  1. Configure SMTP/SNMP/syslog/webhook targets.
  2. Create an alert notification rule with tight filters.
  3. Trigger or simulate a matching event.
  4. Confirm receipt at the email inbox / NMS / syslog / webhook endpoint.
  5. Run POST /validate on SMTP to catch silent relay breakage.

SLA Reports

Beyond alerts, Helios provides SLA reports showing protection compliance per protection group, source, and cluster — the percentage of protected objects that met RPO over a reporting window. These surface chronic drift no individual alert reveals: a group sliding from 99.5% to 96% over six weeks is missing windows even though every job "succeeded" eventually.

Key Points — Monitoring and Alerting

Key Takeaway: Helios fans alerts via filterable notification rules; configure SMTP with PUT /v2/clusters/smtp, always validate, tier severity routing, and use SLA reports for systemic drift.
Post-Quiz — Section 2

4. Which four channels does Helios fan alerts out to via notification rules?

A) Email, SNMP, syslog, webhook
B) SMS, MQTT, Kafka, syslog
C) Email, NetFlow, RADIUS, webhook
D) Slack, ICMP, SNMP, NTP

5. After updating SMTP configuration on a cluster, why is running POST /validate a critical step?

A) It applies the change to all clusters via Helios sync
B) It catches silent relay breakage (expired credentials, blocked port, cert chain issues) before an incident does
C) It's required to enable SMTPS on port 465
D) It triggers a notification rule reload

6. Which Helios artifact is the leading indicator that individual alerts can miss — chronic protection drift over weeks?

A) The audit log
B) The SLA report
C) The Heartbeat stream
D) The Bridge service log

Section 3: Logs and Diagnostics

Pre-Quiz — Section 3

7. Where do Cohesity log bundles generated by Siren land on the cluster before upload to Support?

A) /var/log/cohesity/bundles
B) /home/cohesity/data/timecapsules
C) /opt/cohesity/support
D) /etc/cohesity/heartbeat

8. Which Cohesity service is responsible for backup orchestration — the right service to scope when a job fails to schedule?

A) Bridge
B) Apollo
C) Magneto
D) Yoda

9. What is the named, supported Cohesity command-line interface for cluster management operations?

A) cohctl
B) iris_cli
C) spanctl
D) helios-cli

iris_cli: The Supported CLI

iris_cli -server <cluster-IP> -username=admin -password=<pwd>

For the exam, iris_cli is the CLI to name when a question asks about cluster-side actions. Common groups: cluster status, cluster nodes list, protection-runs list, protection-jobs list, plus stats queries useful during triage.

Service Logs — Pick the Right Service

ServiceResponsibilityWhen to Inspect
irisUI / control planeUI errors, login failures, REST API issues
BridgeI/O data path / SpanFS front-endRead/write latency, NFS/SMB issues
MagnetoBackup orchestrationJob failures, scheduling, source registration
ApolloGC, replication, indexingGC stalls, replication lag, index issues
StatsMetrics aggregationMissing dashboards, metric gaps
YodaSearch / index serviceSearch failures, indexing slowness
GandalfConfiguration managementCluster config issues
NexusCluster networking controlNetwork path / route issues

Generating a Log Bundle via Siren

Reach Siren at https://<cluster-VIP>/siren and click Cluster Support Bundle. The dialog scopes:

Bundles land at /home/cohesity/data/timecapsules. Sizes range from a few MB up to 2-3 GB. Upload via uploadFilePackage API or the Support case URL.

Animation: Heartbeat + Siren Log Bundle — UI Trigger to Support Upload

Siren UI triggers log collection from each service into the time capsule, then uploads to Cohesity Support.

Siren UI Cluster Support Bundle iris (UI / control) login, REST, sessions Bridge (I/O path) SpanFS, NFS/SMB Magneto (orchestration) backup jobs Apollo (GC / repl) indexing, replication Time Capsule /home/cohesity/data/timecapsules scope: nodes / services / time / hw Heartbeat continuous telemetry HTTPS / 443 Cohesity Support Proactive + Reactive uploadFilePackage API Heartbeat = continuous; Siren = on-demand. Both feed Cohesity Support.

Heartbeat: The Continuous Diagnostic

Cohesity clusters emit a Heartbeat stream — a continuous, lightweight diagnostic feed reporting cluster health, version, configuration, and key metrics back to Cohesity. Heartbeat enables Proactive Support to spot brewing issues before they cause outages. Architects must keep Heartbeat egress (HTTPS/443) open or proactive support is blind.

Practical Bundle Hygiene

Key Points — Logs and Diagnostics

Key Takeaway: Use Siren to generate scoped log bundles into /home/cohesity/data/timecapsules; pair every bundle with precise UTC timestamps when engaging Support. Heartbeat provides the always-on diagnostic backbone.
Post-Quiz — Section 3

7. Where do Cohesity log bundles generated by Siren land on the cluster before upload to Support?

A) /var/log/cohesity/bundles
B) /home/cohesity/data/timecapsules
C) /opt/cohesity/support
D) /etc/cohesity/heartbeat

8. Which Cohesity service is responsible for backup orchestration — the right service to scope when a job fails to schedule?

A) Bridge
B) Apollo
C) Magneto
D) Yoda

9. What is the named, supported Cohesity command-line interface for cluster management operations?

A) cohctl
B) iris_cli
C) spanctl
D) helios-cli

Section 4: Common Failure Modes

Pre-Quiz — Section 4

10. Replication lag is growing on a protection group: local backups succeed but replicated snapshots are 48 hours behind. Which architectural lever most directly addresses this?

A) Increase replication factor (RF) on the source cluster
B) Design bandwidth throttle windows that yield to backup ingest by day and run replication at full bandwidth overnight
C) Disable encryption on the replication link
D) Reduce snapshot retention to 1 day

11. Which architectural design lever most directly mitigates the risk of correlated node failures dropping the cluster below quorum?

A) Erasure coding 6:2
B) Fault domain awareness (chassis, rack, site)
C) Disabling auto-protection
D) Increasing NVRAM size

12. NTP drift warnings appear alongside intra-cluster latency spikes and node-up/node-down alerts. What failure mode should the architect suspect?

A) Disk SMART failure
B) Cluster network partition (Paxos quorum risk)
C) Garbage collection stall
D) SOBR placement policy issue

13. A protection group "succeeds" every night but the SLA report shows compliance has drifted from 99.5% to 96% over six weeks. What is the architect's interpretation?

A) Jobs are missing windows but eventually completing on retry — a systemic drift problem invisible to single alerts
B) Heartbeat is broken — SLA reports cannot generate
C) The retention policy is too short
D) DataLock is rejecting the snapshots

Failure Modes Taxonomy

graph TD Root[Common Cohesity Failure Modes] Root --> Backup[Backup Job Failures] Root --> Repl[Replication Lag] Root --> Disk[Disk Failure] Root --> Node[Node Failure] Root --> Part[Network Partition] Backup --> B1[Stale credentials / CBT reset / locked files] Backup --> B2[Lever: policy design, retry rules, SLA reports] Repl --> R1[WAN underprovisioned / throttle misalign / target saturated] Repl --> R2[Lever: bandwidth windows, change-rate sizing] Disk --> D1[SpanFS RF/EC rebuild / Heartbeat opens case] Disk --> D2[Lever: schedule replacement, capacity headroom] Node --> N1[Reduced capacity, perf / Quorum risk if multiple] Node --> N2[Lever: fault domain awareness chassis/rack/site] Part --> P1[Paxos quorum split / NTP drift / latency spikes] Part --> P2[Lever: dual-homed nodes, redundant ToR, cluster VLAN]

Backup Job Failures

Failure PatternLikely CauseAction
First-run failures onlyStale credentials, recently-changed VMRefresh creds; re-discover
Random failures across many sourcesNetwork/proxy intermittencyCheck proxy health, network path
Same source fails repeatedlyCBT, agent, locked fileReset CBT; reinstall agent
Cluster-wide failure spikeCluster service issue, upgradeCheck cluster health, change log

Replication Lag

Replication lag is the canonical "silent" DR failure: protection succeeds locally but replication cannot keep up. Causes include WAN bandwidth insufficient for daily change rate, throttling windows misaligned with change rates, target ingest saturated, or compressed/encrypted replication competing for CPU on undersized clusters. Recovery requires a window where ingest exceeds change rate — not instant.

Disk and Node Failures

FailureCluster ImpactTime to Recover
Single diskBackground rebuild; no outageHours
Single node (RF2)Reduced redundancy; rebuild beginsHours to days
Multiple nodes (within tolerance)Performance degraded; rebuild contentionDays
Quorum lossCluster halts; data unavailableRecovery operation

Cluster Network Partition

The most dangerous failure mode. SpanFS uses Paxos-based metadata with strict consistency — the side without quorum cannot serve writes. Detection: Heartbeat alerts, node-up/node-down events, intra-cluster latency spikes, NTP drift warnings (often the first symptom).

  1. Identify partition boundary using iris_cli cluster status from multiple nodes.
  2. Check physical network — switch, uplink, VLAN.
  3. If healed quickly, cluster auto-recovers; otherwise generate Siren bundle scoped to Bridge, Apollo, Gandalf and engage Support before any manual remediation.
  4. Document and review fault domain design.

Key Points — Common Failure Modes

Key Takeaway: Architects design fault domain awareness, throttle windows, and network redundancy before failures, not after. Each failure class has a signature, action, and lever — recognize the pattern fast.
Post-Quiz — Section 4

10. Replication lag is growing on a protection group: local backups succeed but replicated snapshots are 48 hours behind. Which architectural lever most directly addresses this?

A) Increase replication factor (RF) on the source cluster
B) Design bandwidth throttle windows that yield to backup ingest by day and run replication at full bandwidth overnight
C) Disable encryption on the replication link
D) Reduce snapshot retention to 1 day

11. Which architectural design lever most directly mitigates the risk of correlated node failures dropping the cluster below quorum?

A) Erasure coding 6:2
B) Fault domain awareness (chassis, rack, site)
C) Disabling auto-protection
D) Increasing NVRAM size

12. NTP drift warnings appear alongside intra-cluster latency spikes and node-up/node-down alerts. What failure mode should the architect suspect?

A) Disk SMART failure
B) Cluster network partition (Paxos quorum risk)
C) Garbage collection stall
D) SOBR placement policy issue

13. A protection group "succeeds" every night but the SLA report shows compliance has drifted from 99.5% to 96% over six weeks. What is the architect's interpretation?

A) Jobs are missing windows but eventually completing on retry — a systemic drift problem invisible to single alerts
B) Heartbeat is broken — SLA reports cannot generate
C) The retention policy is too short
D) DataLock is rejecting the snapshots

Chapter Summary & Three Drills

For the exam, internalize three drills:

  1. Bottleneck triage drill — given a slow-backup symptom, walk Source → Network → Ingest → NVRAM → Writer and name the metric that proves the verdict.
  2. Log bundle drill — name Siren, the /home/cohesity/data/timecapsules path, and the four scoping inputs (nodes, services, time range, hardware logs).
  3. Helios alerting drill — name the four channels (email, SNMP, syslog, webhook), the SMTP API PUT /v2/clusters/smtp, the validation step POST /validate, and the rule creator createAlertNotificationRule.

Your Progress

Answer Explanations