Chapter 14: Performance, Monitoring, and Troubleshooting

Cohesity Certified Architect Expert (CCAE) — Interactive Study Guide

Learning Objectives

Diagnose performance bottlenecks across the source → network → ingest → NVRAM → writer data path and identify which segment is the limiter.
Use Cohesity statistics, iris_cli, the Siren UI, and Helios alerts to triage incidents methodically.
Generate scoped log bundles and engage Cohesity Support with the artifacts they need on the first round-trip.
Configure Helios alerting across email/SMTP, SNMP, syslog, and webhook channels — including SMTP validation.
Recognize and respond to common failure modes — backup re-runs, replication lag, disk and node failures, network partitions — and design resiliency for each.

If the previous chapters explained how to design and operate a Cohesity estate when everything is healthy, this chapter is about what to do when it is not. The CCAE-level architect is expected to localize bottlenecks, generate the right diagnostic artifacts, integrate alerts into operational tooling, and engage Cohesity Support effectively when escalation is warranted. The discipline that ties all of this together is structured triage: walk the data path end to end, measure each segment, and find the lowest sustained throughput point.

Section 1: Performance Bottleneck Analysis

Pre-Quiz — Section 1

1. A Cohesity backup that previously ran at 190 MB/s is now running at 20 MB/s. Helios shows source read MB/s = 22, writer MB/s = 21, writer latency normal, NVRAM not saturated. Where is the bottleneck most likely to be?

A) Stage 4 NVRAM — destage backpressure

B) Stage 1 Source — the cluster is not being fed

C) Stage 5 Writer — high disk latency

D) Stage 3 Ingest — proxy concurrency exhausted

2. A Veeam SOBR backed by Cohesity drops from ~140 MB/s on a single repository to ~20-30 MB/s once SOBR is enabled. What is the architect's first remediation?

A) Add more Cohesity nodes

B) Switch SOBR placement policy from Data Locality to Performance

C) Increase NVRAM allocation

D) Disable inline deduplication

3. Why is establishing a network baseline with iperf the recommended first step before deeper triage?

A) iperf calibrates Cohesity's internal counters

B) iperf detects substrate issues (1 GbE masquerading as 10 GbE, wrong VLAN/uplink) in seconds

C) iperf is required by the SOBR Performance policy

D) iperf flushes the writer queues

The Five-Stage Data Path

A Cohesity backup is a multi-stage assembly line. Bytes are read from a source, pushed through the network, ingested by the cluster, journaled into NVRAM, and finally destaged onto SSD or HDD by the writer service. Each stage has a maximum sustainable throughput, and the slowest stage sets the throughput of the whole line.

[Source App/VM] -> [Network/Proxy] -> [Cluster Ingest] -> [NVRAM Journal] -> [Writer -> SSD/HDD]
     Stage 1            Stage 2            Stage 3              Stage 4            Stage 5

Animation: Bottleneck Triage — Sequential Check Across the Five-Stage Data Path

Each stage is checked in turn (green pulse = healthy). The bottleneck stage glows red.

Symptom Patterns by Stage

Stage	Symptom Pattern	Telltale Metric
1. Source	Low source read MB/s; idle network and writers	Source read latency high; CBT/RCT slow; storage array saturated
2. Network	Source ready; cluster idle; low end-to-end MB/s	iperf below link speed; retransmits; wrong VLAN/uplink
3. Ingest	Network saturated but cluster CPU/proxy queues high	Proxy concurrency exhausted; NBD/HotAdd contention
4. NVRAM	Bursty throughput; periodic stalls; destage backpressure	NVRAM journal utilization high; destage queue growing
5. Writer	Network healthy; NVRAM filling; write latency rising	Writer latency, disk queue depth, SSD/HDD saturation

Bottleneck Triage Decision Tree

flowchart TD Start([Slow backup symptom]) --> NetCheck{iperf >= link speed?} NetCheck -->|No| Stage2[Stage 2: Network
Fix VLAN / uplink / MTU] NetCheck -->|Yes| SrcCheck{Source MB/s and
writer MB/s both low
and balanced?} SrcCheck -->|Yes| Stage1[Stage 1: Source
Investigate array,
hypervisor, agent] SrcCheck -->|No| NvramCheck{Writer latency high
or NVRAM saturated?} NvramCheck -->|Yes| Stage45[Stages 4-5: NVRAM / Writer
Cluster-side limiter] NvramCheck -->|No| ConcCheck{Per-stream healthy
but total throughput low?} ConcCheck -->|Yes| Stage3[Stage 3: Ingest / Concurrency
Add proxies, raise concurrency] ConcCheck -->|No| Bundle[Generate Siren log bundle
Engage Cohesity Support]

The SOBR Performance Trap (140 vs 20-30 MB/s)

When third-party tools like Veeam present Cohesity as a Scale-Out Backup Repository, throughput can collapse from ~140 MB/s on a single-node Cohesity repository to 20-30 MB/s once SOBR is enabled with the default placement policy. The remediation is to set SOBR placement to Performance rather than Data Locality, restoring parallel ingest across nodes.

SOBR Placement Policy	Behavior	Throughput
Data Locality (default)	Pin a backup chain to one extent	~20-30 MB/s — single node serializes
Performance	Stripe across SOBR extents in parallel	~140 MB/s — nodes work in parallel

Worked Example: Backup Running at 20 MB/s

Network baseline. iperf3 from vSphere proxy to three node primary IPs: 9.3 Gb/s. Network healthy. Stage 2 ruled out.
Job stats. Source read 22 MB/s, writer 21 MB/s, writer latency normal, NVRAM not saturated. Reads and writes balanced and low — cluster is not being fed.
Source check. vCenter shows backing array queue depth 95%, read latency 40 ms.
Conclusion. A SQL re-index is contending with the source array. Reschedule the protection group outside the re-index window. Do not open a Cohesity Support case — Cohesity is not the limiter.

Key Points — Performance Bottleneck Analysis

Bottleneck = lowest sustained throughput point across Source → Network → Ingest → NVRAM → Writer.
Always start with an iperf baseline; a 1 GbE-vs-10 GbE substrate explains a large fraction of "slow job" tickets.
SOBR with Veeam: switch placement to Performance to lift from ~20-30 MB/s to ~140 MB/s.
Watch for AV scanning on proxies, CBT/RCT regressions, NAS incremental issues, and S3 multipart chunk size for cloud targets.
For SmartFiles, profile IOPS and latency, not MB/s — small-file workloads need more nodes/flash, not bandwidth.

Key Takeaway: Bottlenecks are localized by walking the path and finding the lowest sustained throughput point. Establish iperf first, watch for SOBR placement traps, and let metric divergence — not assumptions — point to the limiter.

Post-Quiz — Section 1

A) Stage 4 NVRAM — destage backpressure

B) Stage 1 Source — the cluster is not being fed

C) Stage 5 Writer — high disk latency

D) Stage 3 Ingest — proxy concurrency exhausted

2. A Veeam SOBR backed by Cohesity drops from ~140 MB/s on a single repository to ~20-30 MB/s once SOBR is enabled. What is the architect's first remediation?

A) Add more Cohesity nodes

B) Switch SOBR placement policy from Data Locality to Performance

C) Increase NVRAM allocation

D) Disable inline deduplication

3. Why is establishing a network baseline with iperf the recommended first step before deeper triage?

A) iperf calibrates Cohesity's internal counters

B) iperf detects substrate issues (1 GbE masquerading as 10 GbE, wrong VLAN/uplink) in seconds

C) iperf is required by the SOBR Performance policy

D) iperf flushes the writer queues

Section 2: Monitoring and Alerting

Pre-Quiz — Section 2

4. Which four channels does Helios fan alerts out to via notification rules?

A) Email, SNMP, syslog, webhook

B) SMS, MQTT, Kafka, syslog

C) Email, NetFlow, RADIUS, webhook

D) Slack, ICMP, SNMP, NTP

5. After updating SMTP configuration on a cluster, why is running POST /validate a critical step?

A) It applies the change to all clusters via Helios sync

B) It catches silent relay breakage (expired credentials, blocked port, cert chain issues) before an incident does

C) It's required to enable SMTPS on port 465

D) It triggers a notification rule reload

6. Which Helios artifact is the leading indicator that individual alerts can miss — chronic protection drift over weeks?

A) The audit log

B) The SLA report

C) The Heartbeat stream

D) The Bridge service log

Helios is Cohesity's primary alerting plane. It aggregates events from every registered cluster and fans them out via filterable notification rules to email, SNMP, syslog, and webhook channels. Alerts are categorized (cluster health, protection, replication, capacity, security, hardware) and tagged by severity (Critical, Warning, Info).

Severity	Examples	Recommended Routing
Critical	Node down, quorum risk, replication failure	Email + SNMP + PagerDuty webhook
Warning	Disk predictive failure, capacity > 80%, missed SLA	Email + Syslog (SIEM)
Info	Job completed, snapshot expired, config change	Syslog only (SIEM archive)

Configuring Alert Notification Rules

Navigate to Health > Notification in Helios/DataProtect.
Click Create > New Alert Notification Rule.
Set rule name and filters (category, severity, alert name, cluster scope).
Choose delivery method: email, SNMP, syslog, or webhook.
For email: specify To, Cc, Subject. Save.

Programmatically the endpoint is createAlertNotificationRule. The same alert can fan to multiple channels — create one rule per channel, or different filters per channel for tiered routing.

Animation: Helios Alert Fan-Out — Cluster Event to Four Channels

A cluster event reaches Helios, then fans out to all four notification channels in parallel.

Email/SMTP Configuration

API	Purpose	Notes
`PUT /v2/clusters/smtp`	Update SMTP config	Server, port (465 SMTPS, 587 STARTTLS), credentials. Requires CLUSTER_MODIFY.
`GET /v2/clusters/smtp`	Retrieve SMTP config	Audit current settings
`POST /validate`	Test SMTP delivery	Always run after changes — catches silent relay breakage

Validation Pattern (Memorize)

Configure SMTP/SNMP/syslog/webhook targets.
Create an alert notification rule with tight filters.
Trigger or simulate a matching event.
Confirm receipt at the email inbox / NMS / syslog / webhook endpoint.
Run POST /validate on SMTP to catch silent relay breakage.

SLA Reports

Beyond alerts, Helios provides SLA reports showing protection compliance per protection group, source, and cluster — the percentage of protected objects that met RPO over a reporting window. These surface chronic drift no individual alert reveals: a group sliding from 99.5% to 96% over six weeks is missing windows even though every job "succeeded" eventually.

Key Points — Monitoring and Alerting

Helios fans alerts to four channels: email, SNMP, syslog, webhook.
SMTP API is PUT /v2/clusters/smtp; always run POST /validate to catch silent relay breakage.
Use severity tiers to fight alert fatigue: Critical → on-call, Warning → ops queue, Info → SIEM archive.
Notification rule API: createAlertNotificationRule. One rule per channel, or per filter set.
SLA reports surface chronic drift that individual alerts miss — the artifact for quarterly reviews.

Key Takeaway: Helios fans alerts via filterable notification rules; configure SMTP with PUT /v2/clusters/smtp, always validate, tier severity routing, and use SLA reports for systemic drift.

Post-Quiz — Section 2

4. Which four channels does Helios fan alerts out to via notification rules?

A) Email, SNMP, syslog, webhook

B) SMS, MQTT, Kafka, syslog

C) Email, NetFlow, RADIUS, webhook

D) Slack, ICMP, SNMP, NTP

5. After updating SMTP configuration on a cluster, why is running POST /validate a critical step?

A) It applies the change to all clusters via Helios sync

B) It catches silent relay breakage (expired credentials, blocked port, cert chain issues) before an incident does

C) It's required to enable SMTPS on port 465

D) It triggers a notification rule reload

6. Which Helios artifact is the leading indicator that individual alerts can miss — chronic protection drift over weeks?

A) The audit log

B) The SLA report

C) The Heartbeat stream

D) The Bridge service log

Section 3: Logs and Diagnostics

Pre-Quiz — Section 3

7. Where do Cohesity log bundles generated by Siren land on the cluster before upload to Support?

A) /var/log/cohesity/bundles

B) /home/cohesity/data/timecapsules

C) /opt/cohesity/support

D) /etc/cohesity/heartbeat

8. Which Cohesity service is responsible for backup orchestration — the right service to scope when a job fails to schedule?

A) Bridge

B) Apollo

C) Magneto

D) Yoda

9. What is the named, supported Cohesity command-line interface for cluster management operations?

A) cohctl

B) iris_cli

C) spanctl

D) helios-cli

iris_cli: The Supported CLI

iris_cli -server <cluster-IP> -username=admin -password=<pwd>

For the exam, iris_cli is the CLI to name when a question asks about cluster-side actions. Common groups: cluster status, cluster nodes list, protection-runs list, protection-jobs list, plus stats queries useful during triage.

Service Logs — Pick the Right Service

Service	Responsibility	When to Inspect
iris	UI / control plane	UI errors, login failures, REST API issues
Bridge	I/O data path / SpanFS front-end	Read/write latency, NFS/SMB issues
Magneto	Backup orchestration	Job failures, scheduling, source registration
Apollo	GC, replication, indexing	GC stalls, replication lag, index issues
Stats	Metrics aggregation	Missing dashboards, metric gaps
Yoda	Search / index service	Search failures, indexing slowness
Gandalf	Configuration management	Cluster config issues
Nexus	Cluster networking control	Network path / route issues

Generating a Log Bundle via Siren

Reach Siren at https://<cluster-VIP>/siren and click Cluster Support Bundle. The dialog scopes:

Nodes — uncheck "Select all" to scope to specific node IPs
Services — pick only relevant services (iris, Bridge, Magneto, etc.)
Log level — verbosity threshold
Time range — defaults to last 24 hours
Include hardware logs — firmware, IPMI/BMC, SMART, chassis events

Bundles land at /home/cohesity/data/timecapsules. Sizes range from a few MB up to 2-3 GB. Upload via uploadFilePackage API or the Support case URL.

Animation: Heartbeat + Siren Log Bundle — UI Trigger to Support Upload

Siren UI triggers log collection from each service into the time capsule, then uploads to Cohesity Support.

Heartbeat: The Continuous Diagnostic

Cohesity clusters emit a Heartbeat stream — a continuous, lightweight diagnostic feed reporting cluster health, version, configuration, and key metrics back to Cohesity. Heartbeat enables Proactive Support to spot brewing issues before they cause outages. Architects must keep Heartbeat egress (HTTPS/443) open or proactive support is blind.

Practical Bundle Hygiene

Scope tightly. Minimum services, narrowest time window.
Capture both healthy and unhealthy nodes for cluster-wide issues.
Record context separately — UTC timestamps, job IDs, change events.
Audit logs capture admin actions; forward to syslog/SIEM for HIPAA/PCI/FedRAMP.

Key Points — Logs and Diagnostics

The trio: Siren (UI generator) → timecapsules path → upload to Support.
Bundle path: /home/cohesity/data/timecapsules.
Bundle scope inputs: nodes, services, log level, time range, hardware logs.
iris_cli is the supported CLI; service trio to know — Bridge (I/O), Magneto (orchestration), Apollo (GC/replication).
Heartbeat is the continuous proactive telemetry stream; needs HTTPS/443 egress.

Key Takeaway: Use Siren to generate scoped log bundles into /home/cohesity/data/timecapsules; pair every bundle with precise UTC timestamps when engaging Support. Heartbeat provides the always-on diagnostic backbone.

Post-Quiz — Section 3

7. Where do Cohesity log bundles generated by Siren land on the cluster before upload to Support?

A) /var/log/cohesity/bundles

B) /home/cohesity/data/timecapsules

C) /opt/cohesity/support

D) /etc/cohesity/heartbeat

8. Which Cohesity service is responsible for backup orchestration — the right service to scope when a job fails to schedule?

A) Bridge

B) Apollo

C) Magneto

D) Yoda

9. What is the named, supported Cohesity command-line interface for cluster management operations?

A) cohctl

B) iris_cli

C) spanctl

D) helios-cli

Section 4: Common Failure Modes

Pre-Quiz — Section 4

10. Replication lag is growing on a protection group: local backups succeed but replicated snapshots are 48 hours behind. Which architectural lever most directly addresses this?

A) Increase replication factor (RF) on the source cluster

B) Design bandwidth throttle windows that yield to backup ingest by day and run replication at full bandwidth overnight

C) Disable encryption on the replication link

D) Reduce snapshot retention to 1 day

11. Which architectural design lever most directly mitigates the risk of correlated node failures dropping the cluster below quorum?

A) Erasure coding 6:2

B) Fault domain awareness (chassis, rack, site)

C) Disabling auto-protection

D) Increasing NVRAM size

12. NTP drift warnings appear alongside intra-cluster latency spikes and node-up/node-down alerts. What failure mode should the architect suspect?

A) Disk SMART failure

B) Cluster network partition (Paxos quorum risk)

C) Garbage collection stall

D) SOBR placement policy issue

13. A protection group "succeeds" every night but the SLA report shows compliance has drifted from 99.5% to 96% over six weeks. What is the architect's interpretation?

A) Jobs are missing windows but eventually completing on retry — a systemic drift problem invisible to single alerts

B) Heartbeat is broken — SLA reports cannot generate

C) The retention policy is too short

D) DataLock is rejecting the snapshots

Failure Modes Taxonomy

graph TD Root[Common Cohesity Failure Modes] Root --> Backup[Backup Job Failures] Root --> Repl[Replication Lag] Root --> Disk[Disk Failure] Root --> Node[Node Failure] Root --> Part[Network Partition] Backup --> B1[Stale credentials / CBT reset / locked files] Backup --> B2[Lever: policy design, retry rules, SLA reports] Repl --> R1[WAN underprovisioned / throttle misalign / target saturated] Repl --> R2[Lever: bandwidth windows, change-rate sizing] Disk --> D1[SpanFS RF/EC rebuild / Heartbeat opens case] Disk --> D2[Lever: schedule replacement, capacity headroom] Node --> N1[Reduced capacity, perf / Quorum risk if multiple] Node --> N2[Lever: fault domain awareness chassis/rack/site] Part --> P1[Paxos quorum split / NTP drift / latency spikes] Part --> P2[Lever: dual-homed nodes, redundant ToR, cluster VLAN]

Backup Job Failures

Failure Pattern	Likely Cause	Action
First-run failures only	Stale credentials, recently-changed VM	Refresh creds; re-discover
Random failures across many sources	Network/proxy intermittency	Check proxy health, network path
Same source fails repeatedly	CBT, agent, locked file	Reset CBT; reinstall agent
Cluster-wide failure spike	Cluster service issue, upgrade	Check cluster health, change log

Replication Lag

Replication lag is the canonical "silent" DR failure: protection succeeds locally but replication cannot keep up. Causes include WAN bandwidth insufficient for daily change rate, throttling windows misaligned with change rates, target ingest saturated, or compressed/encrypted replication competing for CPU on undersized clusters. Recovery requires a window where ingest exceeds change rate — not instant.

Disk and Node Failures

Failure	Cluster Impact	Time to Recover
Single disk	Background rebuild; no outage	Hours
Single node (RF2)	Reduced redundancy; rebuild begins	Hours to days
Multiple nodes (within tolerance)	Performance degraded; rebuild contention	Days
Quorum loss	Cluster halts; data unavailable	Recovery operation

Cluster Network Partition

The most dangerous failure mode. SpanFS uses Paxos-based metadata with strict consistency — the side without quorum cannot serve writes. Detection: Heartbeat alerts, node-up/node-down events, intra-cluster latency spikes, NTP drift warnings (often the first symptom).

Identify partition boundary using iris_cli cluster status from multiple nodes.
Check physical network — switch, uplink, VLAN.
If healed quickly, cluster auto-recovers; otherwise generate Siren bundle scoped to Bridge, Apollo, Gandalf and engage Support before any manual remediation.
Document and review fault domain design.

Key Points — Common Failure Modes

Each failure mode has a signature, an immediate action, and a longer-term design lever.
Replication lag → design bandwidth throttle windows around the daily change rate.
Disk = non-event (SpanFS rebuilds); node = consequential (capacity/perf reduced); quorum loss = cluster halts.
Network partition: scope Siren bundle to Bridge, Apollo, Gandalf; do not manually remediate before engaging Support.
Fault domain awareness (chassis/rack/site), redundant ToR, dual-homed nodes, dedicated cluster VLAN are the architectural levers built before the failure.

Key Takeaway: Architects design fault domain awareness, throttle windows, and network redundancy before failures, not after. Each failure class has a signature, action, and lever — recognize the pattern fast.

Post-Quiz — Section 4