Chapter 14: Performance, Monitoring, and Troubleshooting
Cohesity Certified Architect Expert (CCAE) — Interactive Study Guide
Learning Objectives
Diagnose performance bottlenecks across the source → network → ingest → NVRAM → writer data path and identify which segment is the limiter.
Use Cohesity statistics, iris_cli, the Siren UI, and Helios alerts to triage incidents methodically.
Generate scoped log bundles and engage Cohesity Support with the artifacts they need on the first round-trip.
Configure Helios alerting across email/SMTP, SNMP, syslog, and webhook channels — including SMTP validation.
Recognize and respond to common failure modes — backup re-runs, replication lag, disk and node failures, network partitions — and design resiliency for each.
If the previous chapters explained how to design and operate a Cohesity estate when everything is healthy, this chapter is about what to do when it is not. The CCAE-level architect is expected to localize bottlenecks, generate the right diagnostic artifacts, integrate alerts into operational tooling, and engage Cohesity Support effectively when escalation is warranted. The discipline that ties all of this together is structured triage: walk the data path end to end, measure each segment, and find the lowest sustained throughput point.
Section 1: Performance Bottleneck Analysis
Pre-Quiz — Section 1
1. A Cohesity backup that previously ran at 190 MB/s is now running at 20 MB/s. Helios shows source read MB/s = 22, writer MB/s = 21, writer latency normal, NVRAM not saturated. Where is the bottleneck most likely to be?
A) Stage 4 NVRAM — destage backpressure
B) Stage 1 Source — the cluster is not being fed
C) Stage 5 Writer — high disk latency
D) Stage 3 Ingest — proxy concurrency exhausted
2. A Veeam SOBR backed by Cohesity drops from ~140 MB/s on a single repository to ~20-30 MB/s once SOBR is enabled. What is the architect's first remediation?
A) Add more Cohesity nodes
B) Switch SOBR placement policy from Data Locality to Performance
C) Increase NVRAM allocation
D) Disable inline deduplication
3. Why is establishing a network baseline with iperf the recommended first step before deeper triage?
A) iperf calibrates Cohesity's internal counters
B) iperf detects substrate issues (1 GbE masquerading as 10 GbE, wrong VLAN/uplink) in seconds
C) iperf is required by the SOBR Performance policy
D) iperf flushes the writer queues
The Five-Stage Data Path
A Cohesity backup is a multi-stage assembly line. Bytes are read from a source, pushed through the network, ingested by the cluster, journaled into NVRAM, and finally destaged onto SSD or HDD by the writer service. Each stage has a maximum sustainable throughput, and the slowest stage sets the throughput of the whole line.
When third-party tools like Veeam present Cohesity as a Scale-Out Backup Repository, throughput can collapse from ~140 MB/s on a single-node Cohesity repository to 20-30 MB/s once SOBR is enabled with the default placement policy. The remediation is to set SOBR placement to Performance rather than Data Locality, restoring parallel ingest across nodes.
SOBR Placement Policy
Behavior
Throughput
Data Locality (default)
Pin a backup chain to one extent
~20-30 MB/s — single node serializes
Performance
Stripe across SOBR extents in parallel
~140 MB/s — nodes work in parallel
Worked Example: Backup Running at 20 MB/s
Network baseline. iperf3 from vSphere proxy to three node primary IPs: 9.3 Gb/s. Network healthy. Stage 2 ruled out.
Job stats. Source read 22 MB/s, writer 21 MB/s, writer latency normal, NVRAM not saturated. Reads and writes balanced and low — cluster is not being fed.
Conclusion. A SQL re-index is contending with the source array. Reschedule the protection group outside the re-index window. Do not open a Cohesity Support case — Cohesity is not the limiter.
Key Points — Performance Bottleneck Analysis
Bottleneck = lowest sustained throughput point across Source → Network → Ingest → NVRAM → Writer.
Always start with an iperf baseline; a 1 GbE-vs-10 GbE substrate explains a large fraction of "slow job" tickets.
SOBR with Veeam: switch placement to Performance to lift from ~20-30 MB/s to ~140 MB/s.
Watch for AV scanning on proxies, CBT/RCT regressions, NAS incremental issues, and S3 multipart chunk size for cloud targets.
For SmartFiles, profile IOPS and latency, not MB/s — small-file workloads need more nodes/flash, not bandwidth.
Key Takeaway: Bottlenecks are localized by walking the path and finding the lowest sustained throughput point. Establish iperf first, watch for SOBR placement traps, and let metric divergence — not assumptions — point to the limiter.
Post-Quiz — Section 1
1. A Cohesity backup that previously ran at 190 MB/s is now running at 20 MB/s. Helios shows source read MB/s = 22, writer MB/s = 21, writer latency normal, NVRAM not saturated. Where is the bottleneck most likely to be?
A) Stage 4 NVRAM — destage backpressure
B) Stage 1 Source — the cluster is not being fed
C) Stage 5 Writer — high disk latency
D) Stage 3 Ingest — proxy concurrency exhausted
2. A Veeam SOBR backed by Cohesity drops from ~140 MB/s on a single repository to ~20-30 MB/s once SOBR is enabled. What is the architect's first remediation?
A) Add more Cohesity nodes
B) Switch SOBR placement policy from Data Locality to Performance
C) Increase NVRAM allocation
D) Disable inline deduplication
3. Why is establishing a network baseline with iperf the recommended first step before deeper triage?
A) iperf calibrates Cohesity's internal counters
B) iperf detects substrate issues (1 GbE masquerading as 10 GbE, wrong VLAN/uplink) in seconds
C) iperf is required by the SOBR Performance policy
D) iperf flushes the writer queues
Section 2: Monitoring and Alerting
Pre-Quiz — Section 2
4. Which four channels does Helios fan alerts out to via notification rules?
A) Email, SNMP, syslog, webhook
B) SMS, MQTT, Kafka, syslog
C) Email, NetFlow, RADIUS, webhook
D) Slack, ICMP, SNMP, NTP
5. After updating SMTP configuration on a cluster, why is running POST /validate a critical step?
A) It applies the change to all clusters via Helios sync
B) It catches silent relay breakage (expired credentials, blocked port, cert chain issues) before an incident does
C) It's required to enable SMTPS on port 465
D) It triggers a notification rule reload
6. Which Helios artifact is the leading indicator that individual alerts can miss — chronic protection drift over weeks?
A) The audit log
B) The SLA report
C) The Heartbeat stream
D) The Bridge service log
Helios is Cohesity's primary alerting plane. It aggregates events from every registered cluster and fans them out via filterable notification rules to email, SNMP, syslog, and webhook channels. Alerts are categorized (cluster health, protection, replication, capacity, security, hardware) and tagged by severity (Critical, Warning, Info).
Severity
Examples
Recommended Routing
Critical
Node down, quorum risk, replication failure
Email + SNMP + PagerDuty webhook
Warning
Disk predictive failure, capacity > 80%, missed SLA
Email + Syslog (SIEM)
Info
Job completed, snapshot expired, config change
Syslog only (SIEM archive)
Configuring Alert Notification Rules
Navigate to Health > Notification in Helios/DataProtect.
Click Create > New Alert Notification Rule.
Set rule name and filters (category, severity, alert name, cluster scope).
Choose delivery method: email, SNMP, syslog, or webhook.
For email: specify To, Cc, Subject. Save.
Programmatically the endpoint is createAlertNotificationRule. The same alert can fan to multiple channels — create one rule per channel, or different filters per channel for tiered routing.
Animation: Helios Alert Fan-Out — Cluster Event to Four Channels
A cluster event reaches Helios, then fans out to all four notification channels in parallel.
Email/SMTP Configuration
API
Purpose
Notes
PUT /v2/clusters/smtp
Update SMTP config
Server, port (465 SMTPS, 587 STARTTLS), credentials. Requires CLUSTER_MODIFY.
GET /v2/clusters/smtp
Retrieve SMTP config
Audit current settings
POST /validate
Test SMTP delivery
Always run after changes — catches silent relay breakage
Validation Pattern (Memorize)
Configure SMTP/SNMP/syslog/webhook targets.
Create an alert notification rule with tight filters.
Trigger or simulate a matching event.
Confirm receipt at the email inbox / NMS / syslog / webhook endpoint.
Run POST /validate on SMTP to catch silent relay breakage.
SLA Reports
Beyond alerts, Helios provides SLA reports showing protection compliance per protection group, source, and cluster — the percentage of protected objects that met RPO over a reporting window. These surface chronic drift no individual alert reveals: a group sliding from 99.5% to 96% over six weeks is missing windows even though every job "succeeded" eventually.
Key Points — Monitoring and Alerting
Helios fans alerts to four channels: email, SNMP, syslog, webhook.
SMTP API is PUT /v2/clusters/smtp; always run POST /validate to catch silent relay breakage.
Use severity tiers to fight alert fatigue: Critical → on-call, Warning → ops queue, Info → SIEM archive.
Notification rule API: createAlertNotificationRule. One rule per channel, or per filter set.
SLA reports surface chronic drift that individual alerts miss — the artifact for quarterly reviews.
Key Takeaway: Helios fans alerts via filterable notification rules; configure SMTP with PUT /v2/clusters/smtp, always validate, tier severity routing, and use SLA reports for systemic drift.
Post-Quiz — Section 2
4. Which four channels does Helios fan alerts out to via notification rules?
A) Email, SNMP, syslog, webhook
B) SMS, MQTT, Kafka, syslog
C) Email, NetFlow, RADIUS, webhook
D) Slack, ICMP, SNMP, NTP
5. After updating SMTP configuration on a cluster, why is running POST /validate a critical step?
A) It applies the change to all clusters via Helios sync
B) It catches silent relay breakage (expired credentials, blocked port, cert chain issues) before an incident does
C) It's required to enable SMTPS on port 465
D) It triggers a notification rule reload
6. Which Helios artifact is the leading indicator that individual alerts can miss — chronic protection drift over weeks?
A) The audit log
B) The SLA report
C) The Heartbeat stream
D) The Bridge service log
Section 3: Logs and Diagnostics
Pre-Quiz — Section 3
7. Where do Cohesity log bundles generated by Siren land on the cluster before upload to Support?
A) /var/log/cohesity/bundles
B) /home/cohesity/data/timecapsules
C) /opt/cohesity/support
D) /etc/cohesity/heartbeat
8. Which Cohesity service is responsible for backup orchestration — the right service to scope when a job fails to schedule?
A) Bridge
B) Apollo
C) Magneto
D) Yoda
9. What is the named, supported Cohesity command-line interface for cluster management operations?
For the exam, iris_cli is the CLI to name when a question asks about cluster-side actions. Common groups: cluster status, cluster nodes list, protection-runs list, protection-jobs list, plus stats queries useful during triage.
Service Logs — Pick the Right Service
Service
Responsibility
When to Inspect
iris
UI / control plane
UI errors, login failures, REST API issues
Bridge
I/O data path / SpanFS front-end
Read/write latency, NFS/SMB issues
Magneto
Backup orchestration
Job failures, scheduling, source registration
Apollo
GC, replication, indexing
GC stalls, replication lag, index issues
Stats
Metrics aggregation
Missing dashboards, metric gaps
Yoda
Search / index service
Search failures, indexing slowness
Gandalf
Configuration management
Cluster config issues
Nexus
Cluster networking control
Network path / route issues
Generating a Log Bundle via Siren
Reach Siren at https://<cluster-VIP>/siren and click Cluster Support Bundle. The dialog scopes:
Nodes — uncheck "Select all" to scope to specific node IPs
Services — pick only relevant services (iris, Bridge, Magneto, etc.)
Log level — verbosity threshold
Time range — defaults to last 24 hours
Include hardware logs — firmware, IPMI/BMC, SMART, chassis events
Bundles land at /home/cohesity/data/timecapsules. Sizes range from a few MB up to 2-3 GB. Upload via uploadFilePackage API or the Support case URL.
Animation: Heartbeat + Siren Log Bundle — UI Trigger to Support Upload
Siren UI triggers log collection from each service into the time capsule, then uploads to Cohesity Support.
Heartbeat: The Continuous Diagnostic
Cohesity clusters emit a Heartbeat stream — a continuous, lightweight diagnostic feed reporting cluster health, version, configuration, and key metrics back to Cohesity. Heartbeat enables Proactive Support to spot brewing issues before they cause outages. Architects must keep Heartbeat egress (HTTPS/443) open or proactive support is blind.
Practical Bundle Hygiene
Scope tightly. Minimum services, narrowest time window.
Capture both healthy and unhealthy nodes for cluster-wide issues.
Record context separately — UTC timestamps, job IDs, change events.
Audit logs capture admin actions; forward to syslog/SIEM for HIPAA/PCI/FedRAMP.
Key Points — Logs and Diagnostics
The trio: Siren (UI generator) → timecapsules path → upload to Support.
iris_cli is the supported CLI; service trio to know — Bridge (I/O), Magneto (orchestration), Apollo (GC/replication).
Heartbeat is the continuous proactive telemetry stream; needs HTTPS/443 egress.
Key Takeaway: Use Siren to generate scoped log bundles into /home/cohesity/data/timecapsules; pair every bundle with precise UTC timestamps when engaging Support. Heartbeat provides the always-on diagnostic backbone.
Post-Quiz — Section 3
7. Where do Cohesity log bundles generated by Siren land on the cluster before upload to Support?
A) /var/log/cohesity/bundles
B) /home/cohesity/data/timecapsules
C) /opt/cohesity/support
D) /etc/cohesity/heartbeat
8. Which Cohesity service is responsible for backup orchestration — the right service to scope when a job fails to schedule?
A) Bridge
B) Apollo
C) Magneto
D) Yoda
9. What is the named, supported Cohesity command-line interface for cluster management operations?
A) cohctl
B) iris_cli
C) spanctl
D) helios-cli
Section 4: Common Failure Modes
Pre-Quiz — Section 4
10. Replication lag is growing on a protection group: local backups succeed but replicated snapshots are 48 hours behind. Which architectural lever most directly addresses this?
A) Increase replication factor (RF) on the source cluster
B) Design bandwidth throttle windows that yield to backup ingest by day and run replication at full bandwidth overnight
C) Disable encryption on the replication link
D) Reduce snapshot retention to 1 day
11. Which architectural design lever most directly mitigates the risk of correlated node failures dropping the cluster below quorum?
A) Erasure coding 6:2
B) Fault domain awareness (chassis, rack, site)
C) Disabling auto-protection
D) Increasing NVRAM size
12. NTP drift warnings appear alongside intra-cluster latency spikes and node-up/node-down alerts. What failure mode should the architect suspect?
A) Disk SMART failure
B) Cluster network partition (Paxos quorum risk)
C) Garbage collection stall
D) SOBR placement policy issue
13. A protection group "succeeds" every night but the SLA report shows compliance has drifted from 99.5% to 96% over six weeks. What is the architect's interpretation?
A) Jobs are missing windows but eventually completing on retry — a systemic drift problem invisible to single alerts
B) Heartbeat is broken — SLA reports cannot generate
Replication lag is the canonical "silent" DR failure: protection succeeds locally but replication cannot keep up. Causes include WAN bandwidth insufficient for daily change rate, throttling windows misaligned with change rates, target ingest saturated, or compressed/encrypted replication competing for CPU on undersized clusters. Recovery requires a window where ingest exceeds change rate — not instant.
Disk and Node Failures
Failure
Cluster Impact
Time to Recover
Single disk
Background rebuild; no outage
Hours
Single node (RF2)
Reduced redundancy; rebuild begins
Hours to days
Multiple nodes (within tolerance)
Performance degraded; rebuild contention
Days
Quorum loss
Cluster halts; data unavailable
Recovery operation
Cluster Network Partition
The most dangerous failure mode. SpanFS uses Paxos-based metadata with strict consistency — the side without quorum cannot serve writes. Detection: Heartbeat alerts, node-up/node-down events, intra-cluster latency spikes, NTP drift warnings (often the first symptom).
Identify partition boundary using iris_cli cluster status from multiple nodes.
Check physical network — switch, uplink, VLAN.
If healed quickly, cluster auto-recovers; otherwise generate Siren bundle scoped to Bridge, Apollo, Gandalf and engage Support before any manual remediation.
Document and review fault domain design.
Key Points — Common Failure Modes
Each failure mode has a signature, an immediate action, and a longer-term design lever.
Replication lag → design bandwidth throttle windows around the daily change rate.
Disk = non-event (SpanFS rebuilds); node = consequential (capacity/perf reduced); quorum loss = cluster halts.
Network partition: scope Siren bundle to Bridge, Apollo, Gandalf; do not manually remediate before engaging Support.
Fault domain awareness (chassis/rack/site), redundant ToR, dual-homed nodes, dedicated cluster VLAN are the architectural levers built before the failure.
Key Takeaway: Architects design fault domain awareness, throttle windows, and network redundancy before failures, not after. Each failure class has a signature, action, and lever — recognize the pattern fast.
Post-Quiz — Section 4
10. Replication lag is growing on a protection group: local backups succeed but replicated snapshots are 48 hours behind. Which architectural lever most directly addresses this?
A) Increase replication factor (RF) on the source cluster
B) Design bandwidth throttle windows that yield to backup ingest by day and run replication at full bandwidth overnight
C) Disable encryption on the replication link
D) Reduce snapshot retention to 1 day
11. Which architectural design lever most directly mitigates the risk of correlated node failures dropping the cluster below quorum?
A) Erasure coding 6:2
B) Fault domain awareness (chassis, rack, site)
C) Disabling auto-protection
D) Increasing NVRAM size
12. NTP drift warnings appear alongside intra-cluster latency spikes and node-up/node-down alerts. What failure mode should the architect suspect?
A) Disk SMART failure
B) Cluster network partition (Paxos quorum risk)
C) Garbage collection stall
D) SOBR placement policy issue
13. A protection group "succeeds" every night but the SLA report shows compliance has drifted from 99.5% to 96% over six weeks. What is the architect's interpretation?
A) Jobs are missing windows but eventually completing on retry — a systemic drift problem invisible to single alerts
B) Heartbeat is broken — SLA reports cannot generate
C) The retention policy is too short
D) DataLock is rejecting the snapshots
Chapter Summary & Three Drills
For the exam, internalize three drills:
Bottleneck triage drill — given a slow-backup symptom, walk Source → Network → Ingest → NVRAM → Writer and name the metric that proves the verdict.
Log bundle drill — name Siren, the /home/cohesity/data/timecapsules path, and the four scoping inputs (nodes, services, time range, hardware logs).
Helios alerting drill — name the four channels (email, SNMP, syslog, webhook), the SMTP API PUT /v2/clusters/smtp, the validation step POST /validate, and the rule creator createAlertNotificationRule.