Study Guide: Cluster Design, Sizing, and Capacity Planning

Learning Objectives

Design Cohesity clusters that meet specific RPO, RTO, and retention SLAs derived from a customer's workload profile.
Apply Cohesity sizing tools and capacity formulas to translate FETB, change rate, and retention into raw and usable cluster capacity.
Plan node count, fault tolerance margin (RF vs. EC), and growth headroom (N+1, year-3 horizon) for production deployments.
Justify hardware family selection (C4000 hybrid, C5000/C5200 performance, C6000 dense, all-flash variants) based on workload mix and SLA priority.
Build a capacity plan that survives technology refresh cycles, tiering decisions, and growth shocks.

If Chapters 1 and 2 explained what Cohesity is and how SpanFS holds data together, this chapter answers the question every architect actually has to answer: "How big should the cluster be, and what nodes do I buy?" Sizing is part arithmetic, part workload psychology, and part risk management.

1. Workload Profiling Inputs

Pre-Quiz: Workload Profiling

1. Which three numbers form the foundational inputs for any Cohesity sizing exercise?

Node count, RAM per node, and NVRAM size FETB, change rate, and retention CPU clock speed, network bandwidth, and disk IOPS Replication factor, EC stripe width, and fault domain count

2. A customer protects 200 TB FETB of homogeneous Windows VMs versus 200 TB FETB of mixed-media NAS. What should the architect expect about back-end storage consumption?

Both will consume identical back-end capacity The VMs will consume far more BETB because they have more files The VMs will consume far less BETB because OS dedupe is excellent across similar VMs The NAS will always compress better than the VMs

3. Which scenario most clearly indicates a performance-driven sizing rather than capacity-driven?

7-year compliance retention on a stable 100 TB FETB Sub-minute RTO with concurrent instant mass restore of dozens of VMs Daily 1% change rate on long-lived archive data A primarily NDMP NAS workload with monthly retention

FETB, Change Rate, and Retention

Three numbers dominate any backup sizing conversation:

Front-End TB (FETB) — the total source-side data footprint of protected workloads, measured before Cohesity touches it. FETB is the licensing unit.
Change Rate — percentage of FETB that mutates between backups. Typical: 1–3% daily for VM/NAS, 5–15% for transactional databases, up to 30% for log-heavy workloads.
Retention — how long each recovery point must remain. A typical GFS policy: 30 daily / 12 weekly / 12 monthly / 7 yearly.

BETB ≈ FETB × full_compression_factor + (FETB × change_rate × retention_days × incremental_factor)
Effective_BETB = BETB / (dedupe_ratio × compression_ratio)

A 1% change rate sounds small until you multiply by 365 retention days — you've stored 4.65 full copies' worth of incrementals.

Workload Categorization

Not all FETB is equal. The Sizer expects a breakdown by workload type because each has different reduction ratios and ingest patterns.

Performance vs. Capacity-Driven Sizing

Workload	Daily Change	Reduction Ratio	Sizing Notes
VMware / Hyper-V / AHV VMs	1–3%	6:1 to 10:1	CBT/RCT incremental forever; OS dedupe excellent
Physical (Linux/Windows agent)	1–5%	3:1 to 5:1	Lower dedupe than VMs
NAS (SMB/NFS/NDMP)	1–3%	2:1 to 3:1	Mixed media compresses poorly
Databases (Oracle/SQL/HANA)	5–15%	3:1 to 5:1	Frequent log backups; engine compression limits dedupe
Microsoft 365	1–2%	2:1 to 3:1	API rate-limited
Object/S3	varies	1.5:1 to 2:1	Often pre-compressed

flowchart TD A[FETB by Workload] --> E[Initial Full Backend] B[Daily Change Rate] --> F[Incremental Volume] C[Retention Policy
D/W/M/Y] --> G[Cumulative Retention Multiplier] D[Dedupe + Compression Ratios] --> H[Reduction Factor] E --> I[Effective BETB Required] F --> I G --> I H --> I I --> J[Apply Growth + N+1 Headroom] J --> K[Target Usable Capacity] K --> L{Choose EC Scheme} L --> M[Compute Raw Capacity] M --> N[Select ReadyNode Family] N --> O[Final Node Count]

Key Points

FETB is the licensing input; change rate is the daily multiplier; retention is the time multiplier — all three must be reasonably accurate before any capacity number can be trusted.
VM workloads achieve 6:1–10:1 reduction; NAS and pre-compressed object data may only see 2:1.
Mixing workloads dilutes the average reduction ratio — re-run the Sizer rather than extrapolating.
Capacity-driven sizings are constrained by disk; performance-driven sizings by CPU/NVRAM/flash.
Distinguish first-full ingest math from steady-state ingest math — they have very different windows.

Post-Quiz: Workload Profiling

1. Which three numbers form the foundational inputs for any Cohesity sizing exercise?

Node count, RAM per node, and NVRAM size FETB, change rate, and retention CPU clock speed, network bandwidth, and disk IOPS Replication factor, EC stripe width, and fault domain count

2. A customer protects 200 TB FETB of homogeneous Windows VMs versus 200 TB FETB of mixed-media NAS. What should the architect expect about back-end storage consumption?

3. Which scenario most clearly indicates a performance-driven sizing rather than capacity-driven?

2. Sizing Tools and Calculators

Pre-Quiz: Sizing Tools

1. In the capacity transformation chain, what immediately follows "Raw Capacity"?

Effective Capacity, after dedupe and compression Usable Capacity, after applying RF or EC overhead Protectable FETB, after dividing by retention multiplier Available Capacity, after subtracting metadata reserve

2. EC 6:2 versus RF 3 — how do they compare in fault tolerance and storage cost?

Both tolerate 1 failure; EC 6:2 has lower overhead RF 3 tolerates more failures than EC 6:2 Both tolerate 2 failures; EC 6:2 has less than half the storage cost but requires 8+ nodes EC 6:2 tolerates fewer failures than RF 3 but is faster

3. Why do SmartFiles primary workloads use a lower target utilization (~70%) than backup workloads?

SmartFiles cannot support EC 6:2 Primary workloads are less tolerant of full-cluster behavior, so more headroom is required SmartFiles always uses RF 3 instead of EC SmartFiles dedupes more aggressively, so less raw is needed

4. A 10-node cluster with 192 TB raw per node uses EC 6:2. What is the usable capacity (before metadata reserve)?

960 TB 1,920 TB 1,440 TB 2,560 TB

The Capacity Transformation Chain

Raw Capacity              (sum of all HDD/NVMe across all nodes)
   ↓ × (D / (D+P)) for EC, or × (1/RF) for replication
Usable Capacity           (after resiliency overhead)
   ↓ − ~5–10% reserved
Available Capacity        (after metadata, snapshots, rebuild reserve)
   ↓ × dedupe × compression
Effective Capacity        (the number on the data sheet)
   ↓ ÷ retention multiplier
Protectable FETB          (the number on the sales quote)

A 10-node cluster of C5066 nodes at 192 TB raw per node = 1,920 TB raw. EC 6:2 yields 1,920 × (6/8) = 1,440 TB usable. Subtract 7% for metadata/rebuild reserve = ~1,340 TB available. Apply 4.5x reduction for VM workloads = ~6,030 TB effective. Divide by ~12x retention multiplier = ~500 TB protectable FETB.

RF vs. EC Capacity Comparison

Scheme	Min Nodes	Failures Tolerated	Storage Overhead	Usable / Raw
RF 2	3	1	100%	50.0%
RF 3	4	2	200%	33.3%
EC 2:1	3	1	50%	66.7%
EC 4:1	5	1	25%	80.0%
EC 4:2	6	2	50%	66.7%
EC 5:2	7	2	40%	71.4%
EC 6:2	8	2	33%	75.0%

EC 6:2 delivers RF 3's fault tolerance at less than half the storage cost — but you need 8+ nodes. Cluster size unlocks EC efficiency.

Cloud Edition and CloudArchive Sizing

Cloud Edition sizings cite specific cloud SKUs (e.g., AWS i3en.6xlarge), account for object-storage latency, and must include monthly egress estimates. CloudArchive sizings are simpler: cumulative archived data × storage class price + estimated monthly recall + ingestion bandwidth. Glacier and Azure Archive offer 80–90% cost reduction but introduce 4–12 hr rehydration latency.

Key Points

The Sizer is a structured wrapper around the same arithmetic an architect can do by hand — understand both.
Capacity transforms compound: Raw → Usable → Available → Effective → Protectable FETB. Mistakes amplify.
EC 6:2 matches RF 3 fault tolerance with 33% overhead instead of 200% — but requires 8+ nodes.
SmartFiles primary workloads target ~70% utilization (vs. 80% for backup) because primary I/O is less tolerant of full-cluster behavior.
Always validate dedupe ratios with a pilot — optimistic assumptions cause double-digit capacity misses.

Post-Quiz: Sizing Tools

1. In the capacity transformation chain, what immediately follows "Raw Capacity"?

2. EC 6:2 versus RF 3 — how do they compare in fault tolerance and storage cost?

3. Why do SmartFiles primary workloads use a lower target utilization (~70%) than backup workloads?

4. A 10-node cluster with 192 TB raw per node uses EC 6:2. What is the usable capacity (before metadata reserve)?

960 TB 1,920 TB 1,440 TB 2,560 TB

3. Node Selection and Cluster Topology

Pre-Quiz: Node Selection

1. A customer needs sub-minute RTO for instant mass restore of dozens of VMs. Which ReadyNode family is most appropriate?

C4000 (Entry/Edge) C6000 dense hybrid C5066 mainstream hybrid C5200 / C6200 all-flash

2. What is the minimum supported cluster size for a Cohesity production deployment?

1 node 3 nodes 6 nodes 8 nodes

3. A customer's existing C5066 cluster needs more capacity at year-3, but performance demands have not grown. What is the recommended approach?

Forklift the entire cluster to all-flash C6200 nodes Add C6000 dense nodes — SpanFS migrates cold data to denser nodes while keeping hot data on performance nodes Replace C5066 nodes one-by-one with C4000 nodes Rebuild on a new cluster — Cohesity does not support mixed-node clusters

4. When does brick mode provide architectural value?

For all clusters of 3 or more nodes by default When dense C6000 nodes hold so much TB that losing a whole node would exceed rebuild capacity Only on all-flash clusters using PCIe Gen5 For Robo Edition deployments with 1-2 nodes

ReadyNode Families

Family	Form Factor	Storage	Optimized For
C4000	2U single, 1× Xeon 8-core	8× HDD slots + NVMe metadata	Entry/edge, branch ROBO
C5066 (hybrid)	2U, 1× Xeon 16-core, 128 GB RAM	54 TB HDD + 3.2 TB NVMe / node	Mainstream backup — ~70% of designs
C5200	2U, 4-node block, 5th-Gen Xeon, PCIe Gen5	216 TB HDD + 12.8 TB flash / block (or all-NVMe)	Density + performance
C6000 (dense)	Dense 2U	168–192 TB raw HDD / node + flash	Long-term retention, archive
C6200	2U all-flash dense	All NVMe	High-perf retention, low power

Selection Heuristics

C4000 — edge/ROBO, ≤4 nodes, low concurrency.
C5000/C5066 — mainstream datacenter, 6+ nodes, mixed workloads. Default for ~70% of CCAE scenarios.
C5200 — performance density per RU; modernization/refresh deals.
C6000 — retention dominates (≥1 year), TB-per-watt matters.
All-flash (C5200/C6200) — sub-minute RTO, instant mass restore, MongoDB/Cassandra restore SLAs.

Minimum Cluster Sizes & EC Constraints

Cluster Size	Smallest Usable EC	Recommended EC
3 nodes	RF 2 only	RF 2
4–5 nodes	EC 2:1	EC 2:1 or RF 2
6–7 nodes	EC 4:2	EC 4:2 or 5:2
8+ nodes	EC 6:2	EC 6:2 (best efficiency at 2-failure tolerance)
12+ nodes	EC 6:2	EC 6:2 (more parallel rebuild domains)

Mixed-Node Clusters and Brick Mode

Cohesity supports heterogeneous clusters — different ReadyNode models in a single cluster. SpanFS auto-rebalances and routes hot data to flash, cold data to HDD. Brick mode subdivides a single dense node into multiple fault domains, useful when a C6000 holds enough TB that losing the whole node would exceed rebuild capacity.

flowchart TD S[Start: Workload + SLA Profile] --> Q1{Edge / ROBO site
≤ 4 nodes?} Q1 -->|Yes| C4[C4000
Entry / Edge Hybrid] Q1 -->|No| Q2{Sub-minute RTO
or instant mass restore?} Q2 -->|Yes| Q3{Density per RU
also matters?} Q3 -->|Yes| C6200[C6200
All-Flash Dense] Q3 -->|No| C5200AF[C5200 All-Flash
PCIe Gen5 NVMe] Q2 -->|No| Q4{Retention dominates
≥ 1 year on cluster?} Q4 -->|Yes| C6000[C6000
Dense Hybrid] Q4 -->|No| Q5{4-node-per-2U
density required?} Q5 -->|Yes| C5200H[C5200 Hybrid
Performance Density] Q5 -->|No| C5066[C5066
Mainstream Hybrid]

Key Points

Choose nodes by SLA, not by price. C5066 is the default mainstream pick; reach for all-flash only when RTO demands it.
Minimum production cluster is 3 nodes; EC 6:2 unlocks at 8 nodes — cluster size dictates achievable efficiency.
Heterogeneous clusters allow capacity-only growth at year-3 without forklift — add C6000 dense nodes to an existing C5066 cluster.
Brick mode trades simplicity for placement flexibility on dense nodes; uncommon outside C6000 deployments.
Performance is gated by the slowest node class for chunks placed on it — avoid mixing radically different CPU classes.

Post-Quiz: Node Selection

1. A customer needs sub-minute RTO for instant mass restore of dozens of VMs. Which ReadyNode family is most appropriate?

C4000 (Entry/Edge) C6000 dense hybrid C5066 mainstream hybrid C5200 / C6200 all-flash

2. What is the minimum supported cluster size for a Cohesity production deployment?

1 node 3 nodes 6 nodes 8 nodes

3. A customer's existing C5066 cluster needs more capacity at year-3, but performance demands have not grown. What is the recommended approach?

4. When does brick mode provide architectural value?

4. Capacity Planning Over Time

Pre-Quiz: Capacity Planning

1. What is the minimum recommended planning horizon when sizing a new cluster?

Day-one demand only Year-1 demand Year-3 protected FETB at minimum, ideally year-5 Year-10 fully amortized

2. For a 10-node cluster with EC 6:2 and 192 TB raw per node, what does a sensible practical ceiling look like (after 80% target and N+1 reserve)?

~1,440 TB — the full usable capacity ~960 TB — 80% of usable minus one node's raw for N+1 ~1,920 TB — the raw sum ~500 TB — only protectable FETB

3. What is the difference between CloudTier and CloudArchive?

CloudTier is for backup; CloudArchive is for primary workloads CloudTier is a transparent capacity extension managed by the cluster; CloudArchive is a logical retention destination for policy-driven movement They are the same product with different names CloudTier requires Glacier; CloudArchive only supports S3 Standard

Modeling Growth and Tech-Refresh

Apply 10–25% YoY growth depending on industry. A 5-year compounding view of a 500 TB FETB starting baseline at 15% YoY:

Year	FETB	Effective BETB (4.5x reduction, 12x retention)
0	500 TB	1,333 TB
1	575 TB	1,533 TB
2	661 TB	1,763 TB
3	760 TB	2,028 TB
5	1,005 TB	2,681 TB

Tiering Strategy

Tier	Media	Latency	Use Case
Hot	Local NVMe / flash	<1 ms	Last 7–30 days, instant recovery
Warm	Local HDD	5–10 ms	30–180 days
Cold (CloudTier)	S3 Standard / Azure Cool	50–200 ms	6–24 months
Archive (CloudArchive)	Glacier / Azure Archive	4–12 hr rehydrate	1+ year compliance

Reserve Capacity Discipline

N+1 reserve — at least one full node's worth of free space, so SpanFS can rebalance after node failure without exceeding 100%.
80% utilization ceiling — above this threshold, performance degrades and rebuild margins shrink. Helios alerts trigger at 80%.

For a 10-node, 192 TB-per-node, EC 6:2 cluster: usable = 1,440 TB → 80% = 1,152 TB → minus 192 TB N+1 = ~960 TB practical effective ceiling. Sizing to the full 1,440 TB usable is a customer-failure setup.

Helios Reporting and Forecasting

Capacity trend reports with projected exhaustion dates
Per-View / Per-Protection-Group consumption to find the workload responsible for unexpected growth
Reduction-ratio trending (sharp drops indicate workload composition change)
Multi-cluster fleet view for SPs and large enterprises

Worked Example: 500 TB FETB Sizing

Customer: 500 TB FETB (350 TB VMware + 100 TB NAS + 50 TB SQL), 3% blended daily change, 30-day retention.

The 9-node C5066 answer is the typical exam-correct response: satisfies EC 6:2 (8 nodes) + N+1 (1), uses the mainstream node, leaves growth headroom.

Key Points

Size for year-3 protected FETB minimum — ideally year-5 — with explicit YoY growth assumption.
Always reserve N+1 capacity (one full node's raw) and target 80% utilization as the operational ceiling.
Tier cold data: hot to local flash, warm to local HDD, cold to CloudTier (S3 Standard / Azure Cool), archive to CloudArchive (Glacier / Azure Archive).
CloudTier is a transparent capacity extension; CloudArchive is a policy-driven retention destination — they are not interchangeable.
Helios trend reports turn theoretical sizing into operational practice — forecast exhaustion dates and order nodes before the cliff.

Post-Quiz: Capacity Planning

1. What is the minimum recommended planning horizon when sizing a new cluster?

Day-one demand only Year-1 demand Year-3 protected FETB at minimum, ideally year-5 Year-10 fully amortized

2. For a 10-node cluster with EC 6:2 and 192 TB raw per node, what does a sensible practical ceiling look like (after 80% target and N+1 reserve)?

~1,440 TB — the full usable capacity ~960 TB — 80% of usable minus one node's raw for N+1 ~1,920 TB — the raw sum ~500 TB — only protectable FETB

3. What is the difference between CloudTier and CloudArchive?

Chapter 3: Cluster Design, Sizing, and Capacity Planning

Learning Objectives

1. Workload Profiling Inputs

FETB, Change Rate, and Retention

Workload Categorization

Performance vs. Capacity-Driven Sizing

Key Points

2. Sizing Tools and Calculators

The Capacity Transformation Chain

RF vs. EC Capacity Comparison

Cloud Edition and CloudArchive Sizing

Key Points

3. Node Selection and Cluster Topology

ReadyNode Families

Selection Heuristics

Minimum Cluster Sizes & EC Constraints

Mixed-Node Clusters and Brick Mode

Key Points

4. Capacity Planning Over Time

Modeling Growth and Tech-Refresh

Tiering Strategy

Reserve Capacity Discipline

Helios Reporting and Forecasting

Worked Example: 500 TB FETB Sizing

Key Points

Your Progress

Answer Explanations