Chapter 3: Cluster Design, Sizing, and Capacity Planning

Learning Objectives

If Chapters 1 and 2 explained what Cohesity is and how SpanFS holds data together, this chapter answers the question every architect actually has to answer: "How big should the cluster be, and what nodes do I buy?" Sizing is part arithmetic, part workload psychology, and part risk management.

1. Workload Profiling Inputs

Pre-Quiz: Workload Profiling

1. Which three numbers form the foundational inputs for any Cohesity sizing exercise?

Node count, RAM per node, and NVRAM size FETB, change rate, and retention CPU clock speed, network bandwidth, and disk IOPS Replication factor, EC stripe width, and fault domain count

2. A customer protects 200 TB FETB of homogeneous Windows VMs versus 200 TB FETB of mixed-media NAS. What should the architect expect about back-end storage consumption?

Both will consume identical back-end capacity The VMs will consume far more BETB because they have more files The VMs will consume far less BETB because OS dedupe is excellent across similar VMs The NAS will always compress better than the VMs

3. Which scenario most clearly indicates a performance-driven sizing rather than capacity-driven?

7-year compliance retention on a stable 100 TB FETB Sub-minute RTO with concurrent instant mass restore of dozens of VMs Daily 1% change rate on long-lived archive data A primarily NDMP NAS workload with monthly retention

FETB, Change Rate, and Retention

Three numbers dominate any backup sizing conversation:

  1. Front-End TB (FETB) — the total source-side data footprint of protected workloads, measured before Cohesity touches it. FETB is the licensing unit.
  2. Change Rate — percentage of FETB that mutates between backups. Typical: 1–3% daily for VM/NAS, 5–15% for transactional databases, up to 30% for log-heavy workloads.
  3. Retention — how long each recovery point must remain. A typical GFS policy: 30 daily / 12 weekly / 12 monthly / 7 yearly.
BETB ≈ FETB × full_compression_factor + (FETB × change_rate × retention_days × incremental_factor)
Effective_BETB = BETB / (dedupe_ratio × compression_ratio)

A 1% change rate sounds small until you multiply by 365 retention days — you've stored 4.65 full copies' worth of incrementals.

Animation: Sizing Inputs → Outputs Flow
FETB by Workload VM / NAS / DB / M365 Change Rate 1-15% daily Retention Policy 30D / 12W / 12M / 7Y Dedupe + Compression 2:1 - 10:1 ratios Effective BETB + growth headroom + N+1 reserve Usable Capacity via EC scheme Final Node Count ReadyNode family + EC scheme Sequential reveal: inputs → effective BETB → usable capacity → node count

Workload Categorization

Not all FETB is equal. The Sizer expects a breakdown by workload type because each has different reduction ratios and ingest patterns.

WorkloadDaily ChangeReduction RatioSizing Notes
VMware / Hyper-V / AHV VMs1–3%6:1 to 10:1CBT/RCT incremental forever; OS dedupe excellent
Physical (Linux/Windows agent)1–5%3:1 to 5:1Lower dedupe than VMs
NAS (SMB/NFS/NDMP)1–3%2:1 to 3:1Mixed media compresses poorly
Databases (Oracle/SQL/HANA)5–15%3:1 to 5:1Frequent log backups; engine compression limits dedupe
Microsoft 3651–2%2:1 to 3:1API rate-limited
Object/S3varies1.5:1 to 2:1Often pre-compressed

Performance vs. Capacity-Driven Sizing

flowchart TD A[FETB by Workload] --> E[Initial Full Backend] B[Daily Change Rate] --> F[Incremental Volume] C[Retention Policy
D/W/M/Y] --> G[Cumulative Retention Multiplier] D[Dedupe + Compression Ratios] --> H[Reduction Factor] E --> I[Effective BETB Required] F --> I G --> I H --> I I --> J[Apply Growth + N+1 Headroom] J --> K[Target Usable Capacity] K --> L{Choose EC Scheme} L --> M[Compute Raw Capacity] M --> N[Select ReadyNode Family] N --> O[Final Node Count]

Key Points

Post-Quiz: Workload Profiling

1. Which three numbers form the foundational inputs for any Cohesity sizing exercise?

Node count, RAM per node, and NVRAM size FETB, change rate, and retention CPU clock speed, network bandwidth, and disk IOPS Replication factor, EC stripe width, and fault domain count

2. A customer protects 200 TB FETB of homogeneous Windows VMs versus 200 TB FETB of mixed-media NAS. What should the architect expect about back-end storage consumption?

Both will consume identical back-end capacity The VMs will consume far more BETB because they have more files The VMs will consume far less BETB because OS dedupe is excellent across similar VMs The NAS will always compress better than the VMs

3. Which scenario most clearly indicates a performance-driven sizing rather than capacity-driven?

7-year compliance retention on a stable 100 TB FETB Sub-minute RTO with concurrent instant mass restore of dozens of VMs Daily 1% change rate on long-lived archive data A primarily NDMP NAS workload with monthly retention

2. Sizing Tools and Calculators

Pre-Quiz: Sizing Tools

1. In the capacity transformation chain, what immediately follows "Raw Capacity"?

Effective Capacity, after dedupe and compression Usable Capacity, after applying RF or EC overhead Protectable FETB, after dividing by retention multiplier Available Capacity, after subtracting metadata reserve

2. EC 6:2 versus RF 3 — how do they compare in fault tolerance and storage cost?

Both tolerate 1 failure; EC 6:2 has lower overhead RF 3 tolerates more failures than EC 6:2 Both tolerate 2 failures; EC 6:2 has less than half the storage cost but requires 8+ nodes EC 6:2 tolerates fewer failures than RF 3 but is faster

3. Why do SmartFiles primary workloads use a lower target utilization (~70%) than backup workloads?

SmartFiles cannot support EC 6:2 Primary workloads are less tolerant of full-cluster behavior, so more headroom is required SmartFiles always uses RF 3 instead of EC SmartFiles dedupes more aggressively, so less raw is needed

4. A 10-node cluster with 192 TB raw per node uses EC 6:2. What is the usable capacity (before metadata reserve)?

960 TB 1,920 TB 1,440 TB 2,560 TB

The Capacity Transformation Chain

Raw Capacity              (sum of all HDD/NVMe across all nodes)
   ↓ × (D / (D+P)) for EC, or × (1/RF) for replication
Usable Capacity           (after resiliency overhead)
   ↓ − ~5–10% reserved
Available Capacity        (after metadata, snapshots, rebuild reserve)
   ↓ × dedupe × compression
Effective Capacity        (the number on the data sheet)
   ↓ ÷ retention multiplier
Protectable FETB          (the number on the sales quote)

A 10-node cluster of C5066 nodes at 192 TB raw per node = 1,920 TB raw. EC 6:2 yields 1,920 × (6/8) = 1,440 TB usable. Subtract 7% for metadata/rebuild reserve = ~1,340 TB available. Apply 4.5x reduction for VM workloads = ~6,030 TB effective. Divide by ~12x retention multiplier = ~500 TB protectable FETB.

Animation: Capacity Stack — Raw shrinks through each transformation
Raw 1,920 TB (100%) Usable 1,440 TB (-25% EC 6:2) Available 1,340 TB (-7% metadata) Practical ~960 TB (80% + N+1 reserve) Protectable FETB ~500 TB FETB (after retention ÷ 12) Each step compounds — a 20% optimistic dedupe + 20% optimistic retention = 44% capacity miss

RF vs. EC Capacity Comparison

SchemeMin NodesFailures ToleratedStorage OverheadUsable / Raw
RF 231100%50.0%
RF 342200%33.3%
EC 2:13150%66.7%
EC 4:15125%80.0%
EC 4:26250%66.7%
EC 5:27240%71.4%
EC 6:28233%75.0%

EC 6:2 delivers RF 3's fault tolerance at less than half the storage cost — but you need 8+ nodes. Cluster size unlocks EC efficiency.

Cloud Edition and CloudArchive Sizing

Cloud Edition sizings cite specific cloud SKUs (e.g., AWS i3en.6xlarge), account for object-storage latency, and must include monthly egress estimates. CloudArchive sizings are simpler: cumulative archived data × storage class price + estimated monthly recall + ingestion bandwidth. Glacier and Azure Archive offer 80–90% cost reduction but introduce 4–12 hr rehydration latency.

flowchart LR R[Raw Capacity
Sum of all HDD/NVMe] --> U[Usable Capacity
After resiliency] U --> A[Available Capacity] A --> E[Effective Capacity] E --> RES[Practical Ceiling] RES --> F[Protectable FETB]

Key Points

Post-Quiz: Sizing Tools

1. In the capacity transformation chain, what immediately follows "Raw Capacity"?

Effective Capacity, after dedupe and compression Usable Capacity, after applying RF or EC overhead Protectable FETB, after dividing by retention multiplier Available Capacity, after subtracting metadata reserve

2. EC 6:2 versus RF 3 — how do they compare in fault tolerance and storage cost?

Both tolerate 1 failure; EC 6:2 has lower overhead RF 3 tolerates more failures than EC 6:2 Both tolerate 2 failures; EC 6:2 has less than half the storage cost but requires 8+ nodes EC 6:2 tolerates fewer failures than RF 3 but is faster

3. Why do SmartFiles primary workloads use a lower target utilization (~70%) than backup workloads?

SmartFiles cannot support EC 6:2 Primary workloads are less tolerant of full-cluster behavior, so more headroom is required SmartFiles always uses RF 3 instead of EC SmartFiles dedupes more aggressively, so less raw is needed

4. A 10-node cluster with 192 TB raw per node uses EC 6:2. What is the usable capacity (before metadata reserve)?

960 TB 1,920 TB 1,440 TB 2,560 TB

3. Node Selection and Cluster Topology

Pre-Quiz: Node Selection

1. A customer needs sub-minute RTO for instant mass restore of dozens of VMs. Which ReadyNode family is most appropriate?

C4000 (Entry/Edge) C6000 dense hybrid C5066 mainstream hybrid C5200 / C6200 all-flash

2. What is the minimum supported cluster size for a Cohesity production deployment?

1 node 3 nodes 6 nodes 8 nodes

3. A customer's existing C5066 cluster needs more capacity at year-3, but performance demands have not grown. What is the recommended approach?

Forklift the entire cluster to all-flash C6200 nodes Add C6000 dense nodes — SpanFS migrates cold data to denser nodes while keeping hot data on performance nodes Replace C5066 nodes one-by-one with C4000 nodes Rebuild on a new cluster — Cohesity does not support mixed-node clusters

4. When does brick mode provide architectural value?

For all clusters of 3 or more nodes by default When dense C6000 nodes hold so much TB that losing a whole node would exceed rebuild capacity Only on all-flash clusters using PCIe Gen5 For Robo Edition deployments with 1-2 nodes

ReadyNode Families

FamilyForm FactorStorageOptimized For
C40002U single, 1× Xeon 8-core8× HDD slots + NVMe metadataEntry/edge, branch ROBO
C5066 (hybrid)2U, 1× Xeon 16-core, 128 GB RAM54 TB HDD + 3.2 TB NVMe / nodeMainstream backup — ~70% of designs
C52002U, 4-node block, 5th-Gen Xeon, PCIe Gen5216 TB HDD + 12.8 TB flash / block (or all-NVMe)Density + performance
C6000 (dense)Dense 2U168–192 TB raw HDD / node + flashLong-term retention, archive
C62002U all-flash denseAll NVMeHigh-perf retention, low power

Selection Heuristics

Minimum Cluster Sizes & EC Constraints

Cluster SizeSmallest Usable ECRecommended EC
3 nodesRF 2 onlyRF 2
4–5 nodesEC 2:1EC 2:1 or RF 2
6–7 nodesEC 4:2EC 4:2 or 5:2
8+ nodesEC 6:2EC 6:2 (best efficiency at 2-failure tolerance)
12+ nodesEC 6:2EC 6:2 (more parallel rebuild domains)
Animation: ReadyNode Selection — workload routes to family
Workload + SLA FETB / RPO / RTO / retention Sub-minute RTO? Mainstream backup? Long retention? All-Flash C5200 AF / C6200 Instant mass restore, sub-min RTO Hybrid (Mainstream) C5066 / C5200 hybrid ~70% of CCAE scenarios Dense C6000 (or C4000 edge) Retention ≥ 1 yr / TB-per-watt Mixed clusters OK Add C6000 to a C5066 cluster at year 3

Mixed-Node Clusters and Brick Mode

Cohesity supports heterogeneous clusters — different ReadyNode models in a single cluster. SpanFS auto-rebalances and routes hot data to flash, cold data to HDD. Brick mode subdivides a single dense node into multiple fault domains, useful when a C6000 holds enough TB that losing the whole node would exceed rebuild capacity.

flowchart TD S[Start: Workload + SLA Profile] --> Q1{Edge / ROBO site
≤ 4 nodes?} Q1 -->|Yes| C4[C4000
Entry / Edge Hybrid] Q1 -->|No| Q2{Sub-minute RTO
or instant mass restore?} Q2 -->|Yes| Q3{Density per RU
also matters?} Q3 -->|Yes| C6200[C6200
All-Flash Dense] Q3 -->|No| C5200AF[C5200 All-Flash
PCIe Gen5 NVMe] Q2 -->|No| Q4{Retention dominates
≥ 1 year on cluster?} Q4 -->|Yes| C6000[C6000
Dense Hybrid] Q4 -->|No| Q5{4-node-per-2U
density required?} Q5 -->|Yes| C5200H[C5200 Hybrid
Performance Density] Q5 -->|No| C5066[C5066
Mainstream Hybrid]

Key Points

Post-Quiz: Node Selection

1. A customer needs sub-minute RTO for instant mass restore of dozens of VMs. Which ReadyNode family is most appropriate?

C4000 (Entry/Edge) C6000 dense hybrid C5066 mainstream hybrid C5200 / C6200 all-flash

2. What is the minimum supported cluster size for a Cohesity production deployment?

1 node 3 nodes 6 nodes 8 nodes

3. A customer's existing C5066 cluster needs more capacity at year-3, but performance demands have not grown. What is the recommended approach?

Forklift the entire cluster to all-flash C6200 nodes Add C6000 dense nodes — SpanFS migrates cold data to denser nodes while keeping hot data on performance nodes Replace C5066 nodes one-by-one with C4000 nodes Rebuild on a new cluster — Cohesity does not support mixed-node clusters

4. When does brick mode provide architectural value?

For all clusters of 3 or more nodes by default When dense C6000 nodes hold so much TB that losing a whole node would exceed rebuild capacity Only on all-flash clusters using PCIe Gen5 For Robo Edition deployments with 1-2 nodes

4. Capacity Planning Over Time

Pre-Quiz: Capacity Planning

1. What is the minimum recommended planning horizon when sizing a new cluster?

Day-one demand only Year-1 demand Year-3 protected FETB at minimum, ideally year-5 Year-10 fully amortized

2. For a 10-node cluster with EC 6:2 and 192 TB raw per node, what does a sensible practical ceiling look like (after 80% target and N+1 reserve)?

~1,440 TB — the full usable capacity ~960 TB — 80% of usable minus one node's raw for N+1 ~1,920 TB — the raw sum ~500 TB — only protectable FETB

3. What is the difference between CloudTier and CloudArchive?

CloudTier is for backup; CloudArchive is for primary workloads CloudTier is a transparent capacity extension managed by the cluster; CloudArchive is a logical retention destination for policy-driven movement They are the same product with different names CloudTier requires Glacier; CloudArchive only supports S3 Standard

Modeling Growth and Tech-Refresh

Apply 10–25% YoY growth depending on industry. A 5-year compounding view of a 500 TB FETB starting baseline at 15% YoY:

YearFETBEffective BETB (4.5x reduction, 12x retention)
0500 TB1,333 TB
1575 TB1,533 TB
2661 TB1,763 TB
3760 TB2,028 TB
51,005 TB2,681 TB

Tiering Strategy

TierMediaLatencyUse Case
HotLocal NVMe / flash<1 msLast 7–30 days, instant recovery
WarmLocal HDD5–10 ms30–180 days
Cold (CloudTier)S3 Standard / Azure Cool50–200 ms6–24 months
Archive (CloudArchive)Glacier / Azure Archive4–12 hr rehydrate1+ year compliance

Reserve Capacity Discipline

For a 10-node, 192 TB-per-node, EC 6:2 cluster: usable = 1,440 TB → 80% = 1,152 TB → minus 192 TB N+1 = ~960 TB practical effective ceiling. Sizing to the full 1,440 TB usable is a customer-failure setup.

Helios Reporting and Forecasting

Worked Example: 500 TB FETB Sizing

Customer: 500 TB FETB (350 TB VMware + 100 TB NAS + 50 TB SQL), 3% blended daily change, 30-day retention.

  1. Daily incremental = 500 × 3% = 15 TB/day — 30-day cumulative = 450 TB pre-reduction.
  2. Apply blended reduction: VMware 7:1 = 50 TB; NAS 2.5:1 = 40 TB; SQL 4:1 = 12.5 TB → initial-full ~102.5 TB. Steady-state ~193 TB.
  3. Add growth + headroom: year-3 at 15% YoY → ~293 TB; +20% = 366 TB target available.
  4. Choose EC 6:2 (8+ nodes, 75% efficiency) → raw needed = 366 / 0.75 ≈ 488 TB.
  5. Pick nodes: 8× C5066 for EC 6:2 + 1× C5066 for N+1 = 9-node, 486 TB raw cluster.

The 9-node C5066 answer is the typical exam-correct response: satisfies EC 6:2 (8 nodes) + N+1 (1), uses the mainstream node, leaves growth headroom.

Key Points

Post-Quiz: Capacity Planning

1. What is the minimum recommended planning horizon when sizing a new cluster?

Day-one demand only Year-1 demand Year-3 protected FETB at minimum, ideally year-5 Year-10 fully amortized

2. For a 10-node cluster with EC 6:2 and 192 TB raw per node, what does a sensible practical ceiling look like (after 80% target and N+1 reserve)?

~1,440 TB — the full usable capacity ~960 TB — 80% of usable minus one node's raw for N+1 ~1,920 TB — the raw sum ~500 TB — only protectable FETB

3. What is the difference between CloudTier and CloudArchive?

CloudTier is for backup; CloudArchive is for primary workloads CloudTier is a transparent capacity extension managed by the cluster; CloudArchive is a logical retention destination for policy-driven movement They are the same product with different names CloudTier requires Glacier; CloudArchive only supports S3 Standard

Your Progress

Answer Explanations