Chapter 2: SpanFS Internals, Distributed Storage, and Cluster Mechanics

Learning Objectives

Section 1: SpanFS Data Path

Pre-Quiz: SpanFS Data Path

1. Why does SpanFS journal incoming writes to NVRAM before destaging them to HDD?

To bypass deduplication on the hot path To absorb random small writes and ACK clients with single-digit-ms latency Because HDDs cannot store user data, only metadata To force every chunk through compression twice

2. What role does SnapTree play in SpanFS?

It is the network fabric between nodes A copy-on-write B+ tree mapping logical offsets to chunk file references A dedicated SSD partition for the dedup hash table A backup orchestration scheduler

3. Which on-disk construct is the resiliency unit that gets RF-replicated or EC-striped?

Chunk file Blob file Chunk group SnapTree leaf

4. When does a deleted snapshot's capacity actually return to the pool?

Instantly upon delete Only after garbage collection drops chunk reference counts to zero (often hours later) After the next full backup cycle When the cluster is rebooted

5. For an Instant Mass Restore of a 1 TB VM, why does SpanFS issue parallel reads to multiple nodes?

Because no single node holds any data Chunk group fragments span the cluster, so parallel fetch maximizes throughput To avoid the dedup hash table on cold reads To force the cluster into degraded mode

SpanFS stores user data as chunks, not files. The IO Engine runs a Rabin rolling hash across the byte stream and slices it at content-defined boundaries. Each resulting chunk is fingerprinted with SHA-1 and looked up in the cluster-wide deduplication hash table.

Three on-disk constructs hold these chunks together:

The relationship is captured in SnapTree, a distributed B+ tree atop the cluster-wide key-value store. SnapTree supports copy-on-write, so a snapshot or clone is just a new root pointer.

Figure 2.1: Chunk file, Blob file, and Chunk group hierarchy

flowchart LR Client[Client object
VM disk / file] --> Blob[Blob file
logical container] Blob --> ST[SnapTree root
B+ tree index] ST --> CF1[Chunk file
deduped + compressed] ST --> CF2[Chunk file
deduped + compressed] ST --> CF3[Chunk file
deduped + compressed] CF1 --> CG[Chunk group
resiliency unit] CF2 --> CG CF3 --> CG CG --> RF[RF2/RF3 copies] CG --> EC[EC stripe
across fault domains]

NVRAM journaling and write coalescing

Every node includes an NVRAM region (battery- or flash-backed SSD) that survives power loss. The mechanism follows a journal-then-checkpoint pattern: append to NVRAM log, mirror to peer NVRAM (matching RF), ACK client, then a background destage coalesces small NVRAM entries into large sequential disk writes.

Animated Write Path: Client to Durable Disk

SpanFS Write I/O Path (sequential)
Client NFS/SMB/S3 VIP Routes to node Bridge Auth + route IO Engine Chunk + dedup NVRAM L Local journal NVRAM Peer Mirror (RF2) Distrib KV Quorum write Disk (HDD/SSD) Coalesced chunk files (async destage)

Read path and locality optimizations

Reads exploit locality. Cohesity prefers to ingest, dedup, and persist data on the same node that received it, so subsequent reads of recent data are local. Read cache (DRAM and SSD) is per-node and warmed by access patterns. For sequential restore workloads, SpanFS issues parallel reads to multiple nodes simultaneously since the chunk group's fragments span the cluster.

Garbage collection and chunk reclamation

Because chunks are deduplicated, deleting a backup or expiring a snapshot does not immediately free space. A chunk is reclaimable only when its reference count in the distributed KV store drops to zero. GC walks the SnapTree, decrements ref counts, compacts blob files, and re-runs erasure coding on cold blob files. GC is throttled, so capacity reclaim typically lags deletion events by 24-72 hours.

Key Points

Post-Quiz: SpanFS Data Path

1. Why does SpanFS journal incoming writes to NVRAM before destaging them to HDD?

To bypass deduplication on the hot path To absorb random small writes and ACK clients with single-digit-ms latency Because HDDs cannot store user data, only metadata To force every chunk through compression twice

2. What role does SnapTree play in SpanFS?

It is the network fabric between nodes A copy-on-write B+ tree mapping logical offsets to chunk file references A dedicated SSD partition for the dedup hash table A backup orchestration scheduler

3. Which on-disk construct is the resiliency unit that gets RF-replicated or EC-striped?

Chunk file Blob file Chunk group SnapTree leaf

4. When does a deleted snapshot's capacity actually return to the pool?

Instantly upon delete Only after garbage collection drops chunk reference counts to zero (often hours later) After the next full backup cycle When the cluster is rebooted

5. For an Instant Mass Restore of a 1 TB VM, why does SpanFS issue parallel reads to multiple nodes?

Because no single node holds any data Chunk group fragments span the cluster, so parallel fetch maximizes throughput To avoid the dedup hash table on cold reads To force the cluster into degraded mode

Section 2: Resiliency — RF and Erasure Coding

Pre-Quiz: RF and Erasure Coding

1. EC 4:2 matches RF3's failure tolerance. What is the headline trade-off?

EC 4:2 needs only 3 nodes, RF3 needs 8 EC 4:2 uses ~50% overhead vs RF3's 200%, but requires 6 fault domains and more CPU EC 4:2 is faster on writes than RF3 EC 4:2 only works for SSD-only clusters

2. Why does Cohesity use a two-stage RF2-then-EC pipeline?

Because EC cannot survive any failure RF2 minimizes write latency; background EC re-encodes cold data for capacity efficiency RF2 is required for compliance and EC is optional EC encoding is too dangerous to run synchronously, ever

3. A 4-node ReadyNode chassis is configured at chassis-level fault domain. What happens?

Nothing — chassis fault domain is the default EC stripes still place normally because the cluster auto-falls-back to node level The cluster cannot satisfy stripe placement — fault-domain count is one All data is automatically replicated off-site

4. Why must architects size capacity assuming RF2 for hot data even when the View Box policy is EC 4:2?

EC never actually runs in Cohesity Freshly ingested chunks live at RF2 until background EC conversion completes RF2 is cheaper than EC at all times EC 4:2 always requires double the capacity of RF2

5. RF3 yields what approximate usable capacity from a 100 TB raw cluster?

~67 TB ~50 TB ~33 TB ~75 TB

A SpanFS cluster keeps data safe through two complementary mechanisms: Replication Factor (RF) and Erasure Coding (EC). Both can coexist within a single cluster — even within a single View Box.

RF2 vs RF3 trade-offs

Erasure Coding schemes

SchemeDataParityMin FDsFailures ToleratedUsableOverhead
RF211 mirror3 nodes1~50%100%
RF312 mirrors4 nodes2~33%200%
EC 2:1214 nodes1~67%50%
EC 4:2426 nodes2~67%50%
EC 6:2628 nodes2~75%33%

Headline: EC 4:2 and RF3 tolerate the same number of failures, but EC 4:2 uses half the overhead. The cost is encoding CPU and a more complex rebuild path.

Inline RF2 -> Post-process EC pipeline

Cohesity rarely writes EC-protected chunks directly on the hot ingest path. Instead:

  1. Stage 1 (Inline RF2): Hot chunks land in NVRAM and SSD with RF2 protection. Mirroring is cheap, latency is low.
  2. Stage 2 (Post-process EC): A background task identifies cold chunk groups, re-encodes under the View Box's EC policy (e.g., 4:2), persists the new EC stripe, and frees the RF2 copies.

Animated RF2 to EC Conversion

Stage 1 RF2 ingest -> Stage 2 background EC 4:2 re-encoding
Stage 1: RF2 Ingest (hot) 2 mirror copies on 2 nodes Stage 2: EC 4:2 (cold, post-process) 4 data + 2 parity across 6 nodes Node A D1 D2 D3 D4 Node B (mirror) D1' D2' D3' D4' re-encode Reed-Solomon Node 1 D1 Node 2 D2 Node 3 D3 Node 4 D4 Node 5 P1 Node 6 P2 Capacity used: 100% overhead (2x) Capacity used: 50% overhead (~67% usable) Tolerates: 1 failure survives loss of Node A or Node B Tolerates: 2 simultaneous failures any 2 of 6 fragments can be lost & rebuilt from remaining 4 Background GC triggers EC re-encode when chunk group becomes cold; RF2 copies released after EC stripe is durable

Figure 2.3: Erasure coding stripe placement decision flow

flowchart TD Ingest[New write arrives at node] --> RF2Land[Stage 1: Land at RF2
NVRAM mirror + SSD] RF2Land --> Ack[Client ACK
low latency] Ack --> Cold{Chunk group
cold?} Cold -- No --> Stay[Remain at RF2
hot tier] Stay --> Cold Cold -- Yes --> ECCheck{Enough fault
domains for EC?} ECCheck -- No --> StayRF[Keep at RF2/RF3
per View Box policy] ECCheck -- Yes --> Encode[Stage 2: Reed-Solomon encode
EC 2:1 / 4:2 / 6:2] Encode --> Place[Distribute fragments
across fault domains] Place --> Free[Release original
RF2 copies] Free --> Reclaim[GC reclaims capacity]

Key Points

Post-Quiz: RF and Erasure Coding

1. EC 4:2 matches RF3's failure tolerance. What is the headline trade-off?

EC 4:2 needs only 3 nodes, RF3 needs 8 EC 4:2 uses ~50% overhead vs RF3's 200%, but requires 6 fault domains and more CPU EC 4:2 is faster on writes than RF3 EC 4:2 only works for SSD-only clusters

2. Why does Cohesity use a two-stage RF2-then-EC pipeline?

Because EC cannot survive any failure RF2 minimizes write latency; background EC re-encodes cold data for capacity efficiency RF2 is required for compliance and EC is optional EC encoding is too dangerous to run synchronously, ever

3. A 4-node ReadyNode chassis is configured at chassis-level fault domain. What happens?

Nothing — chassis fault domain is the default EC stripes still place normally because the cluster auto-falls-back to node level The cluster cannot satisfy stripe placement — fault-domain count is one All data is automatically replicated off-site

4. Why must architects size capacity assuming RF2 for hot data even when the View Box policy is EC 4:2?

EC never actually runs in Cohesity Freshly ingested chunks live at RF2 until background EC conversion completes RF2 is cheaper than EC at all times EC 4:2 always requires double the capacity of RF2

5. RF3 yields what approximate usable capacity from a 100 TB raw cluster?

~67 TB ~50 TB ~33 TB ~75 TB

Section 3: Deduplication and Compression

Pre-Quiz: Deduplication and Compression

1. Why does variable-length (Rabin) chunking outperform fixed-block dedup on incremental backups?

It uses smaller blocks, so more matches It is insertion-resilient: a small header prepend creates one new chunk, not all-new fingerprints It runs only on SSD media It compresses every chunk twice for higher ratios

2. What is the dedup scope in Cohesity?

Per cluster — everything dedupes against everything Per View Box — separate hash tables across View Boxes Per node — each node has its own table Per Helios tenant globally across clouds

3. Per-tenant encryption keys break dedup across tenants. Why?

Cohesity disables dedup whenever encryption is on Identical plaintext encrypts to different ciphertext under different keys, so fingerprints differ Encryption and dedup share the same hash and conflict Encryption multiplies chunk sizes by 10x

4. Which workload typically yields ~1x dedup and compression combined?

VMware Windows VM backups Microsoft 365 mailboxes Already-compressed media (video, ZIP) or encrypted source data Oracle full backups

5. When is post-process compression preferable to inline?

When you must absolutely minimize write-path CPU and accept transient capacity When dedup is disabled When the cluster is in quorum loss Never — inline always wins

Cohesity's headline storage-efficiency multipliers come from global, variable-length deduplication combined with inline compression.

Variable-length sliding-window dedup

  1. The IO Engine slides a Rabin rolling hash across the byte stream.
  2. When the rolling hash matches a content-defined breakpoint pattern, a chunk boundary is declared.
  3. Each resulting chunk gets a SHA-1 fingerprint.
  4. The fingerprint is queried against the global hash table; if found, only a metadata reference is stored.

The advantage is insertion-resilience: a 1 KB header prepend produces one new chunk, not all new chunks like fixed-block schemes. This can be a 5-10x dedup ratio difference for incremental and synthetic-full workloads.

Global vs local dedup domains

The dedup hash table lives in the distributed KV store, but its scope is the View Box. Two View Boxes maintain separate hash tables. Architectural consequences:

Inline vs post-process compression

Planning ratios for sizing

WorkloadDedupCompressionCombined
VMware mixed Windows VMs4-6x1.5-2x6-12x
Oracle / SQL fulls3-5x1.5-2x4-10x
M365 / SharePoint3-5x1.3-1.5x4-8x
File shares / SmartFiles1.5-3x1.3-2x2-6x
Pre-compressed media / encrypted source~1x~1x~1x

Key Points

Post-Quiz: Deduplication and Compression

1. Why does variable-length (Rabin) chunking outperform fixed-block dedup on incremental backups?

It uses smaller blocks, so more matches It is insertion-resilient: a small header prepend creates one new chunk, not all-new fingerprints It runs only on SSD media It compresses every chunk twice for higher ratios

2. What is the dedup scope in Cohesity?

Per cluster — everything dedupes against everything Per View Box — separate hash tables across View Boxes Per node — each node has its own table Per Helios tenant globally across clouds

3. Per-tenant encryption keys break dedup across tenants. Why?

Cohesity disables dedup whenever encryption is on Identical plaintext encrypts to different ciphertext under different keys, so fingerprints differ Encryption and dedup share the same hash and conflict Encryption multiplies chunk sizes by 10x

4. Which workload typically yields ~1x dedup and compression combined?

VMware Windows VM backups Microsoft 365 mailboxes Already-compressed media (video, ZIP) or encrypted source data Oracle full backups

5. When is post-process compression preferable to inline?

When you must absolutely minimize write-path CPU and accept transient capacity When dedup is disabled When the cluster is in quorum loss Never — inline always wins

Section 4: Cluster Mechanics and Quorum

Pre-Quiz: Cluster Mechanics and Quorum

1. SpanFS guarantees strict consistency for metadata. How is that implemented?

A leader node makes all decisions A Paxos-style consensus over the distributed KV store with quorum acks Eventual consistency with periodic reconciliation DNS-based serialization

2. A 3-node cluster loses 2 nodes simultaneously. What happens?

Cluster keeps writing as normal Quorum is lost — cluster goes read-only or offline; manual recovery required EC 4:2 still tolerates the loss The third node automatically clones itself

3. Why is a 4-node cluster splitting 2/2 a worst case?

EC 4:2 is unavailable on 4 nodes Neither side has majority quorum — both sides go offline to prevent split-brain 4 is not a supported cluster size It triggers an automatic failover to the cloud

4. During a rolling upgrade of an RF2 cluster, what happens to fault tolerance?

It improves because GC runs faster It temporarily drops to zero — one node already in maintenance, no further failure tolerated It is unaffected It permanently increases by one

5. An architect promotes fault domain from node to chassis on a 4-node single-chassis cluster. Result?

Improved resiliency at no cost The cluster cannot place RF/EC stripes — only one chassis exists as a fault domain Dedup ratio doubles All workloads automatically migrate to RF3

A SpanFS cluster is a peer-to-peer system: every node runs the same services. Coordination uses explicit consensus, fault-domain awareness, and disciplined upgrade procedures.

Strict consistency via Paxos

SpanFS guarantees strict consistency for both data and metadata. Reads always observe the most recent durably committed write. With three KV replicas per record, a metadata write requires at least two acks. Two clients writing the same offset never see split-brain.

Fault domain levels

Fault DomainDescriptionCluster Shape
DiskTwo copies/fragments may share a nodeDefault; smallest clusters
NodeEach copy/fragment on a different nodeRecommended baseline
ChassisEach copy/fragment on a different chassisMulti-node-per-chassis
RackEach copy/fragment on a different rackStretched / large clusters

Quorum scenarios

Animated Quorum State Machine

Cluster consistency states: Healthy -> Degraded -> Quorum Loss -> Recovery
Healthy All nodes online Full quorum RF/EC at policy Degraded Node/disk down Quorum maintained Rebuild in progress Quorum Loss Majority unreachable Read-only / offline Mgmt plane down Recovery Manual action Partition resolved Quorum restored failure in tolerance further failure exceeds tolerance operator intervention quorum restored -> Degraded (rebuild) or Healthy (all back) Note: Rolling upgrade temporarily transitions Healthy -> Degraded by design (1 node in maintenance) -- RF2 clusters lose all spare tolerance

Rolling upgrades

SpanFS upgrades are rolling: one node at a time enters maintenance, drains leadership and VIPs, upgrades, reboots, and rejoins. During maintenance: quorum tightens, effective resiliency drops by one, in-flight client connections migrate via VIP failover. For RF2 clusters this means upgrade temporarily reduces fault tolerance to zero. RF3 or EC 4:2 is preferred for mission-critical systems so that maintenance plus an unplanned failure does not equal data loss.

Key Points

Post-Quiz: Cluster Mechanics and Quorum

1. SpanFS guarantees strict consistency for metadata. How is that implemented?

A leader node makes all decisions A Paxos-style consensus over the distributed KV store with quorum acks Eventual consistency with periodic reconciliation DNS-based serialization

2. A 3-node cluster loses 2 nodes simultaneously. What happens?

Cluster keeps writing as normal Quorum is lost — cluster goes read-only or offline; manual recovery required EC 4:2 still tolerates the loss The third node automatically clones itself

3. Why is a 4-node cluster splitting 2/2 a worst case?

EC 4:2 is unavailable on 4 nodes Neither side has majority quorum — both sides go offline to prevent split-brain 4 is not a supported cluster size It triggers an automatic failover to the cloud

4. During a rolling upgrade of an RF2 cluster, what happens to fault tolerance?

It improves because GC runs faster It temporarily drops to zero — one node already in maintenance, no further failure tolerated It is unaffected It permanently increases by one

5. An architect promotes fault domain from node to chassis on a 4-node single-chassis cluster. Result?

Improved resiliency at no cost The cluster cannot place RF/EC stripes — only one chassis exists as a fault domain Dedup ratio doubles All workloads automatically migrate to RF3

Figure 2.4: Cluster consistency states and quorum transitions (Mermaid)

stateDiagram-v2 [*] --> Healthy Healthy: Healthy
All nodes online
full quorum + RF/EC Degraded: Degraded
Node/disk down
quorum maintained Rebuilding: Rebuilding
Background reconstruction
fragments restored QuorumLoss: Quorum Loss
Majority unreachable
read-only / offline Recovery: Recovery
Manual intervention
partition resolved Healthy --> Degraded: failure within tolerance Degraded --> Rebuilding: replacement / spare available Rebuilding --> Healthy: rebuild completes Degraded --> QuorumLoss: additional failure exceeds tolerance Healthy --> QuorumLoss: network partition splits cluster QuorumLoss --> Recovery: operator action Recovery --> Degraded: quorum restored Recovery --> Healthy: all nodes rejoined

Your Progress

Answer Explanations