Study Guide: SpanFS Internals, Distributed Storage, and Cluster Mechanics

Learning Objectives

Explain how SpanFS chunks, fingerprints, and stores data across nodes using chunk files, blob files, and the SnapTree metadata index.
Describe Replication Factor (RF) and Erasure Coding (EC) trade-offs, including stripe placement, minimum cluster sizes, and when each scheme is appropriate.
Trace a write I/O end-to-end from client through the Bridge service, NVRAM journal, IO Engine, and into chunk files on persistent media.
Predict how a cluster behaves under disk failure, node failure, chassis failure, and quorum loss, including rebuild times and management-plane availability.
Choose deduplication and compression policies (inline vs. post-process, variable-length scope) appropriate for VM, NAS, and database workloads.

Section 1: SpanFS Data Path

Pre-Quiz: SpanFS Data Path

1. Why does SpanFS journal incoming writes to NVRAM before destaging them to HDD?

To bypass deduplication on the hot path To absorb random small writes and ACK clients with single-digit-ms latency Because HDDs cannot store user data, only metadata To force every chunk through compression twice

2. What role does SnapTree play in SpanFS?

It is the network fabric between nodes A copy-on-write B+ tree mapping logical offsets to chunk file references A dedicated SSD partition for the dedup hash table A backup orchestration scheduler

3. Which on-disk construct is the resiliency unit that gets RF-replicated or EC-striped?

Chunk file Blob file Chunk group SnapTree leaf

4. When does a deleted snapshot's capacity actually return to the pool?

Instantly upon delete Only after garbage collection drops chunk reference counts to zero (often hours later) After the next full backup cycle When the cluster is rebooted

5. For an Instant Mass Restore of a 1 TB VM, why does SpanFS issue parallel reads to multiple nodes?

Because no single node holds any data Chunk group fragments span the cluster, so parallel fetch maximizes throughput To avoid the dedup hash table on cold reads To force the cluster into degraded mode

SpanFS stores user data as chunks, not files. The IO Engine runs a Rabin rolling hash across the byte stream and slices it at content-defined boundaries. Each resulting chunk is fingerprinted with SHA-1 and looked up in the cluster-wide deduplication hash table.

Three on-disk constructs hold these chunks together:

Chunk file — smallest unit of user data persisted to disk after dedup and compression.
Blob file — container that aggregates many chunk files belonging to a logical object.
Chunk group — the resiliency unit that is RF-replicated or EC-striped as a whole.

The relationship is captured in SnapTree, a distributed B+ tree atop the cluster-wide key-value store. SnapTree supports copy-on-write, so a snapshot or clone is just a new root pointer.

Figure 2.1: Chunk file, Blob file, and Chunk group hierarchy

flowchart LR Client[Client object
VM disk / file] --> Blob[Blob file
logical container] Blob --> ST[SnapTree root
B+ tree index] ST --> CF1[Chunk file
deduped + compressed] ST --> CF2[Chunk file
deduped + compressed] ST --> CF3[Chunk file
deduped + compressed] CF1 --> CG[Chunk group
resiliency unit] CF2 --> CG CF3 --> CG CG --> RF[RF2/RF3 copies] CG --> EC[EC stripe
across fault domains]

NVRAM journaling and write coalescing

Every node includes an NVRAM region (battery- or flash-backed SSD) that survives power loss. The mechanism follows a journal-then-checkpoint pattern: append to NVRAM log, mirror to peer NVRAM (matching RF), ACK client, then a background destage coalesces small NVRAM entries into large sequential disk writes.

Animated Write Path: Client to Durable Disk

SpanFS Write I/O Path (sequential)

Read path and locality optimizations

Reads exploit locality. Cohesity prefers to ingest, dedup, and persist data on the same node that received it, so subsequent reads of recent data are local. Read cache (DRAM and SSD) is per-node and warmed by access patterns. For sequential restore workloads, SpanFS issues parallel reads to multiple nodes simultaneously since the chunk group's fragments span the cluster.

Garbage collection and chunk reclamation

Because chunks are deduplicated, deleting a backup or expiring a snapshot does not immediately free space. A chunk is reclaimable only when its reference count in the distributed KV store drops to zero. GC walks the SnapTree, decrements ref counts, compacts blob files, and re-runs erasure coding on cold blob files. GC is throttled, so capacity reclaim typically lags deletion events by 24-72 hours.

Post-Quiz: SpanFS Data Path

1. Why does SpanFS journal incoming writes to NVRAM before destaging them to HDD?

2. What role does SnapTree play in SpanFS?

It is the network fabric between nodes A copy-on-write B+ tree mapping logical offsets to chunk file references A dedicated SSD partition for the dedup hash table A backup orchestration scheduler

3. Which on-disk construct is the resiliency unit that gets RF-replicated or EC-striped?

Chunk file Blob file Chunk group SnapTree leaf

4. When does a deleted snapshot's capacity actually return to the pool?

Instantly upon delete Only after garbage collection drops chunk reference counts to zero (often hours later) After the next full backup cycle When the cluster is rebooted

5. For an Instant Mass Restore of a 1 TB VM, why does SpanFS issue parallel reads to multiple nodes?

Section 2: Resiliency — RF and Erasure Coding

Pre-Quiz: RF and Erasure Coding

1. EC 4:2 matches RF3's failure tolerance. What is the headline trade-off?

EC 4:2 needs only 3 nodes, RF3 needs 8 EC 4:2 uses ~50% overhead vs RF3's 200%, but requires 6 fault domains and more CPU EC 4:2 is faster on writes than RF3 EC 4:2 only works for SSD-only clusters

2. Why does Cohesity use a two-stage RF2-then-EC pipeline?

Because EC cannot survive any failure RF2 minimizes write latency; background EC re-encodes cold data for capacity efficiency RF2 is required for compliance and EC is optional EC encoding is too dangerous to run synchronously, ever

3. A 4-node ReadyNode chassis is configured at chassis-level fault domain. What happens?

Nothing — chassis fault domain is the default EC stripes still place normally because the cluster auto-falls-back to node level The cluster cannot satisfy stripe placement — fault-domain count is one All data is automatically replicated off-site

4. Why must architects size capacity assuming RF2 for hot data even when the View Box policy is EC 4:2?

EC never actually runs in Cohesity Freshly ingested chunks live at RF2 until background EC conversion completes RF2 is cheaper than EC at all times EC 4:2 always requires double the capacity of RF2

5. RF3 yields what approximate usable capacity from a 100 TB raw cluster?

~67 TB ~50 TB ~33 TB ~75 TB

A SpanFS cluster keeps data safe through two complementary mechanisms: Replication Factor (RF) and Erasure Coding (EC). Both can coexist within a single cluster — even within a single View Box.

RF2 vs RF3 trade-offs

RF2: tolerates 1 simultaneous failure, ~50% usable, 100% overhead. Fast block-level rebuild. Best for hot data.
RF3: tolerates 2 simultaneous failures, ~33% usable, 200% overhead. Required for strict SLAs / small clusters.

Erasure Coding schemes

Scheme	Data	Parity	Min FDs	Failures Tolerated	Usable	Overhead
RF2	1	1 mirror	3 nodes	1	~50%	100%
RF3	1	2 mirrors	4 nodes	2	~33%	200%
EC 2:1	2	1	4 nodes	1	~67%	50%
EC 4:2	4	2	6 nodes	2	~67%	50%
EC 6:2	6	2	8 nodes	2	~75%	33%

Headline: EC 4:2 and RF3 tolerate the same number of failures, but EC 4:2 uses half the overhead. The cost is encoding CPU and a more complex rebuild path.

Inline RF2 -> Post-process EC pipeline

Cohesity rarely writes EC-protected chunks directly on the hot ingest path. Instead:

Stage 1 (Inline RF2): Hot chunks land in NVRAM and SSD with RF2 protection. Mirroring is cheap, latency is low.
Stage 2 (Post-process EC): A background task identifies cold chunk groups, re-encodes under the View Box's EC policy (e.g., 4:2), persists the new EC stripe, and frees the RF2 copies.

Animated RF2 to EC Conversion

Stage 1 RF2 ingest -> Stage 2 background EC 4:2 re-encoding

Figure 2.3: Erasure coding stripe placement decision flow

flowchart TD Ingest[New write arrives at node] --> RF2Land[Stage 1: Land at RF2
NVRAM mirror + SSD] RF2Land --> Ack[Client ACK
low latency] Ack --> Cold{Chunk group
cold?} Cold -- No --> Stay[Remain at RF2
hot tier] Stay --> Cold Cold -- Yes --> ECCheck{Enough fault
domains for EC?} ECCheck -- No --> StayRF[Keep at RF2/RF3
per View Box policy] ECCheck -- Yes --> Encode[Stage 2: Reed-Solomon encode
EC 2:1 / 4:2 / 6:2] Encode --> Place[Distribute fragments
across fault domains] Place --> Free[Release original
RF2 copies] Free --> Reclaim[GC reclaims capacity]

Post-Quiz: RF and Erasure Coding

1. EC 4:2 matches RF3's failure tolerance. What is the headline trade-off?

EC 4:2 needs only 3 nodes, RF3 needs 8 EC 4:2 uses ~50% overhead vs RF3's 200%, but requires 6 fault domains and more CPU EC 4:2 is faster on writes than RF3 EC 4:2 only works for SSD-only clusters

2. Why does Cohesity use a two-stage RF2-then-EC pipeline?

3. A 4-node ReadyNode chassis is configured at chassis-level fault domain. What happens?

4. Why must architects size capacity assuming RF2 for hot data even when the View Box policy is EC 4:2?

EC never actually runs in Cohesity Freshly ingested chunks live at RF2 until background EC conversion completes RF2 is cheaper than EC at all times EC 4:2 always requires double the capacity of RF2

5. RF3 yields what approximate usable capacity from a 100 TB raw cluster?

~67 TB ~50 TB ~33 TB ~75 TB

Section 3: Deduplication and Compression

Pre-Quiz: Deduplication and Compression

1. Why does variable-length (Rabin) chunking outperform fixed-block dedup on incremental backups?

It uses smaller blocks, so more matches It is insertion-resilient: a small header prepend creates one new chunk, not all-new fingerprints It runs only on SSD media It compresses every chunk twice for higher ratios

2. What is the dedup scope in Cohesity?

Per cluster — everything dedupes against everything Per View Box — separate hash tables across View Boxes Per node — each node has its own table Per Helios tenant globally across clouds

3. Per-tenant encryption keys break dedup across tenants. Why?

Cohesity disables dedup whenever encryption is on Identical plaintext encrypts to different ciphertext under different keys, so fingerprints differ Encryption and dedup share the same hash and conflict Encryption multiplies chunk sizes by 10x

4. Which workload typically yields ~1x dedup and compression combined?

VMware Windows VM backups Microsoft 365 mailboxes Already-compressed media (video, ZIP) or encrypted source data Oracle full backups

5. When is post-process compression preferable to inline?

When you must absolutely minimize write-path CPU and accept transient capacity When dedup is disabled When the cluster is in quorum loss Never — inline always wins

Cohesity's headline storage-efficiency multipliers come from global, variable-length deduplication combined with inline compression.

Variable-length sliding-window dedup

The IO Engine slides a Rabin rolling hash across the byte stream.
When the rolling hash matches a content-defined breakpoint pattern, a chunk boundary is declared.
Each resulting chunk gets a SHA-1 fingerprint.
The fingerprint is queried against the global hash table; if found, only a metadata reference is stored.

The advantage is insertion-resilience: a 1 KB header prepend produces one new chunk, not all new chunks like fixed-block schemes. This can be a 5-10x dedup ratio difference for incremental and synthetic-full workloads.

Global vs local dedup domains

The dedup hash table lives in the distributed KV store, but its scope is the View Box. Two View Boxes maintain separate hash tables. Architectural consequences:

Per-tenant View Boxes give isolation but lose cross-tenant dedup.
Group similar workloads (all VMware backups) in the same View Box to maximize dedup.
Per-tenant encryption keys break dedup across tenants regardless of View Box settings — identical plaintext encrypts differently.

Inline vs post-process compression

Inline compression (LZ4 / zstd-class): each unique chunk compressed before disk write. Lower disk usage on first write; small CPU cost on ingest.
Post-process compression: chunks written uncompressed first (faster ACK), then compressed by background task. Better latency at the cost of transient capacity.

Planning ratios for sizing

Workload	Dedup	Compression	Combined
VMware mixed Windows VMs	4-6x	1.5-2x	6-12x
Oracle / SQL fulls	3-5x	1.5-2x	4-10x
M365 / SharePoint	3-5x	1.3-1.5x	4-8x
File shares / SmartFiles	1.5-3x	1.3-2x	2-6x
Pre-compressed media / encrypted source	~1x	~1x	~1x

Post-Quiz: Deduplication and Compression

1. Why does variable-length (Rabin) chunking outperform fixed-block dedup on incremental backups?

2. What is the dedup scope in Cohesity?

Per cluster — everything dedupes against everything Per View Box — separate hash tables across View Boxes Per node — each node has its own table Per Helios tenant globally across clouds

3. Per-tenant encryption keys break dedup across tenants. Why?

4. Which workload typically yields ~1x dedup and compression combined?

VMware Windows VM backups Microsoft 365 mailboxes Already-compressed media (video, ZIP) or encrypted source data Oracle full backups

5. When is post-process compression preferable to inline?

When you must absolutely minimize write-path CPU and accept transient capacity When dedup is disabled When the cluster is in quorum loss Never — inline always wins

Section 4: Cluster Mechanics and Quorum

Pre-Quiz: Cluster Mechanics and Quorum

1. SpanFS guarantees strict consistency for metadata. How is that implemented?

A leader node makes all decisions A Paxos-style consensus over the distributed KV store with quorum acks Eventual consistency with periodic reconciliation DNS-based serialization

2. A 3-node cluster loses 2 nodes simultaneously. What happens?

Cluster keeps writing as normal Quorum is lost — cluster goes read-only or offline; manual recovery required EC 4:2 still tolerates the loss The third node automatically clones itself

3. Why is a 4-node cluster splitting 2/2 a worst case?

EC 4:2 is unavailable on 4 nodes Neither side has majority quorum — both sides go offline to prevent split-brain 4 is not a supported cluster size It triggers an automatic failover to the cloud

4. During a rolling upgrade of an RF2 cluster, what happens to fault tolerance?

It improves because GC runs faster It temporarily drops to zero — one node already in maintenance, no further failure tolerated It is unaffected It permanently increases by one

5. An architect promotes fault domain from node to chassis on a 4-node single-chassis cluster. Result?

Improved resiliency at no cost The cluster cannot place RF/EC stripes — only one chassis exists as a fault domain Dedup ratio doubles All workloads automatically migrate to RF3

A SpanFS cluster is a peer-to-peer system: every node runs the same services. Coordination uses explicit consensus, fault-domain awareness, and disciplined upgrade procedures.

Strict consistency via Paxos

SpanFS guarantees strict consistency for both data and metadata. Reads always observe the most recent durably committed write. With three KV replicas per record, a metadata write requires at least two acks. Two clients writing the same offset never see split-brain.

Fault domain levels

Fault Domain	Description	Cluster Shape
Disk	Two copies/fragments may share a node	Default; smallest clusters
Node	Each copy/fragment on a different node	Recommended baseline
Chassis	Each copy/fragment on a different chassis	Multi-node-per-chassis
Rack	Each copy/fragment on a different rack	Stretched / large clusters

Quorum scenarios

3-node, 1 down: 2/3 online, quorum holds, rebuild proceeds.
3-node, 2 down: quorum lost, cluster goes read-only/offline.
6-node, 2 down: 4/6 online, EC 4:2 still tolerates the second failure.
Network partition 3/3: neither side has majority — both go offline to prevent split-brain.

Animated Quorum State Machine

Cluster consistency states: Healthy -> Degraded -> Quorum Loss -> Recovery

Rolling upgrades

SpanFS upgrades are rolling: one node at a time enters maintenance, drains leadership and VIPs, upgrades, reboots, and rejoins. During maintenance: quorum tightens, effective resiliency drops by one, in-flight client connections migrate via VIP failover. For RF2 clusters this means upgrade temporarily reduces fault tolerance to zero. RF3 or EC 4:2 is preferred for mission-critical systems so that maintenance plus an unplanned failure does not equal data loss.

Post-Quiz: Cluster Mechanics and Quorum

1. SpanFS guarantees strict consistency for metadata. How is that implemented?

A leader node makes all decisions A Paxos-style consensus over the distributed KV store with quorum acks Eventual consistency with periodic reconciliation DNS-based serialization

2. A 3-node cluster loses 2 nodes simultaneously. What happens?

Cluster keeps writing as normal Quorum is lost — cluster goes read-only or offline; manual recovery required EC 4:2 still tolerates the loss The third node automatically clones itself

3. Why is a 4-node cluster splitting 2/2 a worst case?

EC 4:2 is unavailable on 4 nodes Neither side has majority quorum — both sides go offline to prevent split-brain 4 is not a supported cluster size It triggers an automatic failover to the cloud

4. During a rolling upgrade of an RF2 cluster, what happens to fault tolerance?

It improves because GC runs faster It temporarily drops to zero — one node already in maintenance, no further failure tolerated It is unaffected It permanently increases by one

5. An architect promotes fault domain from node to chassis on a 4-node single-chassis cluster. Result?

Improved resiliency at no cost The cluster cannot place RF/EC stripes — only one chassis exists as a fault domain Dedup ratio doubles All workloads automatically migrate to RF3

Figure 2.4: Cluster consistency states and quorum transitions (Mermaid)

stateDiagram-v2 [*] --> Healthy Healthy: Healthy
All nodes online
full quorum + RF/EC Degraded: Degraded
Node/disk down
quorum maintained Rebuilding: Rebuilding
Background reconstruction
fragments restored QuorumLoss: Quorum Loss
Majority unreachable
read-only / offline Recovery: Recovery
Manual intervention
partition resolved Healthy --> Degraded: failure within tolerance Degraded --> Rebuilding: replacement / spare available Rebuilding --> Healthy: rebuild completes Degraded --> QuorumLoss: additional failure exceeds tolerance Healthy --> QuorumLoss: network partition splits cluster QuorumLoss --> Recovery: operator action Recovery --> Degraded: quorum restored Recovery --> Healthy: all nodes rejoined

Chapter 2: SpanFS Internals, Distributed Storage, and Cluster Mechanics

Learning Objectives

Section 1: SpanFS Data Path

Figure 2.1: Chunk file, Blob file, and Chunk group hierarchy

NVRAM journaling and write coalescing

Animated Write Path: Client to Durable Disk

Read path and locality optimizations

Garbage collection and chunk reclamation

Key Points

Section 2: Resiliency — RF and Erasure Coding

RF2 vs RF3 trade-offs

Erasure Coding schemes

Inline RF2 -> Post-process EC pipeline

Animated RF2 to EC Conversion

Figure 2.3: Erasure coding stripe placement decision flow

Key Points

Section 3: Deduplication and Compression

Variable-length sliding-window dedup

Global vs local dedup domains

Inline vs post-process compression

Planning ratios for sizing

Key Points

Section 4: Cluster Mechanics and Quorum

Strict consistency via Paxos

Fault domain levels

Quorum scenarios

Animated Quorum State Machine

Rolling upgrades

Key Points

Figure 2.4: Cluster consistency states and quorum transitions (Mermaid)

Your Progress

Answer Explanations