Chapter 2: SpanFS Internals, Distributed Storage, and Cluster Mechanics
Learning Objectives
Explain how SpanFS chunks, fingerprints, and stores data across nodes using chunk files, blob files, and the SnapTree metadata index.
Describe Replication Factor (RF) and Erasure Coding (EC) trade-offs, including stripe placement, minimum cluster sizes, and when each scheme is appropriate.
Trace a write I/O end-to-end from client through the Bridge service, NVRAM journal, IO Engine, and into chunk files on persistent media.
Predict how a cluster behaves under disk failure, node failure, chassis failure, and quorum loss, including rebuild times and management-plane availability.
Choose deduplication and compression policies (inline vs. post-process, variable-length scope) appropriate for VM, NAS, and database workloads.
Section 1: SpanFS Data Path
Pre-Quiz: SpanFS Data Path
1. Why does SpanFS journal incoming writes to NVRAM before destaging them to HDD?
To bypass deduplication on the hot pathTo absorb random small writes and ACK clients with single-digit-ms latencyBecause HDDs cannot store user data, only metadataTo force every chunk through compression twice
2. What role does SnapTree play in SpanFS?
It is the network fabric between nodesA copy-on-write B+ tree mapping logical offsets to chunk file referencesA dedicated SSD partition for the dedup hash tableA backup orchestration scheduler
3. Which on-disk construct is the resiliency unit that gets RF-replicated or EC-striped?
Chunk fileBlob fileChunk groupSnapTree leaf
4. When does a deleted snapshot's capacity actually return to the pool?
Instantly upon deleteOnly after garbage collection drops chunk reference counts to zero (often hours later)After the next full backup cycleWhen the cluster is rebooted
5. For an Instant Mass Restore of a 1 TB VM, why does SpanFS issue parallel reads to multiple nodes?
Because no single node holds any dataChunk group fragments span the cluster, so parallel fetch maximizes throughputTo avoid the dedup hash table on cold readsTo force the cluster into degraded mode
SpanFS stores user data as chunks, not files. The IO Engine runs a Rabin rolling hash across the byte stream and slices it at content-defined boundaries. Each resulting chunk is fingerprinted with SHA-1 and looked up in the cluster-wide deduplication hash table.
Three on-disk constructs hold these chunks together:
Chunk file — smallest unit of user data persisted to disk after dedup and compression.
Blob file — container that aggregates many chunk files belonging to a logical object.
Chunk group — the resiliency unit that is RF-replicated or EC-striped as a whole.
The relationship is captured in SnapTree, a distributed B+ tree atop the cluster-wide key-value store. SnapTree supports copy-on-write, so a snapshot or clone is just a new root pointer.
Figure 2.1: Chunk file, Blob file, and Chunk group hierarchy
flowchart LR
Client[Client object VM disk / file] --> Blob[Blob file logical container]
Blob --> ST[SnapTree root B+ tree index]
ST --> CF1[Chunk file deduped + compressed]
ST --> CF2[Chunk file deduped + compressed]
ST --> CF3[Chunk file deduped + compressed]
CF1 --> CG[Chunk group resiliency unit]
CF2 --> CG
CF3 --> CG
CG --> RF[RF2/RF3 copies]
CG --> EC[EC stripe across fault domains]
NVRAM journaling and write coalescing
Every node includes an NVRAM region (battery- or flash-backed SSD) that survives power loss. The mechanism follows a journal-then-checkpoint pattern: append to NVRAM log, mirror to peer NVRAM (matching RF), ACK client, then a background destage coalesces small NVRAM entries into large sequential disk writes.
Animated Write Path: Client to Durable Disk
SpanFS Write I/O Path (sequential)
Read path and locality optimizations
Reads exploit locality. Cohesity prefers to ingest, dedup, and persist data on the same node that received it, so subsequent reads of recent data are local. Read cache (DRAM and SSD) is per-node and warmed by access patterns. For sequential restore workloads, SpanFS issues parallel reads to multiple nodes simultaneously since the chunk group's fragments span the cluster.
Garbage collection and chunk reclamation
Because chunks are deduplicated, deleting a backup or expiring a snapshot does not immediately free space. A chunk is reclaimable only when its reference count in the distributed KV store drops to zero. GC walks the SnapTree, decrements ref counts, compacts blob files, and re-runs erasure coding on cold blob files. GC is throttled, so capacity reclaim typically lags deletion events by 24-72 hours.
Key Points
SpanFS is journal-then-checkpoint: NVRAM absorbs random writes; destage coalesces them into sequential chunk file writes.
Chunk file = smallest unit; Blob file = logical container; Chunk group = the resiliency unit (RF/EC stripe).
SnapTree is a copy-on-write B+ tree, so snapshots and clones cost zero capacity until divergence.
Reads exploit locality (per-node cache) and use parallel fetch across the cluster for large restores.
GC is asynchronous and throttled — plan capacity headroom assuming 24-72h reclaim lag after deletes.
Post-Quiz: SpanFS Data Path
1. Why does SpanFS journal incoming writes to NVRAM before destaging them to HDD?
To bypass deduplication on the hot pathTo absorb random small writes and ACK clients with single-digit-ms latencyBecause HDDs cannot store user data, only metadataTo force every chunk through compression twice
2. What role does SnapTree play in SpanFS?
It is the network fabric between nodesA copy-on-write B+ tree mapping logical offsets to chunk file referencesA dedicated SSD partition for the dedup hash tableA backup orchestration scheduler
3. Which on-disk construct is the resiliency unit that gets RF-replicated or EC-striped?
Chunk fileBlob fileChunk groupSnapTree leaf
4. When does a deleted snapshot's capacity actually return to the pool?
Instantly upon deleteOnly after garbage collection drops chunk reference counts to zero (often hours later)After the next full backup cycleWhen the cluster is rebooted
5. For an Instant Mass Restore of a 1 TB VM, why does SpanFS issue parallel reads to multiple nodes?
Because no single node holds any dataChunk group fragments span the cluster, so parallel fetch maximizes throughputTo avoid the dedup hash table on cold readsTo force the cluster into degraded mode
Section 2: Resiliency — RF and Erasure Coding
Pre-Quiz: RF and Erasure Coding
1. EC 4:2 matches RF3's failure tolerance. What is the headline trade-off?
EC 4:2 needs only 3 nodes, RF3 needs 8EC 4:2 uses ~50% overhead vs RF3's 200%, but requires 6 fault domains and more CPUEC 4:2 is faster on writes than RF3EC 4:2 only works for SSD-only clusters
2. Why does Cohesity use a two-stage RF2-then-EC pipeline?
Because EC cannot survive any failureRF2 minimizes write latency; background EC re-encodes cold data for capacity efficiencyRF2 is required for compliance and EC is optionalEC encoding is too dangerous to run synchronously, ever
3. A 4-node ReadyNode chassis is configured at chassis-level fault domain. What happens?
Nothing — chassis fault domain is the defaultEC stripes still place normally because the cluster auto-falls-back to node levelThe cluster cannot satisfy stripe placement — fault-domain count is oneAll data is automatically replicated off-site
4. Why must architects size capacity assuming RF2 for hot data even when the View Box policy is EC 4:2?
EC never actually runs in CohesityFreshly ingested chunks live at RF2 until background EC conversion completesRF2 is cheaper than EC at all timesEC 4:2 always requires double the capacity of RF2
5. RF3 yields what approximate usable capacity from a 100 TB raw cluster?
~67 TB~50 TB~33 TB~75 TB
A SpanFS cluster keeps data safe through two complementary mechanisms: Replication Factor (RF) and Erasure Coding (EC). Both can coexist within a single cluster — even within a single View Box.
RF2 vs RF3 trade-offs
RF2: tolerates 1 simultaneous failure, ~50% usable, 100% overhead. Fast block-level rebuild. Best for hot data.
RF3: tolerates 2 simultaneous failures, ~33% usable, 200% overhead. Required for strict SLAs / small clusters.
Erasure Coding schemes
Scheme
Data
Parity
Min FDs
Failures Tolerated
Usable
Overhead
RF2
1
1 mirror
3 nodes
1
~50%
100%
RF3
1
2 mirrors
4 nodes
2
~33%
200%
EC 2:1
2
1
4 nodes
1
~67%
50%
EC 4:2
4
2
6 nodes
2
~67%
50%
EC 6:2
6
2
8 nodes
2
~75%
33%
Headline: EC 4:2 and RF3 tolerate the same number of failures, but EC 4:2 uses half the overhead. The cost is encoding CPU and a more complex rebuild path.
Inline RF2 -> Post-process EC pipeline
Cohesity rarely writes EC-protected chunks directly on the hot ingest path. Instead:
Stage 1 (Inline RF2): Hot chunks land in NVRAM and SSD with RF2 protection. Mirroring is cheap, latency is low.
Stage 2 (Post-process EC): A background task identifies cold chunk groups, re-encodes under the View Box's EC policy (e.g., 4:2), persists the new EC stripe, and frees the RF2 copies.
EC 4:2 matches RF3 fault tolerance with only 50% overhead, but needs 6 fault domains and CPU for encoding.
Cohesity blends both: RF2 inline for low write latency, EC in background for cold-data efficiency.
Always size for fault-domain count, not raw node count — EC schemes have hard FD minimums.
Day-one capacity must assume RF2 because freshly ingested data lives at RF2 until conversion.
Post-Quiz: RF and Erasure Coding
1. EC 4:2 matches RF3's failure tolerance. What is the headline trade-off?
EC 4:2 needs only 3 nodes, RF3 needs 8EC 4:2 uses ~50% overhead vs RF3's 200%, but requires 6 fault domains and more CPUEC 4:2 is faster on writes than RF3EC 4:2 only works for SSD-only clusters
2. Why does Cohesity use a two-stage RF2-then-EC pipeline?
Because EC cannot survive any failureRF2 minimizes write latency; background EC re-encodes cold data for capacity efficiencyRF2 is required for compliance and EC is optionalEC encoding is too dangerous to run synchronously, ever
3. A 4-node ReadyNode chassis is configured at chassis-level fault domain. What happens?
Nothing — chassis fault domain is the defaultEC stripes still place normally because the cluster auto-falls-back to node levelThe cluster cannot satisfy stripe placement — fault-domain count is oneAll data is automatically replicated off-site
4. Why must architects size capacity assuming RF2 for hot data even when the View Box policy is EC 4:2?
EC never actually runs in CohesityFreshly ingested chunks live at RF2 until background EC conversion completesRF2 is cheaper than EC at all timesEC 4:2 always requires double the capacity of RF2
5. RF3 yields what approximate usable capacity from a 100 TB raw cluster?
~67 TB~50 TB~33 TB~75 TB
Section 3: Deduplication and Compression
Pre-Quiz: Deduplication and Compression
1. Why does variable-length (Rabin) chunking outperform fixed-block dedup on incremental backups?
It uses smaller blocks, so more matchesIt is insertion-resilient: a small header prepend creates one new chunk, not all-new fingerprintsIt runs only on SSD mediaIt compresses every chunk twice for higher ratios
2. What is the dedup scope in Cohesity?
Per cluster — everything dedupes against everythingPer View Box — separate hash tables across View BoxesPer node — each node has its own tablePer Helios tenant globally across clouds
3. Per-tenant encryption keys break dedup across tenants. Why?
Cohesity disables dedup whenever encryption is onIdentical plaintext encrypts to different ciphertext under different keys, so fingerprints differEncryption and dedup share the same hash and conflictEncryption multiplies chunk sizes by 10x
4. Which workload typically yields ~1x dedup and compression combined?
VMware Windows VM backupsMicrosoft 365 mailboxesAlready-compressed media (video, ZIP) or encrypted source dataOracle full backups
5. When is post-process compression preferable to inline?
When you must absolutely minimize write-path CPU and accept transient capacityWhen dedup is disabledWhen the cluster is in quorum lossNever — inline always wins
Cohesity's headline storage-efficiency multipliers come from global, variable-length deduplication combined with inline compression.
Variable-length sliding-window dedup
The IO Engine slides a Rabin rolling hash across the byte stream.
When the rolling hash matches a content-defined breakpoint pattern, a chunk boundary is declared.
Each resulting chunk gets a SHA-1 fingerprint.
The fingerprint is queried against the global hash table; if found, only a metadata reference is stored.
The advantage is insertion-resilience: a 1 KB header prepend produces one new chunk, not all new chunks like fixed-block schemes. This can be a 5-10x dedup ratio difference for incremental and synthetic-full workloads.
Global vs local dedup domains
The dedup hash table lives in the distributed KV store, but its scope is the View Box. Two View Boxes maintain separate hash tables. Architectural consequences:
Per-tenant View Boxes give isolation but lose cross-tenant dedup.
Group similar workloads (all VMware backups) in the same View Box to maximize dedup.
Per-tenant encryption keys break dedup across tenants regardless of View Box settings — identical plaintext encrypts differently.
Inline vs post-process compression
Inline compression (LZ4 / zstd-class): each unique chunk compressed before disk write. Lower disk usage on first write; small CPU cost on ingest.
Post-process compression: chunks written uncompressed first (faster ACK), then compressed by background task. Better latency at the cost of transient capacity.
Planning ratios for sizing
Workload
Dedup
Compression
Combined
VMware mixed Windows VMs
4-6x
1.5-2x
6-12x
Oracle / SQL fulls
3-5x
1.5-2x
4-10x
M365 / SharePoint
3-5x
1.3-1.5x
4-8x
File shares / SmartFiles
1.5-3x
1.3-2x
2-6x
Pre-compressed media / encrypted source
~1x
~1x
~1x
Key Points
SpanFS dedup is variable-length (Rabin), SHA-1 fingerprinted, and content-defined — insertion-resilient.
Dedup scope is the View Box. Multi-tenant View Box separation trades dedup for isolation.
Per-tenant encryption keys break dedup across tenants because ciphertext differs.
Group similar workloads in one View Box; mixing dilutes ratios.
Pre-compressed and encrypted source data dedup at ~1x — never apply VM ratios to those.
Post-Quiz: Deduplication and Compression
1. Why does variable-length (Rabin) chunking outperform fixed-block dedup on incremental backups?
It uses smaller blocks, so more matchesIt is insertion-resilient: a small header prepend creates one new chunk, not all-new fingerprintsIt runs only on SSD mediaIt compresses every chunk twice for higher ratios
2. What is the dedup scope in Cohesity?
Per cluster — everything dedupes against everythingPer View Box — separate hash tables across View BoxesPer node — each node has its own tablePer Helios tenant globally across clouds
3. Per-tenant encryption keys break dedup across tenants. Why?
Cohesity disables dedup whenever encryption is onIdentical plaintext encrypts to different ciphertext under different keys, so fingerprints differEncryption and dedup share the same hash and conflictEncryption multiplies chunk sizes by 10x
4. Which workload typically yields ~1x dedup and compression combined?
VMware Windows VM backupsMicrosoft 365 mailboxesAlready-compressed media (video, ZIP) or encrypted source dataOracle full backups
5. When is post-process compression preferable to inline?
When you must absolutely minimize write-path CPU and accept transient capacityWhen dedup is disabledWhen the cluster is in quorum lossNever — inline always wins
Section 4: Cluster Mechanics and Quorum
Pre-Quiz: Cluster Mechanics and Quorum
1. SpanFS guarantees strict consistency for metadata. How is that implemented?
A leader node makes all decisionsA Paxos-style consensus over the distributed KV store with quorum acksEventual consistency with periodic reconciliationDNS-based serialization
2. A 3-node cluster loses 2 nodes simultaneously. What happens?
Cluster keeps writing as normalQuorum is lost — cluster goes read-only or offline; manual recovery requiredEC 4:2 still tolerates the lossThe third node automatically clones itself
3. Why is a 4-node cluster splitting 2/2 a worst case?
EC 4:2 is unavailable on 4 nodesNeither side has majority quorum — both sides go offline to prevent split-brain4 is not a supported cluster sizeIt triggers an automatic failover to the cloud
4. During a rolling upgrade of an RF2 cluster, what happens to fault tolerance?
It improves because GC runs fasterIt temporarily drops to zero — one node already in maintenance, no further failure toleratedIt is unaffectedIt permanently increases by one
5. An architect promotes fault domain from node to chassis on a 4-node single-chassis cluster. Result?
Improved resiliency at no costThe cluster cannot place RF/EC stripes — only one chassis exists as a fault domainDedup ratio doublesAll workloads automatically migrate to RF3
A SpanFS cluster is a peer-to-peer system: every node runs the same services. Coordination uses explicit consensus, fault-domain awareness, and disciplined upgrade procedures.
Strict consistency via Paxos
SpanFS guarantees strict consistency for both data and metadata. Reads always observe the most recent durably committed write. With three KV replicas per record, a metadata write requires at least two acks. Two clients writing the same offset never see split-brain.
Fault domain levels
Fault Domain
Description
Cluster Shape
Disk
Two copies/fragments may share a node
Default; smallest clusters
Node
Each copy/fragment on a different node
Recommended baseline
Chassis
Each copy/fragment on a different chassis
Multi-node-per-chassis
Rack
Each copy/fragment on a different rack
Stretched / large clusters
Quorum scenarios
3-node, 1 down: 2/3 online, quorum holds, rebuild proceeds.
3-node, 2 down: quorum lost, cluster goes read-only/offline.
6-node, 2 down: 4/6 online, EC 4:2 still tolerates the second failure.
Network partition 3/3: neither side has majority — both go offline to prevent split-brain.
Animated Quorum State Machine
Cluster consistency states: Healthy -> Degraded -> Quorum Loss -> Recovery
Rolling upgrades
SpanFS upgrades are rolling: one node at a time enters maintenance, drains leadership and VIPs, upgrades, reboots, and rejoins. During maintenance: quorum tightens, effective resiliency drops by one, in-flight client connections migrate via VIP failover. For RF2 clusters this means upgrade temporarily reduces fault tolerance to zero. RF3 or EC 4:2 is preferred for mission-critical systems so that maintenance plus an unplanned failure does not equal data loss.
Key Points
SpanFS uses Paxos-style consensus over the distributed KV store; metadata writes need quorum acks.
Quorum = majority of nodes/replicas. Lose majority -> read-only/offline; partition splits also halt both sides.
Configure fault domains thoughtfully: chassis-level FD on a 1-chassis cluster makes EC unplaceable.
Even-numbered cluster sizes (esp. 4 nodes) risk 2/2 partition tie — avoid where possible.
Rolling upgrade lowers effective fault tolerance by one; RF2 clusters become single-failure-vulnerable.
Post-Quiz: Cluster Mechanics and Quorum
1. SpanFS guarantees strict consistency for metadata. How is that implemented?
A leader node makes all decisionsA Paxos-style consensus over the distributed KV store with quorum acksEventual consistency with periodic reconciliationDNS-based serialization
2. A 3-node cluster loses 2 nodes simultaneously. What happens?
Cluster keeps writing as normalQuorum is lost — cluster goes read-only or offline; manual recovery requiredEC 4:2 still tolerates the lossThe third node automatically clones itself
3. Why is a 4-node cluster splitting 2/2 a worst case?
EC 4:2 is unavailable on 4 nodesNeither side has majority quorum — both sides go offline to prevent split-brain4 is not a supported cluster sizeIt triggers an automatic failover to the cloud
4. During a rolling upgrade of an RF2 cluster, what happens to fault tolerance?
It improves because GC runs fasterIt temporarily drops to zero — one node already in maintenance, no further failure toleratedIt is unaffectedIt permanently increases by one
5. An architect promotes fault domain from node to chassis on a 4-node single-chassis cluster. Result?
Improved resiliency at no costThe cluster cannot place RF/EC stripes — only one chassis exists as a fault domainDedup ratio doublesAll workloads automatically migrate to RF3
Figure 2.4: Cluster consistency states and quorum transitions (Mermaid)
stateDiagram-v2
[*] --> Healthy
Healthy: Healthy All nodes online full quorum + RF/EC
Degraded: Degraded Node/disk down quorum maintained
Rebuilding: Rebuilding Background reconstruction fragments restored
QuorumLoss: Quorum Loss Majority unreachable read-only / offline
Recovery: Recovery Manual intervention partition resolved
Healthy --> Degraded: failure within tolerance
Degraded --> Rebuilding: replacement / spare available
Rebuilding --> Healthy: rebuild completes
Degraded --> QuorumLoss: additional failure exceeds tolerance
Healthy --> QuorumLoss: network partition splits cluster
QuorumLoss --> Recovery: operator action
Recovery --> Degraded: quorum restored
Recovery --> Healthy: all nodes rejoined