Chapter 3: Multi-Node Networking, Scaling & Performance Optimization

Learning Objectives

Section 1: ConnectX-7 SmartNIC & Network Architecture

Pre-Quiz: ConnectX-7 Capabilities & Direct Connections

1. What is the primary advantage of RDMA over traditional network transfers for distributed GPU workloads?

A) RDMA encrypts data in transit, improving security B) RDMA eliminates intermediate memory copies and OS kernel involvement, reducing latency C) RDMA compresses data before transmission, reducing bandwidth usage D) RDMA allows wireless communication between GPUs

2. What bandwidth does each ConnectX-7 SmartNIC provide on the DGX Spark?

A) 100 Gb/s B) 400 Gb/s C) 200 Gb/s D) 50 Gb/s

3. What is the simplest way to connect two DGX Spark nodes for multi-node inference?

A) Through a managed Ethernet switch with VLANs B) Via a direct QSFP cable between ConnectX-7 ports C) Using InfiniBand Host Channel Adapters D) Through a Wi-Fi mesh network

4. Why is GPUDirect RDMA critical for tensor parallelism across DGX Spark nodes?

A) It allows GPUs to share their compute cores across nodes B) It transfers data directly between GPU memory on different nodes without CPU staging C) It doubles the clock speed of the GPU during network transfers D) It compresses model weights before sending them across the network

5. What IP addressing scheme is recommended for ConnectX-7 interfaces in a DGX Spark cluster?

A) DHCP with automatic IP assignment B) IPv6 link-local addresses C) Static IP addresses on the ConnectX-7 interfaces D) mDNS/Bonjour for automatic discovery

ConnectX-7 200 Gb/s Network Interface and RDMA Support

Each DGX Spark ships with an NVIDIA ConnectX-7 SmartNIC, a dedicated networking processor that handles 200 Gb/s Ethernet traffic. The ConnectX-7 supports RDMA over Converged Ethernet (RoCE), which allows one machine's GPU to read from or write to another machine's memory directly, bypassing the operating system's network stack entirely.

In a traditional network transfer, data passes through multiple copies: application to kernel buffer, kernel buffer to NIC send buffer, across the wire, then the reverse on the receiving side. Each copy adds latency. RDMA eliminates these intermediate copies -- the NIC reads directly from GPU memory on one node and writes directly into GPU memory on the other. For AI workloads where nodes synchronize tensor data millions of times during inference, this difference is transformative.

Each DGX Spark node exposes two QSFP ports through the ConnectX-7, providing four RoCE interfaces total across the two physical ports.

sequenceDiagram participant App1 as Application (Node 1) participant K1 as OS Kernel (Node 1) participant NIC1 as ConnectX-7 NIC (Node 1) participant NIC2 as ConnectX-7 NIC (Node 2) participant K2 as OS Kernel (Node 2) participant App2 as Application (Node 2) Note over App1,App2: Traditional Network Transfer (multiple copies) App1->>K1: Copy data to kernel buffer K1->>NIC1: Copy to NIC send buffer NIC1->>NIC2: Wire transfer NIC2->>K2: Copy to kernel buffer K2->>App2: Copy to application memory Note over App1,App2: RDMA Transfer (zero-copy) App1->>NIC1: NIC reads directly from GPU memory NIC1->>NIC2: Wire transfer (200 Gb/s RoCE) NIC2->>App2: NIC writes directly to GPU memory

Direct Two-Node Scaling

The simplest multi-node configuration connects two DGX Sparks with a direct QSFP cable -- no switch required. A 0.5-meter QSFP cable between the ConnectX-7 ports creates a point-to-point 200 Gb/s link.

NodeInterfaceIP Address
Node 1enP2p1s0f1np1192.168.100.10/24
Node 2enP2p1s0f1np1192.168.100.11/24

Key Points: ConnectX-7 & Direct Connections

Pre-Quiz: Four-Node Topologies & Network Configuration

1. Why can't you daisy-chain more than two DGX Spark nodes without a switch?

A) DGX Spark only has one network port B) Each node needs a path to every other node, and a switch provides that star topology C) The QSFP cables are too short for daisy-chaining D) The ConnectX-7 firmware doesn't support more than one connection

2. How much unified GPU memory does a four-node DGX Spark cluster provide?

A) 256 GB B) 128 GB C) 512 GB D) 1 TB

3. What is the recommended MTU setting for DGX Spark cluster network interfaces?

A) 1500 bytes (standard Ethernet) B) 4096 bytes C) 9000 bytes (jumbo frames) D) 65535 bytes (maximum IP packet)

4. What parallelism configuration does the three-node switchless mesh topology use?

A) TP=3, PP=1 B) PP=3, TP=1 C) TP=2, PP=2 D) TP=1, PP=1 with data parallelism

5. What is the minimum NCCL version required for DGX Spark multi-node operation?

A) v2.20.0 B) v2.28.3 C) v3.0.0 D) v2.25.1

Four-Node Cluster Topologies

Scaling beyond two nodes requires an Ethernet switch. Each node needs a path to every other node, and a switch provides that star topology. Community-tested switches include the MikroTik CRS804-4DDQ and CRS812, both supporting 200 GbE QSFP connections.

A four-node cluster provides 512 GB of unified memory (4 x 128 GB), enough to host 700-billion-parameter class models such as Qwen3.5-397B. An alternative for three-node clusters exists: a switchless mesh topology using PP=3/TP=1, where each node connects directly to the other two.

Interactive: Multi-Node Cluster Topology with Data Flow

Two-Node (Direct Cable) DGX Spark 1 128 GB Unified ConnectX-7 SmartNIC DGX Spark Node 1: Grace Blackwell GB10 with 128 GB unified memory and ConnectX-7 200 Gb/s SmartNIC DGX Spark 2 128 GB Unified ConnectX-7 SmartNIC DGX Spark Node 2: Grace Blackwell GB10 with 128 GB unified memory and ConnectX-7 200 Gb/s SmartNIC QSFP 200 Gb/s Four-Node (Switch-Based) 200 GbE Switch 200 GbE Ethernet Switch (e.g., MikroTik CRS804-4DDQ) providing star topology for up to 4 DGX Spark nodes Spark 1 128 GB CX-7 Spark 2 128 GB CX-7 Spark 3 128 GB CX-7 Spark 4 128 GB CX-7 Total: 256 GB memory Total: 512 GB memory Supports 700B+ parameter models

Network Configuration, MTU Tuning, and GPUDirect RDMA

Configuration StepRecommendationPurpose
MTU9000 bytes (jumbo frames)Reduces per-packet overhead for large tensor transfers
IP addressingStatic IPs on CX-7 interfacesEliminates DHCP latency, deterministic routing
GPUDirect RDMAEnable via NVIDIA driversAllows NIC to access GPU memory directly without CPU staging
NCCL versionv2.28.3 or laterRequired collective communication library for multi-node GPU ops
OS requirementsUbuntu 24.04+ with current NVIDIA driversBaseline software environment for DGX Spark clustering

Key Points: Four-Node Topologies & Configuration

Post-Quiz: ConnectX-7 Capabilities & Direct Connections

1. What is the primary advantage of RDMA over traditional network transfers for distributed GPU workloads?

A) RDMA encrypts data in transit, improving security B) RDMA eliminates intermediate memory copies and OS kernel involvement, reducing latency C) RDMA compresses data before transmission, reducing bandwidth usage D) RDMA allows wireless communication between GPUs

2. What bandwidth does each ConnectX-7 SmartNIC provide on the DGX Spark?

A) 100 Gb/s B) 400 Gb/s C) 200 Gb/s D) 50 Gb/s

3. What is the simplest way to connect two DGX Spark nodes for multi-node inference?

A) Through a managed Ethernet switch with VLANs B) Via a direct QSFP cable between ConnectX-7 ports C) Using InfiniBand Host Channel Adapters D) Through a Wi-Fi mesh network

4. Why is GPUDirect RDMA critical for tensor parallelism across DGX Spark nodes?

A) It allows GPUs to share their compute cores across nodes B) It transfers data directly between GPU memory on different nodes without CPU staging C) It doubles the clock speed of the GPU during network transfers D) It compresses model weights before sending them across the network

5. What IP addressing scheme is recommended for ConnectX-7 interfaces in a DGX Spark cluster?

A) DHCP with automatic IP assignment B) IPv6 link-local addresses C) Static IP addresses on the ConnectX-7 interfaces D) mDNS/Bonjour for automatic discovery
Post-Quiz: Four-Node Topologies & Network Configuration

1. Why can't you daisy-chain more than two DGX Spark nodes without a switch?

A) DGX Spark only has one network port B) Each node needs a path to every other node, and a switch provides that star topology C) The QSFP cables are too short for daisy-chaining D) The ConnectX-7 firmware doesn't support more than one connection

2. How much unified GPU memory does a four-node DGX Spark cluster provide?

A) 256 GB B) 128 GB C) 512 GB D) 1 TB

3. What is the recommended MTU setting for DGX Spark cluster network interfaces?

A) 1500 bytes (standard Ethernet) B) 4096 bytes C) 9000 bytes (jumbo frames) D) 65535 bytes (maximum IP packet)

4. What parallelism configuration does the three-node switchless mesh topology use?

A) TP=3, PP=1 B) PP=3, TP=1 C) TP=2, PP=2 D) TP=1, PP=1 with data parallelism

5. What is the minimum NCCL version required for DGX Spark multi-node operation?

A) v2.20.0 B) v2.28.3 C) v3.0.0 D) v2.25.1

Section 2: Distributed AI — Tensor & Pipeline Parallelism

Pre-Quiz: Tensor & Pipeline Parallelism

1. How does tensor parallelism distribute a neural network across multiple GPUs?

A) It assigns entire layers to different GPUs sequentially B) It splits individual layers horizontally so each GPU computes a portion of every matrix multiplication C) It replicates the full model on every GPU and averages outputs D) It distributes different training data batches to each GPU

2. Why does TP2 achieve near-perfect 2x speedup for the decode phase on DGX Spark?

A) The decode phase is compute-bound, and TP2 doubles compute capacity B) The decode phase is memory-bandwidth-bound, and TP2 doubles available bandwidth C) The 200 Gb/s link has zero latency overhead D) TP2 eliminates the need for NCCL communication entirely

3. What is the main disadvantage of pipeline parallelism compared to tensor parallelism for inference?

A) Pipeline parallelism requires more network bandwidth B) Pipeline parallelism introduces pipeline bubbles -- idle time as data flows through stages C) Pipeline parallelism cannot run on DGX Spark hardware D) Pipeline parallelism doubles memory usage on each node

4. What NCCL operation is critical for synchronizing partial results in tensor parallelism?

A) Broadcast B) Scatter C) All-reduce D) Gather

5. What throughput did the four-node TP4 cluster achieve for Qwen3.5-397B with 4 concurrent users?

A) 37 tok/s total B) 200 tok/s total C) 103 tok/s total D) 500 tok/s total

Tensor Parallelism (TP)

Tensor parallelism splits individual layers of a neural network across multiple GPUs. Each GPU computes a portion of every matrix multiplication, then partial results are combined via an all-reduce communication step. On DGX Spark, TP2 means the model is split across two nodes (256 GB total); TP4 means four nodes (512 GB total).

A practical TP2 deployment with vLLM:

vllm serve Qwen/Qwen3-Coder-Next \
  --port 8000 \
  --tensor-parallel-size 2 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --max-model-len 32768

Interactive: Tensor Parallelism All-Reduce Data Flow (TP2)

Input Tokens Input token embeddings are broadcast to both nodes Node 1 (128 GB) Weight Shard A Weight Shard A: first half of each layer's weight matrices Partial MatMul Compute half of each layer Node 2 (128 GB) Weight Shard B Partial MatMul Compute other half of each layer NCCL All-Reduce 200 Gb/s RoCE NCCL All-Reduce: every GPU sends partial results to every other GPU and receives the combined sum Sync Combined Layer Output Next transformer layer

Pipeline Parallelism (PP)

Pipeline parallelism assigns entire groups of layers to different GPUs. Node 1 handles layers 1-20, Node 2 handles layers 21-40, and so on. It communicates less frequently than tensor parallelism (only between stages, not at every layer), but introduces pipeline bubbles -- idle time when stages wait for data. PP is better suited for training workloads where micro-batching can fill the bubbles.

flowchart LR Input["Input Tokens"] --> N1 subgraph N1["Node 1: Layers 1-20"] L1["Process Layers 1-20"] end subgraph N2["Node 2: Layers 21-40"] L2["Process Layers 21-40"] end subgraph N3["Node 3: Layers 41-60"] L3["Process Layers 41-60"] end N1 -->|"Send activations"| N2 N2 -->|"Send activations"| N3 N3 --> Output["Output Tokens"]

Scaling Benchmarks

Llama 3.3 70B NVFP4 (32K input, 1K output, batch=1):

Metric1-Node (TP1)2-Node (TP2)Speedup
Time to First Token (TTFT)33,415 ms21,384 ms1.56x
Time Per Output Token (TPOT)269 ms133 ms2.02x

Four-node TP4 results for Qwen3.5-397B-INT4:

ScenarioThroughput
Single user, 4-node TP437 tok/s
4 concurrent users, 4-node TP4103 tok/s (total)

Key Points: Tensor & Pipeline Parallelism

Post-Quiz: Tensor & Pipeline Parallelism

1. How does tensor parallelism distribute a neural network across multiple GPUs?

A) It assigns entire layers to different GPUs sequentially B) It splits individual layers horizontally so each GPU computes a portion of every matrix multiplication C) It replicates the full model on every GPU and averages outputs D) It distributes different training data batches to each GPU

2. Why does TP2 achieve near-perfect 2x speedup for the decode phase on DGX Spark?

A) The decode phase is compute-bound, and TP2 doubles compute capacity B) The decode phase is memory-bandwidth-bound, and TP2 doubles available bandwidth C) The 200 Gb/s link has zero latency overhead D) TP2 eliminates the need for NCCL communication entirely

3. What is the main disadvantage of pipeline parallelism compared to tensor parallelism for inference?

A) Pipeline parallelism requires more network bandwidth B) Pipeline parallelism introduces pipeline bubbles -- idle time as data flows through stages C) Pipeline parallelism cannot run on DGX Spark hardware D) Pipeline parallelism doubles memory usage on each node

4. What NCCL operation is critical for synchronizing partial results in tensor parallelism?

A) Broadcast B) Scatter C) All-reduce D) Gather

5. What throughput did the four-node TP4 cluster achieve for Qwen3.5-397B with 4 concurrent users?

A) 37 tok/s total B) 200 tok/s total C) 103 tok/s total D) 500 tok/s total

Section 3: Inference Optimization Techniques

Pre-Quiz: Speculative Decoding & Attention Optimization

1. How does speculative decoding accelerate LLM token generation?

A) It uses a larger model to generate tokens faster B) A small draft model proposes multiple tokens that the target model verifies in a single forward pass C) It skips attention computation for common tokens D) It compresses the vocabulary to reduce computation

2. What speedup does speculative decoding achieve on Blackwell GPUs like those in DGX Spark?

A) 1.1-1.2x B) 2-3x C) 5-10x D) No speedup; it only improves quality

3. What is FlashAttention's key innovation compared to standard attention?

A) It uses a different mathematical formula for attention B) It processes attention in tiles without materializing the full N-by-N attention matrix C) It replaces softmax with a linear approximation D) It skips attention for tokens beyond position 1024

4. How much higher decode throughput do FlashInfer kernels deliver on Blackwell compared to unoptimized implementations?

A) 1.1x B) 2.25x C) 4x D) 10x

5. What does kernel fusion accomplish in the inference pipeline?

A) It combines multiple CUDA kernels into one, eliminating intermediate tensor writes and launch overhead B) It merges multiple GPU cores into a single super-core C) It fuses CPU and GPU execution into a single pipeline D) It combines multiple models into a single architecture

Speculative Decoding

Speculative decoding uses a small, fast "draft" model to propose multiple token candidates (typically 3-12 tokens ahead), which the larger "target" model then verifies in a single forward pass. The key insight: verifying multiple tokens simultaneously is nearly as fast as generating a single token because verification can be parallelized within one forward pass.

On Blackwell GPUs, speculative decoding achieves 2-3x speedups (vs ~1.5x on Hopper). The most dramatic result: DFlash speculative decoding on Blackwell 6000 Pro reached ~429.69 tokens/s -- a 4.8x increase over the baseline 90.20 tokens/s.

Interactive: Speculative Decoding Draft-Verify-Accept Pipeline

Step 1: Draft Step 2: Verify Step 3: Accept/Correct Draft Model (Small) Draft model: a small, fast model that proposes candidate tokens. Runs at high speed due to small parameter count. T1 T2 T3 T4 T5 5 candidates proposed Send K=5 Target Model (Large) Target model: the large, accurate model that verifies all K candidate tokens in a single forward pass Single Forward Pass Verifies all 5 tokens T1 T2 T3 T4 T5 3 accepted, 1 rejected, 1 skipped Output Buffer T1 T2 T3 T4' 3 accepted + 1 corrected = 4 tokens in ~1 pass! Speedup Comparison Without speculative decoding: 1 token per forward pass = ~90 tok/s With DFlash speculative decoding (Blackwell): ~4 tokens per forward pass = ~430 tok/s (4.8x speedup)

FlashAttention and Kernel Fusion

FlashAttention processes attention in tiles without materializing the full N-by-N attention matrix, reducing memory usage from O(N^2) to O(N). On Blackwell, optimized FlashInfer kernels deliver up to 2.25x higher decode throughput, achieving 85-90% tensor core utilization.

Kernel fusion combines LayerNorm, matrix multiplications, activations, and bias additions into single CUDA kernels, eliminating intermediate tensor writes. TensorRT-LLM applies these fusions automatically, achieving 4x throughput over native PyTorch.

Key Points: Speculative Decoding & Attention Optimization

Pre-Quiz: Quantization & Prefill/Decode Trade-offs

1. How much memory does NVFP4 quantization save compared to FP16?

A) 2x reduction (50% savings) B) 4x reduction (75% savings) C) 8x reduction (87.5% savings) D) No savings; NVFP4 only speeds up computation

2. What is the fundamental difference between the prefill and decode phases of LLM inference?

A) Prefill uses GPU while decode uses CPU B) Prefill is compute-bound (processes all input tokens in parallel); decode is memory-bandwidth-bound (generates tokens one at a time) C) Prefill handles text while decode handles images D) There is no difference; both phases have the same bottleneck

3. Why is quantization especially impactful for the decode phase on DGX Spark?

A) Decode is compute-bound, and quantization reduces computation B) Decode is memory-bandwidth-bound, and quantization directly reduces bytes read per token C) Quantization allows the decode phase to run on CPU instead of GPU D) Quantization eliminates the decode phase entirely

4. What is continuous batching?

A) Processing all requests in a single large batch that runs to completion B) Allowing new requests to enter the batch as soon as a slot opens, overlapping prefill and decode C) Batching requests by their input length for uniform processing D) Running multiple copies of the model in parallel

5. What is the peak memory bandwidth of DGX Spark's LPDDR5X, and how does it compare to datacenter HBM3e?

A) 1 TB/s; roughly half of HBM3e B) 273 GB/s; roughly 1/12 of HBM3e (~8 TB/s) C) 8 TB/s; equal to HBM3e D) 50 GB/s; roughly 1/100 of HBM3e

Quantization: FP4/FP8/INT8

FormatBitsMemory vs FP16Throughput ImpactAccuracy Impact
FP16/BF16161x (baseline)BaselineFull precision
FP880.5x~1.5-2x speedupMinimal
INT8 (AWQ)80.5x~1.5-2x speedupSmall; calibration-dependent
NVFP440.25x~2.5x speedupNoticeable on edge cases
INT440.25x~2.5x speedupModerate; task-dependent

Quantization is often a necessity on DGX Spark. A 200B-parameter model in BF16 requires ~400 GB, far exceeding the 128 GB per node. At 4-bit precision, the same model fits in ~100 GB, making single-node inference possible. NVFP4 is specifically optimized for Blackwell's tensor cores.

Prefill vs Decode Phases

PhaseBottleneckBest OptimizationsDGX Spark Behavior
PrefillCompute (FLOPS)Larger batches, kernel fusion, FlashAttentionFast; GPU cores well-utilized
DecodeMemory bandwidthQuantization, speculative decoding, multi-node TPSlow; limited by 273 GB/s LPDDR5X

Continuous batching allows new requests to enter the batch as soon as a slot opens, overlapping prefill for new requests with decode for existing ones. DGX Spark's 273 GB/s LPDDR5X bandwidth (far below HBM3e's ~8 TB/s) makes quantization and multi-node TP the highest-impact interventions for decode performance.

Key Points: Quantization & Prefill/Decode

Post-Quiz: Speculative Decoding & Attention Optimization

1. How does speculative decoding accelerate LLM token generation?

A) It uses a larger model to generate tokens faster B) A small draft model proposes multiple tokens that the target model verifies in a single forward pass C) It skips attention computation for common tokens D) It compresses the vocabulary to reduce computation

2. What speedup does speculative decoding achieve on Blackwell GPUs like those in DGX Spark?

A) 1.1-1.2x B) 2-3x C) 5-10x D) No speedup; it only improves quality

3. What is FlashAttention's key innovation compared to standard attention?

A) It uses a different mathematical formula for attention B) It processes attention in tiles without materializing the full N-by-N attention matrix C) It replaces softmax with a linear approximation D) It skips attention for tokens beyond position 1024

4. How much higher decode throughput do FlashInfer kernels deliver on Blackwell compared to unoptimized implementations?

A) 1.1x B) 2.25x C) 4x D) 10x

5. What does kernel fusion accomplish in the inference pipeline?

A) It combines multiple CUDA kernels into one, eliminating intermediate tensor writes and launch overhead B) It merges multiple GPU cores into a single super-core C) It fuses CPU and GPU execution into a single pipeline D) It combines multiple models into a single architecture
Post-Quiz: Quantization & Prefill/Decode Trade-offs

1. How much memory does NVFP4 quantization save compared to FP16?

A) 2x reduction (50% savings) B) 4x reduction (75% savings) C) 8x reduction (87.5% savings) D) No savings; NVFP4 only speeds up computation

2. What is the fundamental difference between the prefill and decode phases of LLM inference?

A) Prefill uses GPU while decode uses CPU B) Prefill is compute-bound (processes all input tokens in parallel); decode is memory-bandwidth-bound (generates tokens one at a time) C) Prefill handles text while decode handles images D) There is no difference; both phases have the same bottleneck

3. Why is quantization especially impactful for the decode phase on DGX Spark?

A) Decode is compute-bound, and quantization reduces computation B) Decode is memory-bandwidth-bound, and quantization directly reduces bytes read per token C) Quantization allows the decode phase to run on CPU instead of GPU D) Quantization eliminates the decode phase entirely

4. What is continuous batching?

A) Processing all requests in a single large batch that runs to completion B) Allowing new requests to enter the batch as soon as a slot opens, overlapping prefill and decode C) Batching requests by their input length for uniform processing D) Running multiple copies of the model in parallel

5. What is the peak memory bandwidth of DGX Spark's LPDDR5X, and how does it compare to datacenter HBM3e?

A) 1 TB/s; roughly half of HBM3e B) 273 GB/s; roughly 1/12 of HBM3e (~8 TB/s) C) 8 TB/s; equal to HBM3e D) 50 GB/s; roughly 1/100 of HBM3e

Section 4: Profiling, Bottleneck Analysis & Memory Management

Pre-Quiz: Profiling & Memory Management

1. What is the primary difference between Nsight Systems and Nsight Compute?

A) Nsight Systems profiles CPU only; Nsight Compute profiles GPU only B) Nsight Systems provides system-wide timeline views; Nsight Compute provides kernel-level detail C) Nsight Systems is free; Nsight Compute requires a license D) They are the same tool with different names

2. During decode on DGX Spark, what percentage of the 273 GB/s peak bandwidth is typically utilized?

A) 95-100% B) 55-60%, with a contention floor around 80-90 GB/s C) 10-20% D) Bandwidth is not measurable during decode

3. What does the vLLM parameter --gpu-memory-utilization control?

A) The GPU clock speed during inference B) The fraction of GPU memory available for model weights and KV-cache combined C) The number of GPU cores used for computation D) The power consumption limit of the GPU

4. Why do smaller tile sizes (64x64) outperform larger tiles on DGX Spark's GB10 SoC?

A) Smaller tiles use less memory bandwidth B) The GB10 has 48 SMs, and larger tiles exceed the available parallelism C) Smaller tiles are compatible with LPDDR5X but larger tiles are not D) NVIDIA artificially limits tile sizes on desktop GPUs

5. What is the highest-impact mitigation strategy for DGX Spark's bandwidth bottleneck?

A) Upgrading to faster memory modules B) Aggressive quantization (e.g., FP16 to NVFP4 cuts bytes-per-weight by 4x) C) Overclocking the GPU D) Using CPU offloading for model weights

Nsight Systems and Nsight Compute

Nsight Systems provides a system-wide view of GPU activity, CPU activity, memory transfers, and kernel execution timelines. Nsight Compute provides kernel-level detail: occupancy, memory throughput, instruction throughput, and roofline proximity.

On DGX Spark, profiling reveals that during decode, bandwidth utilization drops to 55-60% of the 273 GB/s peak, with a contention floor around 80-90 GB/s due to concurrent memory accesses (weights, KV-cache, activations).

Memory Bandwidth: The Fundamental Bottleneck

DGX Spark's GB10 SoC uses LPDDR5X at 8533 MT/s, delivering approximately 273 GB/s peak bandwidth. For a 70B-parameter model in FP16, reading all weights (~140 GB) takes ~0.51 seconds at peak bandwidth -- this sets a hard floor on per-token latency.

WorkloadMeasured BandwidthTokens/secNotes
35B-A3B MoE (BF16, TP1)178 GB/s (weight reads)30.3MoE routing creates bursty patterns
Llama 3B (BF16, FlashAttn-2)Near peak14-20~25W power draw at 95% GPU util
General 200B (4-bit)~273 GB/s limit34-38Capacity-for-latency trade-off

Mitigation strategies, ranked by impact:

  1. Quantize aggressively: FP16 to NVFP4 = 4x fewer bytes per weight
  2. Scale to multiple nodes: Each node adds 273 GB/s bandwidth
  3. Use sparse MoE models: 35B MoE with 3B active reads only ~6 GB per step
  4. Fuse kernels: Eliminate intermediate tensor writes

KV-Cache Management

The KV-cache stores key and value tensors from previous tokens. It grows proportionally with sequence length and competes directly with model weight storage in the 128 GB unified memory. vLLM manages this via --gpu-memory-utilization (default 0.9; reduce to 0.7 for multi-node long-context serving).

TechniqueMemory SavingsTrade-off
KV-cache quantization (INT8/FP8)50%Marginal accuracy impact
Prefix cachingVariableEffective only with repeated system prompts
Sliding window attentionProportional to windowLimits effective context length
Sparse MoE model selectionIndirectArchitecture-dependent

Data Locality and NUMA-Aware Scheduling

Despite the unified memory architecture (Grace CPU + Blackwell GPU on NVLink-C2C), data locality still matters. Memory pages closer to GPU controllers are accessed with lower latency. NUMA-aware scheduling ensures:

Optimal tile sizes for DGX Spark: 64x64 with occupancy 2 for the GB10's 48 SMs at SM 12.1 compute capability. Larger tiles that perform well on datacenter GPUs underperform on DGX Spark.

flowchart TD A["Run vLLM with representative workload"] --> B["Capture Nsight Systems trace"] B --> C["Identify longest-running CUDA kernels"] C --> D{"Kernel bottleneck type?"} D -->|"Compute-bound (prefill)"| E["Increase arithmetic intensity"] D -->|"Memory-bound (decode)"| F["Reduce bytes per operation"] E --> G["Optimize batch size and occupancy"] F --> H["Apply quantization (NVFP4)"] F --> I["Fuse kernels to eliminate writes"] G --> J["Re-profile and validate"] H --> J I --> J J --> K{"Target throughput met?"} K -->|No| C K -->|Yes| L["Deploy optimized configuration"]

Key Points: Profiling & Memory Management

Post-Quiz: Profiling & Memory Management

1. What is the primary difference between Nsight Systems and Nsight Compute?

A) Nsight Systems profiles CPU only; Nsight Compute profiles GPU only B) Nsight Systems provides system-wide timeline views; Nsight Compute provides kernel-level detail C) Nsight Systems is free; Nsight Compute requires a license D) They are the same tool with different names

2. During decode on DGX Spark, what percentage of the 273 GB/s peak bandwidth is typically utilized?

A) 95-100% B) 55-60%, with a contention floor around 80-90 GB/s C) 10-20% D) Bandwidth is not measurable during decode

3. What does the vLLM parameter --gpu-memory-utilization control?

A) The GPU clock speed during inference B) The fraction of GPU memory available for model weights and KV-cache combined C) The number of GPU cores used for computation D) The power consumption limit of the GPU

4. Why do smaller tile sizes (64x64) outperform larger tiles on DGX Spark's GB10 SoC?

A) Smaller tiles use less memory bandwidth B) The GB10 has 48 SMs, and larger tiles exceed the available parallelism C) Smaller tiles are compatible with LPDDR5X but larger tiles are not D) NVIDIA artificially limits tile sizes on desktop GPUs

5. What is the highest-impact mitigation strategy for DGX Spark's bandwidth bottleneck?

A) Upgrading to faster memory modules B) Aggressive quantization (e.g., FP16 to NVFP4 cuts bytes-per-weight by 4x) C) Overclocking the GPU D) Using CPU offloading for model weights

Your Progress

Answer Explanations