Pre-Quiz: ConnectX-7 Capabilities & Direct Connections
1. What is the primary advantage of RDMA over traditional network transfers for distributed GPU workloads?
A) RDMA encrypts data in transit, improving securityB) RDMA eliminates intermediate memory copies and OS kernel involvement, reducing latencyC) RDMA compresses data before transmission, reducing bandwidth usageD) RDMA allows wireless communication between GPUs
2. What bandwidth does each ConnectX-7 SmartNIC provide on the DGX Spark?
A) 100 Gb/sB) 400 Gb/sC) 200 Gb/sD) 50 Gb/s
3. What is the simplest way to connect two DGX Spark nodes for multi-node inference?
A) Through a managed Ethernet switch with VLANsB) Via a direct QSFP cable between ConnectX-7 portsC) Using InfiniBand Host Channel AdaptersD) Through a Wi-Fi mesh network
4. Why is GPUDirect RDMA critical for tensor parallelism across DGX Spark nodes?
A) It allows GPUs to share their compute cores across nodesB) It transfers data directly between GPU memory on different nodes without CPU stagingC) It doubles the clock speed of the GPU during network transfersD) It compresses model weights before sending them across the network
5. What IP addressing scheme is recommended for ConnectX-7 interfaces in a DGX Spark cluster?
A) DHCP with automatic IP assignmentB) IPv6 link-local addressesC) Static IP addresses on the ConnectX-7 interfacesD) mDNS/Bonjour for automatic discovery
ConnectX-7 200 Gb/s Network Interface and RDMA Support
Each DGX Spark ships with an NVIDIA ConnectX-7 SmartNIC, a dedicated networking processor that handles 200 Gb/s Ethernet traffic. The ConnectX-7 supports RDMA over Converged Ethernet (RoCE), which allows one machine's GPU to read from or write to another machine's memory directly, bypassing the operating system's network stack entirely.
In a traditional network transfer, data passes through multiple copies: application to kernel buffer, kernel buffer to NIC send buffer, across the wire, then the reverse on the receiving side. Each copy adds latency. RDMA eliminates these intermediate copies -- the NIC reads directly from GPU memory on one node and writes directly into GPU memory on the other. For AI workloads where nodes synchronize tensor data millions of times during inference, this difference is transformative.
Each DGX Spark node exposes two QSFP ports through the ConnectX-7, providing four RoCE interfaces total across the two physical ports.
sequenceDiagram
participant App1 as Application (Node 1)
participant K1 as OS Kernel (Node 1)
participant NIC1 as ConnectX-7 NIC (Node 1)
participant NIC2 as ConnectX-7 NIC (Node 2)
participant K2 as OS Kernel (Node 2)
participant App2 as Application (Node 2)
Note over App1,App2: Traditional Network Transfer (multiple copies)
App1->>K1: Copy data to kernel buffer
K1->>NIC1: Copy to NIC send buffer
NIC1->>NIC2: Wire transfer
NIC2->>K2: Copy to kernel buffer
K2->>App2: Copy to application memory
Note over App1,App2: RDMA Transfer (zero-copy)
App1->>NIC1: NIC reads directly from GPU memory
NIC1->>NIC2: Wire transfer (200 Gb/s RoCE)
NIC2->>App2: NIC writes directly to GPU memory
Direct Two-Node Scaling
The simplest multi-node configuration connects two DGX Sparks with a direct QSFP cable -- no switch required. A 0.5-meter QSFP cable between the ConnectX-7 ports creates a point-to-point 200 Gb/s link.
Node
Interface
IP Address
Node 1
enP2p1s0f1np1
192.168.100.10/24
Node 2
enP2p1s0f1np1
192.168.100.11/24
Key Points: ConnectX-7 & Direct Connections
ConnectX-7 SmartNIC provides 200 Gb/s Ethernet with RoCE (RDMA over Converged Ethernet)
RDMA bypasses the OS kernel, enabling zero-copy GPU-to-GPU data transfers across nodes
Two DGX Sparks connect directly via a single QSFP cable -- no switch needed
Static IP addresses on ConnectX-7 interfaces eliminate DHCP latency and ensure deterministic routing
GPUDirect RDMA enables the NIC to access GPU memory directly without CPU staging
1. Why can't you daisy-chain more than two DGX Spark nodes without a switch?
A) DGX Spark only has one network portB) Each node needs a path to every other node, and a switch provides that star topologyC) The QSFP cables are too short for daisy-chainingD) The ConnectX-7 firmware doesn't support more than one connection
2. How much unified GPU memory does a four-node DGX Spark cluster provide?
A) 256 GBB) 128 GBC) 512 GBD) 1 TB
3. What is the recommended MTU setting for DGX Spark cluster network interfaces?
A) 1500 bytes (standard Ethernet)B) 4096 bytesC) 9000 bytes (jumbo frames)D) 65535 bytes (maximum IP packet)
4. What parallelism configuration does the three-node switchless mesh topology use?
A) TP=3, PP=1B) PP=3, TP=1C) TP=2, PP=2D) TP=1, PP=1 with data parallelism
5. What is the minimum NCCL version required for DGX Spark multi-node operation?
A) v2.20.0B) v2.28.3C) v3.0.0D) v2.25.1
Four-Node Cluster Topologies
Scaling beyond two nodes requires an Ethernet switch. Each node needs a path to every other node, and a switch provides that star topology. Community-tested switches include the MikroTik CRS804-4DDQ and CRS812, both supporting 200 GbE QSFP connections.
A four-node cluster provides 512 GB of unified memory (4 x 128 GB), enough to host 700-billion-parameter class models such as Qwen3.5-397B. An alternative for three-node clusters exists: a switchless mesh topology using PP=3/TP=1, where each node connects directly to the other two.
Interactive: Multi-Node Cluster Topology with Data Flow
Network Configuration, MTU Tuning, and GPUDirect RDMA
Configuration Step
Recommendation
Purpose
MTU
9000 bytes (jumbo frames)
Reduces per-packet overhead for large tensor transfers
IP addressing
Static IPs on CX-7 interfaces
Eliminates DHCP latency, deterministic routing
GPUDirect RDMA
Enable via NVIDIA drivers
Allows NIC to access GPU memory directly without CPU staging
NCCL version
v2.28.3 or later
Required collective communication library for multi-node GPU ops
OS requirements
Ubuntu 24.04+ with current NVIDIA drivers
Baseline software environment for DGX Spark clustering
Key Points: Four-Node Topologies & Configuration
Four-node clusters require a 200 GbE switch (e.g., MikroTik CRS804-4DDQ) for star topology
Four nodes provide 512 GB aggregate memory, enabling 700B-class models
Three-node switchless mesh uses PP=3/TP=1 (pipeline parallelism, no tensor parallelism)
Jumbo frames (MTU 9000) reduce per-packet overhead for large tensor transfers
NCCL v2.28.3+ is required, with optimizations for Grace Blackwell and RoCE transport
Post-Quiz: ConnectX-7 Capabilities & Direct Connections
1. What is the primary advantage of RDMA over traditional network transfers for distributed GPU workloads?
A) RDMA encrypts data in transit, improving securityB) RDMA eliminates intermediate memory copies and OS kernel involvement, reducing latencyC) RDMA compresses data before transmission, reducing bandwidth usageD) RDMA allows wireless communication between GPUs
2. What bandwidth does each ConnectX-7 SmartNIC provide on the DGX Spark?
A) 100 Gb/sB) 400 Gb/sC) 200 Gb/sD) 50 Gb/s
3. What is the simplest way to connect two DGX Spark nodes for multi-node inference?
A) Through a managed Ethernet switch with VLANsB) Via a direct QSFP cable between ConnectX-7 portsC) Using InfiniBand Host Channel AdaptersD) Through a Wi-Fi mesh network
4. Why is GPUDirect RDMA critical for tensor parallelism across DGX Spark nodes?
A) It allows GPUs to share their compute cores across nodesB) It transfers data directly between GPU memory on different nodes without CPU stagingC) It doubles the clock speed of the GPU during network transfersD) It compresses model weights before sending them across the network
5. What IP addressing scheme is recommended for ConnectX-7 interfaces in a DGX Spark cluster?
A) DHCP with automatic IP assignmentB) IPv6 link-local addressesC) Static IP addresses on the ConnectX-7 interfacesD) mDNS/Bonjour for automatic discovery
1. Why can't you daisy-chain more than two DGX Spark nodes without a switch?
A) DGX Spark only has one network portB) Each node needs a path to every other node, and a switch provides that star topologyC) The QSFP cables are too short for daisy-chainingD) The ConnectX-7 firmware doesn't support more than one connection
2. How much unified GPU memory does a four-node DGX Spark cluster provide?
A) 256 GBB) 128 GBC) 512 GBD) 1 TB
3. What is the recommended MTU setting for DGX Spark cluster network interfaces?
A) 1500 bytes (standard Ethernet)B) 4096 bytesC) 9000 bytes (jumbo frames)D) 65535 bytes (maximum IP packet)
4. What parallelism configuration does the three-node switchless mesh topology use?
A) TP=3, PP=1B) PP=3, TP=1C) TP=2, PP=2D) TP=1, PP=1 with data parallelism
5. What is the minimum NCCL version required for DGX Spark multi-node operation?
A) v2.20.0B) v2.28.3C) v3.0.0D) v2.25.1
Section 2: Distributed AI — Tensor & Pipeline Parallelism
Pre-Quiz: Tensor & Pipeline Parallelism
1. How does tensor parallelism distribute a neural network across multiple GPUs?
A) It assigns entire layers to different GPUs sequentiallyB) It splits individual layers horizontally so each GPU computes a portion of every matrix multiplicationC) It replicates the full model on every GPU and averages outputsD) It distributes different training data batches to each GPU
2. Why does TP2 achieve near-perfect 2x speedup for the decode phase on DGX Spark?
A) The decode phase is compute-bound, and TP2 doubles compute capacityB) The decode phase is memory-bandwidth-bound, and TP2 doubles available bandwidthC) The 200 Gb/s link has zero latency overheadD) TP2 eliminates the need for NCCL communication entirely
3. What is the main disadvantage of pipeline parallelism compared to tensor parallelism for inference?
A) Pipeline parallelism requires more network bandwidthB) Pipeline parallelism introduces pipeline bubbles -- idle time as data flows through stagesC) Pipeline parallelism cannot run on DGX Spark hardwareD) Pipeline parallelism doubles memory usage on each node
4. What NCCL operation is critical for synchronizing partial results in tensor parallelism?
A) BroadcastB) ScatterC) All-reduceD) Gather
5. What throughput did the four-node TP4 cluster achieve for Qwen3.5-397B with 4 concurrent users?
A) 37 tok/s totalB) 200 tok/s totalC) 103 tok/s totalD) 500 tok/s total
Tensor Parallelism (TP)
Tensor parallelism splits individual layers of a neural network across multiple GPUs. Each GPU computes a portion of every matrix multiplication, then partial results are combined via an all-reduce communication step. On DGX Spark, TP2 means the model is split across two nodes (256 GB total); TP4 means four nodes (512 GB total).
Interactive: Tensor Parallelism All-Reduce Data Flow (TP2)
Pipeline Parallelism (PP)
Pipeline parallelism assigns entire groups of layers to different GPUs. Node 1 handles layers 1-20, Node 2 handles layers 21-40, and so on. It communicates less frequently than tensor parallelism (only between stages, not at every layer), but introduces pipeline bubbles -- idle time when stages wait for data. PP is better suited for training workloads where micro-batching can fill the bubbles.
Tensor parallelism (TP) splits individual layers across GPUs; requires all-reduce at every layer
Pipeline parallelism (PP) assigns groups of layers to different GPUs; communicates only between stages
TP2 delivers near-perfect 2x TPOT speedup because decode is memory-bandwidth-bound
TP4 enables 700B-class models (512 GB aggregate) with 103 tok/s for 4 concurrent users
NCCL all-reduce over RoCE keeps communication overhead low for TP configurations
Pipeline parallelism suits training or switchless 3-node topologies (PP=3/TP=1)
Post-Quiz: Tensor & Pipeline Parallelism
1. How does tensor parallelism distribute a neural network across multiple GPUs?
A) It assigns entire layers to different GPUs sequentiallyB) It splits individual layers horizontally so each GPU computes a portion of every matrix multiplicationC) It replicates the full model on every GPU and averages outputsD) It distributes different training data batches to each GPU
2. Why does TP2 achieve near-perfect 2x speedup for the decode phase on DGX Spark?
A) The decode phase is compute-bound, and TP2 doubles compute capacityB) The decode phase is memory-bandwidth-bound, and TP2 doubles available bandwidthC) The 200 Gb/s link has zero latency overheadD) TP2 eliminates the need for NCCL communication entirely
3. What is the main disadvantage of pipeline parallelism compared to tensor parallelism for inference?
A) Pipeline parallelism requires more network bandwidthB) Pipeline parallelism introduces pipeline bubbles -- idle time as data flows through stagesC) Pipeline parallelism cannot run on DGX Spark hardwareD) Pipeline parallelism doubles memory usage on each node
4. What NCCL operation is critical for synchronizing partial results in tensor parallelism?
A) BroadcastB) ScatterC) All-reduceD) Gather
5. What throughput did the four-node TP4 cluster achieve for Qwen3.5-397B with 4 concurrent users?
A) 37 tok/s totalB) 200 tok/s totalC) 103 tok/s totalD) 500 tok/s total
1. How does speculative decoding accelerate LLM token generation?
A) It uses a larger model to generate tokens fasterB) A small draft model proposes multiple tokens that the target model verifies in a single forward passC) It skips attention computation for common tokensD) It compresses the vocabulary to reduce computation
2. What speedup does speculative decoding achieve on Blackwell GPUs like those in DGX Spark?
A) 1.1-1.2xB) 2-3xC) 5-10xD) No speedup; it only improves quality
3. What is FlashAttention's key innovation compared to standard attention?
A) It uses a different mathematical formula for attentionB) It processes attention in tiles without materializing the full N-by-N attention matrixC) It replaces softmax with a linear approximationD) It skips attention for tokens beyond position 1024
4. How much higher decode throughput do FlashInfer kernels deliver on Blackwell compared to unoptimized implementations?
A) 1.1xB) 2.25xC) 4xD) 10x
5. What does kernel fusion accomplish in the inference pipeline?
A) It combines multiple CUDA kernels into one, eliminating intermediate tensor writes and launch overheadB) It merges multiple GPU cores into a single super-coreC) It fuses CPU and GPU execution into a single pipelineD) It combines multiple models into a single architecture
Speculative Decoding
Speculative decoding uses a small, fast "draft" model to propose multiple token candidates (typically 3-12 tokens ahead), which the larger "target" model then verifies in a single forward pass. The key insight: verifying multiple tokens simultaneously is nearly as fast as generating a single token because verification can be parallelized within one forward pass.
On Blackwell GPUs, speculative decoding achieves 2-3x speedups (vs ~1.5x on Hopper). The most dramatic result: DFlash speculative decoding on Blackwell 6000 Pro reached ~429.69 tokens/s -- a 4.8x increase over the baseline 90.20 tokens/s.
FlashAttention processes attention in tiles without materializing the full N-by-N attention matrix, reducing memory usage from O(N^2) to O(N). On Blackwell, optimized FlashInfer kernels deliver up to 2.25x higher decode throughput, achieving 85-90% tensor core utilization.
Kernel fusion combines LayerNorm, matrix multiplications, activations, and bias additions into single CUDA kernels, eliminating intermediate tensor writes. TensorRT-LLM applies these fusions automatically, achieving 4x throughput over native PyTorch.
1. How much memory does NVFP4 quantization save compared to FP16?
A) 2x reduction (50% savings)B) 4x reduction (75% savings)C) 8x reduction (87.5% savings)D) No savings; NVFP4 only speeds up computation
2. What is the fundamental difference between the prefill and decode phases of LLM inference?
A) Prefill uses GPU while decode uses CPUB) Prefill is compute-bound (processes all input tokens in parallel); decode is memory-bandwidth-bound (generates tokens one at a time)C) Prefill handles text while decode handles imagesD) There is no difference; both phases have the same bottleneck
3. Why is quantization especially impactful for the decode phase on DGX Spark?
A) Decode is compute-bound, and quantization reduces computationB) Decode is memory-bandwidth-bound, and quantization directly reduces bytes read per tokenC) Quantization allows the decode phase to run on CPU instead of GPUD) Quantization eliminates the decode phase entirely
4. What is continuous batching?
A) Processing all requests in a single large batch that runs to completionB) Allowing new requests to enter the batch as soon as a slot opens, overlapping prefill and decodeC) Batching requests by their input length for uniform processingD) Running multiple copies of the model in parallel
5. What is the peak memory bandwidth of DGX Spark's LPDDR5X, and how does it compare to datacenter HBM3e?
A) 1 TB/s; roughly half of HBM3eB) 273 GB/s; roughly 1/12 of HBM3e (~8 TB/s)C) 8 TB/s; equal to HBM3eD) 50 GB/s; roughly 1/100 of HBM3e
Quantization: FP4/FP8/INT8
Format
Bits
Memory vs FP16
Throughput Impact
Accuracy Impact
FP16/BF16
16
1x (baseline)
Baseline
Full precision
FP8
8
0.5x
~1.5-2x speedup
Minimal
INT8 (AWQ)
8
0.5x
~1.5-2x speedup
Small; calibration-dependent
NVFP4
4
0.25x
~2.5x speedup
Noticeable on edge cases
INT4
4
0.25x
~2.5x speedup
Moderate; task-dependent
Quantization is often a necessity on DGX Spark. A 200B-parameter model in BF16 requires ~400 GB, far exceeding the 128 GB per node. At 4-bit precision, the same model fits in ~100 GB, making single-node inference possible. NVFP4 is specifically optimized for Blackwell's tensor cores.
Prefill vs Decode Phases
Phase
Bottleneck
Best Optimizations
DGX Spark Behavior
Prefill
Compute (FLOPS)
Larger batches, kernel fusion, FlashAttention
Fast; GPU cores well-utilized
Decode
Memory bandwidth
Quantization, speculative decoding, multi-node TP
Slow; limited by 273 GB/s LPDDR5X
Continuous batching allows new requests to enter the batch as soon as a slot opens, overlapping prefill for new requests with decode for existing ones. DGX Spark's 273 GB/s LPDDR5X bandwidth (far below HBM3e's ~8 TB/s) makes quantization and multi-node TP the highest-impact interventions for decode performance.
Key Points: Quantization & Prefill/Decode
NVFP4 reduces memory 4x vs FP16 and delivers ~2.5x throughput, optimized for Blackwell tensor cores
Prefill is compute-bound; decode is memory-bandwidth-bound at 273 GB/s LPDDR5X
Decode benefits most from quantization (fewer bytes per weight) and multi-node TP (additive bandwidth)
Continuous batching overlaps prefill and decode to keep the GPU busy
All three optimizations (speculative decoding, FlashAttention, quantization) are multiplicative
1. How does speculative decoding accelerate LLM token generation?
A) It uses a larger model to generate tokens fasterB) A small draft model proposes multiple tokens that the target model verifies in a single forward passC) It skips attention computation for common tokensD) It compresses the vocabulary to reduce computation
2. What speedup does speculative decoding achieve on Blackwell GPUs like those in DGX Spark?
A) 1.1-1.2xB) 2-3xC) 5-10xD) No speedup; it only improves quality
3. What is FlashAttention's key innovation compared to standard attention?
A) It uses a different mathematical formula for attentionB) It processes attention in tiles without materializing the full N-by-N attention matrixC) It replaces softmax with a linear approximationD) It skips attention for tokens beyond position 1024
4. How much higher decode throughput do FlashInfer kernels deliver on Blackwell compared to unoptimized implementations?
A) 1.1xB) 2.25xC) 4xD) 10x
5. What does kernel fusion accomplish in the inference pipeline?
A) It combines multiple CUDA kernels into one, eliminating intermediate tensor writes and launch overheadB) It merges multiple GPU cores into a single super-coreC) It fuses CPU and GPU execution into a single pipelineD) It combines multiple models into a single architecture
1. How much memory does NVFP4 quantization save compared to FP16?
A) 2x reduction (50% savings)B) 4x reduction (75% savings)C) 8x reduction (87.5% savings)D) No savings; NVFP4 only speeds up computation
2. What is the fundamental difference between the prefill and decode phases of LLM inference?
A) Prefill uses GPU while decode uses CPUB) Prefill is compute-bound (processes all input tokens in parallel); decode is memory-bandwidth-bound (generates tokens one at a time)C) Prefill handles text while decode handles imagesD) There is no difference; both phases have the same bottleneck
3. Why is quantization especially impactful for the decode phase on DGX Spark?
A) Decode is compute-bound, and quantization reduces computationB) Decode is memory-bandwidth-bound, and quantization directly reduces bytes read per tokenC) Quantization allows the decode phase to run on CPU instead of GPUD) Quantization eliminates the decode phase entirely
4. What is continuous batching?
A) Processing all requests in a single large batch that runs to completionB) Allowing new requests to enter the batch as soon as a slot opens, overlapping prefill and decodeC) Batching requests by their input length for uniform processingD) Running multiple copies of the model in parallel
5. What is the peak memory bandwidth of DGX Spark's LPDDR5X, and how does it compare to datacenter HBM3e?
A) 1 TB/s; roughly half of HBM3eB) 273 GB/s; roughly 1/12 of HBM3e (~8 TB/s)C) 8 TB/s; equal to HBM3eD) 50 GB/s; roughly 1/100 of HBM3e
1. What is the primary difference between Nsight Systems and Nsight Compute?
A) Nsight Systems profiles CPU only; Nsight Compute profiles GPU onlyB) Nsight Systems provides system-wide timeline views; Nsight Compute provides kernel-level detailC) Nsight Systems is free; Nsight Compute requires a licenseD) They are the same tool with different names
2. During decode on DGX Spark, what percentage of the 273 GB/s peak bandwidth is typically utilized?
A) 95-100%B) 55-60%, with a contention floor around 80-90 GB/sC) 10-20%D) Bandwidth is not measurable during decode
3. What does the vLLM parameter --gpu-memory-utilization control?
A) The GPU clock speed during inferenceB) The fraction of GPU memory available for model weights and KV-cache combinedC) The number of GPU cores used for computationD) The power consumption limit of the GPU
4. Why do smaller tile sizes (64x64) outperform larger tiles on DGX Spark's GB10 SoC?
A) Smaller tiles use less memory bandwidthB) The GB10 has 48 SMs, and larger tiles exceed the available parallelismC) Smaller tiles are compatible with LPDDR5X but larger tiles are notD) NVIDIA artificially limits tile sizes on desktop GPUs
5. What is the highest-impact mitigation strategy for DGX Spark's bandwidth bottleneck?
A) Upgrading to faster memory modulesB) Aggressive quantization (e.g., FP16 to NVFP4 cuts bytes-per-weight by 4x)C) Overclocking the GPUD) Using CPU offloading for model weights
Nsight Systems and Nsight Compute
Nsight Systems provides a system-wide view of GPU activity, CPU activity, memory transfers, and kernel execution timelines. Nsight Compute provides kernel-level detail: occupancy, memory throughput, instruction throughput, and roofline proximity.
On DGX Spark, profiling reveals that during decode, bandwidth utilization drops to 55-60% of the 273 GB/s peak, with a contention floor around 80-90 GB/s due to concurrent memory accesses (weights, KV-cache, activations).
Memory Bandwidth: The Fundamental Bottleneck
DGX Spark's GB10 SoC uses LPDDR5X at 8533 MT/s, delivering approximately 273 GB/s peak bandwidth. For a 70B-parameter model in FP16, reading all weights (~140 GB) takes ~0.51 seconds at peak bandwidth -- this sets a hard floor on per-token latency.
Workload
Measured Bandwidth
Tokens/sec
Notes
35B-A3B MoE (BF16, TP1)
178 GB/s (weight reads)
30.3
MoE routing creates bursty patterns
Llama 3B (BF16, FlashAttn-2)
Near peak
14-20
~25W power draw at 95% GPU util
General 200B (4-bit)
~273 GB/s limit
34-38
Capacity-for-latency trade-off
Mitigation strategies, ranked by impact:
Quantize aggressively: FP16 to NVFP4 = 4x fewer bytes per weight
Scale to multiple nodes: Each node adds 273 GB/s bandwidth
Use sparse MoE models: 35B MoE with 3B active reads only ~6 GB per step
The KV-cache stores key and value tensors from previous tokens. It grows proportionally with sequence length and competes directly with model weight storage in the 128 GB unified memory. vLLM manages this via --gpu-memory-utilization (default 0.9; reduce to 0.7 for multi-node long-context serving).
Technique
Memory Savings
Trade-off
KV-cache quantization (INT8/FP8)
50%
Marginal accuracy impact
Prefix caching
Variable
Effective only with repeated system prompts
Sliding window attention
Proportional to window
Limits effective context length
Sparse MoE model selection
Indirect
Architecture-dependent
Data Locality and NUMA-Aware Scheduling
Despite the unified memory architecture (Grace CPU + Blackwell GPU on NVLink-C2C), data locality still matters. Memory pages closer to GPU controllers are accessed with lower latency. NUMA-aware scheduling ensures:
Pinning vLLM and NCCL processes to the correct NUMA node
Loading model weights into GPU-local memory regions
Avoiding memory migrations during inference
Optimal tile sizes for DGX Spark: 64x64 with occupancy 2 for the GB10's 48 SMs at SM 12.1 compute capability. Larger tiles that perform well on datacenter GPUs underperform on DGX Spark.
flowchart TD
A["Run vLLM with representative workload"] --> B["Capture Nsight Systems trace"]
B --> C["Identify longest-running CUDA kernels"]
C --> D{"Kernel bottleneck type?"}
D -->|"Compute-bound (prefill)"| E["Increase arithmetic intensity"]
D -->|"Memory-bound (decode)"| F["Reduce bytes per operation"]
E --> G["Optimize batch size and occupancy"]
F --> H["Apply quantization (NVFP4)"]
F --> I["Fuse kernels to eliminate writes"]
G --> J["Re-profile and validate"]
H --> J
I --> J
J --> K{"Target throughput met?"}
K -->|No| C
K -->|Yes| L["Deploy optimized configuration"]
Key Points: Profiling & Memory Management
Nsight Systems for system-wide timeline; Nsight Compute for per-kernel roofline analysis
DGX Spark achieves 55-60% of 273 GB/s peak during decode; contention floor ~80-90 GB/s
273 GB/s LPDDR5X is ~12x less than HBM3e -- the dominant bottleneck for LLM inference
KV-cache competes with model weights for the 128 GB unified memory; manage via gpu-memory-utilization
Use 64x64 tile sizes for GB10's 48 SMs; datacenter GPU tile sizes underperform
NUMA-aware scheduling and CUDA graph capture reduce overhead for production deployments
Post-Quiz: Profiling & Memory Management
1. What is the primary difference between Nsight Systems and Nsight Compute?
A) Nsight Systems profiles CPU only; Nsight Compute profiles GPU onlyB) Nsight Systems provides system-wide timeline views; Nsight Compute provides kernel-level detailC) Nsight Systems is free; Nsight Compute requires a licenseD) They are the same tool with different names
2. During decode on DGX Spark, what percentage of the 273 GB/s peak bandwidth is typically utilized?
A) 95-100%B) 55-60%, with a contention floor around 80-90 GB/sC) 10-20%D) Bandwidth is not measurable during decode
3. What does the vLLM parameter --gpu-memory-utilization control?
A) The GPU clock speed during inferenceB) The fraction of GPU memory available for model weights and KV-cache combinedC) The number of GPU cores used for computationD) The power consumption limit of the GPU
4. Why do smaller tile sizes (64x64) outperform larger tiles on DGX Spark's GB10 SoC?
A) Smaller tiles use less memory bandwidthB) The GB10 has 48 SMs, and larger tiles exceed the available parallelismC) Smaller tiles are compatible with LPDDR5X but larger tiles are notD) NVIDIA artificially limits tile sizes on desktop GPUs
5. What is the highest-impact mitigation strategy for DGX Spark's bandwidth bottleneck?
A) Upgrading to faster memory modulesB) Aggressive quantization (e.g., FP16 to NVFP4 cuts bytes-per-weight by 4x)C) Overclocking the GPUD) Using CPU offloading for model weights