Study Guide: Chapter 3 — Multi-Node Networking, Scaling & Performance Optimization

Pre-Quiz: ConnectX-7 Capabilities & Direct Connections

1. What is the primary advantage of RDMA over traditional network transfers for distributed GPU workloads?

A) RDMA encrypts data in transit, improving security B) RDMA eliminates intermediate memory copies and OS kernel involvement, reducing latency C) RDMA compresses data before transmission, reducing bandwidth usage D) RDMA allows wireless communication between GPUs

2. What bandwidth does each ConnectX-7 SmartNIC provide on the DGX Spark?

A) 100 Gb/s B) 400 Gb/s C) 200 Gb/s D) 50 Gb/s

3. What is the simplest way to connect two DGX Spark nodes for multi-node inference?

A) Through a managed Ethernet switch with VLANs B) Via a direct QSFP cable between ConnectX-7 ports C) Using InfiniBand Host Channel Adapters D) Through a Wi-Fi mesh network

4. Why is GPUDirect RDMA critical for tensor parallelism across DGX Spark nodes?

A) It allows GPUs to share their compute cores across nodes B) It transfers data directly between GPU memory on different nodes without CPU staging C) It doubles the clock speed of the GPU during network transfers D) It compresses model weights before sending them across the network

5. What IP addressing scheme is recommended for ConnectX-7 interfaces in a DGX Spark cluster?

A) DHCP with automatic IP assignment B) IPv6 link-local addresses C) Static IP addresses on the ConnectX-7 interfaces D) mDNS/Bonjour for automatic discovery

ConnectX-7 200 Gb/s Network Interface and RDMA Support

Each DGX Spark ships with an NVIDIA ConnectX-7 SmartNIC, a dedicated networking processor that handles 200 Gb/s Ethernet traffic. The ConnectX-7 supports RDMA over Converged Ethernet (RoCE), which allows one machine's GPU to read from or write to another machine's memory directly, bypassing the operating system's network stack entirely.

In a traditional network transfer, data passes through multiple copies: application to kernel buffer, kernel buffer to NIC send buffer, across the wire, then the reverse on the receiving side. Each copy adds latency. RDMA eliminates these intermediate copies -- the NIC reads directly from GPU memory on one node and writes directly into GPU memory on the other. For AI workloads where nodes synchronize tensor data millions of times during inference, this difference is transformative.

Each DGX Spark node exposes two QSFP ports through the ConnectX-7, providing four RoCE interfaces total across the two physical ports.

sequenceDiagram participant App1 as Application (Node 1) participant K1 as OS Kernel (Node 1) participant NIC1 as ConnectX-7 NIC (Node 1) participant NIC2 as ConnectX-7 NIC (Node 2) participant K2 as OS Kernel (Node 2) participant App2 as Application (Node 2) Note over App1,App2: Traditional Network Transfer (multiple copies) App1->>K1: Copy data to kernel buffer K1->>NIC1: Copy to NIC send buffer NIC1->>NIC2: Wire transfer NIC2->>K2: Copy to kernel buffer K2->>App2: Copy to application memory Note over App1,App2: RDMA Transfer (zero-copy) App1->>NIC1: NIC reads directly from GPU memory NIC1->>NIC2: Wire transfer (200 Gb/s RoCE) NIC2->>App2: NIC writes directly to GPU memory

Direct Two-Node Scaling

The simplest multi-node configuration connects two DGX Sparks with a direct QSFP cable -- no switch required. A 0.5-meter QSFP cable between the ConnectX-7 ports creates a point-to-point 200 Gb/s link.

Node	Interface	IP Address
Node 1	`enP2p1s0f1np1`	`192.168.100.10/24`
Node 2	`enP2p1s0f1np1`	`192.168.100.11/24`

Key Points: ConnectX-7 & Direct Connections

ConnectX-7 SmartNIC provides 200 Gb/s Ethernet with RoCE (RDMA over Converged Ethernet)
RDMA bypasses the OS kernel, enabling zero-copy GPU-to-GPU data transfers across nodes
Two DGX Sparks connect directly via a single QSFP cable -- no switch needed
Static IP addresses on ConnectX-7 interfaces eliminate DHCP latency and ensure deterministic routing
GPUDirect RDMA enables the NIC to access GPU memory directly without CPU staging

Pre-Quiz: Four-Node Topologies & Network Configuration

1. Why can't you daisy-chain more than two DGX Spark nodes without a switch?

A) DGX Spark only has one network port B) Each node needs a path to every other node, and a switch provides that star topology C) The QSFP cables are too short for daisy-chaining D) The ConnectX-7 firmware doesn't support more than one connection

2. How much unified GPU memory does a four-node DGX Spark cluster provide?

A) 256 GB B) 128 GB C) 512 GB D) 1 TB

3. What is the recommended MTU setting for DGX Spark cluster network interfaces?

A) 1500 bytes (standard Ethernet) B) 4096 bytes C) 9000 bytes (jumbo frames) D) 65535 bytes (maximum IP packet)

4. What parallelism configuration does the three-node switchless mesh topology use?

A) TP=3, PP=1 B) PP=3, TP=1 C) TP=2, PP=2 D) TP=1, PP=1 with data parallelism

5. What is the minimum NCCL version required for DGX Spark multi-node operation?

A) v2.20.0 B) v2.28.3 C) v3.0.0 D) v2.25.1

Four-Node Cluster Topologies

Scaling beyond two nodes requires an Ethernet switch. Each node needs a path to every other node, and a switch provides that star topology. Community-tested switches include the MikroTik CRS804-4DDQ and CRS812, both supporting 200 GbE QSFP connections.

A four-node cluster provides 512 GB of unified memory (4 x 128 GB), enough to host 700-billion-parameter class models such as Qwen3.5-397B. An alternative for three-node clusters exists: a switchless mesh topology using PP=3/TP=1, where each node connects directly to the other two.

Interactive: Multi-Node Cluster Topology with Data Flow

Network Configuration, MTU Tuning, and GPUDirect RDMA

Configuration Step	Recommendation	Purpose
MTU	9000 bytes (jumbo frames)	Reduces per-packet overhead for large tensor transfers
IP addressing	Static IPs on CX-7 interfaces	Eliminates DHCP latency, deterministic routing
GPUDirect RDMA	Enable via NVIDIA drivers	Allows NIC to access GPU memory directly without CPU staging
NCCL version	v2.28.3 or later	Required collective communication library for multi-node GPU ops
OS requirements	Ubuntu 24.04+ with current NVIDIA drivers	Baseline software environment for DGX Spark clustering

Key Points: Four-Node Topologies & Configuration

Four-node clusters require a 200 GbE switch (e.g., MikroTik CRS804-4DDQ) for star topology
Four nodes provide 512 GB aggregate memory, enabling 700B-class models
Three-node switchless mesh uses PP=3/TP=1 (pipeline parallelism, no tensor parallelism)
Jumbo frames (MTU 9000) reduce per-packet overhead for large tensor transfers
NCCL v2.28.3+ is required, with optimizations for Grace Blackwell and RoCE transport

Post-Quiz: ConnectX-7 Capabilities & Direct Connections

1. What is the primary advantage of RDMA over traditional network transfers for distributed GPU workloads?

2. What bandwidth does each ConnectX-7 SmartNIC provide on the DGX Spark?

A) 100 Gb/s B) 400 Gb/s C) 200 Gb/s D) 50 Gb/s

3. What is the simplest way to connect two DGX Spark nodes for multi-node inference?

A) Through a managed Ethernet switch with VLANs B) Via a direct QSFP cable between ConnectX-7 ports C) Using InfiniBand Host Channel Adapters D) Through a Wi-Fi mesh network

4. Why is GPUDirect RDMA critical for tensor parallelism across DGX Spark nodes?

5. What IP addressing scheme is recommended for ConnectX-7 interfaces in a DGX Spark cluster?

A) DHCP with automatic IP assignment B) IPv6 link-local addresses C) Static IP addresses on the ConnectX-7 interfaces D) mDNS/Bonjour for automatic discovery

Post-Quiz: Four-Node Topologies & Network Configuration

1. Why can't you daisy-chain more than two DGX Spark nodes without a switch?

2. How much unified GPU memory does a four-node DGX Spark cluster provide?

A) 256 GB B) 128 GB C) 512 GB D) 1 TB

3. What is the recommended MTU setting for DGX Spark cluster network interfaces?

A) 1500 bytes (standard Ethernet) B) 4096 bytes C) 9000 bytes (jumbo frames) D) 65535 bytes (maximum IP packet)

4. What parallelism configuration does the three-node switchless mesh topology use?

A) TP=3, PP=1 B) PP=3, TP=1 C) TP=2, PP=2 D) TP=1, PP=1 with data parallelism

5. What is the minimum NCCL version required for DGX Spark multi-node operation?

A) v2.20.0 B) v2.28.3 C) v3.0.0 D) v2.25.1

Section 2: Distributed AI — Tensor & Pipeline Parallelism

Pre-Quiz: Tensor & Pipeline Parallelism

1. How does tensor parallelism distribute a neural network across multiple GPUs?

A) It assigns entire layers to different GPUs sequentially B) It splits individual layers horizontally so each GPU computes a portion of every matrix multiplication C) It replicates the full model on every GPU and averages outputs D) It distributes different training data batches to each GPU

2. Why does TP2 achieve near-perfect 2x speedup for the decode phase on DGX Spark?

A) The decode phase is compute-bound, and TP2 doubles compute capacity B) The decode phase is memory-bandwidth-bound, and TP2 doubles available bandwidth C) The 200 Gb/s link has zero latency overhead D) TP2 eliminates the need for NCCL communication entirely

3. What is the main disadvantage of pipeline parallelism compared to tensor parallelism for inference?

A) Pipeline parallelism requires more network bandwidth B) Pipeline parallelism introduces pipeline bubbles -- idle time as data flows through stages C) Pipeline parallelism cannot run on DGX Spark hardware D) Pipeline parallelism doubles memory usage on each node

4. What NCCL operation is critical for synchronizing partial results in tensor parallelism?

A) Broadcast B) Scatter C) All-reduce D) Gather

5. What throughput did the four-node TP4 cluster achieve for Qwen3.5-397B with 4 concurrent users?

A) 37 tok/s total B) 200 tok/s total C) 103 tok/s total D) 500 tok/s total

Tensor Parallelism (TP)

Tensor parallelism splits individual layers of a neural network across multiple GPUs. Each GPU computes a portion of every matrix multiplication, then partial results are combined via an all-reduce communication step. On DGX Spark, TP2 means the model is split across two nodes (256 GB total); TP4 means four nodes (512 GB total).

A practical TP2 deployment with vLLM:

vllm serve Qwen/Qwen3-Coder-Next \
  --port 8000 \
  --tensor-parallel-size 2 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --max-model-len 32768

Interactive: Tensor Parallelism All-Reduce Data Flow (TP2)

Pipeline Parallelism (PP)

Pipeline parallelism assigns entire groups of layers to different GPUs. Node 1 handles layers 1-20, Node 2 handles layers 21-40, and so on. It communicates less frequently than tensor parallelism (only between stages, not at every layer), but introduces pipeline bubbles -- idle time when stages wait for data. PP is better suited for training workloads where micro-batching can fill the bubbles.

flowchart LR Input["Input Tokens"] --> N1 subgraph N1["Node 1: Layers 1-20"] L1["Process Layers 1-20"] end subgraph N2["Node 2: Layers 21-40"] L2["Process Layers 21-40"] end subgraph N3["Node 3: Layers 41-60"] L3["Process Layers 41-60"] end N1 -->|"Send activations"| N2 N2 -->|"Send activations"| N3 N3 --> Output["Output Tokens"]

Scaling Benchmarks

Llama 3.3 70B NVFP4 (32K input, 1K output, batch=1):

Metric	1-Node (TP1)	2-Node (TP2)	Speedup
Time to First Token (TTFT)	33,415 ms	21,384 ms	1.56x
Time Per Output Token (TPOT)	269 ms	133 ms	2.02x

Four-node TP4 results for Qwen3.5-397B-INT4:

Scenario	Throughput
Single user, 4-node TP4	37 tok/s
4 concurrent users, 4-node TP4	103 tok/s (total)

Key Points: Tensor & Pipeline Parallelism

Tensor parallelism (TP) splits individual layers across GPUs; requires all-reduce at every layer
Pipeline parallelism (PP) assigns groups of layers to different GPUs; communicates only between stages
TP2 delivers near-perfect 2x TPOT speedup because decode is memory-bandwidth-bound
TP4 enables 700B-class models (512 GB aggregate) with 103 tok/s for 4 concurrent users
NCCL all-reduce over RoCE keeps communication overhead low for TP configurations
Pipeline parallelism suits training or switchless 3-node topologies (PP=3/TP=1)

Post-Quiz: Tensor & Pipeline Parallelism

1. How does tensor parallelism distribute a neural network across multiple GPUs?

2. Why does TP2 achieve near-perfect 2x speedup for the decode phase on DGX Spark?

3. What is the main disadvantage of pipeline parallelism compared to tensor parallelism for inference?

4. What NCCL operation is critical for synchronizing partial results in tensor parallelism?

A) Broadcast B) Scatter C) All-reduce D) Gather

5. What throughput did the four-node TP4 cluster achieve for Qwen3.5-397B with 4 concurrent users?

A) 37 tok/s total B) 200 tok/s total C) 103 tok/s total D) 500 tok/s total

Section 3: Inference Optimization Techniques

Pre-Quiz: Speculative Decoding & Attention Optimization

1. How does speculative decoding accelerate LLM token generation?

A) It uses a larger model to generate tokens faster B) A small draft model proposes multiple tokens that the target model verifies in a single forward pass C) It skips attention computation for common tokens D) It compresses the vocabulary to reduce computation

2. What speedup does speculative decoding achieve on Blackwell GPUs like those in DGX Spark?

A) 1.1-1.2x B) 2-3x C) 5-10x D) No speedup; it only improves quality

3. What is FlashAttention's key innovation compared to standard attention?

A) It uses a different mathematical formula for attention B) It processes attention in tiles without materializing the full N-by-N attention matrix C) It replaces softmax with a linear approximation D) It skips attention for tokens beyond position 1024

4. How much higher decode throughput do FlashInfer kernels deliver on Blackwell compared to unoptimized implementations?

A) 1.1x B) 2.25x C) 4x D) 10x

5. What does kernel fusion accomplish in the inference pipeline?

A) It combines multiple CUDA kernels into one, eliminating intermediate tensor writes and launch overhead B) It merges multiple GPU cores into a single super-core C) It fuses CPU and GPU execution into a single pipeline D) It combines multiple models into a single architecture

Speculative Decoding

Speculative decoding uses a small, fast "draft" model to propose multiple token candidates (typically 3-12 tokens ahead), which the larger "target" model then verifies in a single forward pass. The key insight: verifying multiple tokens simultaneously is nearly as fast as generating a single token because verification can be parallelized within one forward pass.

On Blackwell GPUs, speculative decoding achieves 2-3x speedups (vs ~1.5x on Hopper). The most dramatic result: DFlash speculative decoding on Blackwell 6000 Pro reached ~429.69 tokens/s -- a 4.8x increase over the baseline 90.20 tokens/s.

Interactive: Speculative Decoding Draft-Verify-Accept Pipeline

FlashAttention and Kernel Fusion

FlashAttention processes attention in tiles without materializing the full N-by-N attention matrix, reducing memory usage from O(N^2) to O(N). On Blackwell, optimized FlashInfer kernels deliver up to 2.25x higher decode throughput, achieving 85-90% tensor core utilization.

Kernel fusion combines LayerNorm, matrix multiplications, activations, and bias additions into single CUDA kernels, eliminating intermediate tensor writes. TensorRT-LLM applies these fusions automatically, achieving 4x throughput over native PyTorch.

Key Points: Speculative Decoding & Attention Optimization

Speculative decoding: small draft model proposes tokens, large model verifies in one pass (2-3x on Blackwell)
DFlash speculative decoding achieves 4.8x speedup (430 tok/s vs 90 tok/s baseline)
FlashAttention: tiled attention without full N-by-N matrix, O(N) memory instead of O(N^2)
FlashInfer kernels: 2.25x decode throughput on Blackwell, 85-90% tensor core utilization
Kernel fusion eliminates intermediate writes; TensorRT-LLM achieves 4x over native PyTorch

Pre-Quiz: Quantization & Prefill/Decode Trade-offs

1. How much memory does NVFP4 quantization save compared to FP16?

A) 2x reduction (50% savings) B) 4x reduction (75% savings) C) 8x reduction (87.5% savings) D) No savings; NVFP4 only speeds up computation

2. What is the fundamental difference between the prefill and decode phases of LLM inference?

A) Prefill uses GPU while decode uses CPU B) Prefill is compute-bound (processes all input tokens in parallel); decode is memory-bandwidth-bound (generates tokens one at a time) C) Prefill handles text while decode handles images D) There is no difference; both phases have the same bottleneck

3. Why is quantization especially impactful for the decode phase on DGX Spark?

A) Decode is compute-bound, and quantization reduces computation B) Decode is memory-bandwidth-bound, and quantization directly reduces bytes read per token C) Quantization allows the decode phase to run on CPU instead of GPU D) Quantization eliminates the decode phase entirely

4. What is continuous batching?

A) Processing all requests in a single large batch that runs to completion B) Allowing new requests to enter the batch as soon as a slot opens, overlapping prefill and decode C) Batching requests by their input length for uniform processing D) Running multiple copies of the model in parallel

5. What is the peak memory bandwidth of DGX Spark's LPDDR5X, and how does it compare to datacenter HBM3e?

A) 1 TB/s; roughly half of HBM3e B) 273 GB/s; roughly 1/12 of HBM3e (~8 TB/s) C) 8 TB/s; equal to HBM3e D) 50 GB/s; roughly 1/100 of HBM3e

Quantization: FP4/FP8/INT8

Format	Bits	Memory vs FP16	Throughput Impact	Accuracy Impact
FP16/BF16	16	1x (baseline)	Baseline	Full precision
FP8	8	0.5x	~1.5-2x speedup	Minimal
INT8 (AWQ)	8	0.5x	~1.5-2x speedup	Small; calibration-dependent
NVFP4	4	0.25x	~2.5x speedup	Noticeable on edge cases
INT4	4	0.25x	~2.5x speedup	Moderate; task-dependent

Quantization is often a necessity on DGX Spark. A 200B-parameter model in BF16 requires ~400 GB, far exceeding the 128 GB per node. At 4-bit precision, the same model fits in ~100 GB, making single-node inference possible. NVFP4 is specifically optimized for Blackwell's tensor cores.

Prefill vs Decode Phases

Phase	Bottleneck	Best Optimizations	DGX Spark Behavior
Prefill	Compute (FLOPS)	Larger batches, kernel fusion, FlashAttention	Fast; GPU cores well-utilized
Decode	Memory bandwidth	Quantization, speculative decoding, multi-node TP	Slow; limited by 273 GB/s LPDDR5X

Continuous batching allows new requests to enter the batch as soon as a slot opens, overlapping prefill for new requests with decode for existing ones. DGX Spark's 273 GB/s LPDDR5X bandwidth (far below HBM3e's ~8 TB/s) makes quantization and multi-node TP the highest-impact interventions for decode performance.

Key Points: Quantization & Prefill/Decode

NVFP4 reduces memory 4x vs FP16 and delivers ~2.5x throughput, optimized for Blackwell tensor cores
Prefill is compute-bound; decode is memory-bandwidth-bound at 273 GB/s LPDDR5X
Decode benefits most from quantization (fewer bytes per weight) and multi-node TP (additive bandwidth)
Continuous batching overlaps prefill and decode to keep the GPU busy
All three optimizations (speculative decoding, FlashAttention, quantization) are multiplicative

Post-Quiz: Speculative Decoding & Attention Optimization

1. How does speculative decoding accelerate LLM token generation?

2. What speedup does speculative decoding achieve on Blackwell GPUs like those in DGX Spark?

A) 1.1-1.2x B) 2-3x C) 5-10x D) No speedup; it only improves quality

3. What is FlashAttention's key innovation compared to standard attention?

4. How much higher decode throughput do FlashInfer kernels deliver on Blackwell compared to unoptimized implementations?

A) 1.1x B) 2.25x C) 4x D) 10x

5. What does kernel fusion accomplish in the inference pipeline?

Post-Quiz: Quantization & Prefill/Decode Trade-offs

1. How much memory does NVFP4 quantization save compared to FP16?

A) 2x reduction (50% savings) B) 4x reduction (75% savings) C) 8x reduction (87.5% savings) D) No savings; NVFP4 only speeds up computation

2. What is the fundamental difference between the prefill and decode phases of LLM inference?

3. Why is quantization especially impactful for the decode phase on DGX Spark?

4. What is continuous batching?

5. What is the peak memory bandwidth of DGX Spark's LPDDR5X, and how does it compare to datacenter HBM3e?

A) 1 TB/s; roughly half of HBM3e B) 273 GB/s; roughly 1/12 of HBM3e (~8 TB/s) C) 8 TB/s; equal to HBM3e D) 50 GB/s; roughly 1/100 of HBM3e

Section 4: Profiling, Bottleneck Analysis & Memory Management

Pre-Quiz: Profiling & Memory Management

1. What is the primary difference between Nsight Systems and Nsight Compute?

A) Nsight Systems profiles CPU only; Nsight Compute profiles GPU only B) Nsight Systems provides system-wide timeline views; Nsight Compute provides kernel-level detail C) Nsight Systems is free; Nsight Compute requires a license D) They are the same tool with different names

2. During decode on DGX Spark, what percentage of the 273 GB/s peak bandwidth is typically utilized?

A) 95-100% B) 55-60%, with a contention floor around 80-90 GB/s C) 10-20% D) Bandwidth is not measurable during decode

3. What does the vLLM parameter --gpu-memory-utilization control?

A) The GPU clock speed during inference B) The fraction of GPU memory available for model weights and KV-cache combined C) The number of GPU cores used for computation D) The power consumption limit of the GPU

4. Why do smaller tile sizes (64x64) outperform larger tiles on DGX Spark's GB10 SoC?

A) Smaller tiles use less memory bandwidth B) The GB10 has 48 SMs, and larger tiles exceed the available parallelism C) Smaller tiles are compatible with LPDDR5X but larger tiles are not D) NVIDIA artificially limits tile sizes on desktop GPUs

5. What is the highest-impact mitigation strategy for DGX Spark's bandwidth bottleneck?

A) Upgrading to faster memory modules B) Aggressive quantization (e.g., FP16 to NVFP4 cuts bytes-per-weight by 4x) C) Overclocking the GPU D) Using CPU offloading for model weights

Nsight Systems and Nsight Compute

Nsight Systems provides a system-wide view of GPU activity, CPU activity, memory transfers, and kernel execution timelines. Nsight Compute provides kernel-level detail: occupancy, memory throughput, instruction throughput, and roofline proximity.

On DGX Spark, profiling reveals that during decode, bandwidth utilization drops to 55-60% of the 273 GB/s peak, with a contention floor around 80-90 GB/s due to concurrent memory accesses (weights, KV-cache, activations).

Memory Bandwidth: The Fundamental Bottleneck

DGX Spark's GB10 SoC uses LPDDR5X at 8533 MT/s, delivering approximately 273 GB/s peak bandwidth. For a 70B-parameter model in FP16, reading all weights (~140 GB) takes ~0.51 seconds at peak bandwidth -- this sets a hard floor on per-token latency.

Workload	Measured Bandwidth	Tokens/sec	Notes
35B-A3B MoE (BF16, TP1)	178 GB/s (weight reads)	30.3	MoE routing creates bursty patterns
Llama 3B (BF16, FlashAttn-2)	Near peak	14-20	~25W power draw at 95% GPU util
General 200B (4-bit)	~273 GB/s limit	34-38	Capacity-for-latency trade-off

Mitigation strategies, ranked by impact:

Quantize aggressively: FP16 to NVFP4 = 4x fewer bytes per weight
Scale to multiple nodes: Each node adds 273 GB/s bandwidth
Use sparse MoE models: 35B MoE with 3B active reads only ~6 GB per step
Fuse kernels: Eliminate intermediate tensor writes

KV-Cache Management

The KV-cache stores key and value tensors from previous tokens. It grows proportionally with sequence length and competes directly with model weight storage in the 128 GB unified memory. vLLM manages this via --gpu-memory-utilization (default 0.9; reduce to 0.7 for multi-node long-context serving).

Technique	Memory Savings	Trade-off
KV-cache quantization (INT8/FP8)	50%	Marginal accuracy impact
Prefix caching	Variable	Effective only with repeated system prompts
Sliding window attention	Proportional to window	Limits effective context length
Sparse MoE model selection	Indirect	Architecture-dependent

Data Locality and NUMA-Aware Scheduling

Despite the unified memory architecture (Grace CPU + Blackwell GPU on NVLink-C2C), data locality still matters. Memory pages closer to GPU controllers are accessed with lower latency. NUMA-aware scheduling ensures:

Pinning vLLM and NCCL processes to the correct NUMA node
Loading model weights into GPU-local memory regions
Avoiding memory migrations during inference

Optimal tile sizes for DGX Spark: 64x64 with occupancy 2 for the GB10's 48 SMs at SM 12.1 compute capability. Larger tiles that perform well on datacenter GPUs underperform on DGX Spark.

flowchart TD A["Run vLLM with representative workload"] --> B["Capture Nsight Systems trace"] B --> C["Identify longest-running CUDA kernels"] C --> D{"Kernel bottleneck type?"} D -->|"Compute-bound (prefill)"| E["Increase arithmetic intensity"] D -->|"Memory-bound (decode)"| F["Reduce bytes per operation"] E --> G["Optimize batch size and occupancy"] F --> H["Apply quantization (NVFP4)"] F --> I["Fuse kernels to eliminate writes"] G --> J["Re-profile and validate"] H --> J I --> J J --> K{"Target throughput met?"} K -->|No| C K -->|Yes| L["Deploy optimized configuration"]

Key Points: Profiling & Memory Management

Nsight Systems for system-wide timeline; Nsight Compute for per-kernel roofline analysis
DGX Spark achieves 55-60% of 273 GB/s peak during decode; contention floor ~80-90 GB/s
273 GB/s LPDDR5X is ~12x less than HBM3e -- the dominant bottleneck for LLM inference
KV-cache competes with model weights for the 128 GB unified memory; manage via gpu-memory-utilization
Use 64x64 tile sizes for GB10's 48 SMs; datacenter GPU tile sizes underperform
NUMA-aware scheduling and CUDA graph capture reduce overhead for production deployments

Post-Quiz: Profiling & Memory Management