Chapter 1: Grace Blackwell GB10 Superchip Architecture & Unified Memory System

Learning Objectives

Section 1: The GB10 Integrated Superchip Design

Pre-Quiz: SoC Design & Grace CPU

1. What is the primary advantage of the GB10's SoC design over traditional discrete CPU-GPU architectures?

A) It uses more transistors, enabling higher clock speeds B) It eliminates the PCIe bottleneck by placing CPU and GPU on the same package with a high-bandwidth coherent interconnect C) It allows the CPU and GPU to use different instruction sets for better specialization D) It enables the GPU to operate at higher voltages due to the shared power delivery

2. Why does the Grace CPU use a heterogeneous core design with both Cortex-X925 and Cortex-A725 cores?

A) Because the two core types use different instruction sets optimized for different tasks B) To maximize performance per watt by assigning latency-sensitive tasks to high-performance cores and background tasks to efficiency cores C) To enable the CPU to process twice as many threads simultaneously compared to a homogeneous design D) Because the Cortex-A725 cores handle GPU communication while Cortex-X925 cores handle memory management

3. In the context of CPU-GPU communication, what does the term "coherent" mean when describing NVLink-C2C?

A) The interconnect operates at a consistent clock frequency regardless of workload B) Data transfers between CPU and GPU are error-corrected using ECC C) When one processor writes to a memory address, the other immediately sees the updated value without explicit cache management D) The CPU and GPU share the same instruction pipeline for synchronized execution

4. What engineering challenge does the Blackwell GPU's multi-die design solve?

A) It allows the GPU to run at lower voltages by distributing power across two dies B) It circumvents the reticle size limit that constrains monolithic die scaling, enabling more transistors C) It enables the GPU to simultaneously execute two different neural network models D) It provides hardware redundancy so that one die can continue operating if the other fails

5. A developer on a discrete GPU system must explicitly copy data from CPU RAM to GPU VRAM. What replaces this step on the GB10?

A) DMA transfers are still required but are handled automatically by the OS driver B) The GPU uses a dedicated high-speed SSD cache instead of VRAM C) Nothing -- both CPU and GPU share a unified memory address space, so no copies are needed D) The NVLink-C2C performs asynchronous background copies that are invisible to the developer

SoC Philosophy vs. Discrete CPU-GPU Architectures

The NVIDIA DGX Spark represents a turning point in AI computing: a personal AI supercomputer that delivers up to one petaflop of AI performance while sitting on a desk. At its heart is the GB10 Grace Blackwell Superchip, a system-on-a-chip (SoC) that fuses a high-performance CPU, a powerful GPU, and a unified memory system into a single integrated package.

For decades, CPUs and GPUs have lived as separate chips on a motherboard, communicating through PCIe. The GB10 takes a fundamentally different approach: instead of two separate components connected by a bus, NVIDIA engineered a single integrated package where the CPU and GPU share the same infrastructure, connected by NVLink-C2C rather than PCIe.

The result is a device measuring just 150 mm x 150 mm x 50.5 mm -- roughly the size of a Mac Mini -- yet capable of running AI models with up to 200 billion parameters.

FeatureDiscrete GPU SystemGB10 Superchip (SoC)
CPU-GPU connectionPCIe Gen5 (~64 GB/s)NVLink-C2C (900 GB/s bidirectional)
Memory modelSeparate CPU RAM + GPU VRAM128 GB unified LPDDR5X
Data transferExplicit copies requiredShared address space, no copies
Form factorTower/rack with discrete cards150 mm x 150 mm x 50.5 mm desktop unit
Power (typical)300-600 W (GPU alone)140 W (entire SoC)
Programming modelManage two memory spacesSingle coherent memory space
GB10 Superchip Architecture -- Animated Component Reveal
GB10 Superchip (TSMC 4NP, 208B Transistors) Grace CPU 10x Cortex-X925 High-Performance Cores 10x Cortex-A725 Efficiency Cores Armv9-A ISA Crypto + ML Extensions 20 Cores | big.LITTLE AI Preprocessing Pipeline NVLink C2C 900 GB/s Bidirectional Cache Coherent No Explicit Copies Blackwell GPU 48 Streaming Multiprocessors 128 CUDA Cores per SM 192 Tensor Cores (5th Gen) FP4 / FP8 / BF16 / TF32 6,144 CUDA Cores FP32 / INT32 Operations 1 PFLOP FP4 (Sparse) Multi-Die Design 10 Tb/s Inter-Die 128 GB Unified LPDDR5X Memory | 273 GB/s | 16 Channels | ECC

Grace CPU: 20-Core Arm Neoverse V2 Microarchitecture

The CPU component of the GB10 employs a 20-core Arm processor with a heterogeneous design. The configuration pairs 10 high-performance Cortex-X925 cores with 10 energy-efficient Cortex-A725 cores, following the big.LITTLE principle. The OS scheduler dynamically assigns work to the appropriate core type, maximizing performance per watt.

Both core types implement the Armv9-A instruction set architecture with specialized extensions for cryptography and machine learning. For AI preprocessing pipelines -- tasks such as tokenization, data augmentation, feature extraction, and batch assembly -- the Grace CPU provides the serial and moderately parallel compute needed before data is handed to the GPU's massively parallel Tensor Cores.

Blackwell GPU: Fifth-Generation Architecture

The GPU side of the GB10 houses a Blackwell-architecture processor featuring 6,144 CUDA cores organized across 48 streaming multiprocessors (SMs). Each SM contains 128 CUDA cores, 4 fifth-generation Tensor Cores, 1 RT core, a 256 KB register file, and 128 KB configurable L1 cache/shared memory.

The headline figure -- 1 petaFLOP of AI compute at FP4 precision -- requires structured sparsity. Without sparsity, dense performance reaches approximately 500 TFLOPS at FP4.

Die-to-Die Integration and Power Delivery

The GB10 is manufactured on TSMC's 4NP process with 208 billion transistors. The Blackwell GPU uses a multi-die design: two GPU dies connect via a 10 Tb/s internal interconnect, sharing an L2 cache and presenting as a single unified GPU to software. This circumvents the ~800 mm² reticle size limit. Advanced CoWoS packaging brings the Grace CPU and Blackwell GPU dies together within a single package.

Key Takeaway

Post-Quiz: SoC Design & Grace CPU

1. What is the primary advantage of the GB10's SoC design over traditional discrete CPU-GPU architectures?

A) It uses more transistors, enabling higher clock speeds B) It eliminates the PCIe bottleneck by placing CPU and GPU on the same package with a high-bandwidth coherent interconnect C) It allows the CPU and GPU to use different instruction sets for better specialization D) It enables the GPU to operate at higher voltages due to the shared power delivery

2. Why does the Grace CPU use a heterogeneous core design with both Cortex-X925 and Cortex-A725 cores?

A) Because the two core types use different instruction sets optimized for different tasks B) To maximize performance per watt by assigning latency-sensitive tasks to high-performance cores and background tasks to efficiency cores C) To enable the CPU to process twice as many threads simultaneously compared to a homogeneous design D) Because the Cortex-A725 cores handle GPU communication while Cortex-X925 cores handle memory management

3. In the context of CPU-GPU communication, what does the term "coherent" mean when describing NVLink-C2C?

A) The interconnect operates at a consistent clock frequency regardless of workload B) Data transfers between CPU and GPU are error-corrected using ECC C) When one processor writes to a memory address, the other immediately sees the updated value without explicit cache management D) The CPU and GPU share the same instruction pipeline for synchronized execution

4. What engineering challenge does the Blackwell GPU's multi-die design solve?

A) It allows the GPU to run at lower voltages by distributing power across two dies B) It circumvents the reticle size limit that constrains monolithic die scaling, enabling more transistors C) It enables the GPU to simultaneously execute two different neural network models D) It provides hardware redundancy so that one die can continue operating if the other fails

5. A developer on a discrete GPU system must explicitly copy data from CPU RAM to GPU VRAM. What replaces this step on the GB10?

A) DMA transfers are still required but are handled automatically by the OS driver B) The GPU uses a dedicated high-speed SSD cache instead of VRAM C) Nothing -- both CPU and GPU share a unified memory address space, so no copies are needed D) The NVLink-C2C performs asynchronous background copies that are invisible to the developer

Section 2: CUDA Cores, Tensor Cores & Compute Capabilities

Pre-Quiz: CUDA Cores, Tensor Cores & Compute

1. What is the fundamental difference between how CUDA cores and Tensor Cores process computations?

A) CUDA cores operate on scalar operations while Tensor Cores accelerate matrix multiply-accumulate operations B) CUDA cores handle integer operations and Tensor Cores handle floating-point operations C) CUDA cores run at higher clock speeds while Tensor Cores run at lower speeds with wider datapaths D) CUDA cores are programmable while Tensor Cores are fixed-function and cannot be controlled by developers

2. How does the NVFP4 format achieve high accuracy despite using only 4 bits per element?

A) It uses lossless compression to store full FP16 values in 4 bits B) It employs a two-level microscaling strategy with per-group FP8 scale factors and a tensor-level FP32 scale C) It only quantizes weights that are close to zero, keeping important weights in full precision D) It stores a 4-bit index into a shared lookup table of common weight values

3. What role does the Transformer Engine play in the Blackwell architecture?

A) It is a hardware unit that accelerates attention mechanism computations specifically B) It automatically monitors tensor statistics per layer and dynamically selects the optimal precision format C) It converts transformer models into optimized CUDA kernels at compile time D) It manages the scheduling of transformer layers across multiple GPUs

4. Why does the Llama 3.1 70B model achieve dramatically lower decode throughput than the 8B model on the GB10?

A) The 70B model requires FP64 precision which runs much slower on consumer Blackwell B) The 70B model's working set exceeds the cache hierarchy, making performance memory-bandwidth-bound C) The 70B model cannot use Tensor Cores and must rely on CUDA cores alone D) The GB10 lacks sufficient CUDA cores to process 70B parameters in parallel

5. What does "structured sparsity" mean in the context of the GB10's 1 PFLOP FP4 claim?

A) The hardware organizes data in sparse matrix formats to reduce memory usage B) The GPU skips zero-valued computations in weight matrices following a defined sparsity pattern, effectively doubling throughput C) The Tensor Cores only compute on the non-zero diagonal elements of weight matrices D) The GPU uses compression to store only non-zero weights, freeing memory bandwidth

CUDA Core Organization and FP32/FP64 Peak Throughput

The 6,144 CUDA cores across the GB10's 48 SMs form the foundation of its general-purpose GPU computing capability. Each SM's 128 CUDA cores can execute FP32 or INT32 operations. This unified execution model provides flexibility for workloads that interleave floating-point math with integer address calculations.

Worked Example: Estimating CUDA Core Throughput

Fifth-Generation Tensor Cores: Precision Modes

The 192 fifth-generation Tensor Cores (4 per SM x 48 SMs) represent the GB10's primary weapon for AI workloads. Unlike CUDA cores that process individual scalar operations, Tensor Cores accelerate matrix multiply-accumulate (MMA) operations -- the mathematical backbone of neural networks.

PrecisionBitsPrimary Use CaseRelative Throughput
FP4 (NVFP4)4Inference with microscaling~4x
FP8 (E4M3/E5M2)8Training and inference~2x
BF1616Training (wide dynamic range)~1x
TF3232 (19 eff.)Transparent acceleration~0.5x
FP3232High-precision accumulation~0.25x

The NVFP4 format uses E2M1 encoding with a two-level scaling strategy: an FP8 scale factor per group of 16 values (micro-block scaling), and a coarser FP32 scale factor for the entire tensor. This halves the group size compared to MXFP4, providing twice as many opportunities to match local data distributions and significantly reducing quantization error. Models quantized to NVFP4 reduce memory footprint by approximately 3.5x versus FP16.

Tensor Core Precision Scaling Pipeline -- FP4 to BF16
5th-Gen Tensor Cores (192 Total) FP4 / NVFP4 4 bits | ~4x throughput E2M1 + FP8 micro-block scale (16 vals) + FP32 tensor-level scale Low-Latency Inference 3.5x memory reduction vs FP16 FP8 8 bits | ~2x throughput E4M3 (range) / E5M2 (precision) Per-tensor scaling Mixed-Precision Training Transformer Engine auto-selects BF16 / FP16 16 bits | 1x baseline Full 16-bit representation BF16: wider dynamic range Full-Precision Training Accuracy-critical workloads All results accumulate in FP32 for numerical stability

Transformer Engine and Dynamic Precision Scaling

The Blackwell Transformer Engine is a hardware-software subsystem that automates precision selection during model execution. It monitors tensor statistics at each layer of a neural network and dynamically selects the optimal precision format -- choosing FP8 where accuracy permits, falling back to BF16 where it does not. This integrates with frameworks like TensorRT-LLM and NeMo.

Theoretical vs. Practical Compute Utilization

The gap between theoretical peaks and real-world throughput is one of the most important concepts in GPU computing:

The stark contrast reveals a critical truth: when a model's working set fits in the cache hierarchy, Tensor Cores stay well-fed. When the model exceeds cache capacity, performance becomes gated by the 273 GB/s memory bandwidth.

Key Takeaway

Post-Quiz: CUDA Cores, Tensor Cores & Compute

1. What is the fundamental difference between how CUDA cores and Tensor Cores process computations?

A) CUDA cores operate on scalar operations while Tensor Cores accelerate matrix multiply-accumulate operations B) CUDA cores handle integer operations and Tensor Cores handle floating-point operations C) CUDA cores run at higher clock speeds while Tensor Cores run at lower speeds with wider datapaths D) CUDA cores are programmable while Tensor Cores are fixed-function and cannot be controlled by developers

2. How does the NVFP4 format achieve high accuracy despite using only 4 bits per element?

A) It uses lossless compression to store full FP16 values in 4 bits B) It employs a two-level microscaling strategy with per-group FP8 scale factors and a tensor-level FP32 scale C) It only quantizes weights that are close to zero, keeping important weights in full precision D) It stores a 4-bit index into a shared lookup table of common weight values

3. What role does the Transformer Engine play in the Blackwell architecture?

A) It is a hardware unit that accelerates attention mechanism computations specifically B) It automatically monitors tensor statistics per layer and dynamically selects the optimal precision format C) It converts transformer models into optimized CUDA kernels at compile time D) It manages the scheduling of transformer layers across multiple GPUs

4. Why does the Llama 3.1 70B model achieve dramatically lower decode throughput than the 8B model on the GB10?

A) The 70B model requires FP64 precision which runs much slower on consumer Blackwell B) The 70B model's working set exceeds the cache hierarchy, making performance memory-bandwidth-bound C) The 70B model cannot use Tensor Cores and must rely on CUDA cores alone D) The GB10 lacks sufficient CUDA cores to process 70B parameters in parallel

5. What does "structured sparsity" mean in the context of the GB10's 1 PFLOP FP4 claim?

A) The hardware organizes data in sparse matrix formats to reduce memory usage B) The GPU skips zero-valued computations in weight matrices following a defined sparsity pattern, effectively doubling throughput C) The Tensor Cores only compute on the non-zero diagonal elements of weight matrices D) The GPU uses compression to store only non-zero weights, freeing memory bandwidth

Section 3: NVLink-C2C Interconnect & Unified Memory Architecture

Pre-Quiz: NVLink-C2C & Unified Memory

1. What bandwidth advantage does NVLink-C2C provide over PCIe Gen5 x16 in the GB10?

A) Approximately 4x higher bandwidth B) Approximately 14x higher bandwidth (900 GB/s vs ~64 GB/s) C) Approximately 28x higher bandwidth D) Approximately 2x higher bandwidth with lower latency

2. What is "arithmetic intensity" and why does it matter for predicting GB10 performance?

A) The ratio of integer to floating-point operations; it determines which core type to use B) The ratio of compute operations to memory bytes accessed; it determines whether a workload is compute-bound or memory-bound C) The clock frequency at which the Tensor Cores operate; it determines peak throughput D) The density of non-zero values in weight matrices; it determines sparsity benefits

3. During LLM inference, why is the prefill phase typically compute-bound while the decode phase is memory-bound?

A) Prefill uses Tensor Cores while decode uses only CUDA cores B) Prefill processes all input tokens in parallel with high data reuse, while decode generates one token at a time with minimal weight reuse C) Prefill operates on cached data while decode must access main memory for every operation D) Prefill uses FP4 precision while decode requires BF16, reducing throughput

4. How does the GB10's unified memory architecture simplify development for a 70B parameter model compared to a discrete GPU system?

A) It automatically quantizes models to fit in available memory B) It eliminates the need to manage separate CPU and GPU memory pools and explicit data copies between them C) It provides faster memory than HBM3e used in datacenter GPUs D) It enables the model to be split across multiple GB10 devices automatically

5. What is the approximate memory bandwidth gap between the GB10 and datacenter GB200, and what does it imply?

A) 5x gap; the GB10 compensates with better cache utilization B) 28x gap (273 GB/s vs ~7,700 GB/s); memory-intensive workloads will see dramatically different performance C) 100x gap; the GB10 is unsuitable for any large model inference D) 10x gap; the GB10 uses compression to close the effective bandwidth difference

NVLink Chip-to-Chip Coherent Interconnect at 900 GB/s

The NVLink-C2C provides 900 GB/s of bidirectional bandwidth between the Grace CPU and Blackwell GPU -- 14x the bandwidth of PCIe Gen5 x16. It achieves 25x better energy efficiency than PCIe Gen5, and cross-chip operations reach up to 93% of theoretical bandwidth.

InterconnectBidirectional BandwidthRelative to PCIe Gen5
PCIe Gen5 x16~64 GB/s1x
PCIe Gen6 x16 (theoretical)~128 GB/s2x
NVLink-C2C (GB10)900 GB/s14x

The critical word is coherent: when the CPU writes a value to a memory address, the GPU immediately sees the updated value at that same address, and vice versa. No explicit cache flushes, no DMA transfers. The hardware handles it transparently.

128 GB Unified LPDDR5X Memory

The GB10 features 128 GB of LPDDR5X shared between CPU and GPU in a single unified address space. In a traditional discrete system, developers must manage separate memory pools and explicit copies. On the GB10, both CPU and GPU see the same address space -- if the CPU preprocesses a batch of tokens, the GPU can immediately begin matrix multiplication without any data transfer.

The memory operates at 4,266 MHz across a 256-bit interface with 16 channels, delivering 273 GB/s peak bandwidth with ECC protection.

Memory Bandwidth Characteristics and Bottleneck Analysis

The 273 GB/s unified memory bandwidth is the GB10's most important performance constraint for large model workloads. The key concept is arithmetic intensity -- the ratio of compute operations to memory bytes accessed:

Worked Example: Why Large Model Decoding is Memory-Bound

SpecificationGB10 (DGX Spark)GB200 (Datacenter)
Memory capacity128 GB LPDDR5XUp to 384 GB HBM3e
Memory bandwidth273 GB/s~7,700 GB/s
Bandwidth ratio1x~28x
Ideal forModels up to ~200B paramsTrillion+ parameter models

Cache Coherency and LLM Access Patterns

Two memory access patterns dominate LLM inference:

  1. Prefill phase: Processes all input tokens in parallel with high data reuse. Compute-bound. GB10 achieves 7,991 tokens/sec on Llama 3.1 8B.
  2. Decode phase: Generates one token at a time, requiring full weight access with minimal reuse. Memory-bandwidth-bound.

The cache coherency protocol allows CPU-side preprocessing and GPU-side computation to overlap without explicit synchronization -- the CPU continuously prepares data while the GPU processes the current batch.

Key Takeaway

Post-Quiz: NVLink-C2C & Unified Memory

1. What bandwidth advantage does NVLink-C2C provide over PCIe Gen5 x16 in the GB10?

A) Approximately 4x higher bandwidth B) Approximately 14x higher bandwidth (900 GB/s vs ~64 GB/s) C) Approximately 28x higher bandwidth D) Approximately 2x higher bandwidth with lower latency

2. What is "arithmetic intensity" and why does it matter for predicting GB10 performance?

A) The ratio of integer to floating-point operations; it determines which core type to use B) The ratio of compute operations to memory bytes accessed; it determines whether a workload is compute-bound or memory-bound C) The clock frequency at which the Tensor Cores operate; it determines peak throughput D) The density of non-zero values in weight matrices; it determines sparsity benefits

3. During LLM inference, why is the prefill phase typically compute-bound while the decode phase is memory-bound?

A) Prefill uses Tensor Cores while decode uses only CUDA cores B) Prefill processes all input tokens in parallel with high data reuse, while decode generates one token at a time with minimal weight reuse C) Prefill operates on cached data while decode must access main memory for every operation D) Prefill uses FP4 precision while decode requires BF16, reducing throughput

4. How does the GB10's unified memory architecture simplify development for a 70B parameter model compared to a discrete GPU system?

A) It automatically quantizes models to fit in available memory B) It eliminates the need to manage separate CPU and GPU memory pools and explicit data copies between them C) It provides faster memory than HBM3e used in datacenter GPUs D) It enables the model to be split across multiple GB10 devices automatically

5. What is the approximate memory bandwidth gap between the GB10 and datacenter GB200, and what does it imply?

A) 5x gap; the GB10 compensates with better cache utilization B) 28x gap (273 GB/s vs ~7,700 GB/s); memory-intensive workloads will see dramatically different performance C) 100x gap; the GB10 is unsuitable for any large model inference D) 10x gap; the GB10 uses compression to close the effective bandwidth difference

Section 4: Power Delivery & Thermal Management

Pre-Quiz: Power & Thermal Management

1. What is the total system power budget of the DGX Spark, and how does the SoC portion compare to a discrete RTX 4090 GPU?

A) 500 W total, with the SoC at 300 W -- similar to an RTX 4090 B) 240 W total, with the SoC at 140 W -- roughly one-third the power of an RTX 4090 GPU alone C) 140 W total, with the SoC at 80 W -- about one-sixth of an RTX 4090 D) 350 W total, with the SoC at 200 W -- about half an RTX 4090

2. Why does the GB10's shared thermal envelope between CPU and GPU create a unique constraint?

A) The CPU must be powered down whenever the GPU is at full load B) Heavy GPU utilization generates heat that reduces thermal headroom available for CPU tasks, and vice versa C) The CPU and GPU must alternate execution to share the cooling capacity D) The thermal design prevents both CPU and GPU from reaching their peak clock speeds simultaneously

3. What is the maximum recommended ambient operating temperature for the GB10, and what happens above it?

A) 45 degrees C; the system enters a low-power sleep mode B) 30 degrees C; the system may enter thermal throttling, reducing clock frequencies and throughput C) 50 degrees C; the fan speed increases but performance is maintained D) 35 degrees C; the GPU disables half of its SMs to reduce heat

4. What is the GB10's approximate power efficiency in PFLOPS per kilowatt for the SoC alone?

A) ~2.5 PFLOPS/kW B) ~7.1 PFLOPS/kW C) ~12 PFLOPS/kW D) ~20 PFLOPS/kW

5. How can multiple GB10 units be scaled for higher aggregate compute?

A) Via NVLink bridges that directly connect the GPUs across units B) Via ConnectX-7 networking, with up to 4 interconnected units delivering ~4 PFLOPS aggregate C) Via PCIe daisy-chaining of up to 8 units D) Multiple GB10 units cannot be interconnected; they operate independently

Power Distribution Architecture and TDP Envelope

The GB10 operates within a total system power budget of 240 watts. The SoC (GPU + CPU) accounts for 140 W (58%), the ConnectX-7 NIC for ~40 W (17%), and remaining peripherals for ~60 W (25%). For context, a single discrete RTX 4090 GPU alone consumes 450 W -- more than three times the GB10's entire SoC budget.

The power delivery uses a single USB-C power input from a standard electrical outlet. No dedicated circuits, no specialized PDUs -- just a standard desk outlet.

ComponentPower BudgetPercentage
Blackwell GPU + Grace CPU (SoC)140 W58%
ConnectX-7 NIC (200 Gbps)~40 W~17%
Wi-Fi 7, NVMe SSD, USB-C/HDMI~60 W~25%
Total system240 W100%

Thermal Management and Throttling

The 150 mm x 150 mm x 50.5 mm enclosure at 1.2 kg must dissipate 240 watts continuously under load. NVIDIA specifies an operating temperature range of 5-30 degrees C. The relatively narrow upper bound of 30 degrees C implies the thermal solution has limited headroom.

When internal temperatures exceed safe thresholds, the system reduces clock frequencies (thermal throttling). The SoC's integrated design means heavy GPU utilization generates heat that also affects the CPU -- unlike discrete systems with independent thermal domains.

Power Efficiency Comparison

SystemAI Compute (FP4)Total PowerEfficiency
DGX Spark (GB10)1 PFLOP240 W (140 W SoC)~7.1 PFLOPS/kW
DGX B200 (single GPU)~20 PFLOPS~1,000 W~20 PFLOPS/kW
DGX GB200 NVL72 (rack)~1,440 PFLOPS~120 kW~12 PFLOPS/kW

For organizations considering fleet deployments, up to 4 interconnected GB10 units via ConnectX-7 can deliver approximately 4 PFLOPS of aggregate FP4 compute with nearly linear scaling, all from standard desk outlets.

Key Takeaway

Post-Quiz: Power & Thermal Management

1. What is the total system power budget of the DGX Spark, and how does the SoC portion compare to a discrete RTX 4090 GPU?

A) 500 W total, with the SoC at 300 W -- similar to an RTX 4090 B) 240 W total, with the SoC at 140 W -- roughly one-third the power of an RTX 4090 GPU alone C) 140 W total, with the SoC at 80 W -- about one-sixth of an RTX 4090 D) 350 W total, with the SoC at 200 W -- about half an RTX 4090

2. Why does the GB10's shared thermal envelope between CPU and GPU create a unique constraint?

A) The CPU must be powered down whenever the GPU is at full load B) Heavy GPU utilization generates heat that reduces thermal headroom available for CPU tasks, and vice versa C) The CPU and GPU must alternate execution to share the cooling capacity D) The thermal design prevents both CPU and GPU from reaching their peak clock speeds simultaneously

3. What is the maximum recommended ambient operating temperature for the GB10, and what happens above it?

A) 45 degrees C; the system enters a low-power sleep mode B) 30 degrees C; the system may enter thermal throttling, reducing clock frequencies and throughput C) 50 degrees C; the fan speed increases but performance is maintained D) 35 degrees C; the GPU disables half of its SMs to reduce heat

4. What is the GB10's approximate power efficiency in PFLOPS per kilowatt for the SoC alone?

A) ~2.5 PFLOPS/kW B) ~7.1 PFLOPS/kW C) ~12 PFLOPS/kW D) ~20 PFLOPS/kW

5. How can multiple GB10 units be scaled for higher aggregate compute?

A) Via NVLink bridges that directly connect the GPUs across units B) Via ConnectX-7 networking, with up to 4 interconnected units delivering ~4 PFLOPS aggregate C) Via PCIe daisy-chaining of up to 8 units D) Multiple GB10 units cannot be interconnected; they operate independently

Your Progress

Answer Explanations