Chapter 2: DGX OS Software Stack, CUDA Toolkit & Containerized AI Workflows

Learning Objectives

Section 1: DGX OS — Ubuntu Linux with NVIDIA Optimization

Pre-Quiz: DGX OS Fundamentals

1. What distinguishes DGX OS GPU driver installation from a standard Ubuntu GPU setup?

DGX OS requires manual DKMS compilation after each kernel update DGX OS uses a proprietary non-Linux kernel for GPU support DGX OS ships GPU drivers pre-integrated as .deb packages without DKMS, eliminating rebuild steps DGX OS only supports GPU access through containers, not natively

2. What is the primary role of DCGM (Data Center GPU Manager) on DGX Spark?

It replaces nvidia-smi as the only GPU monitoring tool It provides programmatic GPU telemetry, health checks, and integration with monitoring dashboards like Prometheus It manages Docker container networking for GPU workloads It controls GPU clock speeds and voltage for overclocking

3. Why does DGX Spark ship with a custom NVIDIA kernel rather than the stock Ubuntu kernel?

The stock kernel cannot boot on ARM64 hardware The custom kernel is tuned for Grace Blackwell unified memory, NVLink, and GPU scheduling NVIDIA legally cannot distribute the standard Ubuntu kernel The custom kernel removes all networking support to improve GPU performance

4. How does DGX Spark handle multi-user GPU workload isolation?

Each user receives a physically separate GPU partition at the hardware level Only one user can log in at a time to prevent conflicts GPU access is controlled at the container level using the --gpus flag for Docker containers A hypervisor creates virtual GPUs for each user session

5. Which diagnostic approach correctly represents the layered diagnostics model on DGX Spark?

Run nvidia-smi only; it covers all diagnostic needs Hardware layer (nvidia-smi -q), driver layer (dmesg | grep nvidia), application layer (CUDA sample programs) Check Docker logs first, then reboot the system if errors persist Run the management dashboard exclusively; command-line tools are deprecated

DGX OS Base System: Ubuntu LTS with NVIDIA Kernel Modules

DGX OS is built on Ubuntu 24.04 LTS but is far more than a stock installation. NVIDIA ships a custom kernel (6.17.0-1014-nvidia) alongside a Hardware Enablement (HWE) kernel 6.14, both tuned for the Grace Blackwell architecture's unified memory, NVLink interconnects, and GPU scheduling requirements.

The GPU driver package — the nvidia-580-open series — is delivered as a .deb package without DKMS (Dynamic Kernel Module Support). This eliminates the fragile driver rebuild step that plagues standard Linux GPU setups. Think of it this way: if standard Ubuntu with manually installed drivers is like assembling furniture from parts, DGX OS is the factory-assembled version.

ComponentStandard UbuntuDGX OS
KernelGeneric Linux kernelCustom NVIDIA kernel (6.17.0-1014-nvidia)
GPU DriversManually installed, DKMS-rebuiltPre-integrated, .deb packaged, no DKMS
AI LibrariesUser-installedPre-configured CUDA, cuDNN, TensorRT
Container RuntimeDocker onlyDocker + NVIDIA Container Toolkit

DGX OS Software Stack Layers

Grace Blackwell Hardware ARM64 CPU + Blackwell GPU + Unified Memory DGX OS (Ubuntu 24.04 LTS) Custom NVIDIA kernel 6.17.0-1014-nvidia Pre-integrated GPU Drivers nvidia-580-open, no DKMS, .deb packaged CUDA Toolkit nvcc, cuBLAS, cuFFT, cuSPARSE AI Libraries cuDNN, TensorRT, NCCL Container Toolkit GPU Passthrough Runtime NGC Registry Optimized AI Containers AI Applications PyTorch | TensorFlow | JAX | NIM | Triton

GPU Monitoring with nvidia-smi and DCGM

The moment DGX Spark boots, the GPU driver stack is operational. The primary monitoring interface is nvidia-smi, which reports GPU utilization, memory, temperature, power draw, and running processes. For continuous monitoring, nvidia-smi dmon streams metrics at configurable intervals.

Beyond nvidia-smi, DCGM (Data Center GPU Manager) provides programmatic access to GPU telemetry, health checks, and policy-based monitoring suitable for integration with Prometheus and Grafana dashboards.

System Management and Diagnostics

Diagnostics follow a layered approach:

  1. Hardware layer: nvidia-smi -q reports ECC memory errors, PCIe link speed, thermal throttling
  2. Driver layer: dmesg | grep nvidia surfaces kernel-level driver messages
  3. Application layer: CUDA sample programs (deviceQuery, bandwidthTest) verify end-to-end functionality

Multi-User Access and GPU Isolation

DGX Spark supports teams through standard Linux multi-user capabilities enhanced for GPU workload isolation. GPU access is controlled at the container level — each user's Docker containers receive dedicated GPU resources through the --gpus flag. JupyterLab, pre-installed on DGX Spark, provides browser-based development access with per-user sessions.

Key Takeaways

Post-Quiz: DGX OS Fundamentals

1. What distinguishes DGX OS GPU driver installation from a standard Ubuntu GPU setup?

DGX OS requires manual DKMS compilation after each kernel update DGX OS uses a proprietary non-Linux kernel for GPU support DGX OS ships GPU drivers pre-integrated as .deb packages without DKMS, eliminating rebuild steps DGX OS only supports GPU access through containers, not natively

2. What is the primary role of DCGM (Data Center GPU Manager) on DGX Spark?

It replaces nvidia-smi as the only GPU monitoring tool It provides programmatic GPU telemetry, health checks, and integration with monitoring dashboards like Prometheus It manages Docker container networking for GPU workloads It controls GPU clock speeds and voltage for overclocking

3. Why does DGX Spark ship with a custom NVIDIA kernel rather than the stock Ubuntu kernel?

The stock kernel cannot boot on ARM64 hardware The custom kernel is tuned for Grace Blackwell unified memory, NVLink, and GPU scheduling NVIDIA legally cannot distribute the standard Ubuntu kernel The custom kernel removes all networking support to improve GPU performance

4. How does DGX Spark handle multi-user GPU workload isolation?

Each user receives a physically separate GPU partition at the hardware level Only one user can log in at a time to prevent conflicts GPU access is controlled at the container level using the --gpus flag for Docker containers A hypervisor creates virtual GPUs for each user session

5. Which diagnostic approach correctly represents the layered diagnostics model on DGX Spark?

Run nvidia-smi only; it covers all diagnostic needs Hardware layer (nvidia-smi -q), driver layer (dmesg | grep nvidia), application layer (CUDA sample programs) Check Docker logs first, then reboot the system if errors persist Run the management dashboard exclusively; command-line tools are deprecated

Section 2: CUDA Toolkit & Core AI Development Libraries

Pre-Quiz: CUDA Toolkit & AI Libraries

1. What does cuDNN provide that the base CUDA toolkit does not?

A GPU compiler for .cu source files GPU-accelerated deep learning primitives like convolutions, pooling, and normalization Container runtime hooks for GPU passthrough A web-based dashboard for monitoring training progress

2. Why must software compiled natively for DGX Spark target ARM64 rather than x86_64?

The Blackwell GPU only supports ARM instruction sets DGX Spark uses the Grace CPU which is an ARM64 (AArch64) processor ARM64 is required by the NVIDIA Container Toolkit licensing terms x86_64 binaries are automatically translated, but ARM64 is faster

3. What is TensorRT's primary function in the AI deployment pipeline?

Training neural networks from scratch with automatic hyperparameter tuning Converting trained models into optimized inference engines with layer fusion and precision calibration Managing CUDA toolkit version compatibility across different GPU architectures Distributing training data across multiple nodes in a cluster

4. How does NCCL improve multi-GPU training performance on DGX Spark?

It compresses model weights to reduce memory usage per GPU It routes collective operations (AllReduce, Broadcast) over high-bandwidth NVLink rather than slower PCIe It automatically partitions the model across GPUs using pipeline parallelism It replaces cuDNN for communication-heavy operations like attention layers

5. When a developer runs PyTorch on DGX Spark and calls a convolution operation, what library actually performs the GPU computation?

PyTorch's built-in GPU kernels written in Python cuDNN, which selects the fastest algorithm for the specific tensor dimensions and hardware TensorRT, which optimizes all operations at runtime NCCL, which handles all GPU computations including math operations

CUDA Toolkit: Installation, Versioning, and Configuration

The CUDA toolkit comes pre-installed on DGX Spark, verified for Blackwell hardware compatibility. Versions include CUDA 12.8 and CUDA 13.0.2 depending on the release batch. The toolkit includes:

Environment configuration centers on two variables: PATH must include /usr/local/cuda/bin, and LD_LIBRARY_PATH must include /usr/local/cuda/lib64. On DGX OS, these are set by default.

Important: DGX Spark uses the ARM64 (AArch64) architecture via the Grace CPU. Any natively compiled software must target ARM64, and the NGC CLI must be installed from the ARM64 Linux tab.

cuDNN and TensorRT

cuDNN provides GPU-accelerated deep learning primitives: convolutions, pooling, normalization, and activation functions. When PyTorch or TensorFlow execute a convolution, they call cuDNN, which selects the fastest algorithm for the specific tensor dimensions, data type, and hardware.

TensorRT (v10.2 on DGX Spark) is NVIDIA's inference optimization engine. It takes a trained model and produces an optimized execution plan — fusing layers, selecting precision (FP32, FP16, INT8), and calibrating for the target GPU. The workflow: export to ONNX, optimize with trtexec, deploy the .trt engine.

flowchart LR A["Trained Model\nPyTorch / TensorFlow"] --> B["Export to ONNX\ntorch.onnx.export()"] B --> C["TensorRT Optimizer\ntrtexec"] C --> D{"Precision\nSelection"} D --> E["FP32\nFull Precision"] D --> F["FP16\nHalf Precision"] D --> G["INT8\nQuantized"] E --> H["Layer Fusion\n& Kernel Selection"] F --> H G --> H H --> I["Optimized TensorRT\nEngine (.trt)"] I --> J["Deploy for\nInference"] style A fill:#333,color:#fff style B fill:#333,color:#fff style C fill:#76b900,color:#000 style D fill:#005f30,color:#fff style E fill:#005f30,color:#fff style F fill:#005f30,color:#fff style G fill:#005f30,color:#fff style H fill:#76b900,color:#000 style I fill:#76b900,color:#000 style J fill:#333,color:#fff
LibraryPurposeWhen You Use It
CUDA ToolkitGPU computation platformCompiling custom CUDA kernels
cuDNNDL operation primitivesAutomatically via PyTorch/TensorFlow
TensorRTInference optimizationDeploying models to production
cuBLASLinear algebra on GPUMatrix ops, automatically via frameworks
NCCLMulti-GPU communicationDistributed training (automatic)

NCCL and Multi-GPU Communication

NCCL (pronounced "Nickel") handles data transfer between multiple GPUs. It orchestrates operations like AllReduce, Broadcast, and AllGather across GPUs connected via NVLink, exploiting high-bandwidth topology rather than slower PCIe. PyTorch's DistributedDataParallel and TensorFlow's tf.distribute.Strategy call NCCL automatically.

Framework Integration: PyTorch, TensorFlow, JAX

DGX Spark ships with pre-installed versions of major AI frameworks, each compiled against the system's CUDA, cuDNN, and NCCL versions. The pre-installed versions are matched and tested, avoiding version compatibility headaches. For different framework versions, NGC containers provide isolated environments with their own stack.

# Verify GPU availability
# PyTorch
import torch
print(torch.cuda.is_available())       # True
print(torch.cuda.get_device_name(0))   # NVIDIA Blackwell ...

# TensorFlow
import tensorflow as tf
print(tf.config.list_physical_devices('GPU'))

# JAX
import jax
print(jax.devices())

Key Takeaways

Post-Quiz: CUDA Toolkit & AI Libraries

1. What does cuDNN provide that the base CUDA toolkit does not?

A GPU compiler for .cu source files GPU-accelerated deep learning primitives like convolutions, pooling, and normalization Container runtime hooks for GPU passthrough A web-based dashboard for monitoring training progress

2. Why must software compiled natively for DGX Spark target ARM64 rather than x86_64?

The Blackwell GPU only supports ARM instruction sets DGX Spark uses the Grace CPU which is an ARM64 (AArch64) processor ARM64 is required by the NVIDIA Container Toolkit licensing terms x86_64 binaries are automatically translated, but ARM64 is faster

3. What is TensorRT's primary function in the AI deployment pipeline?

Training neural networks from scratch with automatic hyperparameter tuning Converting trained models into optimized inference engines with layer fusion and precision calibration Managing CUDA toolkit version compatibility across different GPU architectures Distributing training data across multiple nodes in a cluster

4. How does NCCL improve multi-GPU training performance on DGX Spark?

It compresses model weights to reduce memory usage per GPU It routes collective operations (AllReduce, Broadcast) over high-bandwidth NVLink rather than slower PCIe It automatically partitions the model across GPUs using pipeline parallelism It replaces cuDNN for communication-heavy operations like attention layers

5. When a developer runs PyTorch on DGX Spark and calls a convolution operation, what library actually performs the GPU computation?

PyTorch's built-in GPU kernels written in Python cuDNN, which selects the fastest algorithm for the specific tensor dimensions and hardware TensorRT, which optimizes all operations at runtime NCCL, which handles all GPU computations including math operations

Section 3: NGC Container Registry & Containerized Workflows

Pre-Quiz: NGC Containers & Workflows

1. What does the NVIDIA Container Toolkit provide that standard Docker does not?

Network isolation between containers OCI runtime hooks that expose GPU drivers, CUDA libraries, and device files inside containers Automatic container image compression for faster pulls Built-in container orchestration with load balancing

2. What is the recommended approach for building a custom AI container on DGX Spark?

Start from a minimal Alpine Linux image and install CUDA from scratch Start from an NVIDIA base image (e.g., nvcr.io/nvidia/pytorch) and layer project-specific dependencies Copy the host system's /usr/local/cuda directory into the container Use a Windows container with CUDA support for maximum compatibility

3. When authenticating Docker with the NGC registry, what value is used as the username?

Your NVIDIA developer account email $oauthtoken (literal string) Your NGC organization name admin

4. What is the purpose of the --gpus all flag when running a Docker container on DGX Spark?

It installs GPU drivers inside the container image It tells the NVIDIA runtime to expose all available GPUs to the container It enables CPU-based GPU emulation for testing It restricts the container to use only GPU memory, not system RAM

5. In a Docker Compose file for DGX Spark, how are GPU resources specified for a service?

Using the gpus: all top-level key Using deploy.resources.reservations.devices with driver nvidia and capabilities [gpu] Adding --gpus all to the command field GPU access is automatic in Docker Compose and needs no configuration

NGC Container Registry

The NGC container registry at nvcr.io hosts hundreds of pre-built, GPU-optimized container images. These include framework containers (PyTorch, TensorFlow, JAX), application containers (Triton, RAPIDS), and model containers (NIM microservices). Each image is tested on NVIDIA hardware.

Setting up NGC access:

  1. Generate an API key at ngc.nvidia.com → Setup → API Key
  2. Authenticate Docker: docker login nvcr.io (username: $oauthtoken, password: your API key)
  3. Install NGC CLI from the ARM64 Linux tab (required for Grace CPU architecture)
  4. Pull containers: docker pull nvcr.io/nvidia/pytorch:24.08-py3

NGC Container Workflow on DGX Spark

NGC Registry nvcr.io API Key Auth Docker Pull GPU-Optimized Container Image Container Toolkit GPU Passthrough --gpus all Running Container Full CUDA Access nvidia-smi OK DGX Spark — Grace Blackwell GPU Hardware STEP 1 STEP 2 STEP 3 STEP 4

NVIDIA Container Toolkit: GPU Passthrough

The NVIDIA Container Toolkit installs OCI runtime hooks that expose GPU drivers, CUDA libraries, and device files inside containers without requiring these to be baked into the image. On DGX Spark, the toolkit is pre-installed and pre-configured.

# Run GPU-enabled container
docker run -it --gpus all nvcr.io/nvidia/pytorch:24.08-py3

# Specify individual GPUs
docker run -it --gpus '"device=0"' nvcr.io/nvidia/pytorch:24.08-py3

# Verify GPU access inside container
docker run --rm --gpus all nvcr.io/nvidia/cuda:13.0.1-devel-ubuntu24.04 nvidia-smi

Building Custom Containers

Start from an NVIDIA base image and layer your requirements:

FROM nvcr.io/nvidia/pytorch:24.08-py3

# Install project-specific dependencies
RUN pip install transformers datasets accelerate wandb

# Copy project code
COPY ./src /workspace/src
WORKDIR /workspace
CMD ["python", "src/train.py"]

Build and run: docker build -t my-training:v1 . then docker run --gpus all -v /data:/data my-training:v1

Container Orchestration Patterns

Two orchestration patterns are common on DGX Spark:

Key Takeaways

Post-Quiz: NGC Containers & Workflows

1. What does the NVIDIA Container Toolkit provide that standard Docker does not?

Network isolation between containers OCI runtime hooks that expose GPU drivers, CUDA libraries, and device files inside containers Automatic container image compression for faster pulls Built-in container orchestration with load balancing

2. What is the recommended approach for building a custom AI container on DGX Spark?

Start from a minimal Alpine Linux image and install CUDA from scratch Start from an NVIDIA base image (e.g., nvcr.io/nvidia/pytorch) and layer project-specific dependencies Copy the host system's /usr/local/cuda directory into the container Use a Windows container with CUDA support for maximum compatibility

3. When authenticating Docker with the NGC registry, what value is used as the username?

Your NVIDIA developer account email $oauthtoken (literal string) Your NGC organization name admin

4. What is the purpose of the --gpus all flag when running a Docker container on DGX Spark?

It installs GPU drivers inside the container image It tells the NVIDIA runtime to expose all available GPUs to the container It enables CPU-based GPU emulation for testing It restricts the container to use only GPU memory, not system RAM

5. In a Docker Compose file for DGX Spark, how are GPU resources specified for a service?

Using the gpus: all top-level key Using deploy.resources.reservations.devices with driver nvidia and capabilities [gpu] Adding --gpus all to the command field GPU access is automatic in Docker Compose and needs no configuration

Section 4: NVIDIA NIM Microservices & AI Enterprise Stack

Pre-Quiz: NIM Microservices & AI Enterprise

1. What is the key advantage of NIM microservices for model deployment?

NIM trains models faster by using distributed computing automatically NIM converts model deployment from a systems engineering challenge into a container orchestration task NIM eliminates the need for GPUs during inference by using CPU-only optimization NIM provides a graphical interface for non-technical users to deploy models

2. How does Triton Inference Server differ from NIM in its approach to model serving?

Triton only supports TensorRT models, while NIM supports all frameworks Triton provides multi-model, multi-framework serving with fine-grained scheduling, while NIM packages single models as turnkey APIs Triton is for training only, while NIM is for inference only There is no practical difference; they are the same tool with different names

3. What does Triton's dynamic batching feature accomplish?

It splits large models across multiple GPUs automatically It automatically groups incoming requests to maximize GPU throughput It dynamically adjusts model precision based on available memory It batches model updates to reduce deployment downtime

4. What does the NVIDIA AI Enterprise license provide beyond the open-source stack?

Access to CUDA and cuDNN, which are not available in the open-source stack Enterprise support with SLA, CVE response guarantees, full NIM catalog, and pre-built Blueprints Higher GPU clock speeds unlocked through a software license key Access to x86_64 emulation for running legacy workloads on ARM64

5. In a production observability stack for AI services on DGX Spark, what role does the /v2/health/ready endpoint serve?

It triggers automatic model retraining when accuracy drops It enables load balancers and Kubernetes to route traffic away from unhealthy instances It exposes GPU temperature data for thermal throttling alerts It provides a web dashboard for real-time inference visualization

NIM Microservices: Models as API Endpoints

NVIDIA NIM provides prebuilt, optimized containers that package foundation models as API endpoints. Each NIM container includes the model weights, an inference engine (typically TensorRT-LLM), and an OpenAI-compatible API server.

# Pull and run a NIM container
docker pull nvcr.io/nim/meta/llama-3.1-8b-instruct:latest
docker run --gpus all -p 8000:8000 nvcr.io/nim/meta/llama-3.1-8b-instruct:latest

# Query the model via OpenAI-compatible API
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "meta/llama-3.1-8b-instruct",
       "messages": [{"role": "user", "content": "Explain GPU memory hierarchy."}]}'

NIM handles model optimization internally — batch sizes, KV-cache memory management, and TensorRT-LLM optimizations. The key advantage: if you can run a Docker container, you can serve a production-grade LLM.

NIM Microservice Request Flow

NIM CONTAINER Client App HTTP POST API Server OpenAI-compatible Tokenizer TensorRT-LLM Inference Engine KV-Cache Model Weights GPU Memory Response JSON Tokens 1. Request 2. Tokenize 3. Infer 4. Stream

NVIDIA AI Enterprise Platform

NVIDIA AI Enterprise is the commercial software layer providing enterprise support, security certifications, API stability guarantees, and validated upgrade paths. DGX Spark includes an AI Enterprise license, unlocking NIM microservices, enterprise support, and NVIDIA Blueprints.

CapabilityOpen-Source StackAI Enterprise
CUDA/cuDNNIncludedIncluded
NGC ContainersPublic catalogFull catalog + enterprise images
NIM MicroservicesCommunity modelsFull model catalog + support
SecurityCommunity patchesCVE response SLA
SupportForumsEnterprise support with SLA
BlueprintsNot availablePre-built reference architectures

Triton Inference Server: Multi-Model Serving

Triton Inference Server serves multiple models simultaneously with fine-grained control over scheduling, batching, and resource allocation. It supports TensorFlow SavedModels, PyTorch TorchScript, ONNX, TensorRT engines, and Python-based models through a unified interface.

Triton serves models via HTTP (port 8000), gRPC (port 8001), and metrics (port 8002).

flowchart TD A["Client Requests"] --> B["Triton Inference Server"] B --> C["Dynamic Batching\nEngine"] C --> D["Text Classifier\nONNX Model"] C --> E["Image Encoder\nTensorRT Engine"] C --> F["LLM Service\nPython Backend"] B --> G["HTTP Port 8000"] B --> H["gRPC Port 8001"] B --> I["Prometheus Metrics\nPort 8002"] I --> J["Grafana\nDashboard"] style A fill:#333,color:#fff style B fill:#76b900,color:#000 style C fill:#005f30,color:#fff style D fill:#005f30,color:#fff style E fill:#005f30,color:#fff style F fill:#005f30,color:#fff style G fill:#333,color:#fff style H fill:#333,color:#fff style I fill:#333,color:#fff style J fill:#1a1a1a,color:#fff

Monitoring and Observability

Production AI services require continuous monitoring across four layers:

  1. GPU metrics: nvidia-smi and DCGM export utilization, memory, temperature, error counts
  2. Inference metrics: Triton exposes latency, throughput, queue depth via Prometheus
  3. Container metrics: Docker/Kubernetes provide CPU, memory, network I/O data
  4. Application logs: Structured logging from NIM and Triton for debugging and auditing

Health check endpoints (/v2/health/ready) enable load balancers and Kubernetes to automatically route traffic away from unhealthy instances.

Key Takeaways

Post-Quiz: NIM Microservices & AI Enterprise

1. What is the key advantage of NIM microservices for model deployment?

NIM trains models faster by using distributed computing automatically NIM converts model deployment from a systems engineering challenge into a container orchestration task NIM eliminates the need for GPUs during inference by using CPU-only optimization NIM provides a graphical interface for non-technical users to deploy models

2. How does Triton Inference Server differ from NIM in its approach to model serving?

Triton only supports TensorRT models, while NIM supports all frameworks Triton provides multi-model, multi-framework serving with fine-grained scheduling, while NIM packages single models as turnkey APIs Triton is for training only, while NIM is for inference only There is no practical difference; they are the same tool with different names

3. What does Triton's dynamic batching feature accomplish?

It splits large models across multiple GPUs automatically It automatically groups incoming requests to maximize GPU throughput It dynamically adjusts model precision based on available memory It batches model updates to reduce deployment downtime

4. What does the NVIDIA AI Enterprise license provide beyond the open-source stack?

Access to CUDA and cuDNN, which are not available in the open-source stack Enterprise support with SLA, CVE response guarantees, full NIM catalog, and pre-built Blueprints Higher GPU clock speeds unlocked through a software license key Access to x86_64 emulation for running legacy workloads on ARM64

5. In a production observability stack for AI services on DGX Spark, what role does the /v2/health/ready endpoint serve?

It triggers automatic model retraining when accuracy drops It enables load balancers and Kubernetes to route traffic away from unhealthy instances It exposes GPU temperature data for thermal throttling alerts It provides a web dashboard for real-time inference visualization

Your Progress

Answer Explanations