Study Guide: DGX OS Software Stack, CUDA Toolkit & Containerized AI Workflows

Pre-Quiz: DGX OS Fundamentals

1. What distinguishes DGX OS GPU driver installation from a standard Ubuntu GPU setup?

DGX OS requires manual DKMS compilation after each kernel update DGX OS uses a proprietary non-Linux kernel for GPU support DGX OS ships GPU drivers pre-integrated as .deb packages without DKMS, eliminating rebuild steps DGX OS only supports GPU access through containers, not natively

2. What is the primary role of DCGM (Data Center GPU Manager) on DGX Spark?

It replaces nvidia-smi as the only GPU monitoring tool It provides programmatic GPU telemetry, health checks, and integration with monitoring dashboards like Prometheus It manages Docker container networking for GPU workloads It controls GPU clock speeds and voltage for overclocking

3. Why does DGX Spark ship with a custom NVIDIA kernel rather than the stock Ubuntu kernel?

The stock kernel cannot boot on ARM64 hardware The custom kernel is tuned for Grace Blackwell unified memory, NVLink, and GPU scheduling NVIDIA legally cannot distribute the standard Ubuntu kernel The custom kernel removes all networking support to improve GPU performance

4. How does DGX Spark handle multi-user GPU workload isolation?

Each user receives a physically separate GPU partition at the hardware level Only one user can log in at a time to prevent conflicts GPU access is controlled at the container level using the --gpus flag for Docker containers A hypervisor creates virtual GPUs for each user session

5. Which diagnostic approach correctly represents the layered diagnostics model on DGX Spark?

Run nvidia-smi only; it covers all diagnostic needs Hardware layer (nvidia-smi -q), driver layer (dmesg | grep nvidia), application layer (CUDA sample programs) Check Docker logs first, then reboot the system if errors persist Run the management dashboard exclusively; command-line tools are deprecated

DGX OS Base System: Ubuntu LTS with NVIDIA Kernel Modules

DGX OS is built on Ubuntu 24.04 LTS but is far more than a stock installation. NVIDIA ships a custom kernel (6.17.0-1014-nvidia) alongside a Hardware Enablement (HWE) kernel 6.14, both tuned for the Grace Blackwell architecture's unified memory, NVLink interconnects, and GPU scheduling requirements.

The GPU driver package — the nvidia-580-open series — is delivered as a .deb package without DKMS (Dynamic Kernel Module Support). This eliminates the fragile driver rebuild step that plagues standard Linux GPU setups. Think of it this way: if standard Ubuntu with manually installed drivers is like assembling furniture from parts, DGX OS is the factory-assembled version.

Component	Standard Ubuntu	DGX OS
Kernel	Generic Linux kernel	Custom NVIDIA kernel (6.17.0-1014-nvidia)
GPU Drivers	Manually installed, DKMS-rebuilt	Pre-integrated, .deb packaged, no DKMS
AI Libraries	User-installed	Pre-configured CUDA, cuDNN, TensorRT
Container Runtime	Docker only	Docker + NVIDIA Container Toolkit

DGX OS Software Stack Layers

GPU Monitoring with nvidia-smi and DCGM

The moment DGX Spark boots, the GPU driver stack is operational. The primary monitoring interface is nvidia-smi, which reports GPU utilization, memory, temperature, power draw, and running processes. For continuous monitoring, nvidia-smi dmon streams metrics at configurable intervals.

Beyond nvidia-smi, DCGM (Data Center GPU Manager) provides programmatic access to GPU telemetry, health checks, and policy-based monitoring suitable for integration with Prometheus and Grafana dashboards.

System Management and Diagnostics

Diagnostics follow a layered approach:

Hardware layer: nvidia-smi -q reports ECC memory errors, PCIe link speed, thermal throttling
Driver layer: dmesg | grep nvidia surfaces kernel-level driver messages
Application layer: CUDA sample programs (deviceQuery, bandwidthTest) verify end-to-end functionality

Multi-User Access and GPU Isolation

DGX Spark supports teams through standard Linux multi-user capabilities enhanced for GPU workload isolation. GPU access is controlled at the container level — each user's Docker containers receive dedicated GPU resources through the --gpus flag. JupyterLab, pre-installed on DGX Spark, provides browser-based development access with per-user sessions.

Key Takeaways

DGX OS is a purpose-built Linux distribution, not simply Ubuntu with NVIDIA drivers bolted on
GPU drivers are kernel-integrated, packaged as .deb without DKMS — no rebuild on kernel updates
nvidia-smi and DCGM provide complementary monitoring (interactive vs. programmatic)
Multi-user isolation is container-based via the --gpus flag, not hardware GPU partitioning

Post-Quiz: DGX OS Fundamentals

1. What distinguishes DGX OS GPU driver installation from a standard Ubuntu GPU setup?

2. What is the primary role of DCGM (Data Center GPU Manager) on DGX Spark?

3. Why does DGX Spark ship with a custom NVIDIA kernel rather than the stock Ubuntu kernel?

4. How does DGX Spark handle multi-user GPU workload isolation?

5. Which diagnostic approach correctly represents the layered diagnostics model on DGX Spark?

Section 2: CUDA Toolkit & Core AI Development Libraries

Pre-Quiz: CUDA Toolkit & AI Libraries

1. What does cuDNN provide that the base CUDA toolkit does not?

A GPU compiler for .cu source files GPU-accelerated deep learning primitives like convolutions, pooling, and normalization Container runtime hooks for GPU passthrough A web-based dashboard for monitoring training progress

2. Why must software compiled natively for DGX Spark target ARM64 rather than x86_64?

The Blackwell GPU only supports ARM instruction sets DGX Spark uses the Grace CPU which is an ARM64 (AArch64) processor ARM64 is required by the NVIDIA Container Toolkit licensing terms x86_64 binaries are automatically translated, but ARM64 is faster

3. What is TensorRT's primary function in the AI deployment pipeline?

Training neural networks from scratch with automatic hyperparameter tuning Converting trained models into optimized inference engines with layer fusion and precision calibration Managing CUDA toolkit version compatibility across different GPU architectures Distributing training data across multiple nodes in a cluster

4. How does NCCL improve multi-GPU training performance on DGX Spark?

It compresses model weights to reduce memory usage per GPU It routes collective operations (AllReduce, Broadcast) over high-bandwidth NVLink rather than slower PCIe It automatically partitions the model across GPUs using pipeline parallelism It replaces cuDNN for communication-heavy operations like attention layers

5. When a developer runs PyTorch on DGX Spark and calls a convolution operation, what library actually performs the GPU computation?

PyTorch's built-in GPU kernels written in Python cuDNN, which selects the fastest algorithm for the specific tensor dimensions and hardware TensorRT, which optimizes all operations at runtime NCCL, which handles all GPU computations including math operations

CUDA Toolkit: Installation, Versioning, and Configuration

The CUDA toolkit comes pre-installed on DGX Spark, verified for Blackwell hardware compatibility. Versions include CUDA 12.8 and CUDA 13.0.2 depending on the release batch. The toolkit includes:

nvcc — the CUDA compiler for .cu source files
CUDA runtime libraries — the API layer applications link against
cuBLAS, cuFFT, cuSPARSE — GPU-accelerated math libraries
CUDA samples — reference implementations for testing

Environment configuration centers on two variables: PATH must include /usr/local/cuda/bin, and LD_LIBRARY_PATH must include /usr/local/cuda/lib64. On DGX OS, these are set by default.

Important: DGX Spark uses the ARM64 (AArch64) architecture via the Grace CPU. Any natively compiled software must target ARM64, and the NGC CLI must be installed from the ARM64 Linux tab.

cuDNN and TensorRT

cuDNN provides GPU-accelerated deep learning primitives: convolutions, pooling, normalization, and activation functions. When PyTorch or TensorFlow execute a convolution, they call cuDNN, which selects the fastest algorithm for the specific tensor dimensions, data type, and hardware.

TensorRT (v10.2 on DGX Spark) is NVIDIA's inference optimization engine. It takes a trained model and produces an optimized execution plan — fusing layers, selecting precision (FP32, FP16, INT8), and calibrating for the target GPU. The workflow: export to ONNX, optimize with trtexec, deploy the .trt engine.

flowchart LR A["Trained Model\nPyTorch / TensorFlow"] --> B["Export to ONNX\ntorch.onnx.export()"] B --> C["TensorRT Optimizer\ntrtexec"] C --> D{"Precision\nSelection"} D --> E["FP32\nFull Precision"] D --> F["FP16\nHalf Precision"] D --> G["INT8\nQuantized"] E --> H["Layer Fusion\n& Kernel Selection"] F --> H G --> H H --> I["Optimized TensorRT\nEngine (.trt)"] I --> J["Deploy for\nInference"] style A fill:#333,color:#fff style B fill:#333,color:#fff style C fill:#76b900,color:#000 style D fill:#005f30,color:#fff style E fill:#005f30,color:#fff style F fill:#005f30,color:#fff style G fill:#005f30,color:#fff style H fill:#76b900,color:#000 style I fill:#76b900,color:#000 style J fill:#333,color:#fff

Library	Purpose	When You Use It
CUDA Toolkit	GPU computation platform	Compiling custom CUDA kernels
cuDNN	DL operation primitives	Automatically via PyTorch/TensorFlow
TensorRT	Inference optimization	Deploying models to production
cuBLAS	Linear algebra on GPU	Matrix ops, automatically via frameworks
NCCL	Multi-GPU communication	Distributed training (automatic)

NCCL and Multi-GPU Communication

NCCL (pronounced "Nickel") handles data transfer between multiple GPUs. It orchestrates operations like AllReduce, Broadcast, and AllGather across GPUs connected via NVLink, exploiting high-bandwidth topology rather than slower PCIe. PyTorch's DistributedDataParallel and TensorFlow's tf.distribute.Strategy call NCCL automatically.

Framework Integration: PyTorch, TensorFlow, JAX

DGX Spark ships with pre-installed versions of major AI frameworks, each compiled against the system's CUDA, cuDNN, and NCCL versions. The pre-installed versions are matched and tested, avoiding version compatibility headaches. For different framework versions, NGC containers provide isolated environments with their own stack.

# Verify GPU availability
# PyTorch
import torch
print(torch.cuda.is_available())       # True
print(torch.cuda.get_device_name(0))   # NVIDIA Blackwell ...

# TensorFlow
import tensorflow as tf
print(tf.config.list_physical_devices('GPU'))

# JAX
import jax
print(jax.devices())

Key Takeaways

CUDA toolkit, cuDNN, TensorRT, and NCCL form a layered acceleration stack, all pre-configured on DGX Spark
cuDNN is called transparently by frameworks — understanding it helps diagnose performance issues
TensorRT converts trained models into optimized inference engines via ONNX export and trtexec
DGX Spark's ARM64 architecture means all native binaries must target AArch64

Post-Quiz: CUDA Toolkit & AI Libraries

1. What does cuDNN provide that the base CUDA toolkit does not?

2. Why must software compiled natively for DGX Spark target ARM64 rather than x86_64?

3. What is TensorRT's primary function in the AI deployment pipeline?

4. How does NCCL improve multi-GPU training performance on DGX Spark?

5. When a developer runs PyTorch on DGX Spark and calls a convolution operation, what library actually performs the GPU computation?

Section 3: NGC Container Registry & Containerized Workflows

Pre-Quiz: NGC Containers & Workflows

1. What does the NVIDIA Container Toolkit provide that standard Docker does not?

Network isolation between containers OCI runtime hooks that expose GPU drivers, CUDA libraries, and device files inside containers Automatic container image compression for faster pulls Built-in container orchestration with load balancing

2. What is the recommended approach for building a custom AI container on DGX Spark?

Start from a minimal Alpine Linux image and install CUDA from scratch Start from an NVIDIA base image (e.g., nvcr.io/nvidia/pytorch) and layer project-specific dependencies Copy the host system's /usr/local/cuda directory into the container Use a Windows container with CUDA support for maximum compatibility

3. When authenticating Docker with the NGC registry, what value is used as the username?

Your NVIDIA developer account email $oauthtoken (literal string) Your NGC organization name admin

4. What is the purpose of the --gpus all flag when running a Docker container on DGX Spark?

It installs GPU drivers inside the container image It tells the NVIDIA runtime to expose all available GPUs to the container It enables CPU-based GPU emulation for testing It restricts the container to use only GPU memory, not system RAM

5. In a Docker Compose file for DGX Spark, how are GPU resources specified for a service?

Using the gpus: all top-level key Using deploy.resources.reservations.devices with driver nvidia and capabilities [gpu] Adding --gpus all to the command field GPU access is automatic in Docker Compose and needs no configuration

NGC Container Registry

The NGC container registry at nvcr.io hosts hundreds of pre-built, GPU-optimized container images. These include framework containers (PyTorch, TensorFlow, JAX), application containers (Triton, RAPIDS), and model containers (NIM microservices). Each image is tested on NVIDIA hardware.

Setting up NGC access:

Generate an API key at ngc.nvidia.com → Setup → API Key
Authenticate Docker: docker login nvcr.io (username: $oauthtoken, password: your API key)
Install NGC CLI from the ARM64 Linux tab (required for Grace CPU architecture)
Pull containers: docker pull nvcr.io/nvidia/pytorch:24.08-py3

NGC Container Workflow on DGX Spark

NVIDIA Container Toolkit: GPU Passthrough

The NVIDIA Container Toolkit installs OCI runtime hooks that expose GPU drivers, CUDA libraries, and device files inside containers without requiring these to be baked into the image. On DGX Spark, the toolkit is pre-installed and pre-configured.

# Run GPU-enabled container
docker run -it --gpus all nvcr.io/nvidia/pytorch:24.08-py3

# Specify individual GPUs
docker run -it --gpus '"device=0"' nvcr.io/nvidia/pytorch:24.08-py3

# Verify GPU access inside container
docker run --rm --gpus all nvcr.io/nvidia/cuda:13.0.1-devel-ubuntu24.04 nvidia-smi

Building Custom Containers

Start from an NVIDIA base image and layer your requirements:

FROM nvcr.io/nvidia/pytorch:24.08-py3

# Install project-specific dependencies
RUN pip install transformers datasets accelerate wandb

# Copy project code
COPY ./src /workspace/src
WORKDIR /workspace
CMD ["python", "src/train.py"]

Build and run: docker build -t my-training:v1 . then docker run --gpus all -v /data:/data my-training:v1

Container Orchestration Patterns

Two orchestration patterns are common on DGX Spark:

Docker Compose for multi-container workflows (training + TensorBoard, etc.) with GPU resource reservations via deploy.resources.reservations.devices
Kubernetes with NVIDIA GPU Operator for larger-scale orchestration with automatic GPU scheduling, resource quotas, and multi-user workload management

Key Takeaways

NGC containers are pre-optimized for NVIDIA hardware and solve the "works on my machine" problem
The NVIDIA Container Toolkit provides GPU passthrough via OCI runtime hooks — no driver install needed in containers
Custom containers should start from NVIDIA base images, not install CUDA from scratch
Docker Compose and Kubernetes provide orchestration for team-scale workloads

Post-Quiz: NGC Containers & Workflows

1. What does the NVIDIA Container Toolkit provide that standard Docker does not?

2. What is the recommended approach for building a custom AI container on DGX Spark?

3. When authenticating Docker with the NGC registry, what value is used as the username?

Your NVIDIA developer account email $oauthtoken (literal string) Your NGC organization name admin

4. What is the purpose of the --gpus all flag when running a Docker container on DGX Spark?

5. In a Docker Compose file for DGX Spark, how are GPU resources specified for a service?

Section 4: NVIDIA NIM Microservices & AI Enterprise Stack

Pre-Quiz: NIM Microservices & AI Enterprise

1. What is the key advantage of NIM microservices for model deployment?

NIM trains models faster by using distributed computing automatically NIM converts model deployment from a systems engineering challenge into a container orchestration task NIM eliminates the need for GPUs during inference by using CPU-only optimization NIM provides a graphical interface for non-technical users to deploy models

2. How does Triton Inference Server differ from NIM in its approach to model serving?

Triton only supports TensorRT models, while NIM supports all frameworks Triton provides multi-model, multi-framework serving with fine-grained scheduling, while NIM packages single models as turnkey APIs Triton is for training only, while NIM is for inference only There is no practical difference; they are the same tool with different names

3. What does Triton's dynamic batching feature accomplish?

It splits large models across multiple GPUs automatically It automatically groups incoming requests to maximize GPU throughput It dynamically adjusts model precision based on available memory It batches model updates to reduce deployment downtime

4. What does the NVIDIA AI Enterprise license provide beyond the open-source stack?

Access to CUDA and cuDNN, which are not available in the open-source stack Enterprise support with SLA, CVE response guarantees, full NIM catalog, and pre-built Blueprints Higher GPU clock speeds unlocked through a software license key Access to x86_64 emulation for running legacy workloads on ARM64

5. In a production observability stack for AI services on DGX Spark, what role does the /v2/health/ready endpoint serve?

It triggers automatic model retraining when accuracy drops It enables load balancers and Kubernetes to route traffic away from unhealthy instances It exposes GPU temperature data for thermal throttling alerts It provides a web dashboard for real-time inference visualization

NIM Microservices: Models as API Endpoints

NVIDIA NIM provides prebuilt, optimized containers that package foundation models as API endpoints. Each NIM container includes the model weights, an inference engine (typically TensorRT-LLM), and an OpenAI-compatible API server.

# Pull and run a NIM container
docker pull nvcr.io/nim/meta/llama-3.1-8b-instruct:latest
docker run --gpus all -p 8000:8000 nvcr.io/nim/meta/llama-3.1-8b-instruct:latest

# Query the model via OpenAI-compatible API
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "meta/llama-3.1-8b-instruct",
       "messages": [{"role": "user", "content": "Explain GPU memory hierarchy."}]}'

NIM handles model optimization internally — batch sizes, KV-cache memory management, and TensorRT-LLM optimizations. The key advantage: if you can run a Docker container, you can serve a production-grade LLM.

NIM Microservice Request Flow

NVIDIA AI Enterprise Platform

NVIDIA AI Enterprise is the commercial software layer providing enterprise support, security certifications, API stability guarantees, and validated upgrade paths. DGX Spark includes an AI Enterprise license, unlocking NIM microservices, enterprise support, and NVIDIA Blueprints.

Capability	Open-Source Stack	AI Enterprise
CUDA/cuDNN	Included	Included
NGC Containers	Public catalog	Full catalog + enterprise images
NIM Microservices	Community models	Full model catalog + support
Security	Community patches	CVE response SLA
Support	Forums	Enterprise support with SLA
Blueprints	Not available	Pre-built reference architectures

Triton Inference Server: Multi-Model Serving

Triton Inference Server serves multiple models simultaneously with fine-grained control over scheduling, batching, and resource allocation. It supports TensorFlow SavedModels, PyTorch TorchScript, ONNX, TensorRT engines, and Python-based models through a unified interface.

Dynamic batching: Automatically groups requests to maximize GPU throughput
Model ensembles: Chains models in a pipeline (tokenizer → LLM → post-processor)
Concurrent execution: Runs different models on the same GPU
Metrics endpoint: Prometheus-compatible metrics for monitoring integration

Triton serves models via HTTP (port 8000), gRPC (port 8001), and metrics (port 8002).

flowchart TD A["Client Requests"] --> B["Triton Inference Server"] B --> C["Dynamic Batching\nEngine"] C --> D["Text Classifier\nONNX Model"] C --> E["Image Encoder\nTensorRT Engine"] C --> F["LLM Service\nPython Backend"] B --> G["HTTP Port 8000"] B --> H["gRPC Port 8001"] B --> I["Prometheus Metrics\nPort 8002"] I --> J["Grafana\nDashboard"] style A fill:#333,color:#fff style B fill:#76b900,color:#000 style C fill:#005f30,color:#fff style D fill:#005f30,color:#fff style E fill:#005f30,color:#fff style F fill:#005f30,color:#fff style G fill:#333,color:#fff style H fill:#333,color:#fff style I fill:#333,color:#fff style J fill:#1a1a1a,color:#fff

Monitoring and Observability

Production AI services require continuous monitoring across four layers:

GPU metrics: nvidia-smi and DCGM export utilization, memory, temperature, error counts
Inference metrics: Triton exposes latency, throughput, queue depth via Prometheus
Container metrics: Docker/Kubernetes provide CPU, memory, network I/O data
Application logs: Structured logging from NIM and Triton for debugging and auditing

Health check endpoints (/v2/health/ready) enable load balancers and Kubernetes to automatically route traffic away from unhealthy instances.

Key Takeaways

NIM provides turnkey single-model deployment as OpenAI-compatible API endpoints
Triton provides flexible multi-model, multi-framework serving with dynamic batching
NVIDIA AI Enterprise adds enterprise support, security SLAs, and pre-built Blueprints
Production monitoring spans GPU, inference, container, and application layers, routed to Prometheus/Grafana

Post-Quiz: NIM Microservices & AI Enterprise

1. What is the key advantage of NIM microservices for model deployment?

2. How does Triton Inference Server differ from NIM in its approach to model serving?

3. What does Triton's dynamic batching feature accomplish?

4. What does the NVIDIA AI Enterprise license provide beyond the open-source stack?

5. In a production observability stack for AI services on DGX Spark, what role does the /v2/health/ready endpoint serve?

Chapter 2: DGX OS Software Stack, CUDA Toolkit & Containerized AI Workflows

Learning Objectives

Section 1: DGX OS — Ubuntu Linux with NVIDIA Optimization

DGX OS Base System: Ubuntu LTS with NVIDIA Kernel Modules

DGX OS Software Stack Layers

GPU Monitoring with nvidia-smi and DCGM

System Management and Diagnostics

Multi-User Access and GPU Isolation

Key Takeaways

Section 2: CUDA Toolkit & Core AI Development Libraries

CUDA Toolkit: Installation, Versioning, and Configuration

cuDNN and TensorRT

NCCL and Multi-GPU Communication

Framework Integration: PyTorch, TensorFlow, JAX

Key Takeaways

Section 3: NGC Container Registry & Containerized Workflows

NGC Container Registry

NGC Container Workflow on DGX Spark

NVIDIA Container Toolkit: GPU Passthrough

Building Custom Containers

Container Orchestration Patterns

Key Takeaways

Section 4: NVIDIA NIM Microservices & AI Enterprise Stack

NIM Microservices: Models as API Endpoints

NIM Microservice Request Flow

NVIDIA AI Enterprise Platform

Triton Inference Server: Multi-Model Serving

Monitoring and Observability

Key Takeaways

Your Progress

Answer Explanations