Chapter 4: Production Deployment: Inference, Fine-Tuning & Enterprise AI Workflows

Learning Objectives

Section 1: Large-Scale Model Inference on a Single Node

Pre-Quiz: Quantization & Engine Configuration

1. A 70B-parameter model stored in FP16 requires approximately 140GB of memory. What is the primary purpose of FP4 quantization on DGX Spark?

To speed up model training by reducing gradient computation
To compress model weights so they fit within the 128GB unified memory
To convert the model from ARM to x86 architecture format
To enable the model to run on CPU instead of GPU

2. What is the key trade-off when choosing TensorRT-LLM over vLLM on DGX Spark?

TensorRT-LLM uses less memory but produces lower-quality outputs
TensorRT-LLM has faster cold starts but lower throughput
TensorRT-LLM delivers higher throughput but has much longer cold start times and higher configuration complexity
TensorRT-LLM is open-source while vLLM is proprietary

3. Which quantization formats does DGX Spark's Blackwell architecture support at 4-bit precision?

INT4 and BF4
NVFP4 and MXFP4
FP4 and INT8
GPTQ and AWQ

4. vLLM uses a memory management technique called PagedAttention. What does this achieve?

It pages model weights to disk when GPU memory is full
It efficiently handles variable-length sequences by dynamically allocating memory rather than reserving fixed blocks
It compiles models into native GPU binaries for faster execution
It splits the model across multiple GPUs automatically

5. Why does the Llama-3.3-70B model achieve only 4.51 tok/s under NVFP4 while GPT-OSS-120B achieves 34.57 tok/s under MXFP4?

The 120B model is smaller when quantized due to its architecture
The difference likely reflects model architecture optimization differences for FP4 inference rather than raw parameter count
NVFP4 is always slower than MXFP4 regardless of model
The 70B model was running on CPU while the 120B model used GPU

Loading and Serving 70B-200B+ Parameter Models with FP4/FP8 Quantization

The central challenge of local inference is fitting the model into available memory. A 70-billion-parameter model stored in FP16 requires approximately 140GB -- already exceeding DGX Spark's 128GB before accounting for activation memory, KV caches, or the OS. Quantization compresses model weights to lower numerical precision, dramatically reducing memory consumption while preserving most of the model's intelligence.

FormatFull NameDescriptionTypical Use Case
NVFP4NVIDIA FP4NVIDIA's proprietary 4-bit format optimized for Blackwell tensor coresSingle-model deployment with maximum compression
MXFP4Microscaling FP4Industry-standard 4-bit format with per-block scaling factorsMulti-framework compatibility and community models
FP88-bit Floating PointHigher fidelity quantization with 2x the memory cost of FP4Quality-sensitive tasks where memory allows

With FP4 quantization, a 70B model shrinks to roughly 35-40GB -- well within the 128GB envelope. Even a 120B parameter model fits comfortably at approximately 65GB. However, models exceeding roughly 200B parameters at FP4 still exceed the single-node ceiling.

vLLM and TensorRT-LLM Engine Configuration

Two inference engines dominate the DGX Spark ecosystem:

FeaturevLLMTensorRT-LLM
Cold Start~62 secondsUp to 28 minutes
ThroughputGoodBetter (10-15% higher)
TTFT (p50 at 10 req)~120 ms~105 ms
Configuration ComplexityLowHigh
Blackwell CompatibilityKnown sm_121 issuesOptimized for Blackwell
LLM Inference Pipeline on DGX Spark
Input Text "Hello, how..." Raw text input from the user before tokenization Tokenizer [15496, 11,] [703, ...] Converts text to token IDs using the model vocabulary Prefill Process all input tokens in parallel GPU-intensive Prefill phase: processes all input tokens in parallel to build initial KV-cache KV-Cache Key-Value store KV-Cache stores computed attention keys and values to avoid recomputation Decode Generate tokens one-by-one Memory-bound Decode phase: generates one token at a time, reading from KV-cache auto- regressive Output Detokenize Response text Final output: token IDs are converted back to readable text Step 1 Step 2 Step 3 Step 4-5 Step 6 Quantization Applied FP16 -> FP4/FP8 70B: 140GB -> 35-40GB

Throughput Benchmarking

Throughput varies dramatically based on model size, quantization format, and engine choice:

ModelParamsEngineQuantizationDecode (tok/s)Memory (GiB)
Llama-3.3-70B-Instruct70BvLLMNVFP44.5139.8
GPT-OSS-120B120BvLLMMXFP434.5765.9
GPT-OSS-120B120BvLLM (TP=2)MXFP480.88N/A
Qwen3.5-35B-A3B (MoE)35BvLLMMXFP460-71N/A
Qwen3-30B30BTRT-LLMNVFP439.5N/A

Concurrent Request Handling and Dynamic Batching

DGX Spark supports dynamic batching -- grouping incoming requests so the GPU processes them in parallel. Both vLLM and TensorRT-LLM implement continuous batching, where new requests join an active batch as earlier ones complete. Each concurrent request adds KV-cache overhead, so the 128GB unified memory must accommodate both model weights and all active request state.

Key Takeaway

Post-Quiz: Quantization & Engine Configuration

1. A 70B-parameter model stored in FP16 requires approximately 140GB of memory. What is the primary purpose of FP4 quantization on DGX Spark?

To speed up model training by reducing gradient computation
To compress model weights so they fit within the 128GB unified memory
To convert the model from ARM to x86 architecture format
To enable the model to run on CPU instead of GPU

2. What is the key trade-off when choosing TensorRT-LLM over vLLM on DGX Spark?

TensorRT-LLM uses less memory but produces lower-quality outputs
TensorRT-LLM has faster cold starts but lower throughput
TensorRT-LLM delivers higher throughput but has much longer cold start times and higher configuration complexity
TensorRT-LLM is open-source while vLLM is proprietary

3. Which quantization formats does DGX Spark's Blackwell architecture support at 4-bit precision?

INT4 and BF4
NVFP4 and MXFP4
FP4 and INT8
GPTQ and AWQ

4. vLLM uses a memory management technique called PagedAttention. What does this achieve?

It pages model weights to disk when GPU memory is full
It efficiently handles variable-length sequences by dynamically allocating memory rather than reserving fixed blocks
It compiles models into native GPU binaries for faster execution
It splits the model across multiple GPUs automatically

5. Why does the Llama-3.3-70B model achieve only 4.51 tok/s under NVFP4 while GPT-OSS-120B achieves 34.57 tok/s under MXFP4?

The 120B model is smaller when quantized due to its architecture
The difference likely reflects model architecture optimization differences for FP4 inference rather than raw parameter count
NVFP4 is always slower than MXFP4 regardless of model
The 70B model was running on CPU while the 120B model used GPU

Section 2: Local Fine-Tuning & Model Adaptation

Pre-Quiz: Fine-Tuning & Model Adaptation

1. What is the key advantage of QLoRA over standard LoRA for fine-tuning on DGX Spark?

QLoRA trains faster because it uses fewer parameters
QLoRA quantizes the base model to 4-bit, reducing memory usage by approximately 60% compared to LoRA
QLoRA produces higher-quality results than full supervised fine-tuning
QLoRA supports tensor parallelism while LoRA does not

2. Why is gradient accumulation essential for fine-tuning large models on memory-constrained systems like DGX Spark?

It eliminates the need for a GPU by accumulating work on the CPU
It simulates larger effective batch sizes by accumulating gradients across multiple micro-batches before updating weights
It speeds up training by skipping backward passes on some batches
It reduces model size by accumulating only the most important gradients

3. In LoRA fine-tuning, what happens to the original base model weights?

They are deleted and replaced by the LoRA adapter weights
They are updated at a slower learning rate than the adapters
They are frozen (not updated), and small trainable adapter matrices are injected into each layer
They are compressed to 1-bit precision to make room for adapters

4. What is the primary limitation of QLoRA compared to LoRA on DGX Spark?

QLoRA cannot fine-tune models larger than 8B parameters
QLoRA runs 50-200% slower due to dequantization overhead and does not support tensor/sequence parallelism
QLoRA requires a separate x86 server for the quantization step
QLoRA adapters cannot be merged back into the base model

5. What is gradient checkpointing, and what trade-off does it make?

It saves gradients to disk periodically; trades disk space for GPU memory
It discards intermediate activations during forward pass and recomputes them during backprop; trades compute time for memory savings
It checkpoints the entire model to resume from crashes; trades training speed for reliability
It only computes gradients for a subset of layers; trades model quality for speed

Supervised Fine-Tuning Within 128GB Unified Memory

Supervised fine-tuning (SFT) trains a pre-existing model on labeled input-output pairs from your domain. On DGX Spark, the primary constraint is fitting model weights, optimizer states, gradients, and activations within 128GB.

Critical configuration parameters for memory-constrained training:

LoRA and QLoRA Parameter-Efficient Fine-Tuning

LoRA freezes original model weights and injects small trainable matrices (adapters) into each layer -- training only 0.1-1% of total parameters. QLoRA adds 4-bit quantization of base weights, enabling 70B model fine-tuning within 128GB.

LoRA / QLoRA Fine-Tuning Architecture
Input x Activations Input activations flowing into the transformer layer Frozen Base Weights W (d x d) 70B params -- NOT updated Original pre-trained weights are frozen (not updated during LoRA training) A d x r Down-proj LoRA adapter A: down-projects from dimension d to low rank r B r x d Up-proj LoRA adapter B: up-projects from low rank r back to dimension d Trainable: 0.1-1% of params + The outputs of the frozen weights and LoRA adapters are summed Output h Wx + BAx Final output combines frozen weight output with LoRA adapter output QLoRA Enhancement Base weights quantized to 4-bit Adapters trained at BF16 70B model: ~40-68GB memory rank r = 8-64 (typical)
ApproachMemory for 70B ModelTraining SpeedQuality vs Full SFT
Full SFTExceeds 128GBBaselineBest
LoRA~80-100GB~1.5x fasterNear-baseline
QLoRA~40-68GB50-200% slower than LoRASlightly lower

NeMo Framework & Checkpointing

The NeMo AutoModel container provides pre-configured Docker-based workflows for all fine-tuning approaches. Checkpointing saves model state periodically -- for QLoRA, checkpoints are typically under 1GB (adapter weights only). Best practice: checkpoint every 50-100 training steps.

Key Takeaway

Post-Quiz: Fine-Tuning & Model Adaptation

1. What is the key advantage of QLoRA over standard LoRA for fine-tuning on DGX Spark?

QLoRA trains faster because it uses fewer parameters
QLoRA quantizes the base model to 4-bit, reducing memory usage by approximately 60% compared to LoRA
QLoRA produces higher-quality results than full supervised fine-tuning
QLoRA supports tensor parallelism while LoRA does not

2. Why is gradient accumulation essential for fine-tuning large models on memory-constrained systems like DGX Spark?

It eliminates the need for a GPU by accumulating work on the CPU
It simulates larger effective batch sizes by accumulating gradients across multiple micro-batches before updating weights
It speeds up training by skipping backward passes on some batches
It reduces model size by accumulating only the most important gradients

3. In LoRA fine-tuning, what happens to the original base model weights?

They are deleted and replaced by the LoRA adapter weights
They are updated at a slower learning rate than the adapters
They are frozen (not updated), and small trainable adapter matrices are injected into each layer
They are compressed to 1-bit precision to make room for adapters

4. What is the primary limitation of QLoRA compared to LoRA on DGX Spark?

QLoRA cannot fine-tune models larger than 8B parameters
QLoRA runs 50-200% slower due to dequantization overhead and does not support tensor/sequence parallelism
QLoRA requires a separate x86 server for the quantization step
QLoRA adapters cannot be merged back into the base model

5. What is gradient checkpointing, and what trade-off does it make?

It saves gradients to disk periodically; trades disk space for GPU memory
It discards intermediate activations during forward pass and recomputes them during backprop; trades compute time for memory savings
It checkpoints the entire model to resume from crashes; trades training speed for reliability
It only computes gradients for a subset of layers; trades model quality for speed

Section 3: RAG Pipelines & Agentic AI Systems

Pre-Quiz: RAG & Agentic AI

1. In a RAG pipeline on DGX Spark, what roles do the Grace CPU and Blackwell GPU play respectively?

Grace CPU runs the LLM while Blackwell GPU handles document storage
Grace CPU handles text embedding operations while Blackwell GPU accelerates LLM inference for response generation
Both processors share the LLM workload equally through tensor parallelism
Grace CPU manages network requests while Blackwell GPU processes embeddings

2. A practical RAG pipeline on DGX Spark using LLaMA 3.1 8B with E5-base-v2 embeddings consumes approximately how much memory?

64 GiB -- about half the available memory
~13 GiB -- a small fraction of the 128GB available
~100 GiB -- near the maximum capacity
~35 GiB -- about a quarter of available memory

3. Why is document chunking strategy the most impactful design decision in a RAG system?

Because chunk size directly determines the LLM's maximum output length
Because it affects how well retrieved passages match queries -- too large dilutes relevance, too small loses context
Because larger chunks always produce better results since they contain more information
Because chunking determines which embedding model can be used

4. What are the two critical advantages of running agentic AI workflows locally on DGX Spark?

Lower cost and faster model training
Data privacy (sensitive data never leaves the machine) and latency control (no network round-trips)
Better model quality and larger context windows
Automatic scaling and load balancing

5. In a hybrid DGX Spark + cloud architecture, what is the primary benefit of developing RAG pipelines locally first?

Local development is cheaper than cloud because DGX Spark has no electricity cost
The same NVIDIA software stack runs on both, so validated pipelines migrate without "works on my machine" issues
Cloud providers do not support RAG pipelines, so local development is required
DGX Spark's GPU is faster than cloud GPUs for embedding generation

Building Local RAG Systems

Retrieval-Augmented Generation (RAG) customizes what a model can access at query time by retrieving relevant documents from a knowledge base and injecting them into the model's context window. DGX Spark's heterogeneous architecture naturally maps to the RAG pipeline: Grace CPU handles embedding, Blackwell GPU handles generation.

ComponentRoleExample Tools
Embedding ModelConverts text to vectors for similarity searchE5-base-v2, NVIDIA Nemotron
Vector DatabaseStores and indexes document embeddingsFAISS, Milvus, ElasticSearch
Language ModelGenerates responses using retrieved contextLLaMA 3.1 8B, Qwen
OrchestrationManages the query-retrieve-generate pipelineLangChain, LlamaIndex
RAG Pipeline Flow on DGX Spark
INGESTION PHASE QUERY PHASE Documents PDFs, docs, code knowledge base Source documents forming the knowledge base Chunk 256-1024 tokens semantic splitting Documents are split into manageable chunks for embedding Embed E5-base-v2 Grace CPU Embedding model converts text chunks to vectors (runs on Grace CPU) Vector DB FAISS Index sub-ms retrieval FAISS vector database stores embeddings and enables fast similarity search User Query "What is the max model size?" User's natural language question Query Embed [0.23, -0.41, 0.67, ...] The user query is embedded into the same vector space as the documents Similarity Top-K retrieval Finds the most similar document chunks by cosine similarity LLM Generate Query + Retrieved Context Blackwell GPU Grounded response LLM generates a response grounded in the retrieved context (runs on Blackwell GPU) Response Factual answer Final grounded response returned to the user Full RAG pipeline memory usage: ~13 GiB of 128GB available

Document Ingestion and Chunking Strategies

Chunking StrategyChunk SizeBest ForTrade-off
Fixed-size256-512 tokensGeneral-purposeSimple but may split concepts
SemanticVariableTechnical docsBetter coherence, more complex
Recursive512-1024 tokensHierarchical docsPreserves structure
Sentence-window1-3 sentences + contextPrecision queriesHigh accuracy, larger index

Agentic AI Frameworks

Agentic AI extends RAG into multi-step reasoning systems that can plan, use tools, and iteratively refine answers. On DGX Spark, agentic workflows run entirely locally, providing data privacy (sensitive data never leaves the machine) and latency control (no network round-trips).

A typical agentic architecture layers tool use, chain-of-thought reasoning, memory/state management, and retrieval integration. DGX Spark's 128GB unified memory can hold the LLM, embedding model, vector index, tool definitions, conversation state, and intermediate results simultaneously.

Hybrid Architectures

The most pragmatic deployment pattern treats DGX Spark as the development and small-scale production tier. RAG pipelines are validated locally, then deployed to datacenter DGX systems when serving requirements exceed single-node capacity. The same NVIDIA software stack (NeMo, TensorRT-LLM, NIM containers) runs identically on both.

Key Takeaway

Post-Quiz: RAG & Agentic AI

1. In a RAG pipeline on DGX Spark, what roles do the Grace CPU and Blackwell GPU play respectively?

Grace CPU runs the LLM while Blackwell GPU handles document storage
Grace CPU handles text embedding operations while Blackwell GPU accelerates LLM inference for response generation
Both processors share the LLM workload equally through tensor parallelism
Grace CPU manages network requests while Blackwell GPU processes embeddings

2. A practical RAG pipeline on DGX Spark using LLaMA 3.1 8B with E5-base-v2 embeddings consumes approximately how much memory?

64 GiB -- about half the available memory
~13 GiB -- a small fraction of the 128GB available
~100 GiB -- near the maximum capacity
~35 GiB -- about a quarter of available memory

3. Why is document chunking strategy the most impactful design decision in a RAG system?

Because chunk size directly determines the LLM's maximum output length
Because it affects how well retrieved passages match queries -- too large dilutes relevance, too small loses context
Because larger chunks always produce better results since they contain more information
Because chunking determines which embedding model can be used

4. What are the two critical advantages of running agentic AI workflows locally on DGX Spark?

Lower cost and faster model training
Data privacy (sensitive data never leaves the machine) and latency control (no network round-trips)
Better model quality and larger context windows
Automatic scaling and load balancing

5. In a hybrid DGX Spark + cloud architecture, what is the primary benefit of developing RAG pipelines locally first?

Local development is cheaper than cloud because DGX Spark has no electricity cost
The same NVIDIA software stack runs on both, so validated pipelines migrate without "works on my machine" issues
Cloud providers do not support RAG pipelines, so local development is required
DGX Spark's GPU is faster than cloud GPUs for embedding generation

Section 4: Limitations, Compatibility & Future Migration Paths

Pre-Quiz: Limitations & Migration

1. What is the primary recommendation for avoiding ARM/x86 compatibility issues when developing on DGX Spark?

Only use Python scripts and avoid any compiled code
Develop inside NVIDIA-provided multi-arch containers that abstract away architecture differences
Use a translation layer to convert ARM instructions to x86 in real-time
Maintain separate codebases for ARM and x86 targets

2. At what point should you plan migration from DGX Spark to datacenter DGX systems?

As soon as you begin using quantization, since it indicates you need more memory
When models exceed 120B parameters, concurrent users exceed 5-10, or full SFT is needed on 30B+ models
Only when NVIDIA discontinues DGX Spark support
When your dataset exceeds 1GB in size

3. Why can QLoRA not use tensor parallelism or sequence parallelism on DGX Spark?

DGX Spark hardware does not support any parallelism at all
The NeMo QLoRA implementation does not support tensor or sequence parallelism; only multi-GPU data parallelism is available
QLoRA models are too small to benefit from parallelism
Tensor parallelism is incompatible with 4-bit quantization on any hardware

4. What is the workaround for TensorRT-LLM's extremely long cold start times (up to 28 minutes)?

Switch to vLLM exclusively since TensorRT-LLM is too slow to be practical
Pre-build optimized engines and cache them to NVMe storage for reuse
Keep the GPU running at full power continuously to prevent cold starts
Use a smaller model that loads faster

5. What is the expected trajectory for future personal DGX systems based on NVIDIA's product evolution?

Future systems will likely abandon unified memory in favor of discrete HBM
Future systems will likely offer 256-512GB unified memory, closing the bandwidth gap with datacenter GPUs
NVIDIA plans to discontinue personal DGX systems in favor of cloud-only offerings
Future systems will switch from ARM to x86 to eliminate compatibility issues

ARM Architecture Software Incompatibilities

DGX Spark uses the Grace CPU (ARM/AArch64) rather than x86-64. While ARM support in ML has improved dramatically, incompatibilities persist with custom C/C++ extensions, pre-built Python wheels, and Docker images that only target x86.

The practical recommendation: develop inside NVIDIA-provided containers that abstract away architecture differences, ensuring the same multi-arch container image runs on both ARM and x86 targets.

CategoryStatus on ARM/DGX SparkWorkaround
PyTorch / TensorFlowFully supportedUse official NVIDIA Docker images
vLLMKnown sm_121 issuesPin to tested versions
Custom C/C++ ExtensionsMay need recompilationRebuild with ARM64 toolchain
Pre-built Python WheelsSome x86-onlyBuild from source or conda-forge
Docker ImagesMust use ARM64/multi-archCheck image manifests before pulling

Scalability Ceiling: When to Graduate

DimensionDGX SparkDGX H100 (8-GPU)DGX B200 (8-GPU)
GPU Memory128GB unified640GB HBM31.5TB+ HBM3e
Max Model (FP4)~120-200B~1T+~2T+
Concurrent Users1-5 (typical)50-500+100-1000+
Fine-Tuning ScaleQLoRA up to 70BFull SFT up to 400B+Full SFT up to 1T+
Use CaseDev, prototypingDepartment productionEnterprise-scale

Plan migration when:

  1. Your model exceeds 120B parameters at required precision
  2. Concurrent user demand exceeds 5-10 simultaneous requests
  3. Full SFT is required on models larger than 30B
  4. Training time is measured in days rather than hours
  5. Continuous serving alongside training creates unacceptable memory contention

Software Stack Maturity Gaps

NVIDIA Roadmap

Future personal DGX systems are expected to offer 256-512GB unified memory, improved memory bandwidth, better ARM64 ecosystem support, and higher-bandwidth multi-node interconnects. DGX Spark skills and workflows developed today will transfer directly to more capable future hardware.

Key Takeaway

Post-Quiz: Limitations & Migration

1. What is the primary recommendation for avoiding ARM/x86 compatibility issues when developing on DGX Spark?

Only use Python scripts and avoid any compiled code
Develop inside NVIDIA-provided multi-arch containers that abstract away architecture differences
Use a translation layer to convert ARM instructions to x86 in real-time
Maintain separate codebases for ARM and x86 targets

2. At what point should you plan migration from DGX Spark to datacenter DGX systems?

As soon as you begin using quantization, since it indicates you need more memory
When models exceed 120B parameters, concurrent users exceed 5-10, or full SFT is needed on 30B+ models
Only when NVIDIA discontinues DGX Spark support
When your dataset exceeds 1GB in size

3. Why can QLoRA not use tensor parallelism or sequence parallelism on DGX Spark?

DGX Spark hardware does not support any parallelism at all
The NeMo QLoRA implementation does not support tensor or sequence parallelism; only multi-GPU data parallelism is available
QLoRA models are too small to benefit from parallelism
Tensor parallelism is incompatible with 4-bit quantization on any hardware

4. What is the workaround for TensorRT-LLM's extremely long cold start times (up to 28 minutes)?

Switch to vLLM exclusively since TensorRT-LLM is too slow to be practical
Pre-build optimized engines and cache them to NVMe storage for reuse
Keep the GPU running at full power continuously to prevent cold starts
Use a smaller model that loads faster

5. What is the expected trajectory for future personal DGX systems based on NVIDIA's product evolution?

Future systems will likely abandon unified memory in favor of discrete HBM
Future systems will likely offer 256-512GB unified memory, closing the bandwidth gap with datacenter GPUs
NVIDIA plans to discontinue personal DGX systems in favor of cloud-only offerings
Future systems will switch from ARM to x86 to eliminate compatibility issues

Your Progress

Answer Explanations