Chapter 4: Production Deployment: Inference, Fine-Tuning & Enterprise AI Workflows
Learning Objectives
Deploy large language models (up to 200B+ parameters) for local inference on DGX Spark using quantization and memory optimization
Execute supervised fine-tuning and LoRA/QLoRA adaptation workflows within the 128GB unified memory constraints
Build local Retrieval-Augmented Generation (RAG) pipelines and agentic AI systems on DGX Spark
Evaluate DGX Spark architectural limitations, ARM compatibility challenges, and plan migration paths to datacenter DGX systems
Section 1: Large-Scale Model Inference on a Single Node
Pre-Quiz: Quantization & Engine Configuration
1. A 70B-parameter model stored in FP16 requires approximately 140GB of memory. What is the primary purpose of FP4 quantization on DGX Spark?
To speed up model training by reducing gradient computation
To compress model weights so they fit within the 128GB unified memory
To convert the model from ARM to x86 architecture format
To enable the model to run on CPU instead of GPU
2. What is the key trade-off when choosing TensorRT-LLM over vLLM on DGX Spark?
TensorRT-LLM uses less memory but produces lower-quality outputs
TensorRT-LLM has faster cold starts but lower throughput
TensorRT-LLM delivers higher throughput but has much longer cold start times and higher configuration complexity
TensorRT-LLM is open-source while vLLM is proprietary
3. Which quantization formats does DGX Spark's Blackwell architecture support at 4-bit precision?
INT4 and BF4
NVFP4 and MXFP4
FP4 and INT8
GPTQ and AWQ
4. vLLM uses a memory management technique called PagedAttention. What does this achieve?
It pages model weights to disk when GPU memory is full
It efficiently handles variable-length sequences by dynamically allocating memory rather than reserving fixed blocks
It compiles models into native GPU binaries for faster execution
It splits the model across multiple GPUs automatically
5. Why does the Llama-3.3-70B model achieve only 4.51 tok/s under NVFP4 while GPT-OSS-120B achieves 34.57 tok/s under MXFP4?
The 120B model is smaller when quantized due to its architecture
The difference likely reflects model architecture optimization differences for FP4 inference rather than raw parameter count
NVFP4 is always slower than MXFP4 regardless of model
The 70B model was running on CPU while the 120B model used GPU
Loading and Serving 70B-200B+ Parameter Models with FP4/FP8 Quantization
The central challenge of local inference is fitting the model into available memory. A 70-billion-parameter model stored in FP16 requires approximately 140GB -- already exceeding DGX Spark's 128GB before accounting for activation memory, KV caches, or the OS. Quantization compresses model weights to lower numerical precision, dramatically reducing memory consumption while preserving most of the model's intelligence.
Format
Full Name
Description
Typical Use Case
NVFP4
NVIDIA FP4
NVIDIA's proprietary 4-bit format optimized for Blackwell tensor cores
Single-model deployment with maximum compression
MXFP4
Microscaling FP4
Industry-standard 4-bit format with per-block scaling factors
Multi-framework compatibility and community models
FP8
8-bit Floating Point
Higher fidelity quantization with 2x the memory cost of FP4
Quality-sensitive tasks where memory allows
With FP4 quantization, a 70B model shrinks to roughly 35-40GB -- well within the 128GB envelope. Even a 120B parameter model fits comfortably at approximately 65GB. However, models exceeding roughly 200B parameters at FP4 still exceed the single-node ceiling.
vLLM and TensorRT-LLM Engine Configuration
Two inference engines dominate the DGX Spark ecosystem:
vLLM: Open-source, built around PagedAttention, ~62s cold start, straightforward configuration. Known compatibility issues with Blackwell sm_121.
TensorRT-LLM: NVIDIA's proprietary engine, compiles models into optimized execution plans. Cold starts up to 28 minutes, but 10-15% higher throughput at scale.
Feature
vLLM
TensorRT-LLM
Cold Start
~62 seconds
Up to 28 minutes
Throughput
Good
Better (10-15% higher)
TTFT (p50 at 10 req)
~120 ms
~105 ms
Configuration Complexity
Low
High
Blackwell Compatibility
Known sm_121 issues
Optimized for Blackwell
LLM Inference Pipeline on DGX Spark
Throughput Benchmarking
Throughput varies dramatically based on model size, quantization format, and engine choice:
Model
Params
Engine
Quantization
Decode (tok/s)
Memory (GiB)
Llama-3.3-70B-Instruct
70B
vLLM
NVFP4
4.51
39.8
GPT-OSS-120B
120B
vLLM
MXFP4
34.57
65.9
GPT-OSS-120B
120B
vLLM (TP=2)
MXFP4
80.88
N/A
Qwen3.5-35B-A3B (MoE)
35B
vLLM
MXFP4
60-71
N/A
Qwen3-30B
30B
TRT-LLM
NVFP4
39.5
N/A
Concurrent Request Handling and Dynamic Batching
DGX Spark supports dynamic batching -- grouping incoming requests so the GPU processes them in parallel. Both vLLM and TensorRT-LLM implement continuous batching, where new requests join an active batch as earlier ones complete. Each concurrent request adds KV-cache overhead, so the 128GB unified memory must accommodate both model weights and all active request state.
Key Takeaway
DGX Spark can run models up to approximately 120B parameters at FP4 quantization
Throughput ranges from 4 to 80+ tokens/second depending on model architecture, quantization, and engine
TensorRT-LLM delivers higher steady-state performance at the cost of longer cold starts
Tensor parallelism and MoE architectures offer the most effective paths to higher throughput
Post-Quiz: Quantization & Engine Configuration
1. A 70B-parameter model stored in FP16 requires approximately 140GB of memory. What is the primary purpose of FP4 quantization on DGX Spark?
To speed up model training by reducing gradient computation
To compress model weights so they fit within the 128GB unified memory
To convert the model from ARM to x86 architecture format
To enable the model to run on CPU instead of GPU
2. What is the key trade-off when choosing TensorRT-LLM over vLLM on DGX Spark?
TensorRT-LLM uses less memory but produces lower-quality outputs
TensorRT-LLM has faster cold starts but lower throughput
TensorRT-LLM delivers higher throughput but has much longer cold start times and higher configuration complexity
TensorRT-LLM is open-source while vLLM is proprietary
3. Which quantization formats does DGX Spark's Blackwell architecture support at 4-bit precision?
INT4 and BF4
NVFP4 and MXFP4
FP4 and INT8
GPTQ and AWQ
4. vLLM uses a memory management technique called PagedAttention. What does this achieve?
It pages model weights to disk when GPU memory is full
It efficiently handles variable-length sequences by dynamically allocating memory rather than reserving fixed blocks
It compiles models into native GPU binaries for faster execution
It splits the model across multiple GPUs automatically
5. Why does the Llama-3.3-70B model achieve only 4.51 tok/s under NVFP4 while GPT-OSS-120B achieves 34.57 tok/s under MXFP4?
The 120B model is smaller when quantized due to its architecture
The difference likely reflects model architecture optimization differences for FP4 inference rather than raw parameter count
NVFP4 is always slower than MXFP4 regardless of model
The 70B model was running on CPU while the 120B model used GPU
Section 2: Local Fine-Tuning & Model Adaptation
Pre-Quiz: Fine-Tuning & Model Adaptation
1. What is the key advantage of QLoRA over standard LoRA for fine-tuning on DGX Spark?
QLoRA trains faster because it uses fewer parameters
QLoRA quantizes the base model to 4-bit, reducing memory usage by approximately 60% compared to LoRA
QLoRA produces higher-quality results than full supervised fine-tuning
QLoRA supports tensor parallelism while LoRA does not
2. Why is gradient accumulation essential for fine-tuning large models on memory-constrained systems like DGX Spark?
It eliminates the need for a GPU by accumulating work on the CPU
It simulates larger effective batch sizes by accumulating gradients across multiple micro-batches before updating weights
It speeds up training by skipping backward passes on some batches
It reduces model size by accumulating only the most important gradients
3. In LoRA fine-tuning, what happens to the original base model weights?
They are deleted and replaced by the LoRA adapter weights
They are updated at a slower learning rate than the adapters
They are frozen (not updated), and small trainable adapter matrices are injected into each layer
They are compressed to 1-bit precision to make room for adapters
4. What is the primary limitation of QLoRA compared to LoRA on DGX Spark?
QLoRA cannot fine-tune models larger than 8B parameters
QLoRA runs 50-200% slower due to dequantization overhead and does not support tensor/sequence parallelism
QLoRA requires a separate x86 server for the quantization step
QLoRA adapters cannot be merged back into the base model
5. What is gradient checkpointing, and what trade-off does it make?
It saves gradients to disk periodically; trades disk space for GPU memory
It discards intermediate activations during forward pass and recomputes them during backprop; trades compute time for memory savings
It checkpoints the entire model to resume from crashes; trades training speed for reliability
It only computes gradients for a subset of layers; trades model quality for speed
Supervised Fine-Tuning Within 128GB Unified Memory
Supervised fine-tuning (SFT) trains a pre-existing model on labeled input-output pairs from your domain. On DGX Spark, the primary constraint is fitting model weights, optimizer states, gradients, and activations within 128GB.
Critical configuration parameters for memory-constrained training:
Micro-batch size 1: The smallest possible batch, typically the only feasible setting for 70B models
Packed sequences: Concatenates multiple short training examples into a single sequence, eliminating wasted padding tokens
Gradient checkpointing: Trades ~30% longer training time for 50-70% memory reduction
LoRA and QLoRA Parameter-Efficient Fine-Tuning
LoRA freezes original model weights and injects small trainable matrices (adapters) into each layer -- training only 0.1-1% of total parameters. QLoRA adds 4-bit quantization of base weights, enabling 70B model fine-tuning within 128GB.
LoRA / QLoRA Fine-Tuning Architecture
Approach
Memory for 70B Model
Training Speed
Quality vs Full SFT
Full SFT
Exceeds 128GB
Baseline
Best
LoRA
~80-100GB
~1.5x faster
Near-baseline
QLoRA
~40-68GB
50-200% slower than LoRA
Slightly lower
NeMo Framework & Checkpointing
The NeMo AutoModel container provides pre-configured Docker-based workflows for all fine-tuning approaches. Checkpointing saves model state periodically -- for QLoRA, checkpoints are typically under 1GB (adapter weights only). Best practice: checkpoint every 50-100 training steps.
Key Takeaway
QLoRA makes 70B-class fine-tuning feasible on DGX Spark using ~40-68GB of the 128GB unified memory
Full SFT is limited to smaller models (8B-30B range) on DGX Spark
NeMo AutoModel provides turnkey Docker-based workflow for all approaches
Typical QLoRA fine-tuning session: 45-90 minutes for 70B models
Post-Quiz: Fine-Tuning & Model Adaptation
1. What is the key advantage of QLoRA over standard LoRA for fine-tuning on DGX Spark?
QLoRA trains faster because it uses fewer parameters
QLoRA quantizes the base model to 4-bit, reducing memory usage by approximately 60% compared to LoRA
QLoRA produces higher-quality results than full supervised fine-tuning
QLoRA supports tensor parallelism while LoRA does not
2. Why is gradient accumulation essential for fine-tuning large models on memory-constrained systems like DGX Spark?
It eliminates the need for a GPU by accumulating work on the CPU
It simulates larger effective batch sizes by accumulating gradients across multiple micro-batches before updating weights
It speeds up training by skipping backward passes on some batches
It reduces model size by accumulating only the most important gradients
3. In LoRA fine-tuning, what happens to the original base model weights?
They are deleted and replaced by the LoRA adapter weights
They are updated at a slower learning rate than the adapters
They are frozen (not updated), and small trainable adapter matrices are injected into each layer
They are compressed to 1-bit precision to make room for adapters
4. What is the primary limitation of QLoRA compared to LoRA on DGX Spark?
QLoRA cannot fine-tune models larger than 8B parameters
QLoRA runs 50-200% slower due to dequantization overhead and does not support tensor/sequence parallelism
QLoRA requires a separate x86 server for the quantization step
QLoRA adapters cannot be merged back into the base model
5. What is gradient checkpointing, and what trade-off does it make?
It saves gradients to disk periodically; trades disk space for GPU memory
It discards intermediate activations during forward pass and recomputes them during backprop; trades compute time for memory savings
It checkpoints the entire model to resume from crashes; trades training speed for reliability
It only computes gradients for a subset of layers; trades model quality for speed
Section 3: RAG Pipelines & Agentic AI Systems
Pre-Quiz: RAG & Agentic AI
1. In a RAG pipeline on DGX Spark, what roles do the Grace CPU and Blackwell GPU play respectively?
Grace CPU runs the LLM while Blackwell GPU handles document storage
Grace CPU handles text embedding operations while Blackwell GPU accelerates LLM inference for response generation
Both processors share the LLM workload equally through tensor parallelism
Grace CPU manages network requests while Blackwell GPU processes embeddings
2. A practical RAG pipeline on DGX Spark using LLaMA 3.1 8B with E5-base-v2 embeddings consumes approximately how much memory?
64 GiB -- about half the available memory
~13 GiB -- a small fraction of the 128GB available
~100 GiB -- near the maximum capacity
~35 GiB -- about a quarter of available memory
3. Why is document chunking strategy the most impactful design decision in a RAG system?
Because chunk size directly determines the LLM's maximum output length
Because it affects how well retrieved passages match queries -- too large dilutes relevance, too small loses context
Because larger chunks always produce better results since they contain more information
Because chunking determines which embedding model can be used
4. What are the two critical advantages of running agentic AI workflows locally on DGX Spark?
Lower cost and faster model training
Data privacy (sensitive data never leaves the machine) and latency control (no network round-trips)
Better model quality and larger context windows
Automatic scaling and load balancing
5. In a hybrid DGX Spark + cloud architecture, what is the primary benefit of developing RAG pipelines locally first?
Local development is cheaper than cloud because DGX Spark has no electricity cost
The same NVIDIA software stack runs on both, so validated pipelines migrate without "works on my machine" issues
Cloud providers do not support RAG pipelines, so local development is required
DGX Spark's GPU is faster than cloud GPUs for embedding generation
Building Local RAG Systems
Retrieval-Augmented Generation (RAG) customizes what a model can access at query time by retrieving relevant documents from a knowledge base and injecting them into the model's context window. DGX Spark's heterogeneous architecture naturally maps to the RAG pipeline: Grace CPU handles embedding, Blackwell GPU handles generation.
Component
Role
Example Tools
Embedding Model
Converts text to vectors for similarity search
E5-base-v2, NVIDIA Nemotron
Vector Database
Stores and indexes document embeddings
FAISS, Milvus, ElasticSearch
Language Model
Generates responses using retrieved context
LLaMA 3.1 8B, Qwen
Orchestration
Manages the query-retrieve-generate pipeline
LangChain, LlamaIndex
RAG Pipeline Flow on DGX Spark
Document Ingestion and Chunking Strategies
Chunking Strategy
Chunk Size
Best For
Trade-off
Fixed-size
256-512 tokens
General-purpose
Simple but may split concepts
Semantic
Variable
Technical docs
Better coherence, more complex
Recursive
512-1024 tokens
Hierarchical docs
Preserves structure
Sentence-window
1-3 sentences + context
Precision queries
High accuracy, larger index
Agentic AI Frameworks
Agentic AI extends RAG into multi-step reasoning systems that can plan, use tools, and iteratively refine answers. On DGX Spark, agentic workflows run entirely locally, providing data privacy (sensitive data never leaves the machine) and latency control (no network round-trips).
A typical agentic architecture layers tool use, chain-of-thought reasoning, memory/state management, and retrieval integration. DGX Spark's 128GB unified memory can hold the LLM, embedding model, vector index, tool definitions, conversation state, and intermediate results simultaneously.
Hybrid Architectures
The most pragmatic deployment pattern treats DGX Spark as the development and small-scale production tier. RAG pipelines are validated locally, then deployed to datacenter DGX systems when serving requirements exceed single-node capacity. The same NVIDIA software stack (NeMo, TensorRT-LLM, NIM containers) runs identically on both.
Key Takeaway
A practical RAG setup consumes as little as 13 GiB of the 128GB available memory
Grace CPU + Blackwell GPU architecture naturally maps to the embedding + generation pipeline
Agentic AI benefits from local execution: data privacy and deterministic latency
Hybrid develop-local, deploy-cloud is the recommended enterprise pattern
Post-Quiz: RAG & Agentic AI
1. In a RAG pipeline on DGX Spark, what roles do the Grace CPU and Blackwell GPU play respectively?
Grace CPU runs the LLM while Blackwell GPU handles document storage
Grace CPU handles text embedding operations while Blackwell GPU accelerates LLM inference for response generation
Both processors share the LLM workload equally through tensor parallelism
Grace CPU manages network requests while Blackwell GPU processes embeddings
2. A practical RAG pipeline on DGX Spark using LLaMA 3.1 8B with E5-base-v2 embeddings consumes approximately how much memory?
64 GiB -- about half the available memory
~13 GiB -- a small fraction of the 128GB available
~100 GiB -- near the maximum capacity
~35 GiB -- about a quarter of available memory
3. Why is document chunking strategy the most impactful design decision in a RAG system?
Because chunk size directly determines the LLM's maximum output length
Because it affects how well retrieved passages match queries -- too large dilutes relevance, too small loses context
Because larger chunks always produce better results since they contain more information
Because chunking determines which embedding model can be used
4. What are the two critical advantages of running agentic AI workflows locally on DGX Spark?
Lower cost and faster model training
Data privacy (sensitive data never leaves the machine) and latency control (no network round-trips)
Better model quality and larger context windows
Automatic scaling and load balancing
5. In a hybrid DGX Spark + cloud architecture, what is the primary benefit of developing RAG pipelines locally first?
Local development is cheaper than cloud because DGX Spark has no electricity cost
The same NVIDIA software stack runs on both, so validated pipelines migrate without "works on my machine" issues
Cloud providers do not support RAG pipelines, so local development is required
DGX Spark's GPU is faster than cloud GPUs for embedding generation
1. What is the primary recommendation for avoiding ARM/x86 compatibility issues when developing on DGX Spark?
Only use Python scripts and avoid any compiled code
Develop inside NVIDIA-provided multi-arch containers that abstract away architecture differences
Use a translation layer to convert ARM instructions to x86 in real-time
Maintain separate codebases for ARM and x86 targets
2. At what point should you plan migration from DGX Spark to datacenter DGX systems?
As soon as you begin using quantization, since it indicates you need more memory
When models exceed 120B parameters, concurrent users exceed 5-10, or full SFT is needed on 30B+ models
Only when NVIDIA discontinues DGX Spark support
When your dataset exceeds 1GB in size
3. Why can QLoRA not use tensor parallelism or sequence parallelism on DGX Spark?
DGX Spark hardware does not support any parallelism at all
The NeMo QLoRA implementation does not support tensor or sequence parallelism; only multi-GPU data parallelism is available
QLoRA models are too small to benefit from parallelism
Tensor parallelism is incompatible with 4-bit quantization on any hardware
4. What is the workaround for TensorRT-LLM's extremely long cold start times (up to 28 minutes)?
Switch to vLLM exclusively since TensorRT-LLM is too slow to be practical
Pre-build optimized engines and cache them to NVMe storage for reuse
Keep the GPU running at full power continuously to prevent cold starts
Use a smaller model that loads faster
5. What is the expected trajectory for future personal DGX systems based on NVIDIA's product evolution?
Future systems will likely abandon unified memory in favor of discrete HBM
Future systems will likely offer 256-512GB unified memory, closing the bandwidth gap with datacenter GPUs
NVIDIA plans to discontinue personal DGX systems in favor of cloud-only offerings
Future systems will switch from ARM to x86 to eliminate compatibility issues
ARM Architecture Software Incompatibilities
DGX Spark uses the Grace CPU (ARM/AArch64) rather than x86-64. While ARM support in ML has improved dramatically, incompatibilities persist with custom C/C++ extensions, pre-built Python wheels, and Docker images that only target x86.
The practical recommendation: develop inside NVIDIA-provided containers that abstract away architecture differences, ensuring the same multi-arch container image runs on both ARM and x86 targets.
Category
Status on ARM/DGX Spark
Workaround
PyTorch / TensorFlow
Fully supported
Use official NVIDIA Docker images
vLLM
Known sm_121 issues
Pin to tested versions
Custom C/C++ Extensions
May need recompilation
Rebuild with ARM64 toolchain
Pre-built Python Wheels
Some x86-only
Build from source or conda-forge
Docker Images
Must use ARM64/multi-arch
Check image manifests before pulling
Scalability Ceiling: When to Graduate
Dimension
DGX Spark
DGX H100 (8-GPU)
DGX B200 (8-GPU)
GPU Memory
128GB unified
640GB HBM3
1.5TB+ HBM3e
Max Model (FP4)
~120-200B
~1T+
~2T+
Concurrent Users
1-5 (typical)
50-500+
100-1000+
Fine-Tuning Scale
QLoRA up to 70B
Full SFT up to 400B+
Full SFT up to 1T+
Use Case
Dev, prototyping
Department production
Enterprise-scale
Plan migration when:
Your model exceeds 120B parameters at required precision
Concurrent user demand exceeds 5-10 simultaneous requests
Full SFT is required on models larger than 30B
Training time is measured in days rather than hours
Continuous serving alongside training creates unacceptable memory contention
Software Stack Maturity Gaps
vLLM Blackwell compatibility: sm_121 edge cases. Workaround: use TensorRT-LLM or pin vLLM versions.
TensorRT-LLM cold starts: 4-28 minutes. Workaround: pre-build engines and cache to NVMe.
QLoRA parallelism: No tensor/sequence parallelism. Workaround: use LoRA when parallelism is required.
Long-context inference: 32K+ tokens on 70B+ models may exhaust memory. Workaround: sliding window attention or reduced max sequence length.
NVIDIA Roadmap
Future personal DGX systems are expected to offer 256-512GB unified memory, improved memory bandwidth, better ARM64 ecosystem support, and higher-bandwidth multi-node interconnects. DGX Spark skills and workflows developed today will transfer directly to more capable future hardware.