Study Guide: Production Deployment: Inference, Fine-Tuning & Enterprise AI Workflows

Pre-Quiz: Quantization & Engine Configuration

1. A 70B-parameter model stored in FP16 requires approximately 140GB of memory. What is the primary purpose of FP4 quantization on DGX Spark?

To speed up model training by reducing gradient computation

To compress model weights so they fit within the 128GB unified memory

To convert the model from ARM to x86 architecture format

To enable the model to run on CPU instead of GPU

2. What is the key trade-off when choosing TensorRT-LLM over vLLM on DGX Spark?

TensorRT-LLM uses less memory but produces lower-quality outputs

TensorRT-LLM has faster cold starts but lower throughput

TensorRT-LLM delivers higher throughput but has much longer cold start times and higher configuration complexity

TensorRT-LLM is open-source while vLLM is proprietary

3. Which quantization formats does DGX Spark's Blackwell architecture support at 4-bit precision?

INT4 and BF4

NVFP4 and MXFP4

FP4 and INT8

GPTQ and AWQ

4. vLLM uses a memory management technique called PagedAttention. What does this achieve?

It pages model weights to disk when GPU memory is full

It efficiently handles variable-length sequences by dynamically allocating memory rather than reserving fixed blocks

It compiles models into native GPU binaries for faster execution

It splits the model across multiple GPUs automatically

5. Why does the Llama-3.3-70B model achieve only 4.51 tok/s under NVFP4 while GPT-OSS-120B achieves 34.57 tok/s under MXFP4?

The 120B model is smaller when quantized due to its architecture

The difference likely reflects model architecture optimization differences for FP4 inference rather than raw parameter count

NVFP4 is always slower than MXFP4 regardless of model

The 70B model was running on CPU while the 120B model used GPU

Loading and Serving 70B-200B+ Parameter Models with FP4/FP8 Quantization

The central challenge of local inference is fitting the model into available memory. A 70-billion-parameter model stored in FP16 requires approximately 140GB -- already exceeding DGX Spark's 128GB before accounting for activation memory, KV caches, or the OS. Quantization compresses model weights to lower numerical precision, dramatically reducing memory consumption while preserving most of the model's intelligence.

Format	Full Name	Description	Typical Use Case
NVFP4	NVIDIA FP4	NVIDIA's proprietary 4-bit format optimized for Blackwell tensor cores	Single-model deployment with maximum compression
MXFP4	Microscaling FP4	Industry-standard 4-bit format with per-block scaling factors	Multi-framework compatibility and community models
FP8	8-bit Floating Point	Higher fidelity quantization with 2x the memory cost of FP4	Quality-sensitive tasks where memory allows

With FP4 quantization, a 70B model shrinks to roughly 35-40GB -- well within the 128GB envelope. Even a 120B parameter model fits comfortably at approximately 65GB. However, models exceeding roughly 200B parameters at FP4 still exceed the single-node ceiling.

vLLM and TensorRT-LLM Engine Configuration

Two inference engines dominate the DGX Spark ecosystem:

vLLM: Open-source, built around PagedAttention, ~62s cold start, straightforward configuration. Known compatibility issues with Blackwell sm_121.
TensorRT-LLM: NVIDIA's proprietary engine, compiles models into optimized execution plans. Cold starts up to 28 minutes, but 10-15% higher throughput at scale.

Feature	vLLM	TensorRT-LLM
Cold Start	~62 seconds	Up to 28 minutes
Throughput	Good	Better (10-15% higher)
TTFT (p50 at 10 req)	~120 ms	~105 ms
Configuration Complexity	Low	High
Blackwell Compatibility	Known sm_121 issues	Optimized for Blackwell

LLM Inference Pipeline on DGX Spark

Throughput Benchmarking

Throughput varies dramatically based on model size, quantization format, and engine choice:

Model	Params	Engine	Quantization	Decode (tok/s)	Memory (GiB)
Llama-3.3-70B-Instruct	70B	vLLM	NVFP4	4.51	39.8
GPT-OSS-120B	120B	vLLM	MXFP4	34.57	65.9
GPT-OSS-120B	120B	vLLM (TP=2)	MXFP4	80.88	N/A
Qwen3.5-35B-A3B (MoE)	35B	vLLM	MXFP4	60-71	N/A
Qwen3-30B	30B	TRT-LLM	NVFP4	39.5	N/A

Concurrent Request Handling and Dynamic Batching

DGX Spark supports dynamic batching -- grouping incoming requests so the GPU processes them in parallel. Both vLLM and TensorRT-LLM implement continuous batching, where new requests join an active batch as earlier ones complete. Each concurrent request adds KV-cache overhead, so the 128GB unified memory must accommodate both model weights and all active request state.

Key Takeaway

DGX Spark can run models up to approximately 120B parameters at FP4 quantization
Throughput ranges from 4 to 80+ tokens/second depending on model architecture, quantization, and engine
TensorRT-LLM delivers higher steady-state performance at the cost of longer cold starts
Tensor parallelism and MoE architectures offer the most effective paths to higher throughput

Post-Quiz: Quantization & Engine Configuration

1. A 70B-parameter model stored in FP16 requires approximately 140GB of memory. What is the primary purpose of FP4 quantization on DGX Spark?

To speed up model training by reducing gradient computation

To compress model weights so they fit within the 128GB unified memory

To convert the model from ARM to x86 architecture format

To enable the model to run on CPU instead of GPU

2. What is the key trade-off when choosing TensorRT-LLM over vLLM on DGX Spark?

TensorRT-LLM uses less memory but produces lower-quality outputs

TensorRT-LLM has faster cold starts but lower throughput

TensorRT-LLM delivers higher throughput but has much longer cold start times and higher configuration complexity

TensorRT-LLM is open-source while vLLM is proprietary

3. Which quantization formats does DGX Spark's Blackwell architecture support at 4-bit precision?

INT4 and BF4

NVFP4 and MXFP4

FP4 and INT8

GPTQ and AWQ

4. vLLM uses a memory management technique called PagedAttention. What does this achieve?

It pages model weights to disk when GPU memory is full

It efficiently handles variable-length sequences by dynamically allocating memory rather than reserving fixed blocks

It compiles models into native GPU binaries for faster execution

It splits the model across multiple GPUs automatically

5. Why does the Llama-3.3-70B model achieve only 4.51 tok/s under NVFP4 while GPT-OSS-120B achieves 34.57 tok/s under MXFP4?

The 120B model is smaller when quantized due to its architecture

The difference likely reflects model architecture optimization differences for FP4 inference rather than raw parameter count

NVFP4 is always slower than MXFP4 regardless of model

The 70B model was running on CPU while the 120B model used GPU

Section 2: Local Fine-Tuning & Model Adaptation

Pre-Quiz: Fine-Tuning & Model Adaptation

1. What is the key advantage of QLoRA over standard LoRA for fine-tuning on DGX Spark?

QLoRA trains faster because it uses fewer parameters

QLoRA quantizes the base model to 4-bit, reducing memory usage by approximately 60% compared to LoRA

QLoRA produces higher-quality results than full supervised fine-tuning

QLoRA supports tensor parallelism while LoRA does not

2. Why is gradient accumulation essential for fine-tuning large models on memory-constrained systems like DGX Spark?

It eliminates the need for a GPU by accumulating work on the CPU

It simulates larger effective batch sizes by accumulating gradients across multiple micro-batches before updating weights

It speeds up training by skipping backward passes on some batches

It reduces model size by accumulating only the most important gradients

3. In LoRA fine-tuning, what happens to the original base model weights?

They are deleted and replaced by the LoRA adapter weights

They are updated at a slower learning rate than the adapters

They are frozen (not updated), and small trainable adapter matrices are injected into each layer

They are compressed to 1-bit precision to make room for adapters

4. What is the primary limitation of QLoRA compared to LoRA on DGX Spark?

QLoRA cannot fine-tune models larger than 8B parameters

QLoRA runs 50-200% slower due to dequantization overhead and does not support tensor/sequence parallelism

QLoRA requires a separate x86 server for the quantization step

QLoRA adapters cannot be merged back into the base model

5. What is gradient checkpointing, and what trade-off does it make?

It saves gradients to disk periodically; trades disk space for GPU memory

It discards intermediate activations during forward pass and recomputes them during backprop; trades compute time for memory savings

It checkpoints the entire model to resume from crashes; trades training speed for reliability

It only computes gradients for a subset of layers; trades model quality for speed

Supervised Fine-Tuning Within 128GB Unified Memory

Supervised fine-tuning (SFT) trains a pre-existing model on labeled input-output pairs from your domain. On DGX Spark, the primary constraint is fitting model weights, optimizer states, gradients, and activations within 128GB.

Critical configuration parameters for memory-constrained training:

Micro-batch size 1: The smallest possible batch, typically the only feasible setting for 70B models
Gradient accumulation: Simulates larger effective batch sizes (e.g., 8 accumulation steps with micro-batch 1 = effective batch size 8)
Packed sequences: Concatenates multiple short training examples into a single sequence, eliminating wasted padding tokens
Gradient checkpointing: Trades ~30% longer training time for 50-70% memory reduction

LoRA and QLoRA Parameter-Efficient Fine-Tuning

LoRA freezes original model weights and injects small trainable matrices (adapters) into each layer -- training only 0.1-1% of total parameters. QLoRA adds 4-bit quantization of base weights, enabling 70B model fine-tuning within 128GB.

LoRA / QLoRA Fine-Tuning Architecture

Approach	Memory for 70B Model	Training Speed	Quality vs Full SFT
Full SFT	Exceeds 128GB	Baseline	Best
LoRA	~80-100GB	~1.5x faster	Near-baseline
QLoRA	~40-68GB	50-200% slower than LoRA	Slightly lower

NeMo Framework & Checkpointing

The NeMo AutoModel container provides pre-configured Docker-based workflows for all fine-tuning approaches. Checkpointing saves model state periodically -- for QLoRA, checkpoints are typically under 1GB (adapter weights only). Best practice: checkpoint every 50-100 training steps.

Key Takeaway

QLoRA makes 70B-class fine-tuning feasible on DGX Spark using ~40-68GB of the 128GB unified memory
Full SFT is limited to smaller models (8B-30B range) on DGX Spark
NeMo AutoModel provides turnkey Docker-based workflow for all approaches
Typical QLoRA fine-tuning session: 45-90 minutes for 70B models

Post-Quiz: Fine-Tuning & Model Adaptation

1. What is the key advantage of QLoRA over standard LoRA for fine-tuning on DGX Spark?

QLoRA trains faster because it uses fewer parameters

QLoRA quantizes the base model to 4-bit, reducing memory usage by approximately 60% compared to LoRA

QLoRA produces higher-quality results than full supervised fine-tuning

QLoRA supports tensor parallelism while LoRA does not

2. Why is gradient accumulation essential for fine-tuning large models on memory-constrained systems like DGX Spark?

It eliminates the need for a GPU by accumulating work on the CPU

It simulates larger effective batch sizes by accumulating gradients across multiple micro-batches before updating weights

It speeds up training by skipping backward passes on some batches

It reduces model size by accumulating only the most important gradients

3. In LoRA fine-tuning, what happens to the original base model weights?

They are deleted and replaced by the LoRA adapter weights

They are updated at a slower learning rate than the adapters

They are frozen (not updated), and small trainable adapter matrices are injected into each layer

They are compressed to 1-bit precision to make room for adapters

4. What is the primary limitation of QLoRA compared to LoRA on DGX Spark?

QLoRA cannot fine-tune models larger than 8B parameters

QLoRA runs 50-200% slower due to dequantization overhead and does not support tensor/sequence parallelism

QLoRA requires a separate x86 server for the quantization step

QLoRA adapters cannot be merged back into the base model

5. What is gradient checkpointing, and what trade-off does it make?

It saves gradients to disk periodically; trades disk space for GPU memory

It discards intermediate activations during forward pass and recomputes them during backprop; trades compute time for memory savings

It checkpoints the entire model to resume from crashes; trades training speed for reliability

It only computes gradients for a subset of layers; trades model quality for speed

Section 3: RAG Pipelines & Agentic AI Systems

Pre-Quiz: RAG & Agentic AI

1. In a RAG pipeline on DGX Spark, what roles do the Grace CPU and Blackwell GPU play respectively?

Grace CPU runs the LLM while Blackwell GPU handles document storage

Grace CPU handles text embedding operations while Blackwell GPU accelerates LLM inference for response generation

Both processors share the LLM workload equally through tensor parallelism

Grace CPU manages network requests while Blackwell GPU processes embeddings

2. A practical RAG pipeline on DGX Spark using LLaMA 3.1 8B with E5-base-v2 embeddings consumes approximately how much memory?

64 GiB -- about half the available memory

~13 GiB -- a small fraction of the 128GB available

~100 GiB -- near the maximum capacity

~35 GiB -- about a quarter of available memory

3. Why is document chunking strategy the most impactful design decision in a RAG system?

Because chunk size directly determines the LLM's maximum output length

Because it affects how well retrieved passages match queries -- too large dilutes relevance, too small loses context

Because larger chunks always produce better results since they contain more information

Because chunking determines which embedding model can be used

4. What are the two critical advantages of running agentic AI workflows locally on DGX Spark?

Lower cost and faster model training

Data privacy (sensitive data never leaves the machine) and latency control (no network round-trips)

Better model quality and larger context windows

Automatic scaling and load balancing

5. In a hybrid DGX Spark + cloud architecture, what is the primary benefit of developing RAG pipelines locally first?

Local development is cheaper than cloud because DGX Spark has no electricity cost

The same NVIDIA software stack runs on both, so validated pipelines migrate without "works on my machine" issues

Cloud providers do not support RAG pipelines, so local development is required

DGX Spark's GPU is faster than cloud GPUs for embedding generation

Building Local RAG Systems

Retrieval-Augmented Generation (RAG) customizes what a model can access at query time by retrieving relevant documents from a knowledge base and injecting them into the model's context window. DGX Spark's heterogeneous architecture naturally maps to the RAG pipeline: Grace CPU handles embedding, Blackwell GPU handles generation.

Component	Role	Example Tools
Embedding Model	Converts text to vectors for similarity search	E5-base-v2, NVIDIA Nemotron
Vector Database	Stores and indexes document embeddings	FAISS, Milvus, ElasticSearch
Language Model	Generates responses using retrieved context	LLaMA 3.1 8B, Qwen
Orchestration	Manages the query-retrieve-generate pipeline	LangChain, LlamaIndex

RAG Pipeline Flow on DGX Spark

Document Ingestion and Chunking Strategies

Chunking Strategy	Chunk Size	Best For	Trade-off
Fixed-size	256-512 tokens	General-purpose	Simple but may split concepts
Semantic	Variable	Technical docs	Better coherence, more complex
Recursive	512-1024 tokens	Hierarchical docs	Preserves structure
Sentence-window	1-3 sentences + context	Precision queries	High accuracy, larger index

Agentic AI Frameworks

Agentic AI extends RAG into multi-step reasoning systems that can plan, use tools, and iteratively refine answers. On DGX Spark, agentic workflows run entirely locally, providing data privacy (sensitive data never leaves the machine) and latency control (no network round-trips).

A typical agentic architecture layers tool use, chain-of-thought reasoning, memory/state management, and retrieval integration. DGX Spark's 128GB unified memory can hold the LLM, embedding model, vector index, tool definitions, conversation state, and intermediate results simultaneously.

Hybrid Architectures

The most pragmatic deployment pattern treats DGX Spark as the development and small-scale production tier. RAG pipelines are validated locally, then deployed to datacenter DGX systems when serving requirements exceed single-node capacity. The same NVIDIA software stack (NeMo, TensorRT-LLM, NIM containers) runs identically on both.

Key Takeaway

A practical RAG setup consumes as little as 13 GiB of the 128GB available memory
Grace CPU + Blackwell GPU architecture naturally maps to the embedding + generation pipeline
Agentic AI benefits from local execution: data privacy and deterministic latency
Hybrid develop-local, deploy-cloud is the recommended enterprise pattern

Post-Quiz: RAG & Agentic AI

1. In a RAG pipeline on DGX Spark, what roles do the Grace CPU and Blackwell GPU play respectively?

Grace CPU runs the LLM while Blackwell GPU handles document storage

Grace CPU handles text embedding operations while Blackwell GPU accelerates LLM inference for response generation

Both processors share the LLM workload equally through tensor parallelism

Grace CPU manages network requests while Blackwell GPU processes embeddings

2. A practical RAG pipeline on DGX Spark using LLaMA 3.1 8B with E5-base-v2 embeddings consumes approximately how much memory?

64 GiB -- about half the available memory

~13 GiB -- a small fraction of the 128GB available

~100 GiB -- near the maximum capacity

~35 GiB -- about a quarter of available memory

3. Why is document chunking strategy the most impactful design decision in a RAG system?

Because chunk size directly determines the LLM's maximum output length

Because it affects how well retrieved passages match queries -- too large dilutes relevance, too small loses context

Because larger chunks always produce better results since they contain more information

Because chunking determines which embedding model can be used

4. What are the two critical advantages of running agentic AI workflows locally on DGX Spark?

Lower cost and faster model training

Data privacy (sensitive data never leaves the machine) and latency control (no network round-trips)

Better model quality and larger context windows

Automatic scaling and load balancing

5. In a hybrid DGX Spark + cloud architecture, what is the primary benefit of developing RAG pipelines locally first?

Local development is cheaper than cloud because DGX Spark has no electricity cost

The same NVIDIA software stack runs on both, so validated pipelines migrate without "works on my machine" issues

Cloud providers do not support RAG pipelines, so local development is required

DGX Spark's GPU is faster than cloud GPUs for embedding generation

Section 4: Limitations, Compatibility & Future Migration Paths

Pre-Quiz: Limitations & Migration

1. What is the primary recommendation for avoiding ARM/x86 compatibility issues when developing on DGX Spark?

Only use Python scripts and avoid any compiled code

Develop inside NVIDIA-provided multi-arch containers that abstract away architecture differences

Use a translation layer to convert ARM instructions to x86 in real-time

Maintain separate codebases for ARM and x86 targets

2. At what point should you plan migration from DGX Spark to datacenter DGX systems?

As soon as you begin using quantization, since it indicates you need more memory

When models exceed 120B parameters, concurrent users exceed 5-10, or full SFT is needed on 30B+ models

Only when NVIDIA discontinues DGX Spark support

When your dataset exceeds 1GB in size

3. Why can QLoRA not use tensor parallelism or sequence parallelism on DGX Spark?

DGX Spark hardware does not support any parallelism at all

The NeMo QLoRA implementation does not support tensor or sequence parallelism; only multi-GPU data parallelism is available

QLoRA models are too small to benefit from parallelism

Tensor parallelism is incompatible with 4-bit quantization on any hardware

4. What is the workaround for TensorRT-LLM's extremely long cold start times (up to 28 minutes)?

Switch to vLLM exclusively since TensorRT-LLM is too slow to be practical

Pre-build optimized engines and cache them to NVMe storage for reuse

Keep the GPU running at full power continuously to prevent cold starts

Use a smaller model that loads faster

5. What is the expected trajectory for future personal DGX systems based on NVIDIA's product evolution?

Future systems will likely abandon unified memory in favor of discrete HBM

Future systems will likely offer 256-512GB unified memory, closing the bandwidth gap with datacenter GPUs

NVIDIA plans to discontinue personal DGX systems in favor of cloud-only offerings

Future systems will switch from ARM to x86 to eliminate compatibility issues

ARM Architecture Software Incompatibilities

DGX Spark uses the Grace CPU (ARM/AArch64) rather than x86-64. While ARM support in ML has improved dramatically, incompatibilities persist with custom C/C++ extensions, pre-built Python wheels, and Docker images that only target x86.

The practical recommendation: develop inside NVIDIA-provided containers that abstract away architecture differences, ensuring the same multi-arch container image runs on both ARM and x86 targets.

Category	Status on ARM/DGX Spark	Workaround
PyTorch / TensorFlow	Fully supported	Use official NVIDIA Docker images
vLLM	Known sm_121 issues	Pin to tested versions
Custom C/C++ Extensions	May need recompilation	Rebuild with ARM64 toolchain
Pre-built Python Wheels	Some x86-only	Build from source or conda-forge
Docker Images	Must use ARM64/multi-arch	Check image manifests before pulling

Scalability Ceiling: When to Graduate

Dimension	DGX Spark	DGX H100 (8-GPU)	DGX B200 (8-GPU)
GPU Memory	128GB unified	640GB HBM3	1.5TB+ HBM3e
Max Model (FP4)	~120-200B	~1T+	~2T+
Concurrent Users	1-5 (typical)	50-500+	100-1000+
Fine-Tuning Scale	QLoRA up to 70B	Full SFT up to 400B+	Full SFT up to 1T+
Use Case	Dev, prototyping	Department production	Enterprise-scale

Plan migration when:

Your model exceeds 120B parameters at required precision
Concurrent user demand exceeds 5-10 simultaneous requests
Full SFT is required on models larger than 30B
Training time is measured in days rather than hours
Continuous serving alongside training creates unacceptable memory contention

Software Stack Maturity Gaps

vLLM Blackwell compatibility: sm_121 edge cases. Workaround: use TensorRT-LLM or pin vLLM versions.
TensorRT-LLM cold starts: 4-28 minutes. Workaround: pre-build engines and cache to NVMe.
QLoRA parallelism: No tensor/sequence parallelism. Workaround: use LoRA when parallelism is required.
Long-context inference: 32K+ tokens on 70B+ models may exhaust memory. Workaround: sliding window attention or reduced max sequence length.

NVIDIA Roadmap

Future personal DGX systems are expected to offer 256-512GB unified memory, improved memory bandwidth, better ARM64 ecosystem support, and higher-bandwidth multi-node interconnects. DGX Spark skills and workflows developed today will transfer directly to more capable future hardware.

Key Takeaway

Primary limitations: 128GB memory ceiling, ARM compatibility gaps, inference engine maturity
Migrate when models exceed 120B params, users exceed single digits, or full SFT at scale is needed
NVIDIA software stack provides consistent experience across the DGX family
Investing in NeMo, TensorRT-LLM, and container-based deployment is forward-compatible

Post-Quiz: Limitations & Migration

1. What is the primary recommendation for avoiding ARM/x86 compatibility issues when developing on DGX Spark?

Only use Python scripts and avoid any compiled code

Develop inside NVIDIA-provided multi-arch containers that abstract away architecture differences

Use a translation layer to convert ARM instructions to x86 in real-time

Maintain separate codebases for ARM and x86 targets

2. At what point should you plan migration from DGX Spark to datacenter DGX systems?

As soon as you begin using quantization, since it indicates you need more memory

When models exceed 120B parameters, concurrent users exceed 5-10, or full SFT is needed on 30B+ models

Only when NVIDIA discontinues DGX Spark support

When your dataset exceeds 1GB in size

3. Why can QLoRA not use tensor parallelism or sequence parallelism on DGX Spark?

DGX Spark hardware does not support any parallelism at all

The NeMo QLoRA implementation does not support tensor or sequence parallelism; only multi-GPU data parallelism is available

QLoRA models are too small to benefit from parallelism

Tensor parallelism is incompatible with 4-bit quantization on any hardware

4. What is the workaround for TensorRT-LLM's extremely long cold start times (up to 28 minutes)?

Switch to vLLM exclusively since TensorRT-LLM is too slow to be practical

Pre-build optimized engines and cache them to NVMe storage for reuse

Keep the GPU running at full power continuously to prevent cold starts

Use a smaller model that loads faster

5. What is the expected trajectory for future personal DGX systems based on NVIDIA's product evolution?

Future systems will likely abandon unified memory in favor of discrete HBM

Future systems will likely offer 256-512GB unified memory, closing the bandwidth gap with datacenter GPUs

NVIDIA plans to discontinue personal DGX systems in favor of cloud-only offerings

Future systems will switch from ARM to x86 to eliminate compatibility issues

Chapter 4: Production Deployment: Inference, Fine-Tuning & Enterprise AI Workflows

Learning Objectives

Section 1: Large-Scale Model Inference on a Single Node

Loading and Serving 70B-200B+ Parameter Models with FP4/FP8 Quantization

vLLM and TensorRT-LLM Engine Configuration

Throughput Benchmarking

Concurrent Request Handling and Dynamic Batching

Key Takeaway

Section 2: Local Fine-Tuning & Model Adaptation

Supervised Fine-Tuning Within 128GB Unified Memory

LoRA and QLoRA Parameter-Efficient Fine-Tuning

NeMo Framework & Checkpointing

Key Takeaway

Section 3: RAG Pipelines & Agentic AI Systems

Building Local RAG Systems

Document Ingestion and Chunking Strategies

Agentic AI Frameworks

Hybrid Architectures

Key Takeaway

Section 4: Limitations, Compatibility & Future Migration Paths

ARM Architecture Software Incompatibilities

Scalability Ceiling: When to Graduate

Software Stack Maturity Gaps

NVIDIA Roadmap

Key Takeaway

Your Progress

Answer Explanations