Running AI Workloads on Kubernetes: From Infrastructure to Production
A comprehensive intermediate guide to deploying, scaling, and managing machine learning and AI workloads on Kubernetes clusters.
Table of Contents
- Chapter 1: Introduction to AI on Kubernetes
- Chapter 2: GPU and Accelerator Management
- Chapter 3: Storage and Data Management for AI Pipelines
- Chapter 4: Training Workloads: Jobs, Operators, and Frameworks
- Chapter 5: Model Serving and Inference
- Chapter 6: Resource Scheduling and Cluster Optimization
- Chapter 7: MLOps and ML Pipelines on Kubernetes
- Chapter 8: Networking and Security for AI Workloads
- Chapter 9: Monitoring, Observability, and Troubleshooting
- Chapter 10: Production Patterns and Scaling AI Platforms
Chapter 1: Introduction to AI on Kubernetes
Learning Objectives
- Explain why Kubernetes is a compelling platform for AI/ML workloads
- Identify the key challenges of running AI workloads compared to traditional microservices
- Describe the AI/ML lifecycle stages and how Kubernetes supports each
Why Kubernetes for AI/ML
Resource Orchestration Needs of AI Workloads
Training a large language model or running a computer vision pipeline is nothing like serving a web application. A web server might need two CPU cores and a few hundred megabytes of memory; a distributed training job might need dozens of GPUs, terabytes of RAM, and high-bandwidth interconnects — all at the same time, all on the same job. The fundamental question for any AI team is: what infrastructure layer can manage these heterogeneous, massive, and bursty resource demands without requiring a team of specialists to babysit every job?
Kubernetes — the open-source container orchestration system originally released by Google in 2014 — has become the dominant answer to that question. Orchestration here means the automated scheduling, scaling, networking, and lifecycle management of containerized workloads. Think of Kubernetes as an air-traffic control system for your compute cluster: it knows what resources are available on every node (server), what each incoming workload needs, and it lands jobs on the right runway without collisions.
For AI workloads specifically, Kubernetes handles the hard parts of resource management that would otherwise require manual intervention. Through plugins such as the NVIDIA device plugin for Kubernetes, the scheduler gains awareness of GPU hardware on each node, allowing it to place GPU-hungry training jobs only on nodes that actually have GPUs available — and to reserve exactly the number requested, no more, no less. [Source: https://collabnix.com/kubernetes-and-ai-the-ultimate-guide-to-orchestrating-machine-learning-workloads-in-2025/]
Figure 1.1: How Kubernetes schedules a GPU training job
flowchart TD
A[User submits YAML manifest\nrequesting 4 GPUs] --> B[Kubernetes API Server\nreceives request]
B --> C[Scheduler queries\nnode resource state]
C --> D{Node has\n4+ GPUs available?}
D -- No --> E[Node skipped]
D -- Yes --> F[Pod bound to\nGPU-equipped node]
F --> G[NVIDIA Device Plugin\nexposes nvidia.com/gpu resources]
G --> H[Training job runs\non reserved GPUs]
E --> C
Worked example: Imagine your team has a cluster of 10 nodes. Three have NVIDIA A100 GPUs; the rest are CPU-only. You submit a PyTorch training job requesting 4 GPUs. Without orchestration, someone has to SSH into the right machines, check GPU availability, and manually launch the job. With Kubernetes and the NVIDIA device plugin, you declare your resource requirements in a YAML manifest, and the scheduler places your pods on GPU-equipped nodes automatically — freeing your team to focus on the model, not the plumbing.
Beyond GPU placement, Kubernetes enforces resource quotas — hard limits on how much CPU, memory, or GPU a team or namespace can consume. This is critical when multiple data science teams share a cluster: without quotas, one team running a poorly-tuned training job could starve every other team’s workloads. [Source: https://portworx.com/knowledge-hub/kubernetes-ai/]
Portability Across Cloud and On-Premises Environments
AI teams frequently need to move workloads between environments: a data scientist develops a model locally, validates it on a staging cluster, and promotes it to a production cluster running on a different cloud provider — or on bare metal in a corporate data center. Without a common abstraction layer, each environment is a bespoke snowflake requiring custom scripts and tribal knowledge.
Kubernetes solves this through containerization. A container packages an application and all its dependencies (Python version, CUDA libraries, model weights, configuration files) into a portable, self-contained image. Because Kubernetes runs on AWS (EKS), Google Cloud (GKE), Azure (AKS), and on-premises infrastructure using the same API surface, a container image that works in development works in production — without modification.
This consistency is not merely a convenience; it is a correctness guarantee. AI models are notoriously sensitive to environment variations: a different version of NumPy or a different CUDA toolkit can produce different numerical outputs, making debugging nightmarish. Containers eliminate that class of problem by making the environment a reproducible artifact alongside the code. [Source: https://www.kubermatic.com/blog/ai-and-machine-learning-integration-into-kubernetes/]
An analogy: think of a container image like a shipping container in global freight logistics. It doesn’t matter whether the container moves on a ship, a train, or a truck — the contents are sealed and consistent. Kubernetes is the port authority that knows how to unload, route, and deliver containers across all those transport modes.
Ecosystem Maturity and Community Momentum
The numbers here are striking. According to the CNCF Annual Survey 2023, more than 96% of organizations are using or evaluating Kubernetes, and 72% use it in production environments. A separate Linux Foundation study on Sovereign AI found that 82% of organizations are building custom AI solutions, and 58% use Kubernetes to support those workloads. [Source: https://www.cncf.io/blog/2026/03/05/the-great-migration-why-every-ai-platform-is-converging-on-kubernetes/]
This adoption level matters because it creates a self-reinforcing ecosystem. Every major AI framework — PyTorch, TensorFlow, JAX, Hugging Face Transformers — has Kubernetes-native tooling built around it. Platforms like Kubeflow, Ray, and MLflow have all converged on Kubernetes as their deployment substrate. The CNCF launched a Certified Kubernetes AI Conformance Program in November 2025 to formalize and standardize how AI workloads run on conformant clusters. [Source: https://www.cncf.io/announcements/2025/11/11/cncf-launches-certified-kubernetes-ai-conformance-program-to-standardize-ai-workloads-on-kubernetes/]
The practical consequence: when you invest in Kubernetes skills and tooling for AI, you are investing in a platform with a decade of production hardening, a vast community answering questions on GitHub and Stack Overflow, and a growing set of purpose-built AI extensions. The alternative — building bespoke cluster managers or relying on HPC schedulers like Slurm — is increasingly a harder sell when the broader industry has converged. [Source: https://blog.skypilot.co/slurm-vs-k8s/]
| Factor | Kubernetes | Traditional HPC (Slurm) | Cloud-Managed VMs |
|---|---|---|---|
| GPU scheduling | Native via device plugins | Native, mature | Manual or autoscaling groups |
| Portability | Excellent (cloud + on-prem) | Limited (usually on-prem) | Cloud-vendor specific |
| Ecosystem tooling | Very rich (Kubeflow, KServe, Kueue) | Limited ML-specific tooling | Vendor-specific services |
| Multi-tenancy | Namespaces + quotas | Fair-share queues | Separate accounts/projects |
| Learning curve | Steep | Moderate (for HPC teams) | Low initially, high at scale |
| Manifest verbosity | High (3x+ over Slurm scripts) | Low | Low |
Key Takeaway: Kubernetes provides the resource orchestration, environment consistency, and ecosystem momentum that AI teams need at scale. Its GPU-aware scheduling and resource quotas directly address the cost and fairness challenges of shared accelerator infrastructure. The 96%+ industry adoption rate means skills and tooling investment compounds over time.
AI/ML Workload Characteristics
Batch Training vs Real-Time Inference Patterns
Not all AI workloads look alike, and understanding the distinction between batch training and real-time inference is foundational to everything that follows in this book.
Batch training is the process of adjusting a model’s internal parameters by repeatedly feeding it large datasets and computing gradients. It is a compute marathon: a job starts, runs for hours or days consuming enormous GPU resources, and then finishes. The outcome is a saved model artifact — a file containing the learned weights. Training is not user-facing; a few extra minutes of latency are irrelevant.
Real-time inference (also called online serving) is the opposite. A trained model sits behind an API endpoint, and users or applications send requests expecting predictions in milliseconds. Latency is critical; a recommendation engine that takes two seconds to respond loses user engagement. Unlike training, inference workloads must handle unpredictable traffic spikes — a viral product launch might generate 100x normal traffic in minutes.
The contrast is stark enough to warrant a table:
| Dimension | Batch Training | Real-Time Inference |
|---|---|---|
| Duration | Hours to days | Milliseconds per request |
| GPU utilization | Continuously saturated | Bursty, often underutilized |
| Latency sensitivity | Low | Very high |
| Traffic pattern | Predictable (job-scheduled) | Unpredictable spikes |
| Cost profile | Large upfront GPU cost | Pay-per-request, ongoing |
| Scaling strategy | Scale-up (more GPUs per job) | Scale-out (more replicas) |
| Failure tolerance | Restart job or checkpoint | Must be highly available |
This distinction has direct consequences for how you configure Kubernetes. A training job runs as a Job resource (discussed in Section 4) — it runs to completion and then stops. An inference service runs as a Deployment behind a Service, with autoscaling rules to add replicas under load. Using the wrong resource type for the workload type leads to wasted resources and operational headaches. [Source: https://www.cncf.io/blog/2026/03/05/the-great-migration-why-every-ai-platform-is-converging-on-kubernetes/]
Figure 1.2: Batch training vs real-time inference on Kubernetes
flowchart LR
subgraph Training ["Batch Training (Job)"]
T1[Dataset\nin storage] --> T2[Training pods\nwith GPUs]
T2 --> T3[Model artifact\nsaved to storage]
end
subgraph Inference ["Real-Time Inference (Deployment)"]
I1[User / Application\nrequest] --> I2[Inference pod\nbehind Service]
I2 --> I3[Prediction\nreturned in ms]
I2 --> I4[Autoscaler adds\nreplicas on load spike]
end
T3 -. model weights .-> I2
GPU and Accelerator Requirements
A GPU (Graphics Processing Unit) was originally designed to render video game graphics by performing thousands of simple mathematical operations in parallel. It turns out this parallel architecture is ideal for the matrix multiplications that dominate neural network training and inference. An accelerator is the broader category: GPUs are the most common, but AI chips from Google (TPUs), AWS (Trainium/Inferentia), and dedicated inference ASICs are all accelerators in this sense.
Kubernetes does not natively understand GPUs out of the box — they are not CPU cores or RAM, which Kubernetes has tracked since day one. Instead, hardware vendors publish device plugins that extend the Kubernetes API to expose their accelerators as schedulable resources. The NVIDIA device plugin, for example, registers each GPU on a node as a nvidia.com/gpu resource unit. A pod can then request nvidia.com/gpu: 2 in its resource spec, and the scheduler places that pod only on a node with at least 2 available NVIDIA GPUs. [Source: https://www.anantacloud.com/post/ai-ml-workloads-on-kubernetes-running-scalable-machine-learning-pipelines-with-gpu-acceleration-and]
GPU resources are expensive — a cloud A100 instance can cost $30/hour or more. Setting precise requests and limits in your pod spec is therefore not optional. Without them, the scheduler has no way to bin-pack GPU workloads efficiently, pods compete for the same physical device, and costs balloon. [Source: https://www.wiz.io/academy/ai-ml-kubernetes-best-practices]
Data-Intensive I/O Patterns
Training a large model requires moving enormous volumes of data — often hundreds of gigabytes or terabytes — into GPU memory as efficiently as possible. A GPU that spends more time waiting for data from disk than computing gradients is a very expensive idle resource. This is called the I/O bottleneck, and it is one of the less-obvious challenges of AI on Kubernetes.
Traditional microservices typically read modest amounts of configuration or database records. AI training pipelines read entire datasets, often in random-access patterns, requiring high-throughput distributed storage. In a Kubernetes context, this means choosing the right PersistentVolume type — fast NVMe-backed storage for hot data, object storage (S3-compatible) for cold datasets — and ensuring the storage class is available on the nodes where training pods land.
Long-Running and Ephemeral Job Types
Kubernetes was originally built to keep long-running services alive: if a web server pod crashes, Kubernetes restarts it. But AI workloads span both extremes. A multi-day distributed training run is an extreme example of a long-running job that must tolerate node failures gracefully (via checkpointing). A feature engineering step might be ephemeral — a pod spins up, processes a batch of data, writes results to storage, and exits cleanly.
The key insight: AI workloads are fundamentally batch in nature. The infrastructure must support tens to thousands of long-running batch jobs concurrently, not millions of short-lived service requests. [Source: https://connect.redhat.com/hydra/prm/v1/business/companies/bf36e6f9100044ef903614234b0f70ad/linked-resources/72b65d25acf341a8a75ac498a349c2e8/content/public/view] This is a different operational profile from what Kubernetes was originally optimized for, which is why additional tools like Kueue (covered in later chapters) are needed to handle queue management, preemption, and fair-share scheduling across competing teams.
Key Takeaway: AI workloads split into two fundamentally different patterns — batch training (resource-heavy, run-to-completion) and real-time inference (latency-sensitive, always-on) — each requiring different Kubernetes configurations. GPU and accelerator management, data I/O throughput, and long-running job handling are the three technical challenges that most distinguish AI operations from standard microservice deployments.
The AI/ML Lifecycle on Kubernetes
Data Preparation and Feature Engineering
Every AI project begins with data. Raw data — sensor logs, transaction records, text corpora, images — rarely arrives in a form that a machine learning model can directly consume. Data preparation covers ingestion, cleaning, normalization, and transformation. Feature engineering is the more ML-specific step: extracting or constructing the numerical representations (features) that a model actually trains on.
On Kubernetes, this stage typically runs as batch jobs using distributed data-processing frameworks: Apache Spark, Dask, Flink, or Ray. These frameworks distribute data-processing work across many pods, enabling parallelism that reduces hours-long preprocessing to minutes. A typical Kubernetes manifest for this stage would use a Job resource that launches a Spark driver pod coordinating a cluster of executor pods. [Source: https://www.kubeflow.org/docs/started/architecture/]
The output of this stage — cleaned datasets and feature tables — is written to a feature store or an object storage bucket, making it available to the training stage without re-computation.
Model Training and Experimentation
With prepared data in hand, the team enters the model training stage. This is where most of the GPU spend happens. Teams often run many experiments in parallel — varying hyperparameters, architectures, or training data — to find the best-performing model configuration.
Distributed training extends a single training run across multiple pods, each handling a partition of the data or a slice of the model. Frameworks like PyTorch’s DistributedDataParallel and DeepSpeed coordinate gradient synchronization across workers. On Kubernetes, Kubeflow Trainer (formerly the Training Operator) provides Kubernetes-native custom resources — PyTorchJob, JAXJob, XGBoostJob — that manage the pod lifecycle for these distributed patterns. [Source: https://www.kubeflow.org/]
The key scheduling challenge at this stage is gang scheduling: all worker pods for a distributed job must start simultaneously, because a job where 7 of 8 workers are running but waiting for the 8th is consuming 7 GPUs while doing zero useful work. The default Kubernetes scheduler has no concept of gang scheduling, which can cause GPU deadlocks where multiple jobs each hold some GPUs while waiting for others. Kueue, the emerging community standard for batch workload management on Kubernetes, addresses this with atomic job admission. [Source: https://www.cncf.io/blog/2026/03/05/the-great-migration-why-every-ai-platform-is-converging-on-kubernetes/]
Model Serving and Inference
Once training produces a satisfactory model artifact, that artifact must be exposed as a service that applications can query. Model serving on Kubernetes typically uses a dedicated inference server — a containerized process that loads model weights, exposes an HTTP or gRPC API, and handles request batching and preprocessing.
KServe is the Kubernetes-native standard for this. It provides a unified InferenceService custom resource that abstracts over multiple serving runtimes (Triton, TorchServe, ONNX Runtime, vLLM), handles autoscaling, and can scale to zero when no traffic is present. [Source: https://medium.com/@jushijun/deploying-machine-learning-models-with-kubeflow-and-kserve-a-comprehensive-guide-2e3d1449dc54]
For large language models specifically, NVIDIA has introduced disaggregated inference architectures that separate the prefill stage (processing the input prompt) and the decode stage (generating output tokens) into independent Kubernetes services. Because prefill and decode have fundamentally different compute profiles — prefill is compute-bound, decode is memory-bandwidth-bound — running them as separate pods enables fine-grained resource allocation and better GPU utilization. [Source: https://developer.nvidia.com/blog/deploying-disaggregated-llm-inference-workloads-on-kubernetes/]
Worked example: An e-commerce platform deploys a product recommendation model as a KServe InferenceService. During normal hours the service runs two replicas. On Black Friday, traffic spikes 40x. KServe’s Knative-based autoscaling detects the request queue growing and launches additional replicas within seconds, keeping response latency under 100ms. When traffic subsides overnight, replicas scale back down to two, reducing GPU costs.
Monitoring and Retraining Loops
A deployed model is not a static artifact. Model performance degrades over time as the real-world data distribution shifts — a phenomenon called model drift. The final stage of the AI/ML lifecycle is monitoring production model behavior and triggering retraining when performance drops below acceptable thresholds.
On Kubernetes, this stage involves:
- Metrics collection: inference pods emit prediction metrics (request latency, error rates, confidence scores) to Prometheus.
- Drift detection: scheduled jobs or streaming pipelines compare incoming production data against the training distribution.
- Automated retraining: when drift is detected, a new training
Jobis submitted automatically, and the resulting model artifact replaces the serving model via a Kubernetes rolling update.
Kubeflow Pipelines orchestrates these multi-step workflows as directed acyclic graphs (DAGs) running on Kubernetes, enabling the team to define the entire retrain-evaluate-deploy loop as code. [Source: https://blog.kubeflow.org/fraud-detection-e2e/]
Figure 1.3: AI/ML lifecycle stages on Kubernetes
flowchart TD
A[Raw Data\nsensor logs, text, images] --> B[Data Preparation\nSpark / Dask / Ray\nKubernetes Job]
B --> C[Feature Store\nor Object Storage]
C --> D[Model Training\nPyTorch / DeepSpeed\nKubeflow Trainer Job]
D --> E[Model Artifact\nsaved weights]
E --> F[Model Serving\nKServe InferenceService\nautoscaling Deployment]
F --> G[Production Traffic\nuser predictions]
G --> H[Monitoring\nPrometheus / Grafana\nDrift Detection]
H -- drift detected --> D
| Lifecycle Stage | Primary Tools | Kubernetes Resource Type |
|---|---|---|
| Data preparation | Spark, Dask, Ray | Job, StatefulSet |
| Model training | PyTorch, DeepSpeed, JAX | Job (via Kubeflow Trainer) |
| Experiment tracking | MLflow, Weights & Biases | Deployment |
| Model serving | KServe, Triton, vLLM | InferenceService (CRD) |
| Pipeline orchestration | Kubeflow Pipelines | Custom resources |
| Monitoring | Prometheus, Grafana | Deployment, DaemonSet |
Key Takeaway: The AI/ML lifecycle maps cleanly onto Kubernetes: data preparation runs as batch Jobs, distributed training uses Kubeflow Trainer custom resources with gang scheduling, model serving deploys as autoscaling InferenceServices via KServe, and monitoring closes the loop through scheduled retraining pipelines. Each stage has a distinct resource profile and tooling ecosystem.
Kubernetes Architecture Refresher for AI Practitioners
This section is not a full Kubernetes tutorial — that would fill its own book. Instead, it covers the specific concepts you need to understand the rest of this textbook. If you have used Kubernetes before, treat this as a vocabulary alignment.
Control Plane and Worker Node Roles
A Kubernetes cluster consists of two types of machines: control plane nodes and worker nodes.
The control plane is the brain of the cluster. It runs the API server (the single entry point for all Kubernetes operations), the scheduler (which decides where pods run), the controller manager (which reconciles desired vs. actual cluster state), and etcd (a distributed key-value store holding all cluster configuration). For AI workloads, the control plane is rarely where your code runs — it is the management layer that handles placement and lifecycle decisions.
Worker nodes are where your actual workloads run. Each worker node runs a kubelet — an agent that communicates with the control plane and manages the containers on that node — and a container runtime (typically containerd). For AI, worker nodes are the GPU-equipped machines where training pods and inference pods are scheduled.
An analogy: think of the control plane as the corporate headquarters of a logistics company and worker nodes as the individual warehouses and delivery facilities. Headquarters receives orders, assigns work to facilities, and monitors status — but the actual packages move through the facilities, not headquarters.
Figure 1.4: Kubernetes cluster architecture for AI workloads
flowchart LR
subgraph ControlPlane ["Control Plane"]
CP1[API Server\nentry point for all operations]
CP2[Scheduler\ndecides pod placement]
CP3[Controller Manager\nreconciles desired state]
CP4[etcd\ncluster configuration store]
end
subgraph Workers ["Worker Nodes"]
subgraph WN1 ["GPU Node 1"]
K1[kubelet] --> C1[Training Pod\nnvidia.com/gpu: 4]
end
subgraph WN2 ["GPU Node 2"]
K2[kubelet] --> C2[Inference Pod\nnvidia.com/gpu: 1]
end
subgraph WN3 ["CPU Node"]
K3[kubelet] --> C3[Preprocessing Pod\nCPU only]
end
end
CP1 --> CP2
CP2 --> K1
CP2 --> K2
CP2 --> K3
K1 --> CP1
K2 --> CP1
K3 --> CP1
Pods, Deployments, Jobs, and CronJobs
Pod: the smallest schedulable unit in Kubernetes. A pod wraps one or more containers that share a network namespace and storage volumes. Every AI workload — a training process, an inference server, a preprocessing script — ultimately runs inside a pod. Pods are ephemeral: when they terminate, they do not restart unless managed by a higher-level resource.
Deployment: a controller that manages a set of identical, long-running pods and ensures the desired number of replicas is always running. Deployments are the right resource for stateless inference servers that should always be available. If an inference pod crashes, the Deployment controller immediately schedules a replacement.
Job: a controller that runs one or more pods to completion. When all pods complete successfully, the Job is considered done. This is the natural fit for batch training runs, preprocessing pipelines, and evaluation scripts. Unlike a Deployment, a completed Job is not restarted — it sits as a record of what ran.
CronJob: a Job that runs on a schedule, expressed in cron syntax. CronJobs are useful for periodic tasks in the AI/ML lifecycle: nightly feature recomputation, weekly model retraining triggers, or hourly data ingestion from upstream sources.
| Resource | Use Case in AI | Restart Behavior |
|---|---|---|
| Pod (bare) | Rarely used directly | Never restarted |
| Deployment | Inference serving, MLflow server | Restarted on failure |
| Job | Training run, preprocessing, evaluation | Runs to completion |
| CronJob | Scheduled retraining, data ingestion | Runs on schedule |
| StatefulSet | Distributed databases, feature stores | Restarted with stable identity |
Namespaces and Resource Quotas
A namespace is a virtual partition within a Kubernetes cluster. Resources (pods, services, secrets, configmaps) in one namespace are logically isolated from resources in another. In multi-team AI environments, namespaces are typically assigned per team or per project: the nlp-team namespace, the cv-team namespace, the staging namespace.
Resource quotas are constraints applied to a namespace that cap its total resource consumption. A quota might specify that the nlp-team namespace can use at most 32 GPU units, 256 GiB of RAM, and 128 CPU cores across all pods. When a team tries to submit a job that would exceed its quota, Kubernetes rejects the request with a clear error — protecting other teams from resource starvation.
Resource requests and limits operate at the pod level. A request is the amount of a resource the scheduler reserves for the pod when placing it; a limit is the maximum the pod is allowed to consume. For CPU, exceeding the limit causes throttling. For memory, exceeding the limit causes the pod to be killed (OOMKilled). For GPUs, Kubernetes does not currently support fractional GPU allocation natively — a pod gets whole GPUs.
Worked example: A research team submits a hyperparameter sweep that accidentally requests 100 GPU pods simultaneously. Without namespace quotas, this would saturate the entire cluster. With a quota of 16 GPUs on the research namespace, Kubernetes admits only as many pods as fit within the quota, queuing the rest. The production inference service in a separate namespace is unaffected.
Key Takeaway: Understanding the control plane / worker node split, the four core workload resource types (Pod, Deployment, Job, CronJob), and the namespace + quota system gives you the vocabulary to read and write Kubernetes configurations for AI throughout the rest of this book. These abstractions map directly to the stages of the AI/ML lifecycle: Jobs for training, Deployments for serving, CronJobs for periodic retraining.
Chapter Summary
Kubernetes has emerged as the dominant substrate for AI/ML workloads not by accident, but because its core primitives — containerized execution, declarative scheduling, resource quotas, and a rich extension API — address the exact challenges that AI teams face at scale. The 96%+ industry adoption rate [Source: https://www.cncf.io/announcements/2025/11/11/cncf-launches-certified-kubernetes-ai-conformance-program-to-standardize-ai-workloads-on-kubernetes/] and the “great migration” of every major AI platform onto Kubernetes [Source: https://www.cncf.io/blog/2026/03/05/the-great-migration-why-every-ai-platform-is-converging-on-kubernetes/] reflect a genuine technical fit: Kubernetes gives AI teams portability across clouds, reproducible environments via containers, GPU-aware scheduling via device plugins, and multi-tenant fairness via namespaces and quotas.
AI workloads are not just web services with bigger machines. They are long-running batch jobs with extreme GPU requirements, data-intensive I/O patterns, and a dual nature — expensive batch training followed by latency-sensitive always-on inference — that demands different scheduling and scaling strategies for each phase. The default Kubernetes scheduler was not built for gang scheduling or fair-share GPU queuing; extensions like Kueue, Kubeflow Trainer, and KServe fill these gaps, and later chapters in this book will cover each in depth.
The AI/ML lifecycle — data preparation, model training, model serving, and monitoring with retraining loops — maps cleanly onto Kubernetes resource types and tooling. Data pipelines run as Jobs, training runs use Kubeflow Trainer custom resources, inference services deploy via KServe, and orchestration pipelines stitch the stages together via Kubeflow Pipelines. By the end of this book, you will be able to design and operate each of these stages on a production Kubernetes cluster.
Key Terms
| Term | Definition |
|---|---|
| AI/ML lifecycle | The end-to-end sequence of stages for building and operating a machine learning model: data preparation, model training and experimentation, model serving, and monitoring with retraining loops. |
| Orchestration | The automated scheduling, scaling, networking, and lifecycle management of containerized workloads across a cluster of machines. In Kubernetes, the control plane performs orchestration. |
| Accelerator | A hardware component designed to perform the parallel matrix computations required by AI workloads more efficiently than a general-purpose CPU. GPUs are the most common accelerators; TPUs and inference ASICs are also in this category. |
| Batch training | The process of training a machine learning model by iterating over large datasets and updating model parameters via gradient descent. Batch training runs to completion (hours to days), is GPU-saturated, and is not user-facing. |
| Real-time inference | The deployment of a trained model as a live API endpoint that processes individual prediction requests with low latency (typically milliseconds). Also called online serving. |
| Resource quota | A Kubernetes mechanism that caps the total amount of CPU, memory, GPU, or other resources that all pods within a namespace can collectively consume, enforcing fairness among teams sharing a cluster. |
Chapter 2: GPU and Accelerator Management
Learning Objectives
By the end of this chapter, you will be able to:
- Configure GPU scheduling and resource requests in Kubernetes pods
- Deploy and manage NVIDIA device plugins and GPU Operators
- Implement GPU sharing strategies including MIG, time-slicing, and MPS
- Troubleshoot common GPU allocation and driver issues
Introduction
Running AI workloads efficiently on Kubernetes requires more than just spinning up pods — it demands precise management of the most expensive and specialized hardware in your cluster: GPUs. A single NVIDIA H100 can cost tens of thousands of dollars, and a poorly configured cluster can leave that hardware sitting at 13% utilization while your cloud bill climbs. [Source: https://www.cio.com/article/4152554/how-kubernetes-is-finally-solving-the-gpu-utilization-crisis-to-save-your-ai-budget.html]
Think of GPU management in Kubernetes like managing a fleet of high-performance racing cars in a taxi company. The cars are far more capable than ordinary vehicles, they require specialized mechanics and fuel, and they are far too expensive to leave parked. Your job as the Kubernetes operator is to make sure those cars are constantly running the right passengers (workloads) to the right destinations — with the right driver qualifications (hardware compatibility) and the right scheduling policies so no car sits idle while others are overloaded.
This chapter walks through the full stack of GPU management in Kubernetes: from understanding the hardware landscape and device plugin architecture, through deploying the NVIDIA GPU Operator, to implementing advanced GPU sharing strategies that can push cluster utilization well past 80%.
Section 1: GPU Fundamentals for Kubernetes
The GPU Landscape for AI
Graphics Processing Units were originally designed for rendering pixels in parallel. That same parallel architecture — thousands of smaller compute cores working simultaneously — turns out to be exactly what training and inferencing neural networks requires. Today, three major GPU vendors compete in the Kubernetes AI workload space:
| Vendor | Key Products for AI | Kubernetes Support | Notes |
|---|---|---|---|
| NVIDIA | A100, H100, L40, RTX 40-series | Mature, via device plugin + GPU Operator | Dominant ecosystem; CUDA runtime widely supported |
| AMD | Instinct MI300, RX 7900 XTX | Stable, via ROCm device plugin | Growing adoption; ROCm is the CUDA equivalent |
| Intel | Gaudi 2/3, Arc GPUs | Emerging, via Intel Device Plugin for Kubernetes | Strong in data center inference; OpenVINO runtime |
NVIDIA holds the dominant position in Kubernetes AI workloads, primarily because the CUDA (Compute Unified Device Architecture) ecosystem — the programming model and library stack for GPU-accelerated computing — has been the default target for almost every major AI framework (PyTorch, TensorFlow, JAX). This chapter focuses primarily on NVIDIA, but the architectural patterns (device plugins, resource requests, sharing strategies) apply across all vendors. [Source: https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/]
Device Plugin Architecture and Discovery
Kubernetes was designed around a generic resource model. Out of the box, the scheduler knows about CPU cores and memory. GPUs are not built-in concepts — they are extended resources: arbitrary named quantities that a node advertises and a pod can request. The mechanism that bridges physical GPU hardware and the Kubernetes resource model is the device plugin framework.
A device plugin is a small program that runs as a DaemonSet on each node where special hardware is present. It communicates with the kubelet (the per-node Kubernetes agent) over a Unix socket using gRPC, and it performs three core functions:
- Discovery: It detects available devices on the node — for NVIDIA GPUs, this is done via the NVML (NVIDIA Management Library).
- Advertisement: It registers those devices as extended resources (e.g.,
nvidia.com/gpu: 4for a four-GPU node) so the Kubernetes API server and scheduler are aware of them. - Allocation: When a pod is scheduled to the node and requests a GPU, the device plugin handles the actual device injection into the container — mounting the right device files and configuring the environment.
The NVIDIA device plugin runs on the Unix socket /var/lib/kubelet/device-plugins/nvidia.sock and exposes GPUs under the nvidia.com/gpu resource name. [Source: https://medium.com/@rifewang/overview-of-kubernetes-gpu-scheduling-device-plugin-cdi-nfd-and-gpu-operator-48a7c4213b28]
An emerging complement to the device plugin framework is the Container Device Interface (CDI), a newer specification for injecting devices into containers in a more portable and robust manner. CDI decouples device injection from the container runtime, which simplifies support across different runtimes (containerd, CRI-O). [Source: https://medium.com/@rifewang/overview-of-kubernetes-gpu-scheduling-device-plugin-cdi-nfd-and-gpu-operator-48a7c4213b28]
Analogy: Think of the device plugin like a hotel concierge for special equipment. The hotel (Kubernetes) doesn’t know by default which rooms have hot tubs. The concierge (device plugin) surveys the property, reports to the front desk (kubelet/API server) which rooms have hot tubs, and when a guest requests a hot-tub room, the concierge hands them the correct key and ensures the hot tub is ready.
Figure 2.1: Device Plugin Architecture — Discovery, Advertisement, and Allocation
flowchart LR
subgraph Node["GPU Worker Node"]
HW["Physical GPU Hardware\n(e.g., NVIDIA A100)"]
NVML["NVML\n(NVIDIA Management Library)"]
DP["Device Plugin\n(DaemonSet)"]
CT["NVIDIA Container Toolkit"]
C["Container\n(AI Workload)"]
end
subgraph Cluster["Kubernetes Control Plane"]
K["kubelet"]
API["API Server"]
SCH["Scheduler"]
end
HW -- "exposes hardware state" --> NVML
NVML -- "1. Discovery:\ndetect GPUs" --> DP
DP -- "2. Advertisement:\nnvidia.com/gpu extended resource\n(gRPC over Unix socket)" --> K
K -- "registers capacity" --> API
SCH -- "reads available capacity" --> API
DP -- "3. Allocation:\nassign device paths +\nenv vars" --> K
K -- "inject /dev/nvidia0" --> CT
CT -- "mounts GPU devices +\nCUDA libraries at runtime" --> C
Container Runtime GPU Passthrough: NVIDIA Container Toolkit
Even with a device plugin advertising GPU resources, the container itself still needs to be able to use those GPUs. Running a GPU-accelerated workload inside a container requires that the NVIDIA drivers, CUDA libraries, and device files from the host be accessible inside the container namespace.
The NVIDIA Container Toolkit (formerly nvidia-docker2) is the component that makes this passthrough transparent. It hooks into the container runtime (containerd or CRI-O) and, at container launch time, injects the correct GPU devices and driver libraries into the container without requiring those libraries to be baked into every container image. This means an AI workload container only needs to include the CUDA toolkit at the application layer — the driver itself lives on the host and is injected at runtime. [Source: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html]
Worked Example — What happens when a GPU pod starts:
- Pod spec requests
nvidia.com/gpu: 1. - Kubernetes scheduler finds a node with available
nvidia.com/gpucapacity. - kubelet calls the NVIDIA device plugin to allocate one GPU.
- Device plugin returns device paths (e.g.,
/dev/nvidia0) and environment variables. - NVIDIA Container Toolkit intercepts the container start, mounts
/dev/nvidia0, and injects CUDA libraries. - The container starts with full GPU access.
Figure 2.2: GPU Pod Startup Sequence
sequenceDiagram
participant User as User / Job Controller
participant SCH as Kubernetes Scheduler
participant K as kubelet
participant DP as NVIDIA Device Plugin
participant CT as NVIDIA Container Toolkit
participant C as Container
User->>SCH: Submit pod spec (nvidia.com/gpu: 1)
SCH->>SCH: Find node with available nvidia.com/gpu capacity
SCH->>K: Bind pod to node
K->>DP: Allocate 1 GPU (gRPC call)
DP-->>K: Return device paths (/dev/nvidia0) + env vars
K->>CT: Start container with device allocation
CT->>CT: Mount /dev/nvidia0 + inject CUDA libraries
CT-->>C: Container starts with full GPU access
Key Takeaway: GPUs enter Kubernetes as extended resources via the device plugin framework. The NVIDIA device plugin runs as a DaemonSet, discovers GPUs with NVML, advertises them as
nvidia.com/gpuresources, and the Container Toolkit handles runtime injection of GPU drivers and device files into containers.
Section 2: NVIDIA GPU Operator
Managing GPU nodes manually — installing drivers, configuring the container toolkit, deploying the device plugin, setting up monitoring — is a fragile, error-prone process that breaks whenever a node is replaced, an OS is upgraded, or a CUDA version changes. The NVIDIA GPU Operator solves this by applying the Kubernetes Operator pattern to the entire GPU software stack.
A Kubernetes Operator is a controller that encodes operational knowledge about a specific piece of software into a custom controller loop. Rather than a human following a runbook, the Operator continuously watches cluster state and reconciles it toward the desired configuration. The GPU Operator applies this model to the full lifecycle of GPU software on every node. [Source: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html]
What the GPU Operator Manages
The GPU Operator provisions and manages the following components automatically on every GPU-enabled node: [Source: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html]
| Component | Purpose |
|---|---|
| NVIDIA Drivers | Kernel module that enables CUDA on the node; the Operator installs these as a container |
| NVIDIA Container Toolkit | Hooks the container runtime to inject GPU access into pods |
| Kubernetes Device Plugin | Advertises nvidia.com/gpu extended resources to the scheduler |
| GPU Feature Discovery (GFD) | Labels nodes with detailed GPU metadata (model, memory, CUDA capabilities) |
| DCGM Exporter | Exposes GPU metrics (utilization, temperature, memory) for Prometheus |
| MIG Manager | Configures MIG partitioning on supported hardware |
| CUDA Validator | Runs a validation workload to confirm CUDA is functional before scheduling begins |
Analogy: Without the GPU Operator, managing GPU nodes is like individually maintaining a fleet of specialized vehicles by hand — oil changes, tyre rotations, and software updates all done manually per vehicle. The GPU Operator is the automated maintenance depot that keeps every vehicle in the fleet correctly configured at all times, and refuses to send a vehicle on a job until it passes inspection.
Operator Installation and Configuration
The GPU Operator is installed via Helm. The prerequisites are straightforward: [Source: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html]
kubectlandhelmCLIs available on a client machine- A container runtime of containerd or CRI-O (not Docker directly)
- All GPU worker nodes must share the same OS version (required for the driver container approach)
- If Pod Security Admission is enforced, the GPU Operator namespace must be labeled as
privileged
Worked Example — Installing the GPU Operator:
# Add the NVIDIA Helm repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
# Create the namespace and label it for privileged pod security
kubectl create namespace gpu-operator
kubectl label --overwrite namespace gpu-operator \
pod-security.kubernetes.io/enforce=privileged
# Install the GPU Operator
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--wait
After installation, the Operator’s controller watches for nodes with NVIDIA GPUs and begins the automated provisioning workflow. [Source: https://www.infracloud.io/blogs/guide-to-nvidia-gpu-operator-in-kubernetes/]
Automatic Driver Management and CUDA Toolkit Provisioning
One of the most operationally valuable features of the GPU Operator is containerized driver management. Instead of installing NVIDIA drivers directly onto the host OS (which creates version lock-in and complicates node upgrades), the Operator deploys drivers as a privileged container. This driver container mounts the necessary kernel interfaces and loads the NVIDIA kernel module, but the driver files themselves live inside the container image.
The automated provisioning workflow for each new GPU node follows three steps: [Source: https://sagar-parmar.medium.com/nvidia-gpu-operator-explained-simplifying-gpu-workloads-on-kubernetes-436e0a60d0ac]
- Discovery: The Operator identifies nodes with NVIDIA GPUs using node labels and hardware detection.
- Installation and Configuration: It deploys the driver container, configures the Container Toolkit in the runtime, and starts the device plugin and monitoring stack.
- Validation: The CUDA Validator runs a test workload to confirm the full stack is functional. Only after validation passes does the node become eligible to receive GPU workloads.
This validation gate is significant: it prevents misconfigured nodes from silently accepting GPU pods that would then fail at runtime.
Figure 2.3: GPU Operator Automated Node Provisioning Workflow
flowchart TD
A["New GPU Node Joins Cluster"] --> B["GPU Operator Controller\nDetects Node via Labels"]
B --> C["Step 1: Discovery\nIdentify NVIDIA GPU hardware\nvia node labels + hardware probes"]
C --> D["Step 2: Installation and Configuration\nDeploy driver container\nConfigure Container Toolkit in runtime\nStart device plugin + DCGM monitoring"]
D --> E["Step 3: Validation\nCUDA Validator runs test workload\nConfirms full software stack is functional"]
E --> F{Validation Passed?}
F -- "Yes" --> G["Node Marked Ready\nAccepts GPU workloads\nnvidia.com/gpu resources advertised"]
F -- "No" --> H["Node Remains Unschedulable\nOperator logs error\nRetries provisioning"]
H --> C
Node Labeling and GPU Feature Discovery
GPU Feature Discovery (GFD) is a component deployed by the GPU Operator that automatically labels Kubernetes nodes with rich GPU metadata. These labels enable fine-grained scheduling decisions.
Example labels applied by GFD to a node with an A100 GPU:
nvidia.com/gpu.present=true
nvidia.com/gpu.product=A100-SXM4-80GB
nvidia.com/gpu.memory=81920
nvidia.com/gpu.count=8
nvidia.com/cuda.driver.major=525
nvidia.com/mig.capable=true
[Source: https://www.spectrocloud.com/blog/the-real-world-guide-to-the-nvidia-gpu-operator-for-kubernetes-ai]
With these labels, a pod that requires an 80 GB A100 with MIG support can express that as a node affinity rule rather than relying on the cluster operator to manually maintain node pools.
Key Takeaway: The GPU Operator automates the entire GPU software lifecycle on Kubernetes nodes — drivers, runtime configuration, device plugin deployment, and monitoring — using a containerized, validated approach. GFD labels provide rich node metadata that enables intelligent scheduling decisions without manual node management.
Section 3: GPU Scheduling and Resource Requests
Requesting GPUs in Pod Specs
Requesting a GPU in a Kubernetes pod spec follows the same resources.requests and resources.limits pattern as CPU and memory, with one important constraint: GPU resources must have matching requests and limits, and they are not divisible by default — you request whole numbers.
Worked Example — Basic GPU pod spec:
apiVersion: v1
kind: Pod
metadata:
name: gpu-training-job
spec:
containers:
- name: trainer
image: pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
resources:
limits:
nvidia.com/gpu: 1
command: ["python", "train.py"]
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
The tolerations section is important: GPU nodes are typically tainted with nvidia.com/gpu:NoSchedule to prevent non-GPU workloads from being scheduled onto expensive GPU nodes. A pod must explicitly tolerate this taint to be eligible for placement. [Source: https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/]
For multi-GPU workloads (e.g., distributed training across 8 GPUs), simply increase the request:
resources:
limits:
nvidia.com/gpu: 8
This requests all 8 GPUs on a single node. For workloads that span multiple nodes, you would use a framework like PyTorch’s DistributedDataParallel combined with a Kubernetes Job controller, which is covered in Chapter 4.
Node Affinity and GPU Topology Awareness
When a cluster contains multiple GPU types — for example, A100 nodes for training and T4 nodes for inference — raw GPU count requests are insufficient. A pod requesting nvidia.com/gpu: 1 might land on either node type, leading to performance inconsistency or cost inefficiency.
Node affinity solves this by allowing pods to specify hardware requirements as label match expressions: [Source: https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/]
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.product
operator: In
values:
- A100-SXM4-80GB
This pod will only schedule on nodes with an 80 GB A100, regardless of what other GPU types are available in the cluster.
Topology-Aware Scheduling addresses a more subtle problem in multi-GPU training: not all GPUs on a node have equal bandwidth to each other. On high-end NVIDIA hardware, GPUs are connected via NVLink, a proprietary high-speed interconnect that provides up to 600 GB/s of bandwidth. GPUs that communicate over PCIe instead are limited to 32 GB/s — nearly a 20x difference. [Source: https://oneuptime.com/blog/post/2026-01-19-kubernetes-gpu-workload-scheduling/view]
For a large language model training run that moves gradient tensors between GPUs thousands of times per second, this bandwidth difference can dominate training time. Topology-aware scheduling ensures that multi-GPU pods are placed on nodes (and across nodes) where the GPUs are interconnected via the fastest available path.
Comparison: GPU Interconnect Bandwidth
| Interconnect | Bandwidth | Typical Use Case |
|---|---|---|
| NVLink (4th gen, H100) | 900 GB/s | Multi-GPU LLM training on a single node |
| NVLink (3rd gen, A100) | 600 GB/s | Distributed training, large model sharding |
| PCIe 4.0 x16 | 32 GB/s | Inference, single-GPU training |
| InfiniBand HDR (inter-node) | 200 Gb/s | Multi-node distributed training |
[Source: https://oneuptime.com/blog/post/2026-01-19-kubernetes-gpu-workload-scheduling/view]
Extended Resources and Custom Schedulers
The standard Kubernetes scheduler understands extended resources like nvidia.com/gpu but treats them as opaque integer quantities. It can count available units and subtract allocations, but it cannot reason about GPU topology, memory bandwidth, or NUMA locality.
Extended resources are the mechanism by which nodes expose non-standard hardware capacities. They follow the naming convention <vendor>/<resource> (e.g., nvidia.com/gpu, amd.com/gpu, intel.com/gpu). Any resource following this pattern can be requested in pod specs and tracked by the scheduler. [Source: https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/]
For workloads that require topology-aware placement, the NVIDIA GPU Operator integrates with the Kubernetes Topology Manager and the NUMA-aware scheduler to co-locate GPU and CPU resources on the same NUMA node — reducing memory access latency for large model training. [Source: https://www.perfectscale.io/blog/kubernetes-gpu]
Key Takeaway: GPU resource requests use the
nvidia.com/gpuextended resource in pod specs. Node affinity rules (populated by GFD labels) enable targeting specific GPU models. For distributed training, topology-aware scheduling maximizes NVLink bandwidth utilization by up to 20x compared to PCIe fallback paths.
Section 4: GPU Sharing and Multi-Tenancy
A single NVIDIA A100 GPU costs roughly $10,000–$30,000. Running a single small inference workload on that GPU at 5% utilization is financially untenable at scale. GPU sharing is the practice of running multiple workloads on a single GPU simultaneously or in rapid succession, increasing utilization and reducing per-workload cost.
NVIDIA provides three distinct sharing mechanisms, each with different trade-offs across isolation, performance, and hardware requirements: [Source: https://rafay.co/ai-and-cloud-native-blog/demystifying-fractional-gpus-in-kubernetes-mig-time-slicing-and-custom-schedulers]
Figure 2.4: GPU Sharing Strategy Architecture Comparison
flowchart LR
subgraph Physical["One Physical GPU (e.g., A100 80GB)"]
subgraph MIG["MIG Partitioning\n(Ampere+ hardware required)"]
MI1["MIG Instance 1\n10 GB — dedicated HW"]
MI2["MIG Instance 2\n10 GB — dedicated HW"]
MI3["MIG Instance 3–7\n..."]
end
subgraph TS["Time-Slicing\n(any NVIDIA GPU)"]
V1["Virtual GPU 1\nshared VRAM"]
V2["Virtual GPU 2\nshared VRAM"]
V3["Virtual GPU 3\nshared VRAM"]
V4["Virtual GPU 4\nshared VRAM"]
end
subgraph MPS["MPS — Multi-Process Service\n(any NVIDIA GPU, not with MIG)"]
MPSd["Shared CUDA Context\n(MPS Daemon)"]
P1["Process 1"]
P2["Process 2"]
P3["Process 3"]
end
end
MI1 -- "hard isolation\nno interference" --> MI2
V1 -. "no isolation\ncontext switch jitter" .-> V2
P1 -- "low-overhead\nmultiplexing" --> MPSd
P2 -- "low-overhead\nmultiplexing" --> MPSd
P3 -- "low-overhead\nmultiplexing" --> MPSd
| Strategy | Isolation Level | Hardware Requirement | Latency Impact | Best Use Case |
|---|---|---|---|---|
| MIG | Hardware (hard) | Ampere+ (A100, H100, L40) | None (dedicated partitions) | Production inference with SLA requirements |
| Time-Slicing | None (soft) | Any NVIDIA GPU | Yes (context switch jitter) | Dev/test, bursty workloads |
| MPS | Process-level (soft) | Any NVIDIA GPU (not with MIG) | Low overhead | Throughput-focused inference, same model replicas |
NVIDIA Multi-Instance GPU (MIG) Partitioning
MIG (Multi-Instance GPU) is a hardware feature introduced with NVIDIA’s Ampere architecture (A100, A30) and available on subsequent generations (H100, L40, H200). It allows a single physical GPU to be partitioned into up to seven independent GPU instances, each with its own dedicated compute engines, memory bandwidth, L2 cache, and DRAM allocation. [Source: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-sharing.html]
The key word is dedicated: MIG partitions are enforced in hardware. One workload running in a MIG instance cannot read memory from another instance, cannot monopolize shared resources, and cannot interfere with another instance’s performance. This is fundamentally different from software-based sharing.
Analogy: MIG is like subdividing a large conference room into smaller breakout rooms using permanent walls. Each team gets their own space with guaranteed resources. The shared lobby (PCIe interface) is the only truly shared resource.
MIG partition profiles on an A100 80GB GPU:
| Profile | Compute Fraction | Memory | Max Instances |
|---|---|---|---|
| 1g.10gb | 1/7 | ~10 GB | 7 |
| 2g.20gb | 2/7 | ~20 GB | 3 |
| 3g.40gb | 3/7 | ~40 GB | 2 |
| 4g.40gb | 4/7 | ~40 GB | 1 |
| 7g.80gb | 7/7 | ~80 GB | 1 (whole GPU) |
[Source: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-sharing.html]
Worked Example — Enabling MIG with the GPU Operator:
The GPU Operator includes a MIG Manager component that handles MIG configuration. To enable MIG on a node and create seven 1g.10gb partitions:
# Label the node with the desired MIG strategy
kubectl label node <gpu-node> nvidia.com/mig.config=all-1g.10gb
# The GPU Operator's MIG Manager detects the label change and
# reconfigures the GPU automatically. Verify the result:
kubectl describe node <gpu-node> | grep nvidia.com/mig
After configuration, the node will advertise nvidia.com/mig-1g.10gb: 7 as an extended resource. Pods request MIG slices explicitly:
resources:
limits:
nvidia.com/mig-1g.10gb: 1
[Source: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-sharing.html]
Time-Slicing GPUs Across Pods
Time-slicing is the simplest GPU sharing mechanism and works on any NVIDIA GPU. It configures the device plugin to advertise a GPU as multiple virtual units. The GPU hardware rapidly switches between the active workloads, giving each workload a time slice of compute — analogous to how a CPU shares time between processes.
Analogy: Time-slicing is like one chef trying to cook for multiple tables simultaneously by rapidly switching attention between them. Each table gets served, but none gets the chef’s undivided attention, and complex dishes may take longer because of the constant context switching.
Time-slicing is configured via a ConfigMap applied to the device plugin: [Source: https://aws.amazon.com/blogs/containers/gpu-sharing-on-amazon-eks-with-nvidia-time-slicing-and-accelerated-ec2-instances/]
apiVersion: v1
kind: ConfigMap
metadata:
name: time-slicing-config
namespace: gpu-operator
data:
any: |-
version: v1
flags:
migStrategy: none
sharing:
timeSlicing:
replicas: 4
This configuration tells the device plugin to advertise each physical GPU as 4 virtual GPUs. A node with 2 physical GPUs will appear to have nvidia.com/gpu: 8. [Source: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-sharing.html]
Critical limitations of time-slicing:
- No memory isolation: all pods share the full GPU VRAM and can observe each other’s allocations
- No compute isolation: a memory-intensive workload can starve others
- Latency jitter: the context switching introduces unpredictable latency spikes
- No fair scheduling: the GPU does not guarantee equal time slices
Time-slicing is appropriate for development clusters, CI/CD pipelines, or bursty workloads where occasional latency is acceptable. It is not suitable for production inference with strict SLA requirements. [Source: https://rafay.co/ai-and-cloud-native-blog/demystifying-fractional-gpus-in-kubernetes-mig-time-slicing-and-custom-schedulers]
MPS (Multi-Process Service) for Concurrent Access
MPS (Multi-Process Service) is an NVIDIA software feature that allows multiple CUDA processes to share a single GPU with lower switching overhead than time-slicing. Instead of context-switching between separate CUDA contexts, MPS funnels all processes through a single CUDA context via a server daemon. This reduces overhead and increases throughput when running multiple instances of the same model. [Source: https://github.com/rh-aiservices-bu/gpu-partitioning-guide]
MPS is ideal for replicated inference: running four identical copies of a medium-sized model on one GPU to handle parallel requests with higher aggregate throughput than time-slicing would provide.
Important constraints on MPS: [Source: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-sharing.html]
- MPS and time-slicing cannot be enabled simultaneously
- MPS is not supported on devices where MIG is enabled
- MPS provides limited fault isolation: a crash in one CUDA process can destabilize the shared context
- MPS is best suited for workloads where all tenants are trusted (i.e., internal inference replicas, not multi-tenant user workloads)
Enabling MPS through the GPU Operator follows the same ConfigMap-based pattern as time-slicing, but uses the mps sharing strategy instead.
Combining Strategies: MIG + Time-Slicing
For maximum density on A100 or H100 hardware, MIG partitioning and time-slicing can be combined. A single A100 can be divided into 7 MIG instances (1g.10gb), and each MIG instance can then be time-sliced into 4 virtual replicas. The result: up to 28 GPU resource allocations from a single physical GPU. [Source: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-sharing.html]
A100 (1 physical GPU)
├── MIG Instance 1 (1g.10gb) → 4 time-sliced replicas
├── MIG Instance 2 (1g.10gb) → 4 time-sliced replicas
├── ...
└── MIG Instance 7 (1g.10gb) → 4 time-sliced replicas
= 28 total pod allocations
This configuration is powerful for workloads that are small enough to fit in 10 GB but benefit from MIG’s hard memory isolation between the 7 instances.
Cost Optimization Through GPU Sharing Strategies
The financial impact of GPU sharing is substantial. Production case studies from CNCF member organizations show that applying advanced GPU scheduling and sharing strategies can improve cluster GPU utilization from approximately 13% (typical for naively scheduled clusters) to over 80%. [Source: https://www.cio.com/article/4152554/how-kubernetes-is-finally-solving-the-gpu-utilization-crisis-to-save-your-ai-budget.html]
Decision Framework — Choosing a GPU Sharing Strategy:
Does your workload require strict memory and compute isolation?
YES → Use MIG (requires Ampere+ hardware)
NO ↓
Does your hardware support MIG?
NO → Is throughput more important than latency?
YES → Use MPS (same-model replicas, trusted tenants)
NO → Use Time-Slicing (dev/test, bursty, or heterogeneous workloads)
YES ↓
Do you need more than 7 partitions per GPU?
YES → Combine MIG + Time-Slicing (up to 28 allocations per A100)
NO → Use MIG alone for cleanest isolation
Figure 2.5: GPU Sharing Strategy Decision Tree
flowchart TD
A["New GPU Workload to Schedule"] --> B{Requires strict memory\nand compute isolation?}
B -- "Yes" --> C{Hardware is\nAmpere or newer?}
C -- "Yes" --> D{Need more than\n7 partitions per GPU?}
D -- "Yes" --> E["MIG + Time-Slicing\n(up to 28 allocations per A100)"]
D -- "No" --> F["MIG Alone\nHardware-enforced isolation\nBest for production SLA workloads"]
C -- "No\n(pre-Ampere hardware)" --> G{Is throughput more\nimportant than latency?}
B -- "No" --> G
G -- "Yes\n(trusted tenants,\nsame-model replicas)" --> H["MPS\nLow-overhead shared CUDA context\nBest for inference replicas"]
G -- "No\n(dev/test,\nbursty workloads)" --> I["Time-Slicing\nSimplest setup, no isolation\nBest for development clusters"]
Key Takeaway: MIG provides hardware-enforced isolation and is the right choice for production workloads with SLA requirements. Time-slicing is simpler but offers no isolation and introduces latency jitter — best for dev/test environments. MPS reduces sharing overhead for throughput-oriented inference replicas but provides weaker fault isolation. The strategies cannot all be combined: MIG + MPS is unsupported, and time-slicing + MPS cannot coexist. The right choice depends on isolation requirements first, hardware second, workload patterns third.
Chapter Summary
GPU management is one of the most operationally complex domains in Kubernetes, but the tooling has matured significantly. The key architectural pieces fit together as a layered stack:
- Hardware foundation: GPU vendor choice (NVIDIA, AMD, Intel) determines the available management tooling and software ecosystem.
- Device plugin layer: Extended resources (
nvidia.com/gpu) bridge physical hardware and the Kubernetes scheduler via gRPC DaemonSets. - Container runtime layer: The NVIDIA Container Toolkit injects driver access into containers at runtime, decoupling workload images from host driver versions.
- Operator layer: The GPU Operator automates the full GPU software lifecycle — drivers, toolkit, device plugin, GFD labels, DCGM monitoring — with a validation gate that prevents misconfigured nodes from receiving workloads.
- Scheduling layer: Node affinity rules on GFD-generated labels, topology-aware scheduling for NVLink bandwidth, and extended resource requests enable intelligent workload placement.
- Sharing layer: MIG, time-slicing, and MPS unlock multi-tenancy and cost optimization, with trade-offs in isolation, latency, and hardware requirements.
Clusters that implement this full stack can move GPU utilization from the typical 13% baseline to over 80%, dramatically reducing the cost of running AI workloads at scale. [Source: https://www.cio.com/article/4152554/how-kubernetes-is-finally-solving-the-gpu-utilization-crisis-to-save-your-ai-budget.html]
Key Terms
| Term | Definition |
|---|---|
| CUDA | Compute Unified Device Architecture; NVIDIA’s parallel computing platform and programming model used by AI frameworks like PyTorch and TensorFlow |
| Device Plugin | A Kubernetes framework component that runs as a DaemonSet to discover, advertise, and allocate special hardware resources (GPUs, FPGAs, etc.) as extended resources |
| Extended Resources | Custom, vendor-namespaced resource types (e.g., nvidia.com/gpu) that nodes can advertise and pods can request beyond the built-in CPU and memory |
| GPU Operator | An NVIDIA Kubernetes Operator that automates the full GPU software stack lifecycle: drivers, Container Toolkit, device plugin, GFD, and DCGM monitoring |
| MIG (Multi-Instance GPU) | A hardware-level GPU partitioning feature available on NVIDIA Ampere and later architectures that divides one physical GPU into up to seven isolated instances with dedicated compute and memory |
| Time-Slicing | A software-based GPU sharing mechanism that overcommits a GPU by advertising it as multiple virtual units, rapidly context-switching between workloads; provides no memory or compute isolation |
| MPS (Multi-Process Service) | An NVIDIA software feature enabling multiple CUDA processes to share a GPU through a single shared CUDA context, reducing overhead compared to time-slicing; best for throughput-oriented inference |
| Topology Awareness | The practice of scheduling multi-GPU pods to nodes and NUMA domains where GPUs are interconnected via high-bandwidth paths (NVLink) rather than lower-bandwidth PCIe |
| GFD (GPU Feature Discovery) | A GPU Operator component that automatically labels Kubernetes nodes with detailed GPU metadata (model, memory, CUDA version, MIG capability) to enable label-based scheduling |
| NVML | NVIDIA Management Library; the C-based API used by the device plugin and monitoring tools to query GPU hardware state, utilization, and configuration |
| DCGM | Data Center GPU Manager; NVIDIA’s monitoring and management toolkit that the GPU Operator deploys to expose GPU metrics (utilization, temperature, memory) to Prometheus |
| NVLink | NVIDIA’s high-speed GPU interconnect providing up to 600–900 GB/s bandwidth between GPUs on the same node; critical for distributed training performance |
| CDI (Container Device Interface) | An emerging standard for injecting hardware devices into containers in a runtime-agnostic manner, complementing the device plugin framework |
Chapter 3: Storage and Data Management for AI Pipelines
Learning Objectives
By the end of this chapter, you will be able to:
- Select appropriate storage solutions for different AI workload phases (training, checkpointing, inference)
- Configure PersistentVolumes, PVCs, StorageClasses, and CSI drivers for high-performance training data access
- Implement data caching strategies using Fluid, Alluxio, and JuiceFS to accelerate training I/O
- Design data pipelines that integrate with Kubernetes-native storage abstractions
Introduction
Running AI workloads on Kubernetes is not simply a question of scheduling GPUs — data is the other half of the equation. A model training job that must wait for its next batch of images from a slow network file system will leave expensive GPUs idle, wasting both time and money. Conversely, a checkpoint written to the wrong storage tier can corrupt weeks of training progress in a single pod eviction.
Think of storage for AI pipelines the way a professional kitchen thinks about ingredient staging: raw supplies come from a warehouse (object storage), prep stations keep frequently used items close at hand (cache tiers), and the chef’s worktop holds only what is needed right now (local NVMe). Getting the staging wrong means the chef waits — and in AI terms, the GPU waits.
This chapter covers the full journey from storage fundamentals through to advanced caching strategies, with concrete Kubernetes configuration examples at every step.
Section 1: Storage Requirements for AI Workloads
Before choosing a storage technology, it is essential to understand what AI workloads actually demand. The access patterns during training differ sharply from those during inference, and checkpoint storage introduces yet another distinct set of requirements.
Dataset Sizes and I/O Patterns in Training vs Inference
Training is overwhelmingly read-heavy. A large language model (LLM) training run may iterate over terabytes of tokenized text hundreds of times across multiple epochs. Each epoch performs a full sequential or randomised pass over the dataset, generating sustained, high-bandwidth read traffic — often tens to hundreds of gigabytes per second when aggregated across a distributed job. Individual read requests are typically small (tens of kilobytes for image patches, a few hundred kilobytes for audio segments), so IOPS (input/output operations per second) and throughput both matter simultaneously [Source: https://simplyblock.io/glossary/kubernetes-storage-for-ai-workloads/].
Inference is quite different. Model weights are loaded once at startup (a large sequential read), and then each serving request triggers a small, latency-sensitive read of key-value cache or embedding lookups. Inference storage optimisation therefore focuses on low latency for random reads rather than raw throughput [Source: https://www.hyperstack.cloud/blog/case-study/types-of-storage-for-ai-workloads-what-you-need-to-know].
The table below summarises these contrasting profiles:
| Characteristic | Training | Inference |
|---|---|---|
| Primary operation | Sequential/random reads | Random reads (small KV) |
| Access frequency | Repeated (multi-epoch) | Continuous, latency-sensitive |
| Data volume | Tens to hundreds of TB | Model weights + serving cache |
| Throughput priority | Very high (GB/s aggregate) | Moderate |
| Latency priority | Moderate | Very low (ms) |
| Write pattern | Checkpoints (periodic burst) | Logs, output tokens (small) |
POSIX vs Object Storage Tradeoffs
Most AI frameworks (PyTorch DataLoader, TensorFlow tf.data) were written expecting a POSIX filesystem — the standard interface that provides familiar directory trees, file opens, seeks, and symbolic links. POSIX storage maps naturally to training pipelines because no code changes are required: your dataset loader works identically whether the data lives on a local disk or a remote network filesystem that speaks POSIX.
Object storage (Amazon S3, Google Cloud Storage, MinIO) organises data as key-value blobs rather than hierarchical files. It offers massive scalability, pay-per-byte economics, and high durability, making it the natural home for raw datasets and model artefacts [Source: https://www.min.io/use-cases/ai-training]. The tradeoff is that object storage does not natively support POSIX operations such as truncate, append, fallocate, or symbolic links. Training code that relies on those operations will fail or require adaptation.
Practical AI pipelines commonly adopt a hybrid approach: raw datasets live in object storage for durability and cost efficiency, while a POSIX-compatible cache layer (such as JuiceFS or a parallel file system) presents the data to training pods as a conventional filesystem [Source: https://juicefs.com/en/blog/usage-tips/train-large-language-model-kubernetes-storage].
Checkpoint and Model Artifact Storage Needs
Checkpoints are periodic snapshots of model weights, optimizer state, and training metadata written to disk so that a job can resume after a failure without restarting from epoch zero. Their storage requirements are distinctive:
- Write burst: a large model (hundreds of gigabytes) is flushed to disk in seconds, creating a very high momentary write load
- Read rarity: checkpoints are read only on restart, so read latency is less critical than write throughput
- Durability: loss of a checkpoint after a node failure defeats the entire purpose
Platform teams should therefore provision separate storage tiers for checkpoints vs training data, with high burst write throughput and strong durability guarantees for the former. Storage Quality-of-Service (QoS) — per-volume IOPS and bandwidth limits — prevents checkpoint writes from crowding out the read bandwidth that training pods depend on [Source: https://portworx.com/knowledge-hub/kubernetes-ai/].
Completed model artefacts (the final trained weights, tokenizers, configuration files) are typically written once and then read frequently by inference servers. Object storage is ideal here: it provides content-addressable retrieval, versioning, and global distribution at low cost [Source: https://docs.cloud.google.com/architecture/ai-ml/storage-for-ai-ml].
Figure 3.1: Storage tier requirements across AI workload phases
flowchart LR
subgraph Training["Training Phase"]
direction TB
T1["High-throughput reads\n(GB/s aggregate)"]
T2["Repeated multi-epoch\nsequential/random reads"]
T3["Periodic checkpoint\nwrite bursts"]
end
subgraph Checkpointing["Checkpointing"]
direction TB
C1["High burst\nwrite throughput"]
C2["Strong durability\nguarantees"]
C3["Read-rare\n(restart only)"]
end
subgraph Inference["Inference Phase"]
direction TB
I1["One-time large\nsequential read (weights)"]
I2["Low-latency\nrandom reads (KV cache)"]
I3["Small output\ntoken writes"]
end
subgraph Storage["Recommended Storage Tier"]
S1["Parallel FS\n(Lustre / BeeGFS)"]
S2["High-write durable\nblock or object storage"]
S3["Low-latency block\nor local NVMe"]
end
Training --> S1
Checkpointing --> S2
Inference --> S3
Key Takeaway: Training, checkpointing, and inference each carry distinct I/O profiles. Choosing a single storage technology for all three is usually the wrong answer; effective AI platforms provision purpose-built tiers for each phase and enforce QoS to prevent workloads from interfering with each other.
Section 2: Kubernetes Storage Primitives
Kubernetes provides a well-defined set of abstractions for attaching durable storage to workloads. Understanding these primitives is essential before exploring specific high-performance backends.
PersistentVolumes, PVCs, and StorageClasses
A PersistentVolume (PV) is a cluster-level resource that represents a piece of storage — a disk, a network share, or an object bucket — that has been provisioned either manually by an administrator or automatically by a storage driver. A PV exists independently of any pod; it persists across pod restarts, rescheduling, and node failures [Source: https://simplyblock.io/glossary/kubernetes-storage-for-ai-workloads/].
A PersistentVolumeClaim (PVC) is a namespaced request for storage made by a workload. A pod does not reference a PV directly; instead it references a PVC, and Kubernetes matches (or dynamically provisions) the appropriate PV. This indirection decouples application configuration from infrastructure decisions.
A StorageClass defines a profile of storage — its provisioner, performance tier, reclaim policy, and any driver-specific parameters. When a PVC references a StorageClass, the associated provisioner creates a PV automatically. This dynamic provisioning eliminates the need for administrators to pre-create hundreds of individual volumes.
The following example illustrates a StorageClass backed by NVMe SSDs, followed by a PVC referencing it:
# StorageClass: high-performance NVMe-backed block storage
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: nvme-training
provisioner: ebs.csi.aws.com # AWS EBS CSI driver
parameters:
type: io2
iopsPerGB: "50"
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
# PVC requesting 500 GiB of NVMe storage for training scratch space
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: training-scratch
namespace: ml-training
spec:
storageClassName: nvme-training
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 500Gi
When a training pod references training-scratch, Kubernetes dynamically provisions a 500 GiB io2 EBS volume and attaches it to the node where the pod is scheduled.
Figure 3.2: Kubernetes storage abstraction — from StorageClass to mounted volume
flowchart TD
SC["StorageClass\n(nvme-training)\nprovisioner: ebs.csi.aws.com"]
PVC["PersistentVolumeClaim\n(training-scratch)\n500 GiB, ReadWriteOnce"]
PV["PersistentVolume\n(auto-provisioned)\n500 GiB io2 EBS"]
POD["Training Pod\nmountPath: /data/scratch"]
DISK["Physical EBS Volume\n(AWS infrastructure)"]
SC -->|"dynamic provisioning triggers"| PV
PVC -->|"bound to"| PV
POD -->|"references"| PVC
PV -->|"backed by"| DISK
CSI Drivers for Cloud and On-Premises Storage
The Container Storage Interface (CSI) is an industry standard that decouples storage provisioning from the Kubernetes core. Before CSI, storage plugins were compiled directly into the Kubernetes binary, making it impossible to ship updates independently. CSI allows any storage vendor to publish a driver that Kubernetes can load without modifying the platform [Source: https://portworx.com/knowledge-hub/a-complete-guide-to-kubernetes-csi/].
A CSI driver typically consists of:
- Controller plugin — runs as a Deployment and handles volume lifecycle operations (create, delete, attach, detach)
- Node plugin — runs as a DaemonSet on every node and handles mount/unmount operations local to that node
- External provisioner sidecar — watches for new PVCs and calls the controller plugin to create volumes
Key CSI drivers relevant to AI workloads are listed below:
| Driver | Backend | Best Fit |
|---|---|---|
ebs.csi.aws.com | Amazon EBS (SSD, io2 NVMe) | Single-node training scratch |
efs.csi.aws.com | Amazon EFS (NFS) | Shared datasets, ReadWriteMany |
fsx.csi.aws.com | Amazon FSx for Lustre | High-throughput distributed training |
pd.csi.storage.gke.io | GCP Persistent Disk | Single-node GKE training |
lustre.csi.storage.gke.io | GKE Managed Lustre | HPC/ML parallel workloads on GKE |
rook-ceph.rbd.csi.ceph.com | Rook-Ceph (block) | On-premises block storage |
rook-ceph.cephfs.csi.ceph.com | Rook-Ceph (file) | On-premises ReadWriteMany |
csi.juicefs.com | JuiceFS | Distributed POSIX cache layer |
[Source: https://kubernetes-csi.github.io/docs/drivers.html] [Source: https://storageclass.info/drivers]
ReadWriteMany Access Modes for Distributed Training
Kubernetes PVs support three access modes:
- ReadWriteOnce (RWO): the volume can be mounted read-write by a single node at a time. Block storage (EBS, local NVMe) falls into this category.
- ReadOnlyMany (ROX): the volume can be mounted read-only by many nodes simultaneously.
- ReadWriteMany (RWX): the volume can be mounted read-write by many nodes simultaneously.
ReadWriteMany is the critical mode for distributed training. When a PyTorch DistributedDataParallel job launches 100 pods across 20 nodes, every pod must read from the same dataset PVC. Block storage backed by a single NVMe disk cannot satisfy RWX; only network file systems and parallel file systems support it natively [Source: https://aws.amazon.com/blogs/storage/using-high-performance-persistent-storage-for-machine-learning-workloads-on-kubernetes/].
For static provisioning on Lustre, a single PV can be shared across multiple clusters and jobs simultaneously, supporting large parallel training runs with hundreds of pods reading the same dataset [Source: https://cloud.google.com/blog/products/containers-kubernetes/gke-managed-lustre-csi-driver-for-aiml-and-hpc-workloads].
Key Takeaway: PVs, PVCs, StorageClasses, and CSI drivers form the foundational Kubernetes storage vocabulary. Distributed training jobs require ReadWriteMany volumes backed by network or parallel file systems; single-node jobs can use faster block storage with ReadWriteOnce semantics.
Section 3: High-Performance Storage Solutions
With the primitives understood, this section surveys the storage backends that are actually deployed in production AI clusters.
Network File Systems: NFS, Lustre, and BeeGFS
NFS (Network File System) is the simplest shared filesystem. A central NFS server exports one or more directories, and any Kubernetes node with network access can mount them as ReadWriteMany PVs. NFS is easy to operate and works well for small teams or inference model serving (where weights are read-only). However, it does not scale to multi-GB/s aggregate bandwidth because a single NFS server becomes a bottleneck [Source: https://juicefs.com/en/blog/usage-tips/train-large-language-model-kubernetes-storage].
Lustre is a high-performance parallel filesystem originally developed for supercomputers and now ubiquitous in GPU clusters. Lustre separates metadata (directory trees, file attributes) from data (file content) across distinct server roles — Metadata Servers (MDS) and Object Storage Servers (OSS). Data is striped across multiple OSSes, allowing aggregate read bandwidth that scales linearly with the number of servers. This architecture is why Lustre is the recommended backend for the most demanding distributed training workloads [Source: https://aws.amazon.com/blogs/storage/using-high-performance-persistent-storage-for-machine-learning-workloads-on-kubernetes/].
Both AWS (FSx for Lustre) and Google Cloud (GKE Managed Lustre) provide fully managed Lustre instances with integrated CSI drivers [Source: https://cloud.google.com/blog/products/containers-kubernetes/gke-managed-lustre-csi-driver-for-aiml-and-hpc-workloads]. A Kubernetes job with 100 parallel pods can all mount the same Lustre PVC and read disjoint subsets of a dataset concurrently, with each pod working on a different data shard [Source: https://aws.amazon.com/blogs/machine-learning/configure-and-verify-a-distributed-training-cluster-with-aws-deep-learning-containers-on-amazon-eks/].
Figure 3.3: Lustre parallel filesystem architecture for distributed training
flowchart TD
subgraph Clients["Kubernetes Training Pods (100x)"]
P1["Pod A\n(shard 0-99)"]
P2["Pod B\n(shard 100-199)"]
P3["Pod C\n(shard 200-299)"]
Px["... 97 more pods"]
end
subgraph Lustre["Lustre Filesystem"]
MDS["Metadata Server (MDS)\nDirectory trees, file attributes,\nblock locations"]
OSS1["Object Storage Server 1\n(data stripe 1)"]
OSS2["Object Storage Server 2\n(data stripe 2)"]
OSS3["Object Storage Server 3\n(data stripe 3)"]
OSSn["... N more OSSes"]
end
P1 & P2 & P3 & Px -->|"metadata lookup"| MDS
MDS -->|"directs reads to"| OSS1 & OSS2 & OSS3 & OSSn
P1 -->|"parallel data read"| OSS1
P2 -->|"parallel data read"| OSS2
P3 -->|"parallel data read"| OSS3
BeeGFS is another parallel filesystem popular in on-premises HPC environments. Like Lustre it separates metadata from data and supports wide striping, but uses a simpler client daemon and is often favoured in enterprise settings for its operational simplicity relative to Lustre.
The diagram below maps these filesystems to their appropriate use cases:
Low complexity / Low throughput
NFS
|
BeeGFS (on-prem)
|
Lustre (on-prem / managed)
|
High complexity / High throughput
Object Storage Integration: S3, MinIO, and GCS
MinIO is the most widely deployed self-hosted object storage system for AI pipelines. It speaks the S3 API, which means any code written for Amazon S3 can use MinIO without modification. MinIO manages unstructured data — images, video, tokenized text, model weights — at petabyte scale with high durability [Source: https://www.min.io/use-cases/ai-training].
Object storage is not normally mounted as a POSIX filesystem. Instead, training pipelines access it through the S3 SDK or through tools like s3fs or goofys that present an S3 bucket as a virtual filesystem. The caching strategies in Section 4 address the performance gap this creates.
For model artefact distribution — shipping trained weights to inference servers — object storage is the correct choice. A model registry backed by MinIO or GCS provides version control, content addressability, and the ability to serve weights to inference pods running anywhere in a multi-cloud deployment.
Local NVMe and hostPath for Scratch Data
The fastest storage available to a Kubernetes pod is the local NVMe SSD attached directly to the node. NVMe/TCP — running the NVMe protocol over standard Ethernet — delivers sub-millisecond latency and multi-GB/s throughput per volume without requiring specialised fabric hardware [Source: https://simplyblock.io/glossary/kubernetes-storage-for-ai-workloads/].
Local NVMe is ideal for scratch data: temporary working space that does not need to persist beyond the lifetime of the pod. A training job that preprocesses its dataset into a fast local cache before beginning epoch iteration can eliminate network filesystem latency during the hot loop.
In Kubernetes, local storage can be exposed via:
- hostPath volumes: mount a directory from the host node directly into the pod. Simple but tied to a specific node — if the pod is rescheduled, it may land on a different node with an empty cache.
- Local PersistentVolumes: a PV backed by a specific block device on a known node, with node-affinity rules that pin the pod to that node.
# Local PV backed by an NVMe device on a specific node
apiVersion: v1
kind: PersistentVolume
metadata:
name: local-nvme-node01
spec:
capacity:
storage: 2Ti
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Delete
storageClassName: local-nvme
local:
path: /mnt/nvme0
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- gpu-node-01
The nodeAffinity field ensures that the pod using this PV always runs on gpu-node-01 where the physical NVMe device is present.
Rook-Ceph for Distributed Block and File Storage
Rook-Ceph is a Kubernetes operator that deploys and manages Ceph — an open-source distributed storage system — entirely within the cluster. Rook translates Kubernetes-native CRD declarations into Ceph cluster configuration, making it possible to run a production-grade distributed storage system on-premises without external storage appliances [Source: https://portworx.com/knowledge-hub/kubernetes-ai/].
Ceph, as managed by Rook, provides three storage interfaces simultaneously:
| Interface | Kubernetes Access | Use Case |
|---|---|---|
| RBD (block) | CSI rook-ceph.rbd.csi.ceph.com | Boot volumes, single-pod scratch, checkpoints |
| CephFS (file) | CSI rook-ceph.cephfs.csi.ceph.com | ReadWriteMany shared datasets |
| RGW (object) | S3-compatible API | Raw dataset storage, model artefacts |
A key advantage of Rook-Ceph for on-premises AI clusters is that it turns commodity server disks into a resilient, self-healing storage pool. When a node fails, Ceph automatically re-replicates the affected data across surviving nodes — no manual intervention required.
Multi-level caching architectures commonly use Ceph as the third storage tier: memory page cache (first tier) → local NVMe SSDs (second tier) → Ceph distributed storage (third tier) [Source: https://juicefs.com/en/blog/user-stories/storage-evolution-of-unisound-hpc-platform-with-juicefs].
Key Takeaway: No single storage system serves all AI workload phases. Lustre or Rook-CephFS satisfies ReadWriteMany training data access; object storage (MinIO, S3) provides economical raw dataset and artefact storage; local NVMe handles scratch computation; and Rook-Ceph can unify all three backends on-premises through a single operator-managed platform.
Section 4: Data Caching and Pipeline Optimization
High-performance storage solves the infrastructure problem; caching solves the access pattern problem. Even a Lustre cluster with 100 GB/s of aggregate bandwidth can become a bottleneck when hundreds of pods repeatedly read the same training samples across many epochs. Caching brings frequently accessed data closer to the compute, transforming repeated remote reads into local memory or SSD accesses.
Fluid: Dataset Acceleration on Kubernetes
Fluid is a CNCF sandbox project that acts as a Kubernetes-native orchestration layer for distributed dataset caching. Rather than requiring data engineers to manage caching infrastructure manually, Fluid exposes two new Kubernetes resource types — Dataset and Runtime — that describe what data should be cached and how [Source: https://fluid-cloudnative.github.io/docs].
The core innovation is the Elastic Dataset abstraction. A Dataset CRD declares the location of training data (an S3 bucket, an NFS share, a HDFS path) without coupling the application to the underlying storage technology. The Runtime CRD selects the caching engine (Alluxio, JuiceFS, Vineyard, Jindo) and configures its parameters. Fluid then instantiates the chosen engine as Kubernetes pods and registers the cached dataset as a PVC that training pods can mount normally [Source: https://billychen1.github.io/fluid-website-demo/docs/core-concepts/introduction].
Data-affinity scheduling is Fluid’s second major contribution. Fluid tracks which nodes hold cached copies of each dataset partition and annotates them with this information. The Kubernetes scheduler can then place training pods on nodes that already hold their required data, eliminating network reads entirely for the cached portions [Source: https://fluid-cloudnative.github.io/docs/core-concepts/architecture-and-concepts].
Fluid also supports automated data operations via CRDs:
DataLoad: pre-warm the cache before a training job startsDataMigrate: move data between storage tiersDataBackup: snapshot cached data to durable storage
These operations support one-time, scheduled, and event-driven trigger modes, enabling GitOps-style automation of data pipeline stages [Source: https://aws.amazon.com/blogs/containers/build-deep-learning-model-training-apps-using-cncf-fluid-with-amazon-eks/].
# Fluid Dataset: declare ImageNet stored in S3
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
name: imagenet
namespace: ml-training
spec:
mounts:
- mountPoint: s3://my-bucket/imagenet/
name: imagenet
options:
fs.s3a.endpoint: s3.amazonaws.com
---
# Fluid JuiceFSRuntime: cache the dataset locally
apiVersion: data.fluid.io/v1alpha1
kind: JuiceFSRuntime
metadata:
name: imagenet
namespace: ml-training
spec:
replicas: 4
tieredstore:
levels:
- mediumtype: SSD
path: /dev/shm
quota: 200Gi
high: "0.95"
low: "0.7"
When a training pod mounts the imagenet PVC, Fluid routes reads through the JuiceFS cache layer. On the first epoch, data is fetched from S3; on subsequent epochs, it is served from the local SSD cache — dramatically reducing network egress costs and storage latency.
Figure 3.4: Fluid dataset caching architecture — from S3 to training pod
flowchart LR
subgraph Control["Fluid Control Plane"]
DS["Dataset CRD\n(imagenet)"]
RT["JuiceFSRuntime CRD\n(4 replicas, SSD cache)"]
DL["DataLoad CRD\n(pre-warm /train)"]
SCHED["Data-Affinity\nScheduler Plugin"]
end
subgraph Cache["Cache Layer (per node)"]
FUSE["FUSE Pod\n(POSIX interface)"]
WORKER["Worker Pod\n(cache lifecycle)"]
SSD["Local SSD Cache\n(200 GiB)"]
end
subgraph Backend["Remote Storage"]
S3["S3 Bucket\ns3://my-bucket/imagenet/"]
end
subgraph Compute["Training Pod"]
POD["PyTorch DataLoader\nmount: imagenet PVC"]
end
DS & RT --> FUSE & WORKER
DL -->|"pre-warms"| SSD
SCHED -->|"schedules pod to\ncached node"| POD
POD -->|"POSIX read"| FUSE
FUSE -->|"cache hit"| SSD
FUSE -->|"cache miss (epoch 1)"| S3
S3 -->|"fills cache"| SSD
Alluxio and JuiceFS for Tiered Caching
Alluxio is a data orchestration platform that sits between compute frameworks and storage systems, presenting a unified namespace over multiple disparate backends (S3, HDFS, NFS, GCS). Training pods address data through Alluxio as if it were a local filesystem, while Alluxio transparently fetches and caches the underlying objects. This makes it particularly useful in heterogeneous environments where training data spans multiple storage systems [Source: https://juicefs.com/en/blog/user-stories/juicefs-vs-alluxio-ai-storage-naver].
However, Alluxio has a documented limitation: it does not fully implement the POSIX API. Missing support for symbolic links, truncate, fallocate, append, and xattr can cause subtle failures in AI tooling that relies on those calls [Source: https://juicefs.com/en/blog/solutions/ai-data-storage-challenges-capabilities-solution-comparison].
JuiceFS provides full POSIX compliance with a metadata/data separation architecture. Metadata (directory trees, file attributes, block maps) is stored in a fast metadata engine (Redis, TiKV, or a managed database), while file content is stored as fixed-size chunks in an object store. This separation allows JuiceFS to scale metadata operations independently of data access [Source: https://juicefs.com/en/blog/usage-tips/train-large-language-model-kubernetes-storage].
When used with Fluid, JuiceFS launches two component types per worker node:
- FUSE Pod: mounts the JuiceFS filesystem and serves POSIX calls from training pods
- Worker Pod: manages the local cache lifecycle and eviction
Data loading performance through the three-layer JuiceFS cache stack (page cache → NVMe SSD → object store) can reach 300–800 MB/s or higher per node, enabling fast multi-epoch training without saturating the backend object store on every pass [Source: https://www.alibabacloud.com/blog/601515].
Figure 3.5: JuiceFS three-layer cache stack — read path per training node
flowchart TD
POD["Training Pod\nread request"]
L1["Layer 1: Kernel Page Cache\n(RAM — sub-microsecond latency)"]
L2["Layer 2: Local NVMe SSD\n(300–800 MB/s per node)"]
L3["Layer 3: Object Store\n(S3 / MinIO — network latency)"]
META["Metadata Engine\n(Redis / TiKV / managed DB)"]
POD -->|"1. check page cache"| L1
L1 -->|"miss → check SSD"| L2
L2 -->|"miss → fetch from object store"| L3
L3 -->|"populate SSD cache"| L2
L2 -->|"populate page cache"| L1
L1 -->|"data returned to pod"| POD
POD <-->|"file metadata lookup"| META
The table below compares Alluxio and JuiceFS for AI workloads:
| Feature | Alluxio | JuiceFS |
|---|---|---|
| POSIX compliance | Partial (missing truncate, symlinks, xattr) | Full |
| Metadata architecture | Integrated | Separated (pluggable metadata engine) |
| Kubernetes-native CRDs | Via Fluid | Via Fluid |
| Multi-storage unification | Strong (HDFS, S3, NFS, GCS) | S3-compatible backends |
| Cache performance | High | Very high (300–800 MB/s per node) |
| Operational complexity | High | Moderate |
| On-premises support | Yes | Yes |
| Cloud-managed option | No | Yes (JuiceFS Cloud) |
[Source: https://juicefs.com/en/blog/user-stories/juicefs-vs-alluxio-ai-storage-naver] [Source: https://juicefs.com/en/blog/user-stories/ai-storage-life-sciences-solution-juicefs-vs-lustre-alluxio]
Prefetching and Data Locality Strategies
Even the fastest cache is ineffective if data arrives after the GPU is already waiting for it. Prefetching — loading the next batch of data into cache while the current batch is being processed — is the key technique for hiding I/O latency behind computation.
Kubernetes provides several mechanisms to implement prefetching and data locality:
1. Fluid DataLoad (pre-warming)
Before submitting a training job, issue a DataLoad CRD that instructs Fluid to pull the entire dataset (or a specified subset) into the local cache. The training job then starts with a warm cache and experiences no cold-start penalty.
apiVersion: data.fluid.io/v1alpha1
kind: DataLoad
metadata:
name: imagenet-preload
namespace: ml-training
spec:
dataset:
name: imagenet
namespace: ml-training
loadMetadata: true
target:
- path: /train
replicas: 1
2. List caching on CSI FUSE drivers
For GKE Cloud Storage FUSE and Lustre CSI drivers, enabling list caching (via flags like --kernel-list-cache-ttl-secs) caches directory and file listings in the kernel. This is particularly beneficial for AI/ML training workloads that repeatedly iterate full directory listings to build file indexes at the start of each epoch [Source: https://docs.cloud.google.com/kubernetes-engine/docs/how-to/cloud-storage-fuse-csi-driver-perf].
3. File cache for multi-epoch training
The file cache feature in Cloud Storage FUSE stores frequently accessed data on local node storage. For multi-epoch training where the same files are read dozens of times, this eliminates redundant network fetches after the first epoch [Source: https://docs.cloud.google.com/kubernetes-engine/docs/how-to/cloud-storage-fuse-csi-driver-perf].
4. Data-affinity pod scheduling
Fluid annotates nodes with information about which dataset partitions they currently cache. Combined with nodeAffinity or podAffinity rules, training pods can be scheduled onto nodes where their required data already resides. This transforms a potential network read into a local memory or SSD read — the lowest latency option available.
5. Storage QoS tiering
Platform teams should enforce per-volume IOPS and bandwidth limits to create distinct performance tiers. Training volumes receive high burst throughput, but a cap prevents them from consuming the entire network fabric during checkpoint writes. Inference volumes receive a guaranteed IOPS floor that is protected from training load spikes [Source: https://portworx.com/knowledge-hub/kubernetes-ai/]. This QoS enforcement is the storage equivalent of CPU resource requests and limits: it prevents noisy neighbours and ensures predictable latency for production inference endpoints.
Key Takeaway: Caching is not optional for production AI training — it is the mechanism that prevents GPU idle time caused by storage I/O bottlenecks. Fluid provides a Kubernetes-native control plane for managing caching engines (Alluxio, JuiceFS) as observable, elastically scalable services with data-affinity scheduling, automated prefetching, and GitOps-compatible data operations.
Chapter Summary
This chapter surveyed the complete storage stack for AI pipelines on Kubernetes, from foundational abstractions to advanced caching strategies.
Section 1 established that training, checkpointing, and inference carry fundamentally different I/O profiles. Training demands sustained high-throughput sequential reads across repeated epochs; inference requires low-latency random reads; checkpointing imposes periodic write bursts. A single storage tier cannot optimally serve all three.
Section 2 covered the Kubernetes storage vocabulary — PersistentVolumes, PVCs, StorageClasses, and CSI drivers — that provides a vendor-agnostic abstraction layer over diverse backends. The ReadWriteMany access mode, supported by network and parallel filesystems but not block storage, is essential for distributed training jobs where dozens to hundreds of pods read the same dataset concurrently.
Section 3 explored the leading storage backends: Lustre and BeeGFS for high-throughput parallel file access; NFS for simple shared access; MinIO and S3-compatible object stores for dataset ingestion and model artefact distribution; local NVMe for maximum-speed scratch computation; and Rook-Ceph for a unified on-premises storage platform that provides block, file, and object interfaces from commodity hardware.
Section 4 demonstrated how Fluid, Alluxio, and JuiceFS close the performance gap between remote storage and GPU-local compute. Fluid’s Elastic Dataset abstraction and data-affinity scheduler turn distributed caching systems into first-class Kubernetes citizens. JuiceFS’s three-layer cache stack (page cache → NVMe SSD → object store) achieves 300–800 MB/s per node while maintaining full POSIX compliance. Prefetching via DataLoad CRDs, list caching, and file caching eliminate cold-start penalties before training epochs begin.
Key Terms
| Term | Definition |
|---|---|
| PersistentVolume (PV) | A cluster-level Kubernetes resource representing a piece of durable storage provisioned independently of any pod |
| PersistentVolumeClaim (PVC) | A namespaced request for storage made by a workload, decoupling pod configuration from storage infrastructure |
| StorageClass | A Kubernetes object that defines a profile of storage (provisioner, performance parameters, reclaim policy) and enables dynamic volume provisioning |
| CSI driver | A Container Storage Interface plugin that implements the CSI standard to provision, attach, and mount volumes from a specific storage backend without modifying Kubernetes core |
| ReadWriteMany (RWX) | A PersistentVolume access mode allowing the volume to be mounted read-write by multiple nodes simultaneously; required for distributed training |
| Lustre | A high-performance parallel filesystem that stripes data across multiple object storage servers; used in HPC and large-scale distributed ML training |
| Rook-Ceph | A Kubernetes operator that deploys and manages Ceph distributed storage, providing block (RBD), file (CephFS), and object (RGW) interfaces from within the cluster |
| Checkpoint | A periodic snapshot of model weights, optimizer state, and training metadata written to durable storage so training can resume after a failure |
| Data locality | The principle of scheduling compute pods on nodes where their required data is already cached, minimising network transfers and storage latency |
| Fluid | A CNCF sandbox project providing Kubernetes-native Dataset and Runtime CRDs that wrap distributed caching engines as observable, elastically scalable services with data-affinity scheduling |
| Alluxio | A data orchestration platform providing a unified namespace over multiple storage backends as a caching layer; lacks full POSIX API support |
| JuiceFS | A POSIX-compatible distributed filesystem with separated metadata and data architecture, capable of 300–800 MB/s per-node cache performance via a three-layer cache stack |
| Elastic Dataset | Fluid’s core abstraction that decouples application data access from the underlying storage platform, enabling portable data pipelines across diverse Kubernetes environments |
| IOPS | Input/Output Operations Per Second; a measure of storage performance critical for random-access workloads such as small-file dataset reads during training |
Chapter 4: Training Workloads: Jobs, Operators, and Frameworks
Learning Objectives
By the end of this chapter, you will be able to:
- Deploy single-node and distributed training jobs on Kubernetes using native Job primitives
- Install and configure the Kubeflow Training Operator and create PyTorchJob, TFJob, and MPIJob custom resources
- Select and configure the appropriate distributed training strategy — data parallel, model parallel, or pipeline parallel — for a given model and hardware topology
- Implement checkpointing, automatic restart, and elastic scaling to build fault-tolerant training pipelines
4.1 Kubernetes Job Primitives for Training
Before reaching for a specialized ML operator, it is worth understanding the building blocks that Kubernetes already provides. Native Job primitives handle a surprisingly wide range of training workloads, and the higher-level operators covered later in this chapter are themselves built on top of them.
4.1.1 Jobs and CronJobs for Batch Training
A Kubernetes Job creates one or more Pods and guarantees that a specified number of them complete successfully. When the Pod finishes, the Job marks itself complete. For a single-node training run — perhaps a nightly fine-tuning pass on a small model — a plain Job is often all you need.
apiVersion: batch/v1
kind: Job
metadata:
name: finetune-bert-small
spec:
backoffLimit: 3 # retry up to 3 times on failure
template:
spec:
restartPolicy: OnFailure
containers:
- name: trainer
image: myregistry/bert-trainer:v2.1
resources:
limits:
nvidia.com/gpu: "1"
command: ["python", "train.py", "--epochs=10", "--output=/checkpoints"]
volumeMounts:
- name: checkpoint-vol
mountPath: /checkpoints
volumes:
- name: checkpoint-vol
persistentVolumeClaim:
claimName: training-checkpoints-pvc
A CronJob wraps a Job with a schedule expression, making it straightforward to run recurring model refreshes — for example, retraining a recommendation model every night from the latest day’s data.
Figure 4.1: Kubernetes Job and CronJob lifecycle for training workloads
flowchart TD
A[User applies Job manifest] --> B[Kubernetes Job Controller]
B --> C[Creates Pod]
C --> D{Pod executes training script}
D -->|Success| E[Pod status: Completed]
D -->|Failure| F{backoffLimit reached?}
F -->|No| C
F -->|Yes| G[Job status: Failed]
E --> H[Job status: Complete]
CJ[CronJob schedule triggers] -->|"0 2 * * *"| B2[Job Controller spawns new Job]
B2 --> C2[Creates Pod]
C2 --> D2[Executes retraining script]
D2 -->|Success| H2[Job Complete — next run waits for next schedule]
style A fill:#4A90D9,color:#fff
style CJ fill:#7B68EE,color:#fff
style H fill:#27AE60,color:#fff
style H2 fill:#27AE60,color:#fff
style G fill:#E74C3C,color:#fff
apiVersion: batch/v1
kind: CronJob
metadata:
name: nightly-retrain
spec:
schedule: "0 2 * * *" # 02:00 UTC every day
jobTemplate:
spec:
template:
spec:
restartPolicy: OnFailure
containers:
- name: trainer
image: myregistry/rec-trainer:latest
resources:
limits:
nvidia.com/gpu: "1"
4.1.2 Indexed Jobs for Parallel Work Distribution
When you want to fan out independent work across multiple Pods — for example, processing 100 data shards in parallel or running a hyperparameter sweep — the Indexed Job is the right primitive. Each Pod receives a unique JOB_COMPLETION_INDEX environment variable (0, 1, 2, …), which your training script can use to select its shard.
apiVersion: batch/v1
kind: Job
metadata:
name: hyperparam-sweep
spec:
completions: 8
parallelism: 8
completionMode: Indexed # each Pod gets a unique index
template:
spec:
restartPolicy: OnFailure
containers:
- name: trainer
image: myregistry/sweep-trainer:v1
command:
- python
- sweep.py
- "--config-index=$(JOB_COMPLETION_INDEX)"
resources:
limits:
nvidia.com/gpu: "1"
Analogy: Think of an Indexed Job as a restaurant kitchen where each chef is assigned a numbered ticket. Chef 0 handles order 0, Chef 1 handles order 1, and so on. The kitchen manager (Kubernetes) ensures every ticket is completed and can re-assign a ticket if a chef calls in sick (pod failure).
Figure 4.2: Indexed Job fan-out for parallel hyperparameter sweep
flowchart LR
IJ["Indexed Job\ncompletions=8\nparallelism=8"]
IJ --> P0["Pod 0\nJOB_COMPLETION_INDEX=0\nconfig-index=0"]
IJ --> P1["Pod 1\nJOB_COMPLETION_INDEX=1\nconfig-index=1"]
IJ --> P2["Pod 2\nJOB_COMPLETION_INDEX=2\nconfig-index=2"]
IJ --> Pdot["..."]
IJ --> P7["Pod 7\nJOB_COMPLETION_INDEX=7\nconfig-index=7"]
P0 --> R0[Result 0]
P1 --> R1[Result 1]
P2 --> R2[Result 2]
P7 --> R7[Result 7]
R0 & R1 & R2 & R7 --> AGG[Aggregate best hyperparameters]
style IJ fill:#4A90D9,color:#fff
style AGG fill:#27AE60,color:#fff
4.1.3 Init Containers for Data Preparation
Training jobs often need data to be staged before the main training container starts. Init containers run to completion before any app containers start in the same Pod, making them a clean mechanism for downloading datasets, decompressing archives, or validating checksums.
spec:
initContainers:
- name: download-dataset
image: amazon/aws-cli:latest
command:
- aws
- s3
- sync
- s3://my-datasets/imagenet-shards/
- /data/
volumeMounts:
- name: dataset-vol
mountPath: /data
containers:
- name: trainer
image: myregistry/resnet-trainer:v3
volumeMounts:
- name: dataset-vol
mountPath: /data
The training container does not start until the download-dataset init container exits successfully. If the download fails, Kubernetes retries the init container according to the Pod’s restartPolicy, preventing training from starting with an incomplete dataset. [Source: https://collabnix.com/building-distributed-training-systems-on-kubernetes-a-complete-guide/]
Figure 4.3: Init container sequence for data preparation before training
sequenceDiagram
participant K as Kubernetes Scheduler
participant I as Init Container<br/>(download-dataset)
participant S as Shared Volume<br/>(/data PVC)
participant T as Training Container<br/>(resnet-trainer)
K->>I: Start init container
I->>S: aws s3 sync → /data/
alt Download succeeds
I-->>K: Exit 0 (success)
K->>T: Start training container
T->>S: Read /data/ for training
T-->>K: Training complete → Exit 0
else Download fails
I-->>K: Exit non-zero (failure)
K->>I: Retry init container (restartPolicy)
end
Key Takeaway: Native Kubernetes Job primitives — plain Jobs, CronJobs, and Indexed Jobs — cover single-node and embarrassingly-parallel training workloads without additional tooling. Init containers provide a clean pre-flight hook for data preparation and validation.
4.2 Kubeflow Training Operator
Once you need tightly coupled distributed training — where multiple workers must communicate gradient updates in real time — native Jobs become awkward. You would need to manually wire up environment variables, coordinate startup order, and handle failure recovery. The Kubeflow Training Operator solves this by providing Kubernetes-native custom resources (CRDs) for the most popular deep learning frameworks. [Source: https://www.kubeflow.org/docs/components/trainer/legacy-v1/overview/]
4.2.1 Installing and Configuring Training Operator
The simplest installation applies the standalone overlay from the official manifest:
kubectl apply -k \
"github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.7.0"
This creates the Training Operator deployment in the kubeflow namespace, registers the PyTorchJob, TFJob, MPIJob, MXNetJob, and XGBoostJob CRDs, and sets up the RBAC rules the operator needs to manage Pods on your behalf. [Source: https://www.kubeflow.org/docs/components/trainer/legacy-v1/getting-started/]
Verify the operator is running:
kubectl get pods -n kubeflow
# NAME READY STATUS RESTARTS
# training-operator-5d97c5f67-xk9qp 1/1 Running 0
4.2.2 PyTorchJob, TFJob, and MPIJob Custom Resources
The three most commonly used custom resources are summarized below.
| Custom Resource | Framework | Communication Pattern | Typical Use Case |
|---|---|---|---|
PyTorchJob | PyTorch | NCCL / Gloo / MPI | DDP, FSDP, tensor parallelism |
TFJob | TensorFlow | gRPC parameter server or AllReduce | Classic TF distributed strategies |
MPIJob | Any MPI-based | MPI (OpenMPI / MPICH) | Horovod, custom HPC workloads |
PyTorchJob example — 1 master + 3 workers:
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: pytorch-distributed-ddp
namespace: training
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: myregistry/ddp-trainer:v1
resources:
limits:
nvidia.com/gpu: "8"
command:
- python
- -m
- torch.distributed.run
- --nproc_per_node=8
- train_ddp.py
Worker:
replicas: 3
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: myregistry/ddp-trainer:v1
resources:
limits:
nvidia.com/gpu: "8"
command:
- python
- -m
- torch.distributed.run
- --nproc_per_node=8
- train_ddp.py
When the Training Operator reconciles this resource, it automatically injects the following environment variables into every Pod: MASTER_ADDR, MASTER_PORT, WORLD_SIZE, and RANK. Your training code uses these directly through torch.distributed.init_process_group() — no manual coordination required. [Source: https://www.kubeflow.org/docs/components/trainer/legacy-v1/reference/distributed-training/]
Figure 4.4: Kubeflow Training Operator reconciliation flow for a PyTorchJob
flowchart TD
User["User applies PyTorchJob YAML"] --> API[Kubernetes API Server]
API --> CRD["PyTorchJob CRD stored in etcd"]
CRD --> TO["Training Operator Controller\n(watches PyTorchJob events)"]
TO --> MP["Create Master Pod\n(rank=0, 8 GPUs)"]
TO --> W1["Create Worker Pod 1\n(rank=1, 8 GPUs)"]
TO --> W2["Create Worker Pod 2\n(rank=2, 8 GPUs)"]
TO --> W3["Create Worker Pod 3\n(rank=3, 8 GPUs)"]
MP & W1 & W2 & W3 --> ENV["Inject env vars:\nMASTER_ADDR, MASTER_PORT\nWORLD_SIZE=4, RANK=n"]
ENV --> DDP["torch.distributed.init_process_group\nNCCL backend → training begins"]
TO --> STATUS["Update PyTorchJob\nstatus conditions"]
style User fill:#4A90D9,color:#fff
style TO fill:#7B68EE,color:#fff
style DDP fill:#27AE60,color:#fff
TFJob example — parameter server strategy:
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
name: tf-parameter-server
namespace: training
spec:
tfReplicaSpecs:
PS:
replicas: 1
restartPolicy: Never
template:
spec:
containers:
- name: tensorflow
image: myregistry/tf-ps-trainer:v2
resources:
limits:
nvidia.com/gpu: "0" # PS is CPU-bound
Worker:
replicas: 4
restartPolicy: Never
template:
spec:
containers:
- name: tensorflow
image: myregistry/tf-ps-trainer:v2
resources:
limits:
nvidia.com/gpu: "4"
MPIJob example — Horovod training:
apiVersion: kubeflow.org/v1
kind: MPIJob
metadata:
name: horovod-resnet
namespace: training
spec:
slotsPerWorker: 8
runPolicy:
cleanPodPolicy: Running
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
spec:
containers:
- name: mpi-launcher
image: myregistry/horovod-trainer:v1
command:
- mpirun
- --allow-run-as-root
- -np
- "16"
- python
- train_horovod.py
Worker:
replicas: 2
template:
spec:
containers:
- name: mpi-worker
image: myregistry/horovod-trainer:v1
resources:
limits:
nvidia.com/gpu: "8"
4.2.3 Worker Pod Topology and Launcher/Worker Patterns
The Training Operator supports two broad topologies:
Master/Worker (PyTorchJob, TFJob): One designated master Pod coordinates the distributed rendezvous, while worker Pods connect back to it. The master is typically rank 0 in the process group. This pattern fits tightly-coupled synchronous training where all ranks participate in AllReduce operations.
Launcher/Worker (MPIJob): A lightweight launcher Pod runs mpirun to spawn and coordinate worker processes across the worker Pods. The launcher itself performs no model computation; it only manages the MPI process topology. Workers execute the actual training code. [Source: https://oneuptime.com/blog/post/2026-02-09-pytorch-kubeflow-training-operator/view]
┌──────────────────────────────────────────────────────────┐
│ PyTorchJob / TFJob Pattern MPIJob Pattern │
│ │
│ ┌─────────┐ ┌──────────┐ │
│ │ Master │◄──rendezvous──┐ │ Launcher │ │
│ │ (rank 0)│ │ │ (mpirun) │ │
│ └────┬────┘ │ └────┬─────┘ │
│ │ AllReduce │ │ SSH/MPI │
│ ┌────▼────┐ ┌─────┴───┐ ┌────▼────┐ ┌──────┐│
│ │Worker 1 │ │Worker 2 │ │Worker 0 │ │Worker││
│ │(rank 1) │ │(rank 2) │ │(slots:8)│ │ 1 ││
│ └─────────┘ └─────────┘ └─────────┘ └──────┘│
└──────────────────────────────────────────────────────────┘
4.2.4 Gang Scheduling with Volcano or Coscheduling
A critical challenge with distributed training on shared Kubernetes clusters is partial scheduling deadlock: if only 3 of 4 required worker Pods can be scheduled (because node resources are fragmented), those 3 Pods sit idle waiting for the 4th — consuming GPU resources without doing any work.
Gang scheduling solves this by treating all Pods in a job as an atomic unit: either all Pods are scheduled simultaneously, or none are. [Source: https://collabnix.com/distributed-training-on-kubernetes-best-practices-implementation/]
Two options are commonly used:
| Scheduler | Install Method | Key Feature |
|---|---|---|
| Volcano | kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development.yaml | Full-featured batch scheduler with queue management and fair-share policies |
| Coscheduling (scheduler-plugins) | Scheduler plugin, lower overhead | Lightweight gang scheduling as a plugin to the default kube-scheduler |
To use Volcano with a PyTorchJob, annotate the job to request gang scheduling:
metadata:
annotations:
scheduling.volcano.sh/group-name: pytorch-ddp-group
spec:
runPolicy:
schedulingPolicy:
minAvailable: 4 # all 4 workers must be available before any start
Key Takeaway: The Kubeflow Training Operator abstracts away the complexity of distributed job coordination. It automatically injects rendezvous environment variables, manages Pod lifecycles, and integrates with gang schedulers like Volcano to prevent partial-allocation deadlocks.
4.3 Distributed Training Strategies
Choosing the right parallelism strategy is one of the most consequential decisions in large-scale ML training. The wrong choice can waste GPU memory, saturate the network, or make training 10x slower than necessary.
Analogy: Think of training a large model like writing an encyclopedia. Data parallelism means you have 8 identical editorial teams, each working through a different section of source material and comparing notes at the end of each day. Model parallelism means one team writes Volume 1, another writes Volume 2, etc. Pipeline parallelism is the assembly-line version: one team drafts, the next team edits, the next team fact-checks, all working in parallel on different articles.
4.3.1 Data Parallelism with PyTorch DDP and Horovod
Data parallelism is the most widely used strategy. Each worker holds a complete copy of the model and trains on a different slice of the data. At the end of each step, all workers synchronize their gradients via an AllReduce collective operation — every worker ends up with identical, averaged gradients. [Source: https://towardsdatascience.com/distributed-parallel-training-data-parallelism-and-model-parallelism-ec2d234e3214/]
Worker 0: batch_0 → forward → backward → ──┐
Worker 1: batch_1 → forward → backward → ──┤ AllReduce → averaged gradients
Worker 2: batch_2 → forward → backward → ──┤ (NCCL) → all workers update
Worker 3: batch_3 → forward → backward → ──┘
Figure 4.5: Data parallelism AllReduce synchronization across workers
flowchart TD
subgraph W0["Worker 0 (GPU 0)"]
D0[Batch shard 0] --> F0[Forward pass]
F0 --> B0[Backward pass]
B0 --> G0[Local gradients]
end
subgraph W1["Worker 1 (GPU 1)"]
D1[Batch shard 1] --> F1[Forward pass]
F1 --> B1[Backward pass]
B1 --> G1[Local gradients]
end
subgraph W2["Worker 2 (GPU 2)"]
D2[Batch shard 2] --> F2[Forward pass]
F2 --> B2[Backward pass]
B2 --> G2[Local gradients]
end
subgraph W3["Worker 3 (GPU 3)"]
D3[Batch shard 3] --> F3[Forward pass]
F3 --> B3[Backward pass]
B3 --> G3[Local gradients]
end
G0 & G1 & G2 & G3 --> AR["AllReduce\n(NCCL)\nSum + average gradients"]
AR --> UPD["All workers receive\nidentical averaged gradients"]
UPD --> OPT["optimizer.step()\nWeights updated identically\non all workers"]
style AR fill:#E67E22,color:#fff
style OPT fill:#27AE60,color:#fff
Key implementation considerations:
| Concern | Recommendation |
|---|---|
| Data sharding | Use torch.utils.data.DistributedSampler so each worker sees a unique shard with no overlap [Source: https://www.digitalocean.com/community/conceptual-articles/data-parallelism-distributed-training] |
| Learning rate | Scale linearly with world size: lr = base_lr * world_size [Source: https://collabnix.com/distributed-training-on-kubernetes-best-practices-implementation/] |
| Batch size | Global batch size = per-worker batch size × world size; warmup scheduling helps convergence |
| Communication backend | NCCL for GPU-to-GPU; Gloo for CPU or debugging; NCCL with InfiniBand/RoCE for optimal throughput [Source: https://www.zettabyte.space/blog-new/kubernetes-for-ai-container-orchestration-best-practices] |
PyTorch DDP training script skeleton:
import os
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
def main():
dist.init_process_group(backend="nccl") # env vars injected by Training Operator
rank = dist.get_rank()
world_size = dist.get_world_size()
local_rank = int(os.environ["LOCAL_RANK"])
torch.cuda.set_device(local_rank)
model = MyModel().to(local_rank)
model = DDP(model, device_ids=[local_rank])
dataset = MyDataset(...)
sampler = torch.utils.data.DistributedSampler(dataset, num_replicas=world_size, rank=rank)
loader = torch.utils.data.DataLoader(dataset, sampler=sampler, batch_size=64)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4 * world_size)
for epoch in range(100):
sampler.set_epoch(epoch) # ensures different shuffling each epoch
for batch in loader:
loss = model(batch)
loss.backward()
optimizer.step()
optimizer.zero_grad()
if __name__ == "__main__":
main()
Horovod provides a framework-agnostic alternative for data parallelism, usable with PyTorch, TensorFlow, and MXNet through a consistent API. It is particularly well-suited to MPIJob deployments. [Source: https://www.min.io/learn/distributed-training]
import horovod.torch as hvd
hvd.init()
torch.cuda.set_device(hvd.local_rank())
model = MyModel().cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01 * hvd.size())
optimizer = hvd.DistributedOptimizer(optimizer, named_parameters=model.named_parameters())
hvd.broadcast_parameters(model.state_dict(), root_rank=0)
4.3.2 Model Parallelism and Tensor Parallelism
When a model is too large to fit in a single GPU’s VRAM, model parallelism distributes the model itself across multiple devices. Different layers (or layer groups) live on different GPUs. Each GPU processes the full mini-batch but only through its own portion of the network, passing activations forward and gradients backward across device boundaries. [Source: https://medium.com/@sulaiman.shamasna/distributed-model-training-5b460f2af482]
GPU 0: Embedding + Layers 0–11 ──activations──► GPU 1: Layers 12–23 ──► GPU 2: Layers 24–35 + Head
◄──gradients──── ◄──
Tensor parallelism is a finer-grained form of model parallelism where individual weight matrices are sharded across GPUs. For example, in a transformer’s attention block, the Q/K/V projection matrices can be column-partitioned, computed in parallel, and the results gathered before the next operation. [Source: https://www.kubeflow.org/docs/components/trainer/legacy-v1/reference/distributed-training/]
PyTorch’s Fully Sharded Data Parallel (FSDP) combines model and data parallelism: model parameters, gradients, and optimizer states are all sharded across workers. FSDP2 (the next-generation implementation) improves memory efficiency further and is directly supported through the Training Operator. [Source: https://pytorch.org/blog/pytorch-on-kubernetes-kubeflow-trainer-joins-the-pytorch-ecosystem/]
| Strategy | Memory per GPU | Communication Overhead | Best For |
|---|---|---|---|
| Data Parallel (DDP) | Full model copy | Gradient AllReduce | Models that fit in GPU VRAM |
| Model Parallel | Fraction of model | Activation transfer between stages | Extremely large models, vertical split |
| Tensor Parallel | Fraction of tensors | AllReduce within each layer | Transformer attention layers |
| FSDP | Fraction of model + optimizer | AllGather + ReduceScatter | Large models, memory-efficient training |
4.3.3 Pipeline Parallelism with DeepSpeed and Megatron-LM
Pipeline parallelism is a specialized form of model parallelism that divides the model into sequential stages, one per device, and then streams multiple micro-batches through the pipeline simultaneously. While Stage 1 processes micro-batch 2, Stage 2 processes micro-batch 1 — overlapping computation across stages. [Source: https://towardsdatascience.com/distributed-parallel-training-data-parallelism-and-model-parallelism-ec2d234e3214/]
Time →
Stage 0: [MB1 fwd] [MB2 fwd] [MB3 fwd] [idle] [MB3 bwd] [MB2 bwd] [MB1 bwd]
Stage 1: [idle] [MB1 fwd] [MB2 fwd] [MB3 fwd] [MB3 bwd] [MB2 bwd] [MB1 bwd]
Stage 2: [idle] [idle] [MB1 fwd] [MB2 fwd] [MB3 fwd] [MB3 bwd] [MB2 bwd]
DeepSpeed implements pipeline parallelism alongside its well-known ZeRO memory optimization stages. ZeRO partitions optimizer states (ZeRO-1), gradients (ZeRO-2), and model parameters (ZeRO-3) across data-parallel ranks, dramatically reducing the per-GPU memory footprint for very large models. [Source: https://oneuptime.com/blog/post/2026-01-27-kubeflow-model-training/view]
A minimal DeepSpeed configuration for ZeRO-3 + pipeline parallelism:
{
"zero_optimization": {
"stage": 3,
"offload_optimizer": { "device": "cpu" },
"offload_param": { "device": "cpu" }
},
"pipeline": {
"stages": 4,
"activation_checkpoint_interval": 1
},
"fp16": { "enabled": true }
}
Megatron-LM is NVIDIA’s framework for training transformer-based language models at scale. It natively combines tensor parallelism and pipeline parallelism and is commonly used for models in the 10B–1T parameter range.
4.3.4 Hybrid Parallelism Strategies for Large Models
For the largest models — those in the tens to hundreds of billions of parameters — no single parallelism strategy is sufficient. 3D parallelism (or hybrid parallelism) combines all three axes:
| Dimension | Strategy | Example |
|---|---|---|
| Intra-layer | Tensor parallelism | Shard attention Q/K/V across 8 GPUs |
| Inter-layer | Pipeline parallelism | 4 pipeline stages across 4 node groups |
| Cross-replica | Data parallelism | 16 DDP replicas of the full pipeline |
The total GPU count is the product: 8 (tensor) × 4 (pipeline) × 16 (data) = 512 GPUs.
Figure 4.6: 3D hybrid parallelism combining tensor, pipeline, and data parallelism
flowchart LR
subgraph DP["Data Parallelism (16 replicas)"]
subgraph PP0["Pipeline replica 0"]
subgraph TP0["Tensor Parallel\n(8 GPUs)"]
S0["Stage 0\nLayers 0-11\n8× GPU shards"]
end
subgraph TP1["Tensor Parallel\n(8 GPUs)"]
S1["Stage 1\nLayers 12-23\n8× GPU shards"]
end
subgraph TP2["Tensor Parallel\n(8 GPUs)"]
S2["Stage 2\nLayers 24-35\n8× GPU shards"]
end
subgraph TP3["Tensor Parallel\n(8 GPUs)"]
S3["Stage 3\nLayer 36+Head\n8× GPU shards"]
end
S0 -->|"activations\n(pipeline)"| S1 --> S2 --> S3
end
PP0 -->|"AllReduce\ngradients"| PP1["... 15 more\nidentical replicas"]
end
TOTAL["Total GPUs: 8 × 4 × 16 = 512"]
style TOTAL fill:#E67E22,color:#fff
style DP fill:#EBF5FB
A rule of thumb for choosing dimensions:
- Tensor parallelism — keep within a single node (NVLink bandwidth is critical)
- Pipeline parallelism — span nodes; pipeline bubbles grow with stage count so keep stages ≤ 8
- Data parallelism — span across the cluster; RDMA/InfiniBand or RoCE is recommended for AllReduce at scale [Source: https://www.theriseunion.com/en/blog/Deep-Learning-with-Multiple-GPUs.html]
Key Takeaway: Data parallelism (DDP/FSDP) is the right default for models that fit in GPU memory. For larger models, layer-wise or tensor-model parallelism is required. DeepSpeed’s ZeRO stages and Megatron-LM’s 3D parallelism provide the framework machinery; the Kubeflow Training Operator provides the Kubernetes plumbing to launch and manage these workloads at scale.
4.4 Checkpointing and Fault Tolerance
A distributed training run on 256 GPUs can take days or weeks. On any modern cloud or on-premises cluster, the probability of at least one node failure during that window is non-trivial. Without checkpointing and fault tolerance, a single hardware glitch can erase days of compute. This section describes how to design training workloads that survive failures gracefully.
4.4.1 Periodic Checkpoint Saving to Shared Storage
The foundation of fault tolerance is periodic checkpointing — saving the model weights, optimizer state, and training metadata to persistent shared storage at regular intervals. If a job fails and restarts, it loads the latest checkpoint and resumes from where it left off rather than restarting from epoch 0. [Source: https://cloud.google.com/blog/products/ai-machine-learning/elastic-training-and-optimized-checkpointing-improve-ml-goodput]
A standard checkpoint save pattern in PyTorch:
import torch
import os
CHECKPOINT_DIR = "/checkpoints"
CHECKPOINT_INTERVAL = 100 # save every 100 steps
def save_checkpoint(model, optimizer, step, epoch, loss):
if dist.get_rank() == 0: # only rank 0 writes to avoid race conditions
path = os.path.join(CHECKPOINT_DIR, f"checkpoint_step_{step}.pt")
torch.save({
"step": step,
"epoch": epoch,
"model_state_dict": model.state_dict(),
"optimizer_state_dict": optimizer.state_dict(),
"loss": loss,
}, path)
# Write a pointer to the latest valid checkpoint
with open(os.path.join(CHECKPOINT_DIR, "latest.txt"), "w") as f:
f.write(path)
def load_checkpoint(model, optimizer):
latest_file = os.path.join(CHECKPOINT_DIR, "latest.txt")
if not os.path.exists(latest_file):
return 0, 0 # no checkpoint — start fresh
with open(latest_file) as f:
path = f.read().strip()
checkpoint = torch.load(path, map_location="cpu")
model.load_state_dict(checkpoint["model_state_dict"])
optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
return checkpoint["step"], checkpoint["epoch"]
Asynchronous checkpointing substantially reduces the GPU stall time associated with writing large checkpoints. Instead of blocking the training loop while serializing model state, the state is first copied from GPU HBM to host CPU memory (a fast, non-blocking operation) and then written to shared storage in a background thread. [Source: https://cloud.google.com/blog/products/ai-machine-learning/elastic-training-and-optimized-checkpointing-improve-ml-goodput]
Shared storage options for checkpoints:
| Storage Type | Use Case | Notes |
|---|---|---|
| NFS / CephFS (ReadWriteMany PVC) | On-premises multi-node training | All workers can mount the same PVC |
| AWS EFS / Azure Files / GCS Fuse | Cloud-based training | Managed, high-durability object store mounts |
| Local NVMe + periodic sync | Highest-speed intermediate checkpoints | Sync to shared storage every N checkpoints |
4.4.2 Automatic Restart and Recovery from Node Failures
Kubernetes itself provides basic restart capability through the restartPolicy: OnFailure setting in Job/Pod specs. For the Training Operator, per-replica restartPolicy can be set to OnFailure or ExitCode (the latter allows fine-grained exit-code-based decisions on whether to restart or mark the job failed).
However, a node failure typically kills the entire distributed job — not just the Pod on the failed node — because the remaining workers cannot make progress without their peers. The full restart flow is:
- Node failure is detected (default: within 5 minutes via node condition taints)
- Pods on the failed node enter
UnknownorFailedstate - Training Operator restarts all Pods (if
restartPolicy: OnFailure) - Each worker calls
load_checkpoint()and resumes from the last saved step
TorchFT (PyTorch Fault Tolerance) enables a more sophisticated recovery path for sub-group failures. Integrated with the TorchTitan training framework on platforms like AMD’s Primus-SaFE Kubernetes environment, when a node fails, the platform can preempt lower-priority workloads to reclaim GPU resources and automatically replace the failed process group — restoring the job to full scale without a full restart. [Source: https://rocm.blogs.amd.com/artificial-intelligence/primus-torchft/README.html]
Normal training: [Group 0] [Group 1] [Group 2] [Group 3] → AllReduce
Node failure: [Group 0] [ DEAD ] [Group 2] [Group 3] → TorchFT detects
Recovery: [Group 0] [Group 1'] [Group 2] [Group 3] → restored, resume from checkpoint
(preempted lower-priority job freed GPUs for Group 1')
For cloud environments with spot/preemptible instances, the Training Operator’s backoffLimit and checkpoint-on-exit handlers are essential. Set backoffLimit generously (e.g., 20) and ensure the training script writes a checkpoint on SIGTERM — the signal Kubernetes sends before evicting a Pod:
import signal, sys
def sigterm_handler(signum, frame):
print("Received SIGTERM — saving emergency checkpoint")
save_checkpoint(model, optimizer, current_step, current_epoch, current_loss)
sys.exit(0)
signal.signal(signal.SIGTERM, sigterm_handler)
4.4.3 Elastic Training with Dynamic Scaling
Elastic training allows a distributed job to continue running even when the number of available workers changes — workers can join or leave mid-training without requiring a full restart. This makes it feasible to use spot instances for long training runs, absorbing preemptions as temporary worker departures rather than catastrophic failures. [Source: https://aws.amazon.com/blogs/containers/fault-tolerant-distributed-machine-learning-training-with-the-torchelastic-controller-for-kubernetes/]
The TorchElastic Controller for Kubernetes (TECK) provides native elastic training support. You specify minReplicas, maxReplicas, and the desired replica count:
apiVersion: elastic.pytorch.org/v1alpha1
kind: ElasticJob
metadata:
name: elastic-llm-pretrain
namespace: training
spec:
minReplicas: 4 # minimum to make progress
maxReplicas: 16 # scale up to here when GPUs are available
replicaSpecs:
Worker:
replicas: 8 # start with 8, scale between 4 and 16
restartPolicy: OnFailure
template:
spec:
containers:
- name: trainer
image: myregistry/elastic-trainer:v1
resources:
limits:
nvidia.com/gpu: "8"
command:
- python
- -m
- torch.distributed.run
- --rdzv_backend=etcd
- --rdzv_endpoint=etcd-service:2379
- --min_nodes=4
- --max_nodes=16
- train_elastic.py
When a node is added (scale-up), torchrun triggers a rendezvous round: all current workers checkpoint, the new worker joins, and training resumes with an updated WORLD_SIZE. When a node is lost (scale-down or preemption), the same rendezvous mechanism fires, workers rejoin with the updated count, and training continues — provided the worker count stays above minReplicas. [Source: https://medium.com/pytorch/reduce-time-and-cost-by-running-distributed-elastic-pytorch-jobs-on-kubernetes-4f7ac3986307]
Hyperparameters such as learning rate and effective batch size need adjustment when the world size changes. TorchElastic supports user-defined callbacks that fire on scale events:
from torch.distributed.elastic.rendezvous import RendezvousParameters
def on_worker_count_changed(old_world_size, new_world_size):
# Linearly scale learning rate with world size change
scale = new_world_size / old_world_size
for param_group in optimizer.param_groups:
param_group['lr'] *= scale
PyTorch Elastic is fully native to PyTorch — no additional frameworks like Horovod or Ray are required for basic elasticity. For users who need distributed execution beyond training (e.g., data preprocessing pipelines integrated with training), Ray’s elastic actor model is a complementary alternative. [Source: https://www.infracloud.io/blogs/distributed-parallel-processing-ray-kuberay/]
| Feature | TorchElastic (TECK) | Horovod Elastic | Ray Train |
|---|---|---|---|
| Min/max worker bounds | Yes | Yes | Yes |
| Framework native | PyTorch only | Multi-framework | Multi-framework |
| Hyperparameter callbacks on scale | Yes | Yes | Yes |
| Kubernetes CRD | ElasticJob | Via MPIJob | Via RayCluster + RayJob |
| Checkpoint integration | Built-in via TorchSnapshot | Manual | Via Ray Train checkpointing API |
Key Takeaway: Robust fault tolerance requires three cooperating mechanisms: periodic checkpointing to shared persistent storage, automatic Pod restart configured in the Training Operator, and optionally elastic scaling via TorchElastic to absorb node fluctuations without job interruption. Asynchronous checkpointing minimizes the GPU time lost to checkpoint I/O.
Chapter Summary
This chapter covered the full spectrum of Kubernetes tooling for running training workloads, from simple single-GPU jobs to elastic multi-node distributed training.
Native Kubernetes Job primitives — plain Jobs, CronJobs, and Indexed Jobs — handle single-node and embarrassingly-parallel workloads without additional dependencies. Init containers cleanly separate data preparation from training execution.
The Kubeflow Training Operator is the de facto standard for tightly-coupled distributed training on Kubernetes. It provides the PyTorchJob, TFJob, and MPIJob custom resources, automatically injects distributed environment variables, and integrates with gang schedulers like Volcano to prevent partial-allocation deadlocks that would stall distributed jobs.
Distributed training strategies range from straightforward data parallelism (DDP, FSDP, Horovod AllReduce) to memory-efficient model and tensor parallelism to throughput-optimized pipeline parallelism (DeepSpeed, Megatron-LM). For the largest models, 3D hybrid parallelism combines all three axes.
Finally, fault tolerance is not optional for long training runs. Periodic checkpointing to shared storage, SIGTERM handlers for graceful preemption, and elastic training via TorchElastic together ensure that hardware failures and spot interruptions do not erase days of compute.
Key Terms
| Term | Definition |
|---|---|
| Training Operator | Kubeflow component providing Kubernetes CRDs and a controller for managing distributed ML training jobs across multiple frameworks |
| PyTorchJob | Kubeflow Training Operator custom resource for running distributed PyTorch training jobs with automatic environment variable injection |
| TFJob | Kubeflow Training Operator custom resource for running distributed TensorFlow training jobs using parameter server or AllReduce strategies |
| MPIJob | Kubeflow Training Operator custom resource for running MPI-based distributed training jobs, including Horovod workloads, via a launcher/worker pattern |
| Data parallelism | Distributed training strategy where each worker holds a full model copy and trains on a different data shard; gradients are synchronized via AllReduce at each step |
| Model parallelism | Distributed training strategy where model layers are split across multiple devices; each device processes the full mini-batch through its portion of the network |
| Pipeline parallelism | Form of model parallelism that divides the model into sequential stages and streams micro-batches through the pipeline simultaneously to improve throughput |
| Gang scheduling | Scheduling policy that requires all Pods of a distributed job to be allocated simultaneously, preventing partial-allocation deadlocks |
| Volcano | Open-source Kubernetes batch scheduler providing gang scheduling, queue management, and fair-share policies for ML and HPC workloads |
| DeepSpeed | Microsoft’s deep learning optimization library providing ZeRO memory optimization stages, pipeline parallelism, and mixed-precision training at scale |
| Horovod | Uber’s framework-agnostic distributed training library using the AllReduce pattern; integrates with PyTorch, TensorFlow, and MXNet via MPIJob |
| Elastic training | Training paradigm where the number of workers can change dynamically during a run (via TorchElastic / TECK) without stopping and restarting the job |
| FSDP | Fully Sharded Data Parallel — PyTorch strategy that shards model parameters, gradients, and optimizer states across data-parallel workers to reduce per-GPU memory usage |
| NCCL | NVIDIA Collective Communications Library — the high-performance backend for GPU-to-GPU AllReduce, AllGather, and other collective operations used in distributed training |
| TorchElastic | PyTorch native library enabling elastic and fault-tolerant distributed training; used by the TorchElastic Controller for Kubernetes (TECK) |
| AllReduce | Collective communication operation that aggregates (e.g., sums) tensors from all processes and distributes the result back to all processes; the core synchronization primitive in data-parallel training |
| ZeRO | Zero Redundancy Optimizer — DeepSpeed technique that partitions optimizer states, gradients, and/or parameters across data-parallel ranks to reduce memory redundancy |
Chapter 5: Model Serving and Inference
Learning Objectives
By the end of this chapter, you will be able to:
- Deploy models for inference using KServe, Triton, and vLLM on Kubernetes
- Configure autoscaling for inference workloads based on GPU utilization and request metrics
- Implement canary deployments and A/B testing for model rollouts
- Optimize inference latency and throughput with batching and model optimization techniques
Introduction
Training a model is only half the job. Getting it in front of users — reliably, quickly, and cost-effectively — is where the real operational work begins. Model serving on Kubernetes is the discipline of turning a trained artifact into a production-grade API endpoint that can handle bursty traffic, scale down to zero at 2am, and be updated without disrupting users.
Think of model serving like a restaurant kitchen. The trained model is the recipe, but KServe is the kitchen management system: it routes orders (requests) to the right cook (predictor), ensures backup cooks are on standby when things get busy (autoscaling), and lets the head chef trial a new dish on a small table before rolling it out to the whole menu (canary deployment). This chapter covers the full serving stack, from choosing the right runtime to shipping model updates safely.
Section 1: Model Serving Architectures
Online vs. Batch Inference Patterns
Not all inference looks the same. Two fundamental patterns govern how predictions are generated:
| Pattern | Trigger | Latency Target | Typical Use Case |
|---|---|---|---|
| Online (real-time) | Single HTTP/gRPC request | < 200ms | Chatbots, recommendation APIs, fraud detection |
| Batch | Scheduled job or queue | Minutes to hours | Offline scoring, ETL enrichment, bulk predictions |
Online inference prioritizes low latency: a user waits for each response, so the serving system must return results quickly. Batch inference prioritizes throughput: thousands of records are processed together, and nobody cares if it takes ten minutes as long as the job completes.
On Kubernetes, online serving uses Deployment or InferenceService resources fronted by a Service, while batch inference is better expressed as a Job or CronJob that runs, processes a dataset, and exits. This chapter focuses primarily on online inference, where the architectural complexity is highest.
Figure 5.1: Online vs. Batch Inference Request Flows
flowchart LR
subgraph Online["Online Inference (Real-Time)"]
direction LR
U1[User / Client] -->|HTTP / gRPC request| S1[Service]
S1 --> P1[InferenceService Pod]
P1 -->|Response < 200ms| U1
end
subgraph Batch["Batch Inference (Scheduled)"]
direction LR
CJ[CronJob / Job] -->|Reads dataset| DS[(Dataset Store)]
CJ --> BP[Batch Predictor Pod]
BP -->|Writes scored records| OUT[(Output Store)]
end
Model Server Options
Choosing the right model server is one of the most consequential decisions in your serving stack. Each server occupies a different point in the design space:
| Server | Best For | Key Feature | Protocols |
|---|---|---|---|
| TorchServe | PyTorch models, custom handlers | Native PyTorch integration, model archiver | REST, gRPC |
| Triton Inference Server | Multi-framework, mixed workloads | Multi-model concurrency, dynamic batching | REST, gRPC, HTTP/2 |
| vLLM | Large language model text generation | PagedAttention, continuous batching | OpenAI-compatible REST |
| Text Generation Inference (TGI) | LLM serving (HuggingFace ecosystem) | Flash Attention, tensor parallelism | REST |
vLLM and TGI are specialized for transformer-based LLMs and handle the unique memory challenges of autoregressive generation. Triton is a generalist: it supports TensorFlow, PyTorch, ONNX, TensorRT, and can even run vLLM as a backend engine [Source: https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/vllm_backend/README.html]. TorchServe suits teams that want tight PyTorch model lifecycle management without introducing a heavier abstraction.
Sidecar and Standalone Serving Topologies
Model servers can be deployed in two main topologies:
Standalone: The model server runs as the primary container in a pod. The pod’s main process is the inference engine. This is simpler and appropriate for most cases.
Sidecar: A lightweight proxy or transformer runs alongside the model server container in the same pod. KServe uses this pattern to inject its storage initializer as an init container that downloads model weights before the serving container starts, and optionally attaches a transformer container for pre/post-processing.
┌─────────────── Pod ──────────────────┐
│ [init: storage-initializer] │
│ ┌─────────────────────────────────┐ │
│ │ transformer container │ │ ← optional pre/post-processing
│ └────────────────┬────────────────┘ │
│ │ │
│ ┌────────────────▼────────────────┐ │
│ │ predictor container (vLLM/ │ │ ← core inference engine
│ │ Triton/TorchServe) │ │
│ └─────────────────────────────────┘ │
└──────────────────────────────────────┘
Key Takeaway: Choose online or batch inference based on latency requirements, and pick your model server based on model framework and workload mix. Triton excels at heterogeneous multi-model deployments; vLLM is the default for high-throughput LLM serving.
Section 2: KServe (formerly KFServing)
KServe is the leading Kubernetes-native model serving platform, providing a single InferenceService CRD that abstracts away the complexity of autoscaling, networking, health checking, and runtime configuration [Source: https://kserve.github.io/website/]. It graduated from the Kubeflow project and is now a standalone CNCF project, with v0.15 released in June 2025 adding expanded generative AI serving capabilities [Source: https://www.cncf.io/blog/2025/06/18/announcing-kserve-v0-15-advancing-generative-ai-model-serving/].
The InferenceService Custom Resource
The InferenceService is KServe’s primary abstraction. A minimal definition for an sklearn model looks like this:
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "sklearn-iris"
spec:
predictor:
model:
modelFormat:
name: sklearn
storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "1"
memory: "1Gi"
[Source: https://kserve.github.io/website/docs/model-serving/predictive-inference/frameworks/overview]
When this resource is applied, KServe:
- Launches a storage initializer init container to download the model from GCS
- Starts the appropriate serving runtime (sklearn-server in this case)
- Configures a Knative Service or Kubernetes Deployment depending on the installation mode
- Wires up ingress routing via Istio or Kourier
- Optionally configures autoscaling
Supported modelFormat values include sklearn, pytorch, tensorflow, onnx, xgboost, lightgbm, pmml, and custom runtimes [Source: https://kserve.github.io/website/docs/model-serving/predictive-inference/frameworks/overview].
Predictor, Transformer, and Explainer Components
A full InferenceService can include up to three logical components:
| Component | Purpose | Example |
|---|---|---|
| Predictor | Core inference — runs the model | vLLM server, Triton, sklearn-server |
| Transformer | Pre/post-processing pipeline | Tokenization, feature engineering, output formatting |
| Explainer | Model explainability | SHAP values, LIME explanations |
The transformer and explainer are optional. When a transformer is present, KServe routes each request through it before (and after) the predictor, enabling clean separation between business logic and model inference.
Figure 5.2: KServe InferenceService Request Pipeline
flowchart LR
Client([Client Request]) --> T
subgraph IS["InferenceService"]
T["Transformer\n(pre-processing)"] -->|Transformed input| P["Predictor\n(model inference)"]
P -->|Raw prediction| T2["Transformer\n(post-processing)"]
P -->|Prediction| E["Explainer\n(SHAP / LIME)"]
end
T2 --> R([Response to Client])
E -->|Explanation| R
style T fill:#dbeafe,stroke:#3b82f6
style P fill:#dcfce7,stroke:#16a34a
style T2 fill:#dbeafe,stroke:#3b82f6
style E fill:#fef9c3,stroke:#ca8a04
spec:
transformer:
containers:
- name: kserve-container
image: my-org/feature-transformer:v2
env:
- name: STORAGE_URI
value: "gs://my-bucket/vocab.json"
predictor:
model:
modelFormat:
name: pytorch
storageUri: "gs://my-bucket/models/classifier/v3"
explainer:
alibi:
type: AnchorTabular
storageUri: "gs://my-bucket/explainers/anchor"
Model Storage Initialization and Model Mesh
The storage initializer is an init container that KServe injects into every predictor pod. It supports multiple storage backends out of the box:
| Backend | URI Prefix | Example |
|---|---|---|
| Google Cloud Storage | gs:// | gs://my-bucket/models/bert |
| AWS S3 | s3:// | s3://my-bucket/models/bert |
| Azure Blob | azureblob:// | azureblob://mycontainer/models/bert |
| HTTP/HTTPS | https:// | https://huggingface.co/... |
| PersistentVolumeClaim | pvc:// | pvc://model-store/bert-base |
Model Mesh is KServe’s multi-model serving capability. Instead of one pod per model, Model Mesh co-locates many models within a shared pool of serving pods. This dramatically reduces the per-model pod overhead when an organization maintains hundreds of smaller models. Model Mesh maintains an in-memory cache and loads/unloads models from GPU or CPU memory on demand [Source: https://github.com/kserve/kserve].
Multi-Model Serving for Cost Efficiency
The economics of model serving can be challenging. A team with 200 fraud-detection models — each serving low traffic — cannot afford a dedicated GPU pod per model. Multi-model serving solves this by packing multiple models into shared infrastructure.
The tradeoff is isolation: in a single-model pod, a poorly behaved model can only crash itself. In a shared pod, a memory leak affects all co-located models. KServe’s Model Mesh mitigates this with separate model loader processes and health monitoring per model.
Key Takeaway: KServe’s
InferenceServiceCRD is the central abstraction for production ML serving on Kubernetes. Its predictor/transformer/explainer decomposition encourages clean separation of concerns, and Model Mesh enables cost-efficient multi-model deployments where per-model pod overhead would otherwise be prohibitive.
Section 3: LLM Serving on Kubernetes
Large language models present unique serving challenges. A 70-billion-parameter model in FP16 requires approximately 140GB of GPU VRAM — more than any single consumer GPU. The autoregressive generation pattern (producing one token at a time) creates inherently sequential workloads that are hard to batch naively. This section covers the specialized tools and techniques for serving LLMs efficiently on Kubernetes.
vLLM Deployment and PagedAttention Optimization
vLLM is the dominant open-source LLM serving engine in production environments [Source: https://github.com/vllm-project/vllm]. Its defining innovation is PagedAttention, an algorithm that manages the KV (key-value) cache using virtual memory techniques borrowed from operating system page tables.
Why does KV cache management matter? During autoregressive generation, each token produced depends on all previous tokens. The model must cache the key and value vectors for every token in the context window to avoid recomputing them. With a 4096-token context window and large batch sizes, this cache can consume more memory than the model weights themselves — and traditional servers reserved it as one large contiguous VRAM block.
PagedAttention stores the KV cache in non-contiguous “pages,” just like virtual memory in an OS. This eliminates internal fragmentation and allows the cache to be shared across concurrent requests. The result: up to 24x more requests can be served on the same hardware compared to naive approaches [Source: https://www.kubenatives.com/p/vllm-vs-triton-vs-kserve-kubernetes].
Figure 5.3: PagedAttention KV Cache — Contiguous vs. Paged Allocation
block-beta
columns 6
block:traditional["Traditional (Contiguous)"]:6
T1["Req-A\nTokens 0-15\n(reserved)"]
T2["Req-B\nTokens 0-7\n(reserved)"]
T3["Req-C\nTokens 0-31\n(reserved)"]
T4["WASTED\n(fragmentation)"]
T5["WASTED"]
T6["WASTED"]
end
block:paged["PagedAttention (Non-Contiguous Pages)"]:6
P1["Page 0\nReq-A t0-3"]
P2["Page 1\nReq-B t0-3"]
P3["Page 2\nReq-A t4-7"]
P4["Page 3\nReq-C t0-3"]
P5["Page 4\nReq-A t8-11"]
P6["Page 5\nReq-B t4-7"]
end
style T4 fill:#fecaca,stroke:#ef4444
style T5 fill:#fecaca,stroke:#ef4444
style T6 fill:#fecaca,stroke:#ef4444
style P1 fill:#dcfce7,stroke:#16a34a
style P2 fill:#dbeafe,stroke:#3b82f6
style P3 fill:#dcfce7,stroke:#16a34a
style P4 fill:#fef9c3,stroke:#ca8a04
style P5 fill:#dcfce7,stroke:#16a34a
style P6 fill:#dbeafe,stroke:#3b82f6
A basic vLLM deployment on Kubernetes:
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-llama3-8b
spec:
replicas: 1
selector:
matchLabels:
app: vllm-llama3-8b
template:
metadata:
labels:
app: vllm-llama3-8b
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- "--model"
- "meta-llama/Meta-Llama-3-8B-Instruct"
- "--gpu-memory-utilization"
- "0.85"
- "--tensor-parallel-size"
- "1"
- "--max-model-len"
- "8192"
resources:
limits:
nvidia.com/gpu: "1"
volumeMounts:
- name: shm
mountPath: /dev/shm
- name: model-cache
mountPath: /root/.cache/huggingface
volumes:
- name: shm
emptyDir:
medium: Memory
sizeLimit: 8Gi # ← critical for LLM IPC
- name: model-cache
persistentVolumeClaim:
claimName: huggingface-cache
A critical Kubernetes-specific detail: pods default to only 64MB of shared memory (/dev/shm), which AI frameworks use heavily for inter-process communication. Without the emptyDir with medium: Memory override, large models will crash with OSError: [Errno 28] No space left on device during initialization [Source: https://www.kubenatives.com/p/vllm-vs-triton-vs-kserve-kubernetes].
vLLM’s gpu_memory_utilization defaults to 0.9 (90% of VRAM). When co-locating models or using MIG-partitioned GPUs, this should be reduced to leave headroom [Source: https://bizety.com/2025/09/29/vllm-vs-triton-competing-or-complementary/].
Text Generation Inference (TGI) Setup
HuggingFace’s Text Generation Inference is an alternative LLM server that integrates deeply with the HuggingFace Hub ecosystem. TGI supports Flash Attention 2, continuous batching, and tensor parallelism, making it competitive with vLLM for many use cases.
containers:
- name: tgi
image: ghcr.io/huggingface/text-generation-inference:2.3
args:
- "--model-id"
- "mistralai/Mistral-7B-Instruct-v0.3"
- "--num-shard"
- "1"
- "--quantize"
- "bitsandbytes-nf4"
resources:
limits:
nvidia.com/gpu: "1"
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token
Quantization (GPTQ, AWQ, GGUF) for Reduced GPU Memory
Quantization converts model weights from high-precision formats (FP32, FP16) to lower-precision formats (INT8, INT4), trading a small amount of accuracy for significant memory and speed gains:
| Format | Precision | Memory Reduction | Typical Accuracy Loss | Supported By |
|---|---|---|---|---|
| FP16 (baseline) | 16-bit float | — | — | All servers |
| GPTQ | INT4 | ~4x vs FP32 | < 1% on most benchmarks | vLLM, TGI, Triton |
| AWQ (Activation-Aware) | INT4 | ~4x vs FP32 | Slightly better than GPTQ | vLLM, TGI |
| GGUF (llama.cpp) | 2–8 bit mixed | Up to 8x | Varies by bit depth | llama.cpp, Ollama |
AWQ improves on GPTQ by identifying and protecting the most salient weights before quantization, yielding better perplexity at the same bit width [Source: https://www.inferless.com/learn/vllm-vs-triton-inference-server-choosing-the-best-inference-library-for-large-language-models].
To serve a GPTQ-quantized model with vLLM:
args:
- "--model"
- "TheBloke/Llama-2-13B-chat-GPTQ"
- "--quantization"
- "gptq"
- "--gpu-memory-utilization"
- "0.80"
A 13B parameter model in FP16 requires ~26GB VRAM. The GPTQ INT4 version requires ~7GB — fitting on a single A10G or consumer RTX 3090.
Continuous Batching and Speculative Decoding
Continuous batching is vLLM’s scheduling strategy that keeps the GPU pipeline full at all times. In traditional static batching, the server waits for a batch of N requests to arrive before processing any of them. If one request in the batch generates 200 tokens and another generates 10, the short request sits idle for 190 tokens.
Continuous batching solves this by treating each decode step as an opportunity to insert new requests. As soon as a sequence completes, a new request from the queue immediately fills its slot in the next forward pass [Source: https://bizety.com/2025/09/29/vllm-vs-triton-competing-or-complementary/]. This keeps GPU utilization high and minimizes time-to-first-token for queued requests.
Time → [Step 1][Step 2][Step 3][Step 4][Step 5]
Slot A: [Req-1 ][ Req-1][ Req-1][Req-4 ][Req-4 ] ← Req-1 finishes at step 3
Slot B: [Req-2 ][ Req-2][ Req-3][Req-3 ][Req-5 ] ← Req-2 finishes, Req-3 added
Slot C: [Req-3 ][ Req-3][ Req-3][Req-3 ][Req-3 ]
Speculative decoding accelerates generation further by using a small draft model to propose several tokens, which the large target model then verifies in parallel. When the draft model is right (which it often is for common phrases), multiple tokens are produced per forward pass, reducing effective latency.
Figure 5.4: Continuous Batching — Dynamic Request Slot Assignment
sequenceDiagram
participant Q as Request Queue
participant S as Scheduler
participant G as GPU (Forward Pass)
Q->>S: Req-1, Req-2, Req-3 arrive
S->>G: Step 1: [Req-1, Req-2, Req-3]
G-->>S: Req-2 completes (short output)
Q->>S: Req-4 waiting in queue
S->>G: Step 2: [Req-1, Req-4, Req-3]
Note over S,G: Req-4 inserted immediately — no idle slot
G-->>S: Req-1 completes
Q->>S: Req-5 waiting in queue
S->>G: Step 3: [Req-5, Req-4, Req-3]
Key Takeaway: vLLM’s PagedAttention and continuous batching are the two most impactful optimizations for LLM serving throughput. Quantization (AWQ or GPTQ) allows models 2–4x too large for your GPU to fit in memory, often with minimal quality degradation. Always configure
/dev/shmfor LLM pods on Kubernetes.
Section 4: Autoscaling and Traffic Management
HPA with Custom GPU and Request Latency Metrics
The Horizontal Pod Autoscaler (HPA) is Kubernetes’ built-in mechanism for scaling workloads based on observed metrics. For inference services, the most useful scaling signals are:
| Metric | Signal | Scaling Behavior |
|---|---|---|
| CPU utilization | General compute pressure | Scale out when CPU > threshold |
| GPU utilization | GPU saturation | Scale out when GPU DCGM metric > threshold |
| Request latency (p99) | User-visible slowdown | Scale out when p99 latency > SLO |
| Request concurrency | Inflight request queue | Scale to match target concurrency |
| Pending request queue depth | Backpressure in LLM serving | Scale out before latency degrades |
KServe’s HPA integration in Standard mode uses scaleMetric and scaleTarget fields in the InferenceService spec [Source: https://kserve.github.io/website/docs/model-serving/predictive-inference/autoscaling/hpa-autoscaler]:
spec:
predictor:
scaleMetric: cpu
scaleTarget: 80 # scale out when avg CPU > 80%
minReplicas: 1
maxReplicas: 10
model:
modelFormat:
name: pytorch
storageUri: "gs://my-bucket/models/classifier"
For GPU-based scaling, you need to expose DCGM (Data Center GPU Manager) metrics via Prometheus and configure an HPA with a custom metric reference. The Metrics Server must be installed for standard HPA to function [Source: https://kserve.github.io/website/docs/model-serving/predictive-inference/autoscaling/hpa-autoscaler].
Scale-to-Zero with Knative or KEDA
One of the most cost-effective features in Kubernetes ML serving is scale-to-zero: completely removing all pods when no traffic is present, and scaling back up on the first request.
Knative Pod Autoscaler (KPA) is KServe’s default autoscaler in serverless mode. It scales based on concurrent requests per pod, with built-in support for scaling to zero [Source: https://kserve.github.io/website/docs/model-serving/predictive-inference/autoscaling/kpa-autoscaler]:
metadata:
annotations:
autoscaling.knative.dev/target: "5" # 5 concurrent requests per pod
spec:
predictor:
minReplicas: 0 # enables scale-to-zero
maxReplicas: 8
The tradeoff with scale-to-zero for LLMs is cold start time. A 7B parameter model takes 30–90 seconds to load from storage and warm up, making scale-to-zero unsuitable for latency-sensitive endpoints. For LLMs, minReplicas: 1 is usually the right choice; scale-to-zero works well for smaller classification or embedding models.
KEDA (Kubernetes Event-Driven Autoscaling) extends HPA with support for custom event sources and LLM-specific metrics. KServe integrates with KEDA via the serving.kserve.io/autoscalerClass: "keda" annotation [Source: https://kserve.github.io/website/docs/model-serving/predictive-inference/autoscaling/keda-autoscaler]:
metadata:
annotations:
serving.kserve.io/autoscalerClass: "keda"
serving.kserve.io/scaleTarget: "10"
serving.kserve.io/scaleMetric: "rps"
spec:
predictor:
minReplicas: 0
maxReplicas: 5
model:
modelFormat:
name: pytorch
storageUri: "s3://my-bucket/models/recommender"
KEDA’s advantage over standard HPA for LLM workloads is its ability to scale on metrics like pending request queue depth or token throughput — signals that predict when the serving tier is about to saturate before latency degrades [Source: https://kserve.github.io/website/docs/model-serving/generative-inference/autoscaling].
The three autoscaler classes compared:
| Autoscaler | Trigger | Best For | Scale-to-Zero |
|---|---|---|---|
| KPA (Knative) | Request concurrency | Serverless, bursty traffic | Yes (via Knative activator) |
| HPA | CPU, memory, custom Prometheus metrics | Stable, predictable traffic | No (min 1 replica) |
| KEDA | Any event source — queues, LLM metrics | LLM serving, event-driven workloads | Yes |
Canary Rollouts and Traffic Splitting
Deploying a new model version to 100% of production traffic immediately is high risk: if the new version regresses on accuracy or introduces higher latency, every user is affected. Canary deployments mitigate this by routing a small percentage of traffic to the new version while the majority continues to use the stable version.
KServe natively supports percentage-based traffic splitting in the InferenceService spec [Source: https://github.com/kserve/kserve]:
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "flower-classifier"
spec:
predictor:
canaryTrafficPercent: 20 # 20% to new version
model:
modelFormat:
name: sklearn
storageUri: "gs://my-bucket/models/classifier/v2" # new (canary)
The stable version is maintained in the previous revision; KServe routes 80% of traffic there and 20% to the canary predictor. No external routing tool is required.
For automated, metrics-driven promotion, Flagger integrates with KServe to progressively shift traffic based on Prometheus metrics. Flagger starts with a small canary percentage, observes error rate and latency metrics, and automatically promotes or rolls back based on configured thresholds [Source: https://oneuptime.com/blog/post/2026-01-30-mlops-canary-model-deployment/view]:
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: flower-classifier
spec:
targetRef:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
name: flower-classifier
progressDeadlineSeconds: 600
analysis:
interval: 1m
threshold: 5
maxWeight: 50
stepWeight: 10
metrics:
- name: request-success-rate
thresholdRange:
min: 99
interval: 1m
- name: request-duration
thresholdRange:
max: 500
interval: 1m
This configuration starts the canary at 0%, increments by 10% every minute if metrics hold, caps at 50%, and promotes to 100% automatically when analysis passes.
Figure 5.5: Flagger Canary Promotion Lifecycle
stateDiagram-v2
[*] --> Initializing: New InferenceService version deployed
Initializing --> Progressing: Canary pods healthy
Progressing --> Progressing: Metrics pass\nshift +10% traffic
Progressing --> Promoting: canaryWeight reaches maxWeight (50%)\nAll analysis intervals pass
Promoting --> Succeeded: Stable promoted to 100%\nCanary destroyed
Progressing --> RollingBack: Error rate > threshold\nor latency > 500ms
RollingBack --> Failed: Traffic reverted to stable 100%
Succeeded --> [*]
Failed --> [*]
A/B Testing for Model Rollouts
While canary deployments gradually migrate all traffic from one version to another, A/B testing routes traffic based on user identity or request metadata — enabling true controlled experiments where specific cohorts see specific model versions simultaneously.
KServe, backed by Istio, supports header-based routing that enables A/B segmentation [Source: https://dasroot.net/posts/2026/03/building-multi-model-inference-platform-kubernetes/]. For example, users in experiment group B can be routed to model v2 while all others continue to use v1, based on an X-Experiment-Group: B header injected by your application layer.
The key distinction:
| Technique | Traffic Split Basis | Rollback Mechanism | End State |
|---|---|---|---|
| Canary | Random percentage | Automatic (Flagger) or manual | One version wins |
| A/B Test | User cohort / header | Manual analysis | Winner determined by experiment |
| Blue/Green | All-or-nothing switch | Flip traffic weight to 0/100 | Immediate cutover |
Load Balancing Strategies for Inference Endpoints
Multiple replicas of a model server need intelligent load balancing. Naive round-robin works for stateless REST APIs, but LLM inference has special considerations:
- Prefix cache affinity: vLLM maintains a KV-cache. If a user’s system prompt is long and consistent, routing them to a replica that has already cached that prefix dramatically reduces computation. Consistent hashing or session affinity achieves this.
- Request length awareness: Routing long-context requests to pods with more available VRAM — rather than round-robin — improves resource utilization.
- Least-connection routing: For variable-length generation, routing to the pod with the fewest in-flight requests distributes load more evenly than round-robin.
Kubernetes Service with the default iptables/IPVS kube-proxy load balancing handles simple round-robin. For more sophisticated strategies, the Kubernetes Gateway API (via Envoy or Istio) supports custom load balancing policies [Source: https://wetranscloud.com/blog/kubernetes-for-ml-scaling-pipelines-across-clouds/].
Key Takeaway: Combine KPA or KEDA for responsive autoscaling with KServe’s native traffic splitting for safe model rollouts. Scale-to-zero saves GPU costs for low-traffic models but requires cold start tolerance. For LLMs, KEDA’s ability to scale on queue depth and token throughput metrics makes it the most effective autoscaling option.
Chapter Summary
This chapter covered the complete lifecycle of deploying and managing model serving on Kubernetes. We began by distinguishing online from batch inference and surveying the model server landscape: Triton for heterogeneous multi-model workloads, vLLM for high-throughput LLM text generation, and TGI for the HuggingFace ecosystem.
KServe’s InferenceService CRD emerged as the central Kubernetes abstraction, encapsulating storage initialization, autoscaling, traffic management, and optional transformer and explainer sidecars in a single declarative resource. Multi-model serving via Model Mesh enables cost-effective sharing of GPU infrastructure across hundreds of low-traffic models.
For LLMs specifically, PagedAttention and continuous batching in vLLM unlock dramatically higher throughput on fixed hardware, while quantization (GPTQ, AWQ) enables models too large for a single GPU to fit in memory with manageable accuracy loss. Kubernetes-specific configuration — particularly shared memory sizing — is essential to avoid silent failures.
Autoscaling for inference requires matching the scaler to the workload: KPA concurrency scaling for serverless patterns, HPA for stable predictable traffic, and KEDA for LLM workloads where queue depth and token throughput are the most relevant signals. Canary deployments via KServe’s traffic splitting, automated by Flagger, enable safe model version promotion with automatic rollback. A/B testing via header-based routing allows controlled experiments where different user cohorts experience different model versions simultaneously.
Key Terms
| Term | Definition |
|---|---|
| KServe | A Kubernetes-native model serving platform providing the InferenceService CRD, autoscaling, and traffic management for ML models |
| InferenceService | KServe’s primary Custom Resource Definition for declaring a model serving endpoint, including predictor, transformer, and explainer components |
| Triton Inference Server | NVIDIA’s multi-framework model serving engine supporting TensorFlow, PyTorch, ONNX, TensorRT, and vLLM backends with dynamic batching |
| vLLM | An open-source LLM serving engine featuring PagedAttention and continuous batching for high-throughput text generation |
| PagedAttention | vLLM’s KV-cache management algorithm that stores attention caches in non-contiguous memory pages, enabling up to 24x more concurrent requests |
| continuous batching | A scheduling strategy that inserts new requests into the inference pipeline as soon as previous requests complete, maximizing GPU utilization |
| quantization | The process of reducing model weight precision (e.g., FP16 → INT4) to decrease memory footprint and increase throughput with minimal accuracy loss |
| HPA | Horizontal Pod Autoscaler — Kubernetes’ built-in mechanism for scaling workloads based on CPU, memory, or custom Prometheus metrics |
| KEDA | Kubernetes Event-Driven Autoscaling — extends HPA with support for custom event sources including LLM-specific metrics like queue depth and token throughput |
| canary deployment | A deployment strategy that routes a small percentage of traffic to a new model version before full rollout, enabling safe validation |
| model mesh | KServe’s multi-model serving capability that co-locates many models within a shared pool of serving pods to reduce per-model infrastructure overhead |
| KPA | Knative Pod Autoscaler — KServe’s default autoscaler in serverless mode, scaling based on request concurrency with built-in scale-to-zero support |
Chapter 6: Resource Scheduling and Cluster Optimization
Learning Objectives
By the end of this chapter, you will be able to:
- Configure advanced scheduling policies for AI workloads including priority classes, preemption, and affinity rules
- Implement resource quotas and limit ranges for multi-tenant GPU clusters
- Use batch schedulers like Volcano and Kueue for job queuing and fair-sharing across teams
- Optimize cluster bin-packing and resource utilization for mixed training and inference workloads
Introduction
Running AI workloads on Kubernetes is not simply a matter of packaging a training job in a container and letting the scheduler do the rest. GPU clusters are expensive, shared, and contended. A poorly configured cluster wastes money when GPUs sit idle waiting for the right pods, causes deadlock when distributed jobs receive only part of their requested resources, and frustrates users when a single research team monopolizes capacity for days.
Think of a large hospital with a limited number of operating rooms. You need a system that reserves the most critical surgeries, gives fair access to different departments, prevents one surgical team from booking all rooms indefinitely, and keeps rooms occupied as densely as possible throughout the day. Kubernetes scheduling for AI workloads is exactly that kind of resource governance problem — at data-center scale.
This chapter covers the tools and techniques that solve it: Kubernetes-native priority and affinity controls, the Kueue job queuing system, the Volcano batch scheduler, and cluster-level autoscaling and cost-optimization strategies.
Section 1: Kubernetes Scheduling for AI
Priority Classes and Preemption for Training Jobs
Every pod in Kubernetes can be assigned a PriorityClass, a cluster-scoped object that assigns a numerical priority value to a workload. When the scheduler cannot fit a new pod onto any node, it will evict (preempt) lower-priority pods to make room [Source: https://debugg.ai/resources/kubernetes-gpu-scheduling-2025-kueue-volcano-mig].
A typical AI platform uses a tiered priority scheme:
| Priority Class | Value | Typical Use |
|---|---|---|
system-critical | 1,000,000 | Kubernetes system components |
production-inference | 10,000 | Real-time model serving |
training-high | 1,000 | Scheduled training jobs with SLAs |
training-standard | 500 | Standard research experiments |
development | 100 | Interactive notebooks, dev jobs |
With this setup, a newly submitted production inference pod can preempt a running development notebook without disrupting any higher-priority work. Critically, research jobs configured at training-standard cannot block production inference — the scheduler will always prefer the higher value [Source: https://debugg.ai/resources/kubernetes-gpu-scheduling-2025-kueue-volcano-mig].
Figure 6.1: Priority class hierarchy and preemption ordering
flowchart TD
A["system-critical\n(1,000,000)\nKubernetes system components"]
B["production-inference\n(10,000)\nReal-time model serving"]
C["training-high\n(1,000)\nScheduled training with SLAs"]
D["training-standard\n(500)\nResearch experiments"]
E["development\n(100)\nNotebooks, dev jobs"]
A -->|"can preempt"| B
B -->|"can preempt"| C
C -->|"can preempt"| D
D -->|"can preempt"| E
style A fill:#b91c1c,color:#fff
style B fill:#c2410c,color:#fff
style C fill:#b45309,color:#fff
style D fill:#4d7c0f,color:#fff
style E fill:#1d4ed8,color:#fff
Worked Example: Defining a PriorityClass
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: training-high
value: 1000
globalDefault: false
description: "High-priority training jobs with team SLAs"
---
apiVersion: batch/v1
kind: Job
metadata:
name: gpt-finetune-v3
spec:
template:
spec:
priorityClassName: training-high
containers:
- name: trainer
image: my-registry/trainer:latest
resources:
limits:
nvidia.com/gpu: "4"
The pod inherits the priority class by name. If the cluster is under pressure and only nodes with lower-priority pods are available, the scheduler will preempt those pods to place this job.
Node Affinity, Taints, and Tolerations for GPU Nodes
GPU nodes in a cluster are specialized and expensive. Without controls, ordinary CPU workloads can land on GPU nodes and waste hardware. Two mechanisms prevent this:
Taints and Tolerations act as a lock-and-key system. The cluster administrator applies a taint to every GPU node, which repels all pods that do not carry the matching toleration. GPU workloads declare that they tolerate the taint and are allowed to schedule there.
# Applied to the GPU node (by the node provisioner or admin)
kubectl taint nodes gpu-node-01 nvidia.com/gpu=present:NoSchedule
# Added to GPU pods
spec:
tolerations:
- key: "nvidia.com/gpu"
operator: "Equal"
value: "present"
effect: "NoSchedule"
Node Affinity goes further by allowing pods to express a preference or requirement for specific node characteristics, such as GPU model, memory capacity, or network interconnect type [Source: https://sealos.io/blog/the-ultimate-guide-to-gpu-provisioning-and-management-in-kubernetes/].
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.product
operator: In
values:
- "NVIDIA-A100-SXM4-80GB"
This job will only run on nodes with an A100 80GB GPU — ensuring large language model training is not accidentally placed on a smaller V100 with insufficient memory.
Figure 6.2: Taints, tolerations, and node affinity as a GPU node gate
flowchart LR
subgraph Pods["Pod Submission"]
P1["CPU Pod\n(no toleration,\nno affinity)"]
P2["GPU Pod\n(toleration: nvidia.com/gpu,\naffinity: A100-80GB)"]
end
subgraph Nodes["Cluster Nodes"]
N1["CPU Node\n(no taint)"]
N2["GPU Node — V100\n(taint: nvidia.com/gpu=present:NoSchedule\nlabel: gpu.product=V100)"]
N3["GPU Node — A100\n(taint: nvidia.com/gpu=present:NoSchedule\nlabel: gpu.product=A100-SXM4-80GB)"]
end
P1 -->|"schedules onto"| N1
P1 -->|"blocked by taint"| N2
P1 -->|"blocked by taint"| N3
P2 -->|"toleration passes,\naffinity rejects V100"| N2
P2 -->|"toleration passes,\naffinity matches A100"| N3
style P1 fill:#6b7280,color:#fff
style P2 fill:#1d4ed8,color:#fff
style N1 fill:#374151,color:#fff
style N2 fill:#7c3aed,color:#fff
style N3 fill:#065f46,color:#fff
Topology-Aware Scheduling for Multi-GPU Locality
For distributed training across multiple GPUs, the physical placement of pods relative to each other has a direct impact on performance. Pods that share a NVLink fabric or are on the same physical host communicate orders of magnitude faster than pods on separate nodes connected only by Ethernet.
The topologySpreadConstraints feature allows a job to specify that its pods should be spread across, or concentrated within, specific topology domains (such as a rack, a zone, or a hostname). Combined with affinity rules for NVLink-capable nodes, this lets a training job request co-location on the same high-bandwidth domain [Source: https://www.ajeetraina.com/kubernetes-and-gpu-the-complete-guide-to-running-ai-ml-workloads-at-scale/].
Key Takeaway: PriorityClasses, taints, tolerations, and affinity rules form the foundation of GPU scheduling hygiene. They prevent resource leakage onto wrong hardware, guarantee preemption ordering across workload tiers, and place distributed training jobs on topologically optimal nodes.
Section 2: Kueue — Kubernetes-Native Job Queuing
The Problem Kueue Solves
The default Kubernetes scheduler assigns pods to nodes as fast as they arrive. For a multi-tenant GPU cluster, this creates problems: a single team can submit thousands of jobs, immediately consuming all GPUs. Other teams wait indefinitely. There is no mechanism for burst borrowing, fairness over time, or holding a job until all its resources are simultaneously available.
Kueue is a Kubernetes-native job queuing system that solves this by intercepting Jobs before their pods are created. It places jobs in a managed queue and only admits the entire workload to the cluster when all requested resources can be guaranteed at once [Source: https://kueue.sigs.k8s.io/docs/overview/]. The default scheduler still handles pod placement — Kueue handles admission control.
ClusterQueue, LocalQueue, and Workload Resources
Kueue’s architecture rests on three objects [Source: https://kueue.sigs.k8s.io/docs/concepts/cluster_queue/]:
| Resource | Scope | Purpose |
|---|---|---|
ClusterQueue | Cluster-wide | Defines a resource pool with quotas, cohort membership, and sharing rules |
LocalQueue | Namespaced | Points to a ClusterQueue; teams submit jobs to their LocalQueue |
Workload | Namespaced | Kueue’s internal representation of an admitted job (auto-created) |
The separation between ClusterQueue and LocalQueue is intentional. Cluster administrators control the resource pool through ClusterQueues. Individual teams interact only with their LocalQueue in their own namespace and cannot modify quota limits. This cleanly separates platform governance from user access.
Figure 6.3: Kueue architecture — from job submission to pod scheduling
flowchart TD
subgraph TeamResearch["Namespace: team-research"]
J1["Job\n(spec.suspend: true\nqueue-name: lq-research)"]
LQ1["LocalQueue\nlq-research"]
W1["Workload\n(auto-created by Kueue)"]
end
subgraph TeamProd["Namespace: team-prod"]
J2["Job\n(spec.suspend: true\nqueue-name: lq-prod)"]
LQ2["LocalQueue\nlq-prod"]
W2["Workload\n(auto-created by Kueue)"]
end
subgraph ClusterLevel["Cluster Level (admin-controlled)"]
CQ1["ClusterQueue\ncq-research\nnominalQuota: 8 GPUs\nborrowingLimit: 8"]
CQ2["ClusterQueue\ncq-production\nnominalQuota: 16 GPUs\nborrowingLimit: 0"]
COH["Cohort: org-cohort\n(shared idle quota pool)"]
end
subgraph Scheduling["Scheduling Layer"]
SCHED["Default Kubernetes\nScheduler\n(pod placement)"]
NODES["GPU Nodes"]
end
J1 --> LQ1 --> W1 --> CQ1
J2 --> LQ2 --> W2 --> CQ2
CQ1 <-->|"borrow/lend idle quota"| COH
CQ2 <-->|"member"| COH
CQ1 -->|"admit workload\n(unsuspend job)"| SCHED
CQ2 -->|"admit workload\n(unsuspend job)"| SCHED
SCHED --> NODES
style CQ1 fill:#1d4ed8,color:#fff
style CQ2 fill:#0f766e,color:#fff
style COH fill:#7c3aed,color:#fff
style SCHED fill:#374151,color:#fff
Worked Example: Setting Up a Two-Team Cluster
# ClusterQueue for the research team
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: cq-research
spec:
cohort: "org-cohort"
resourceGroups:
- coveredResources: ["nvidia.com/gpu"]
flavors:
- name: "gpu-flavor"
resources:
- name: "nvidia.com/gpu"
nominalQuota: 8 # guaranteed GPUs
borrowingLimit: 8 # can borrow up to 8 more from cohort
---
# ClusterQueue for the production team
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: cq-production
spec:
cohort: "org-cohort"
resourceGroups:
- coveredResources: ["nvidia.com/gpu"]
flavors:
- name: "gpu-flavor"
resources:
- name: "nvidia.com/gpu"
nominalQuota: 16
borrowingLimit: 0 # production does not need to borrow
---
# LocalQueue in the research namespace
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
name: lq-research
namespace: team-research
spec:
clusterQueue: cq-research
When a researcher submits a Job in the team-research namespace with spec.suspend: true and a label pointing to lq-research, Kueue takes over. If 8 GPUs are free in the research pool, the job is admitted immediately. If not, it waits in the queue. If the production pool has idle GPUs and borrowing is enabled, the research job can borrow those too [Source: https://kubernetes.io/blog/2022/10/04/introducing-kueue/].
Fair-Sharing and Borrowing Between Queues
When multiple ClusterQueues belong to the same cohort, their unused nominal quota becomes a shared pool that any member can borrow from. This prevents wasted GPU time when one team’s quota is idle while another team has pending jobs [Source: https://kueue.sigs.k8s.io/docs/concepts/fair_sharing/].
Fair sharing uses a scoring system: the queue consuming the least relative to its nominal entitlement receives the highest admission priority. The queue consuming the most is the first target for preemption when a lower-usage queue needs resources. This creates a self-balancing system that converges toward equitable utilization over time [Source: https://medium.com/google-cloud/mastering-workloads-in-kubernetes-with-kueue-part-3-embracing-fair-sharing-8fef4d7fc13f].
The analogy here is a shared bank account with guaranteed minimum balances per account holder, but where unused balances can be temporarily lent to others — and the system automatically claws back those loans when the original account holder needs their guaranteed amount.
Priority-Based Preemption Within Queues
Kueue supports two queueing strategies for ordering pending workloads: StrictFIFO (strict first-in, first-out) and BestEffortFIFO (first-in, first-out unless a later job fits better). On top of this, Workload Priority Classes provide a numerical weight that supersedes arrival order [Source: https://medium.com/google-cloud/mastering-workloads-in-kubernetes-with-kueue-part-1-queues-and-priorities-339257d8b4ba].
Three preemption policies govern what Kueue does when a high-priority pending workload cannot fit [Source: https://kueue.sigs.k8s.io/docs/concepts/preemption/]:
Policy: reclaimWithinCohort | Behavior |
|---|---|
Never | Do not preempt any workload in the cohort |
LowerPriority | Preempt cohort workloads with lower priority than the pending one |
Any | Preempt any cohort workload regardless of priority |
The LessThanInitialShare policy adds a fairness guard: preemption is only allowed if the preempting queue’s share (including its new workload) would be strictly less than the target queue’s current share. This prevents aggressive queues from using high-priority preemption to permanently monopolize resources [Source: https://kueue.sigs.k8s.io/docs/concepts/preemption/].
Kueue’s design also prevents preemption loops — scenarios where workload A preempts B, then B preempts A, cycling indefinitely. The implementation detects and breaks such cycles to ensure stable allocation.
Integration with Jobs, RayJobs, and Training Operator
Kueue integrates natively with Kubernetes batch/v1 Job, but also supports custom resources through its framework: RayJob (KubeRay), PyTorchJob and TFJob (Training Operator), and JobSet. In each case, the integration wraps the custom resource in a Kueue Workload and manages its spec.suspend field — suspending the job while queued and releasing it when admitted [Source: https://docs.ray.io/en/latest/cluster/kubernetes/examples/rayjob-kueue-priority-scheduling.html].
# RayJob submitted via Kueue
apiVersion: ray.io/v1
kind: RayJob
metadata:
name: llm-training-run
namespace: team-research
labels:
kueue.x-k8s.io/queue-name: lq-research
spec:
suspend: true # Kueue controls this field
rayClusterSpec:
headGroupSpec:
...
workerGroupSpecs:
- replicas: 4
template:
spec:
containers:
- resources:
limits:
nvidia.com/gpu: "2"
Key Takeaway: Kueue extends the default Kubernetes scheduler with admission control, quota enforcement, and fair-sharing without replacing it. Teams submit jobs to namespaced LocalQueues; administrators control resource pools through ClusterQueues; cohorts enable burst borrowing and preemption-based fairness across tenant boundaries.
Section 3: Volcano Batch Scheduler
Gang Scheduling and the Deadlock Problem
Consider a distributed training job that requires 16 GPUs across 4 nodes. The default Kubernetes scheduler will happily place 12 pods on available nodes and then block the remaining 4, waiting for capacity. The 12 placed pods begin executing and hold their GPU allocations while waiting for their peers. The remaining jobs in the cluster similarly hold partial allocations. The result is a deadlock: everyone waits, nobody makes progress, and GPU utilization collapses.
Gang scheduling solves this with an all-or-nothing guarantee. A job’s pod group is admitted only when all requested resources are simultaneously available. If 16 GPUs are not free at once, the entire job waits as a unit — but crucially, it does not hold any resources while waiting, leaving capacity available for jobs that can fit [Source: https://www.infracloud.io/blogs/batch-scheduling-on-kubernetes/].
Volcano implements gang scheduling through two objects: Queue (analogous to Kueue’s ClusterQueue) and PodGroup (a group of pods that must be co-scheduled). Volcano replaces the default Kubernetes scheduler entirely and handles pod placement itself [Source: https://volcano.sh/en/].
Figure 6.4: Partial allocation deadlock vs. gang scheduling all-or-nothing admission
flowchart LR
subgraph Deadlock["Without Gang Scheduling — DEADLOCK"]
direction TB
DA["Job A\n(needs 16 GPUs)"]
DB["Job B\n(needs 8 GPUs)"]
DN["Cluster: 20 GPUs free"]
DA -->|"12 pods placed\n(holds 12 GPUs)"| DN
DB -->|"8 pods placed\n(holds 8 GPUs)"| DN
DSTALL["Remaining pods\nPENDING forever\nGPU utilization collapses"]
DN --> DSTALL
end
subgraph GangSched["With Gang Scheduling — ALL-OR-NOTHING"]
direction TB
GA["Job A PodGroup\n(minMember: 16)"]
GB["Job B PodGroup\n(minMember: 8)"]
GN["Cluster: 20 GPUs free"]
GW["Job A waits as unit\n(holds 0 GPUs while queued)"]
GA -->|"16 GPUs not free\nat once → wait"| GW
GB -->|"8 GPUs available\n→ all 8 pods placed"| GN
GRUN["Job B runs\nJob A queued cleanly\nno deadlock"]
GN --> GRUN
end
style DSTALL fill:#b91c1c,color:#fff
style GRUN fill:#065f46,color:#fff
style GW fill:#92400e,color:#fff
Queue and PodGroup Gang Scheduling
# A Volcano Queue with capacity and weight
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: research-queue
spec:
weight: 1
capability:
nvidia.com/gpu: "32"
---
# A PodGroup requiring all 16 pods before any start
apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
name: pytorch-ddp-run
namespace: team-research
spec:
minMember: 16
queue: research-queue
priorityClassName: training-high
A PyTorchJob submitted through the Training Operator automatically creates a PodGroup. Volcano monitors the group and holds all pods in Pending state until minMember pods can be simultaneously scheduled. Once the threshold is met, all pods are placed at once [Source: https://docs.ray.io/en/latest/cluster/kubernetes/k8s-ecosystem/volcano.html].
Fair-Share, DRF, and Proportion Scheduling Plugins
Volcano uses a plugin architecture. Multiple scheduling plugins compose to produce a final decision. Key plugins for AI workloads:
| Plugin | Behavior |
|---|---|
proportion | Allocates cluster capacity in proportion to queue weights |
gang | Enforces all-or-nothing PodGroup admission |
binpack | Packs pods densely onto fewer nodes to leave whole nodes free |
drf | Dominant Resource Fairness — maximizes fairness across multi-resource requests |
priority | Respects PriorityClass values when ordering jobs within a queue |
nodeorder | Scores nodes based on resource fit, NUMA locality, and topology |
Dominant Resource Fairness (DRF) is particularly valuable for AI clusters. When different jobs consume different mixes of resources (one job is GPU-heavy, another is memory-heavy), simple quota counting is unfair. DRF identifies each job’s dominant resource (the one it uses proportionally most) and equalizes fairness across that dimension. A pure GPU training job and a high-memory data-preprocessing job are both treated fairly, even though they consume resources in completely different ratios [Source: https://debugg.ai/resources/kubernetes-gpu-scheduling-2025-kueue-volcano-mig].
Preemption and Reclaim Actions
Volcano distinguishes between two corrective scheduling actions:
- Preemption: A higher-priority job forcibly evicts lower-priority pods in the same queue to make room.
- Reclaim: A queue that is using more than its fair share of cluster resources has pods evicted to return capacity to queues below their entitlement.
These actions fire as part of Volcano’s scheduling cycle, governed by configurable action lists in the scheduler configuration. Reclaim ensures that a burst of activity from one team does not permanently suppress other teams’ throughput [Source: https://volcano.sh/en/].
Kueue + Volcano: Better Together
The recommended pattern for large-scale GPU clusters is to combine both systems [Source: https://debugg.ai/resources/kubernetes-gpu-scheduling-2025-kueue-volcano-mig]:
| Responsibility | Tool |
|---|---|
| Admission control and quota enforcement | Kueue |
| Burst borrowing across tenant boundaries | Kueue (cohorts) |
| Strict gang semantics (all-or-nothing) | Volcano |
| MPI/PyTorch/TensorFlow operator integration | Volcano |
| DRF and multi-resource fairness | Volcano |
Kueue controls whether a job is admitted to the cluster at all. Once admitted, Volcano handles the pod-level scheduling decisions. This layered architecture means teams benefit from Kueue’s quota visibility and borrowing logic while still getting Volcano’s battle-tested gang scheduling guarantees for distributed training.
Key Takeaway: Volcano solves the distributed training deadlock problem through PodGroup gang scheduling and provides a plugin-based architecture for DRF fairness, dense bin-packing, and priority preemption. For production clusters, pairing Kueue (admission control) with Volcano (gang scheduling) provides both quota governance and deadlock-free distributed job execution.
Section 4: Cluster Bin-Packing and Cost Optimization
GPU Utilization Monitoring and Right-Sizing
Before optimizing, you need visibility. Many clusters appear busy — many pods scheduled — while actual GPU compute utilization sits below 30%. The NVIDIA Data Center GPU Manager (DCGM) exposes per-GPU metrics including SM utilization, memory utilization, and NVLink bandwidth. These metrics flow into Prometheus and are visualized in Grafana dashboards.
Common utilization anti-patterns and their fixes:
| Anti-Pattern | Symptom | Fix |
|---|---|---|
| Oversized resource requests | GPU allocated, SM utilization < 10% | Right-size with DCGM data; use MIG for small jobs |
| Partial allocation deadlock | Jobs pending despite free GPUs | Enable Volcano gang scheduling |
| Single-tenant monopoly | One namespace consumes all GPUs | Kueue ClusterQueue quotas |
| Node fragmentation | Multiple nodes partially used | Enable Volcano binpack plugin |
Right-sizing is the process of adjusting resources.limits.nvidia.com/gpu in job specs to match actual consumption. For jobs that need only a fraction of a GPU (interactive notebooks, small inference endpoints), GPU time-slicing or NVIDIA MIG (Multi-Instance GPU) allows multiple workloads to share one physical GPU [Source: https://developers.redhat.com/articles/2025/05/22/improve-gpu-utilization-kueue-openshift-ai].
MIG partitions a single A100 or H100 into isolated slices with dedicated memory and compute. An A100 80GB can be split into up to seven 1g.10gb MIG instances, each appearing as an independent GPU to Kubernetes. This lets seven small training jobs or inference workloads run simultaneously on what would otherwise be a single-tenant device [Source: https://debugg.ai/resources/kubernetes-gpu-scheduling-2025-kueue-volcano-mig].
Spot and Preemptible Instances for Training Workloads
GPU instances are among the most expensive compute resources in any cloud. Spot (AWS) and preemptible (GCP) instances offer the same hardware at discounts of up to 70%, with the tradeoff that the cloud provider can reclaim them with short notice [Source: https://cloudnativenow.com/contributed-content/the-ultimate-guide-to-gpu-scaling-with-karpenter/].
Training workloads — especially long-running experiments — are well-suited for spot if they implement checkpointing. A job that saves model state every 30 minutes loses at most 30 minutes of work when a spot node is reclaimed. Inference workloads serving live traffic are less suitable, though batch inference pipelines with job-level retry logic handle spot interruptions gracefully.
Key spot strategy considerations:
| Factor | Recommendation |
|---|---|
| Instance diversity | Request multiple GPU instance families (G4dn, G5, P4) to widen the spot pool |
| Availability Zone spread | Request across all AZs; each AZ+instance type is a separate capacity pool |
| Checkpointing | Save model state to durable storage (S3/GCS) at regular intervals |
| Interruption handling | Use node termination handlers to drain pods gracefully before reclaim |
| Fallback | Configure on-demand as fallback for critical jobs that cannot checkpoint |
Cluster Autoscaler Configuration for GPU Node Pools
The legacy Cluster Autoscaler (CA) watches for unschedulable pods and scales node groups up or down based on pre-configured node group definitions. For GPU clusters, CA requires a separate node group per GPU type (one for A100 nodes, one for T4 nodes, etc.) and scales each group independently.
CA works well for homogeneous GPU pools but struggles with diversity. When a job needs a specific GPU model not available in any current node group, CA cannot provision it on the fly — the node group must already be defined [Source: https://aws.amazon.com/blogs/aws/introducing-karpenter-an-open-source-high-performance-kubernetes-cluster-autoscaler/].
Scale-down is conservative by default: CA will not remove a node until it has been underutilized for a configurable period (default 10 minutes), which matters for GPU node costs.
# Cluster Autoscaler annotation on a GPU node group
cluster-autoscaler.kubernetes.io/safe-to-evict: "false" # prevent eviction of DaemonSet pods
# In CA ConfigMap:
scale-down-utilization-threshold: "0.5" # remove node if < 50% utilized
scale-down-delay-after-add: "5m" # wait 5 min after scale-up before evaluating scale-down
Karpenter for Just-in-Time GPU Provisioning
Karpenter is a Kubernetes node provisioner that replaces the legacy Cluster Autoscaler with a fundamentally different model. Rather than scaling fixed node groups, Karpenter watches for pending pods, reads their resource requirements and constraints directly, and provisions exactly the right node type in real time — typically under 60 seconds [Source: https://aws.amazon.com/blogs/aws/introducing-karpenter-an-open-source-high-performance-kubernetes-cluster-autoscaler/].
The core Karpenter object is a NodePool, which defines a flexible set of constraints that Karpenter uses when selecting instance types:
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpu-nodepool
spec:
template:
spec:
requirements:
- key: "karpenter.k8s.aws/instance-family"
operator: In
values: ["g4dn", "g5", "p4d"] # multiple families = wider spot pool
- key: "karpenter.sh/capacity-type"
operator: In
values: ["spot", "on-demand"] # spot first, on-demand fallback
- key: "topology.kubernetes.io/zone"
operator: In
values: ["us-east-1a", "us-east-1b", "us-east-1c"] # all AZs
nodeClassRef:
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
name: gpu-nodeclass
disruption:
consolidationPolicy: WhenUnderutilized # automatically remove idle nodes
consolidateAfter: 30s
Consolidation is Karpenter’s most powerful cost feature. It continuously evaluates whether running pods could be packed onto fewer nodes. If a 4-GPU node is running only one 1-GPU pod, Karpenter will drain that node and move the pod to an existing node with spare capacity, then terminate the now-empty node. This happens automatically without administrator intervention [Source: https://cloudnativenow.com/contributed-content/the-ultimate-guide-to-gpu-scaling-with-karpenter/].
Real-world results from organizations adopting Karpenter with Spot instances have included [Source: https://appinventiv.com/blog/kubernetes-cost-optimization-success-story/]:
- 35% reduction in monthly GPU compute spend
- 50% spot instance utilization across GPU clusters
- Scaling time reduced from approximately 5 minutes to under 30 seconds
The contrast with the legacy Cluster Autoscaler is significant:
Figure 6.5: Karpenter just-in-time provisioning and consolidation flow
flowchart TD
subgraph Trigger["Trigger: Unschedulable Pod"]
POD["Pending GPU Pod\n(requests: 4x A100,\nspot preferred)"]
end
subgraph KarpenterLogic["Karpenter Decision Engine"]
KA["Read pod requirements\n& constraints from spec"]
KB["Select optimal instance type\n(g5.12xlarge — 4x A100, spot)"]
KC["Provision node\nin target AZ (<60s)"]
end
subgraph Consolidation["Ongoing Consolidation Loop"]
CL["Monitor running nodes\nfor underutilization"]
CM["Can pods fit on fewer nodes?"]
CN["Drain underutilized node\nMigrate pods"]
CO["Terminate empty node\n(cost eliminated)"]
end
subgraph Compare["Legacy Cluster Autoscaler"]
CA1["Watches unschedulable pods"]
CA2["Scale up matching\npre-defined node group\n(3–5 min)"]
CA3["Scale down after\nutilization threshold\n(10 min default delay)"]
end
POD --> KA --> KB --> KC
KC --> CL --> CM
CM -->|"yes"| CN --> CO
CM -->|"no"| CL
CA1 --> CA2 --> CA3
style KC fill:#065f46,color:#fff
style CO fill:#1d4ed8,color:#fff
style CA2 fill:#7c3aed,color:#fff
style CA3 fill:#6b7280,color:#fff
| Capability | Cluster Autoscaler | Karpenter |
|---|---|---|
| Node selection | Fixed node groups | Dynamic, constraint-based |
| Provisioning speed | 3–5 minutes | Under 60 seconds |
| Instance diversity | Per-group definition | Flexible NodePool requirements |
| Consolidation | Basic scale-down only | Active bin-packing with pod migration |
| Spot integration | Via node group configuration | First-class, automatic fallback |
| Availability zone awareness | Group-level | Per-pending-pod |
A Complete Cost Optimization Architecture
Combining the techniques from this section produces a layered cost optimization stack:
┌─────────────────────────────────────────────────────────────┐
│ Job Submission Layer │
│ Kueue: quota enforcement, fair-sharing, burst borrowing │
├─────────────────────────────────────────────────────────────┤
│ Gang Scheduling Layer │
│ Volcano: all-or-nothing admission, DRF fairness │
├─────────────────────────────────────────────────────────────┤
│ GPU Sharing Layer │
│ NVIDIA MIG / Time-slicing: sub-GPU allocation for small │
│ jobs; eliminates single-job GPU monopoly │
├─────────────────────────────────────────────────────────────┤
│ Node Provisioning Layer │
│ Karpenter: just-in-time spot provisioning, consolidation │
└─────────────────────────────────────────────────────────────┘
Each layer addresses a distinct waste vector: quota enforcement prevents monopoly, gang scheduling eliminates deadlock idle time, GPU sharing maximizes per-device utilization, and just-in-time provisioning ensures you are never paying for nodes with no work to do.
Key Takeaway: GPU cost optimization is a layered problem. Right-sizing with DCGM metrics, MIG partitioning for small workloads, spot instances for fault-tolerant training jobs, and Karpenter’s just-in-time provisioning with active consolidation together deliver the largest reductions in per-experiment compute cost while maintaining throughput for high-priority workloads.
Chapter Summary
This chapter has examined the full scheduling stack for production AI workloads on Kubernetes, from pod-level priority controls to cluster-level cost automation.
Kubernetes scheduling primitives — PriorityClasses, taints, tolerations, and node affinity — form the foundation. They ensure GPU hardware is reserved for appropriate workloads, define preemption ordering across production and research tiers, and co-locate distributed training pods on topologically optimal nodes.
Kueue introduces admission control at the job level. Rather than allowing all submitted jobs to compete for GPUs immediately, Kueue holds jobs in namespaced LocalQueues backed by ClusterQueue resource pools. Cohorts enable burst borrowing between teams. Fair sharing scoring ensures no single team permanently dominates shared capacity. Three preemption policies give platform administrators fine-grained control over how high-priority jobs displace running lower-priority ones.
Volcano addresses the distributed training deadlock problem with gang scheduling via PodGroups. Its plugin architecture supports DRF fairness, dense bin-packing, and deep integration with ML framework operators. For large-scale clusters, the recommended pattern combines Kueue for admission control with Volcano for gang-semantic pod placement.
Cluster optimization begins with visibility through DCGM GPU metrics, proceeds to right-sizing and MIG partitioning for small workloads, and uses spot instances for fault-tolerant training jobs. Karpenter replaces the legacy Cluster Autoscaler with just-in-time, constraint-based node provisioning and active consolidation — automatically eliminating idle GPU nodes and achieving up to 35% cost reduction in documented real-world deployments.
Key Terms
| Term | Definition |
|---|---|
| Kueue | A Kubernetes-native job queuing system that manages workload admission via spec.suspend, enforcing quotas and fair-sharing without replacing the default scheduler |
| Volcano | A batch scheduling system for Kubernetes that replaces the default scheduler with gang scheduling, DRF fairness, and deep ML framework operator integration |
| Gang scheduling | An all-or-nothing scheduling strategy where all pods in a group must be simultaneously placeable before any are started, preventing distributed job deadlock |
| PriorityClass | A Kubernetes object assigning a numerical priority value to pods; higher-priority pods preempt lower-priority pods when node resources are exhausted |
| Preemption | The eviction of a running lower-priority pod to free resources for a higher-priority pending pod |
| Fair-share | A scheduling policy that distributes resources proportionally to entitlements over time, giving underserved queues preference and targeting overserved queues for preemption |
| DRF | Dominant Resource Fairness — a multi-resource fairness algorithm that equalizes the fraction of their dominant resource (the one they consume proportionally most) used by competing jobs |
| Bin-packing | A scheduling strategy that concentrates pods onto the fewest possible nodes to leave entire nodes free or to maximize per-node utilization |
| Karpenter | A Kubernetes node provisioner that dynamically provisions the right node type for pending pods in under 60 seconds, with active consolidation to eliminate idle nodes |
| Cluster autoscaler | The legacy Kubernetes node scaling component that adjusts the size of pre-configured node groups based on unschedulable pod counts and utilization thresholds |
| Taint | A node property that repels pods unless those pods carry a matching toleration |
| Toleration | A pod property that allows it to be scheduled onto nodes with a matching taint |
Chapter 7: MLOps and ML Pipelines on Kubernetes
Learning Objectives
By the end of this chapter, you will be able to:
- Build end-to-end ML pipelines using Kubeflow Pipelines v2 and Argo Workflows
- Implement experiment tracking and model registries integrated with Kubernetes
- Automate model retraining and deployment with CI/CD for ML
- Design reproducible ML workflows with containerized pipeline steps
Introduction
Getting a machine learning model from a data scientist’s laptop into production is famously hard. Training a model once is a craft project; training it reliably, repeatedly, and with full auditability is an engineering discipline. MLOps is that discipline — and Kubernetes has become its natural home.
Think of an ML pipeline as a factory assembly line. Raw materials (data) enter one end; finished products (deployed models) exit the other. Each workstation on the line is a container doing a discrete job — preprocessing, training, evaluation, registration, deployment. Kubernetes is the factory floor: it schedules the workstations, manages the conveyor belts between them, and restarts any station that breaks down. The tools in this chapter — Kubeflow Pipelines, Argo Workflows, MLflow, Feast, and ArgoCD — are the machines and quality-control systems that make that factory hum.
Section 1: ML Pipeline Orchestration
An ML pipeline is a directed acyclic graph (DAG) of computational steps. Each node in the graph is a containerized task; edges carry data artifacts or parameter values from one task to the next. Orchestrating that graph reliably on Kubernetes is the job of pipeline engines.
Kubeflow Pipelines v2 Architecture and SDK
Kubeflow is an end-to-end MLOps platform purpose-built for Kubernetes. At its core sits Kubeflow Pipelines (KFP), which provides a Python SDK and domain-specific language (DSL) for defining pipelines, plus backend and frontend services for running and scheduling them on any Kubernetes cluster [Source: https://www.kubeflow.org/docs/components/pipelines/legacy-v1/introduction/].
KFP v2 introduced several significant improvements over v1:
| Feature | KFP v1 | KFP v2 |
|---|---|---|
| Component decorator | @component (limited) | @dsl.component and @dsl.container_component |
| Intermediate representation | Argo YAML | Backend-agnostic IR |
| Artifact visibility | Hidden implementation detail | First-class DAG nodes |
| Nested pipelines | Not supported | Pipelines as pipeline components |
| Single component execution | Full pipeline required | Run individual components |
The @dsl.component decorator is the primary authoring tool. It converts a plain Python function into a self-contained pipeline step that KFP can containerize and schedule. The @dsl.container_component decorator is used when you need precise control over the container command — for example, when wrapping a non-Python tool or a shell script [Source: https://cloud.google.com/blog/products/ai-machine-learning/whats-new-in-kubeflow-pipelines-v2].
Worked Example — A Three-Step Training Pipeline:
from kfp import dsl
from kfp.dsl import Dataset, Model, Output, Input
@dsl.component(base_image="python:3.11-slim", packages_to_install=["pandas", "scikit-learn"])
def preprocess(raw_data_path: str, dataset: Output[Dataset]):
import pandas as pd
df = pd.read_csv(raw_data_path).dropna()
df.to_csv(dataset.path, index=False)
@dsl.component(base_image="python:3.11-slim", packages_to_install=["pandas", "scikit-learn"])
def train(dataset: Input[Dataset], model: Output[Model], n_estimators: int = 100):
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
import joblib
df = pd.read_csv(dataset.path)
X, y = df.drop("label", axis=1), df["label"]
clf = RandomForestClassifier(n_estimators=n_estimators)
clf.fit(X, y)
joblib.dump(clf, model.path)
@dsl.component(base_image="python:3.11-slim", packages_to_install=["pandas", "scikit-learn"])
def evaluate(dataset: Input[Dataset], model: Input[Model]) -> float:
import pandas as pd
from sklearn.metrics import accuracy_score
import joblib
df = pd.read_csv(dataset.path)
X, y = df.drop("label", axis=1), df["label"]
clf = joblib.load(model.path)
return accuracy_score(y, clf.predict(X))
@dsl.pipeline(name="random-forest-pipeline")
def rf_pipeline(data_path: str, n_estimators: int = 100):
prep = preprocess(raw_data_path=data_path)
fit = train(dataset=prep.outputs["dataset"], n_estimators=n_estimators)
evaluate(dataset=prep.outputs["dataset"], model=fit.outputs["model"])
Notice how the Dataset and Model types are declared as Output and Input parameters. KFP v2 surfaces these as first-class artifact nodes in the pipeline graph visualization, so engineers can see exactly which model artifact came from which training run [Source: https://cloud.google.com/blog/products/ai-machine-learning/whats-new-in-kubeflow-pipelines-v2].
Figure 7.1: KFP v2 Three-Step Pipeline DAG with Artifact Flow
flowchart LR
A([raw_data_path\nstring param]) --> B
subgraph Pipeline["random-forest-pipeline DAG"]
B["preprocess()\n@dsl.component"]
C["train()\n@dsl.component"]
D["evaluate()\n@dsl.component"]
end
B -- "Dataset artifact" --> C
B -- "Dataset artifact" --> D
C -- "Model artifact" --> D
B --> E[(Artifact Store\nMinIO / GCS)]
C --> E
D --> F([float\naccuracy score])
style B fill:#4a90d9,color:#fff,stroke:#2c6fad
style C fill:#4a90d9,color:#fff,stroke:#2c6fad
style D fill:#4a90d9,color:#fff,stroke:#2c6fad
style E fill:#f5a623,color:#fff,stroke:#c47f00
style Pipeline fill:#f0f4ff,stroke:#9ab0d9
Pipeline Caching is a standout productivity feature. When a component’s inputs — both parameter values and artifact content hashes — match a previous execution, KFP reuses the cached output rather than re-running the step. This is invaluable during iterative development: change the n_estimators parameter, and only the train and evaluate steps re-execute; the preprocess step hits the cache and returns instantly.
Argo Workflows for ML DAGs
Kubeflow Pipelines compiles Python pipelines into an intermediate representation that is executed by Argo Workflows as the default backend [Source: https://valohai.com/blog/kubeflow-vs-argo/]. Argo can also be used directly, without KFP, when you need fine-grained control over Kubernetes-native features.
Argo’s value proposition for ML is straightforward: every step is a container, and Kubernetes is the scheduler. Teams working in multiple languages or wrapping legacy tools often prefer writing Argo YAML directly rather than using the KFP Python SDK. Argo’s native capabilities relevant to ML include DAG-based workflow execution, parameter passing between steps, artifact management with S3/GCS integration, and configurable retry and error-handling policies [Source: https://argo-workflows.readthedocs.io/en/latest/use-cases/machine-learning/].
Worked Example — Argo DAG Workflow:
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: ml-training-
spec:
entrypoint: ml-dag
templates:
- name: ml-dag
dag:
tasks:
- name: preprocess
template: run-step
arguments:
parameters: [{name: script, value: "preprocess.py"}]
- name: train
template: run-step
dependencies: [preprocess]
arguments:
parameters: [{name: script, value: "train.py"}]
- name: evaluate
template: run-step
dependencies: [train]
arguments:
parameters: [{name: script, value: "evaluate.py"}]
- name: run-step
inputs:
parameters:
- name: script
container:
image: my-ml-image:latest
command: [python, "{{inputs.parameters.script}}"]
env:
- name: DATA_BUCKET
value: s3://my-training-data
Tekton Pipelines for ML CI/CD
Tekton occupies a different niche than KFP or raw Argo. Where KFP focuses on ML-specific orchestration (artifact lineage, experiment tracking integration), Tekton is a general-purpose CI/CD framework on Kubernetes. Its strength in MLOps is in the CI/CD layer: building container images for pipeline steps, running linting and unit tests before training, and triggering downstream deployments after model validation. Tekton’s Pipeline and Task resources integrate naturally with tools like ArgoCD to create a full GitOps deployment chain from code commit to production serving.
Pipeline Parameterization and Caching
Effective pipelines treat all tunable values as explicit parameters — never as hardcoded constants buried in code. This enables:
- Hyperparameter sweeps: run the same pipeline multiple times with different learning rates or batch sizes
- Environment promotion: swap a
data_pathparameter to move from staging data to production data without changing pipeline logic - Reproducibility auditing: every run records its exact parameter set, so any result can be recreated on demand
Both KFP and Argo store per-run parameter values in their metadata backends, providing a natural audit trail alongside the artifact lineage graph.
Key Takeaway: Kubeflow Pipelines provides a Python-first authoring experience that compiles to backend-agnostic IR executed by Argo Workflows. KFP v2’s first-class artifact support, nested pipelines, and per-step caching dramatically reduce iteration time and improve reproducibility.
Section 2: Experiment Tracking and Model Registry
MLflow on Kubernetes for Experiment Logging
Every training run produces metrics, parameters, and artifacts. Without a system to capture them, teams rapidly lose track of which configuration produced which result — the equivalent of a chemistry lab with no lab notebooks. MLflow is the most widely adopted open-source solution for this problem [Source: https://www.truefoundry.com/blog/mlops-tools].
Deploying MLflow as a production-ready service on Kubernetes uses a standard three-tier architecture:
┌─────────────────────────────────────────────────────────────┐
│ Kubernetes Cluster │
│ │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │ MLflow Tracking │ │ PostgreSQL │ │
│ │ Server │───►│ (backend store) │ │
│ │ (Deployment) │ │ (StatefulSet) │ │
│ └────────┬─────────┘ └──────────────────┘ │
│ │ │
│ │ ┌──────────────────┐ │
│ └─────────────►│ MinIO / S3 │ │
│ │ (artifact store) │ │
│ └──────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Figure 7.2: MLflow Production Architecture on Kubernetes
flowchart TB
subgraph Cluster["Kubernetes Cluster (mlops namespace)"]
TServer["MLflow Tracking Server\nDeployment (2 replicas)"]
PG["PostgreSQL\nStatefulSet\n(backend store)"]
S3["MinIO / S3\n(artifact store)"]
Ing["Ingress / Service\n(port 5000)"]
end
Client["Training Pod\nmlflow.log_metric(...)"]
UI["Data Scientist\nBrowser"]
Client -->|"REST API\nlogs params & metrics"| TServer
Client -->|"uploads model\nartifacts directly"| S3
TServer -->|"persists run\nmetadata"| PG
TServer -->|"resolves artifact\nURIs"| S3
UI -->|"HTTP"| Ing
Ing --> TServer
style TServer fill:#4a90d9,color:#fff,stroke:#2c6fad
style PG fill:#7b68ee,color:#fff,stroke:#5a4fcf
style S3 fill:#f5a623,color:#fff,stroke:#c47f00
style Ing fill:#50c878,color:#fff,stroke:#2e8b57
style Cluster fill:#f0f4ff,stroke:#9ab0d9
The deployment involves four Kubernetes objects [Source: https://medium.com/@nsalexamy/deploying-mlflow-on-kubernetes-a-production-ready-guide-171d8aa2ed30]:
| Object | Purpose |
|---|---|
Deployment | Runs the MLflow tracking server as a scalable pod |
Service + Ingress | Exposes the UI and REST API externally |
StatefulSet (PostgreSQL) | Persists run metadata and model registry entries |
Secret / ConfigMap | Stores database credentials and S3 endpoint configuration |
Training code running anywhere on the cluster (or off-cluster) logs to the tracking server via the MLflow client SDK:
import mlflow
mlflow.set_tracking_uri("http://mlflow-service.mlops.svc.cluster.local:5000")
mlflow.set_experiment("fraud-detection-v2")
with mlflow.start_run():
mlflow.log_param("n_estimators", 200)
mlflow.log_param("max_depth", 8)
mlflow.log_metric("accuracy", 0.942)
mlflow.log_metric("f1_score", 0.917)
mlflow.sklearn.log_model(clf, "model")
Model Versioning and Artifact Management
Logging a model to MLflow is only half the story. The MLflow Model Registry provides the governance layer: a versioned catalog of every model artifact with structured promotion stages and team-level controls [Source: https://oneuptime.com/blog/post/2026-02-09-mlflow-model-registry-kubernetes/view].
The registry uses semantic aliases to communicate a model’s production status. Rather than hard-coding version numbers in serving infrastructure, downstream systems reference aliases:
| Alias | Meaning |
|---|---|
champion | The model currently serving production traffic |
challenger | A candidate model undergoing A/B testing |
shadow | A model receiving traffic copies for offline evaluation |
archived | A retired model retained for audit purposes |
Promotion between stages is a deliberate, audited action — not an automatic side-effect of a training run. This separation of concerns is critical: a model registry tracks versions, metadata, and deployment status, whereas raw object storage holds files but provides no context, making rollbacks and dependency tracking prohibitively difficult [Source: https://oneuptime.com/blog/post/2026-02-09-mlflow-model-registry-kubernetes/view].
Figure 7.3: Model Registry Promotion Lifecycle
stateDiagram-v2
[*] --> Registered : mlflow.register_model()
Registered --> Challenger : Manual promotion\n(team review)
Challenger --> Shadow : A/B test approved\n(offline eval passed)
Shadow --> Champion : Shadow traffic\nvalidation passed
Champion --> Archived : New champion\npromoted
Challenger --> Archived : Validation\nfailed / rejected
Shadow --> Challenger : Shadow metrics\ninsufficient
note right of Champion
Serving production traffic
Referenced by alias "champion"
end note
note right of Challenger
Undergoing A/B test
Referenced by alias "challenger"
end note
note right of Shadow
Receiving traffic copies
offline evaluation only
end note
Weights & Biases and Neptune Integration Patterns
For teams needing richer visualization, real-time collaboration, or experiment comparison across large sweeps, managed services like Weights & Biases (W&B) and Neptune extend the MLflow pattern. Both integrate with Kubernetes workloads via environment variables and lightweight SDK calls:
import wandb
wandb.init(project="fraud-detection", config={"n_estimators": 200})
wandb.log({"accuracy": 0.942, "f1": 0.917})
The key architectural decision is whether experiment metadata leaves the cluster (managed SaaS) or stays within it (self-hosted MLflow). Self-hosted keeps data governance simple; managed SaaS reduces operational burden. Most teams choose self-hosted MLflow for regulated data and SaaS tooling for exploratory research.
Key Takeaway: MLflow on Kubernetes provides experiment tracking, model versioning, and artifact storage in a single platform. Using semantic aliases like “champion” and “challenger” decouples deployment automation from specific version numbers and enables safe canary promotions.
Section 3: Feature Stores on Kubernetes
The Training-Serving Skew Problem
Imagine a data scientist building a fraud detection model. During training, they compute a feature — “transactions in the last 24 hours” — using a batch SQL query over historical data. In production, the same feature must be computed in real time from a live event stream. If the training computation and the serving computation diverge even slightly (different time windows, different null handling), the model’s performance degrades in ways that are extremely difficult to diagnose. This is called training-serving skew, and it is one of the most common sources of silent model degradation in production.
A feature store solves this by providing a single, versioned definition of every feature that is shared between training and serving pipelines.
Feast Deployment on Kubernetes
Feast is the most widely adopted open-source feature store, and it integrates natively with the Kubeflow ecosystem [Source: https://www.kubeflow.org/docs/external-add-ons/feast/introduction/]. Feast’s architecture separates two concerns with distinct storage backends [Source: https://docs.feast.dev]:
| Component | Storage Backend | Use Case |
|---|---|---|
| Offline store | PostgreSQL, BigQuery, Snowflake | Historical feature retrieval for model training |
| Online store | Redis | Low-latency feature lookup during model serving |
| Registry | PostgreSQL | Feature definitions, versioning, and metadata |
A production Kubernetes deployment of Feast uses the following objects [Source: https://oneuptime.com/blog/post/2026-02-09-feast-feature-store-kubernetes/view]:
# feast-server Deployment — 3 replicas for high availability
apiVersion: apps/v1
kind: Deployment
metadata:
name: feast-server
namespace: mlops
spec:
replicas: 3
selector:
matchLabels:
app: feast-server
template:
metadata:
labels:
app: feast-server
spec:
containers:
- name: feast-server
image: my-feast-image:latest
ports:
- containerPort: 6566
env:
- name: FEAST_REGISTRY_PATH
value: postgresql://feast-db:5432/feast
---
# PostgreSQL for registry — StatefulSet with PVC
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: feast-db
namespace: mlops
spec:
serviceName: feast-db
replicas: 1
selector:
matchLabels:
app: feast-db
template:
spec:
containers:
- name: postgres
image: postgres:15-alpine
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 20Gi
Figure 7.4: Feast Dual-Path Feature Architecture
flowchart TB
Def["Feature Definitions\n(Python FeatureView\nin Git)"]
subgraph Offline["Offline Path (Training)"]
OStore["Offline Store\nPostgreSQL / BigQuery"]
TrainJob["Training Pipeline\nget_historical_features()"]
end
subgraph Online["Online Path (Serving)"]
Redis["Online Store\nRedis"]
InfSvc["Inference Service\nget_online_features()"]
end
Mat["Materialization Job\n(Kubernetes CronJob)"]
Reg["Feast Registry\nPostgreSQL\n(feature metadata)"]
Def --> Reg
Reg --> OStore
Reg --> Redis
OStore --> TrainJob
OStore --> Mat
Mat -->|"pushes computed\nfeatures"| Redis
Redis --> InfSvc
TrainJob -->|"trains model"| Model["Trained Model"]
InfSvc -->|"scores request"| Pred["Prediction"]
style Offline fill:#e8f4e8,stroke:#5a9a5a
style Online fill:#e8f0ff,stroke:#5a6aad
style Mat fill:#fff3cd,stroke:#c47f00
style Reg fill:#f5f5f5,stroke:#888
Online and Offline Feature Serving
The workflow is as follows:
- Feature definitions are written as Python
FeatureViewobjects and committed to Git - Materialization runs on a schedule (or event trigger) to push computed features from the offline store into Redis for online serving
- Training pipelines retrieve historical point-in-time correct features from the offline store via
get_historical_features - Serving infrastructure retrieves the same features at inference time via
get_online_features— using identical definitions
# Training: retrieve historical features
training_df = store.get_historical_features(
entity_df=entity_df,
features=["user_stats:transactions_24h", "user_stats:avg_amount_7d"]
).to_df()
# Serving: retrieve online features (identical feature names)
online_features = store.get_online_features(
features=["user_stats:transactions_24h", "user_stats:avg_amount_7d"],
entity_rows=[{"user_id": "u-12345"}]
).to_dict()
The user_stats:transactions_24h feature definition is the same object in both calls — that is the guarantee Feast provides.
Feature Engineering Pipelines
Feast’s default in-process materialization engine does not scale well for large data volumes. Kubernetes provides a natural scaling mechanism: run materialization as Kubernetes Jobs, processing multiple time-range partitions in parallel [Source: https://docs.feast.dev/how-to-guides/running-feast-in-production]. The Bytewax materialization engine provides a production-grade implementation of this pattern, distributing feature computation across pods while writing results to the online store.
Key Takeaway: Feast eliminates training-serving skew by providing a single versioned feature definition shared between batch training and real-time serving. On Kubernetes, a 3-replica Feast server backed by Redis and PostgreSQL gives a production-ready, horizontally scalable feature serving layer.
Section 4: CI/CD for Machine Learning
Traditional software CI/CD pipelines build code, run tests, and deploy binaries. ML CI/CD must do all of that, and also: retrain models on new data, validate statistical model quality before promotion, manage large binary artifacts (model weights), and handle the fact that “tests passing” does not guarantee a model will perform well on the next day’s data distribution.
GitOps Model Deployment with ArgoCD or Flux
GitOps is a deployment philosophy where Git is the single source of truth for both application configuration and infrastructure state. A GitOps controller (ArgoCD or Flux) watches a Git repository and continuously reconciles the cluster state to match what is declared there [Source: https://dev.to/apprecode/mlops-best-practices-10-practical-practices-teams-actually-use-h77].
For ML deployments, the pattern works as follows:
Developer pushes model config change to Git
│
▼
┌─────────────────────┐
│ Git Repository │
│ (infra/ directory) │
│ - deployment.yaml │
│ - serving.yaml │
│ - model-config.yaml │
└──────────┬──────────┘
│ ArgoCD watches for drift
▼
┌─────────────────────┐
│ ArgoCD │
│ (reconciliation │
│ controller) │
└──────────┬──────────┘
│ applies changes
▼
┌─────────────────────┐
│ Kubernetes Cluster │
│ - KServe Inference │
│ Service │
│ - model version │
│ updated │
└─────────────────────┘
Figure 7.5: GitOps ML Deployment Flow with ArgoCD
flowchart TD
Dev["ML Engineer\npushes model config"] --> Git
subgraph GitRepo["Git Repository (infra/)"]
Git["deployment.yaml\nserving.yaml\nmodel-config.yaml"]
end
Git -->|"ArgoCD detects\nconfiguration drift"| Argo
subgraph ArgoCD["ArgoCD Controller"]
Argo["Reconciliation Loop\n(watches for drift)"]
Gate["Validation Gate\naccuracy ≥ threshold?"]
end
Argo --> Gate
Gate -->|"PASSED"| Apply
Gate -->|"FAILED\nexit non-zero"| Block["Promotion Blocked\nAlert sent to team"]
subgraph K8s["Kubernetes Cluster"]
Apply["Apply Manifests"]
KServe["KServe InferenceService\n(new model version live)"]
end
Apply --> KServe
style GitRepo fill:#f9f3e8,stroke:#c8a84b
style ArgoCD fill:#e8f0ff,stroke:#5a6aad
style K8s fill:#e8f4e8,stroke:#5a9a5a
style Block fill:#ffe8e8,stroke:#c94040,color:#8b0000
style Gate fill:#fff3cd,stroke:#c47f00
A typical GitOps pipeline for ML lint, validate, plan, then apply [Source: https://www.thirstysprout.com/post/mlops-best-practices]. Kustomize can layer environment-specific overlays (staging vs production model versions, replica counts, resource limits) on top of a shared base configuration — all managed declaratively in version control [Source: https://dev.to/apprecode/mlops-best-practices-10-practical-practices-teams-actually-use-h77].
Infrastructure as Code (IaC) is the enabling foundation. Because training, serving, and fine-tuning jobs have wildly different resource profiles — a training job might need 8 GPUs for 4 hours; an inference deployment needs 2 CPUs and 4 GB RAM indefinitely — IaC captures these differences explicitly and prevents the “environment drift” where staging and production slowly diverge [Source: https://kubeify.com/blog/the-role-of-kubernetes-in-mlops].
Automated Model Validation Gates
A validation gate is a pipeline step that blocks promotion unless a model meets defined quality thresholds. Gates typically check:
| Gate Type | Example Criterion | Action on Failure |
|---|---|---|
| Accuracy threshold | Accuracy ≥ 0.90 on held-out test set | Block promotion, alert team |
| Regression check | F1 score ≥ 95% of current champion’s F1 | Block promotion |
| Data quality check | Input feature distributions within 2σ of training distribution | Block promotion |
| Latency check | p99 inference latency ≤ 100ms under load | Block promotion |
| Bias/fairness audit | Equal opportunity difference ≤ 0.05 | Block promotion |
In an Argo Workflows or Tekton pipeline, these gates are ordinary container steps that exit non-zero on failure, causing the workflow to halt and the downstream deployment step never to execute.
# Validation gate component (KFP)
@dsl.component(packages_to_install=["mlflow"])
def validate_model(model_uri: str, accuracy_threshold: float = 0.90) -> str:
import mlflow
client = mlflow.MlflowClient()
run = client.get_run(model_uri.split("/")[1])
accuracy = float(run.data.metrics["accuracy"])
if accuracy < accuracy_threshold:
raise ValueError(
f"Model accuracy {accuracy:.3f} below threshold {accuracy_threshold}"
)
return "PASSED"
Container Image Management for ML Workloads
ML container images have several characteristics that complicate standard image build practices:
- Size: a PyTorch training image with CUDA drivers can exceed 10 GB
- Dependency complexity: CUDA version, cuDNN version, PyTorch version, and Python version must all be mutually compatible
- Reproducibility requirement: the exact same image must be usable months later to reproduce a training result
Best practices for managing ML images on Kubernetes:
| Practice | Implementation |
|---|---|
| Multi-stage builds | Separate build-time dependencies from runtime image |
| Pinned base images | FROM pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime (never latest) |
| Image layer caching | Structure Dockerfile to copy requirements before source code |
| Image registry | Harbor or ECR with vulnerability scanning enabled |
| Image signing | Cosign signatures verified at admission time |
| Shared base images | One organization-wide CUDA base; per-project layers on top |
Reproducibility with Pinned Dependencies and Deterministic Builds
The worst-case scenario in ML production is a training result you cannot reproduce — a model in production that you cannot recreate if you need to retrain from scratch. Reproducibility requires discipline at four layers:
Layer 4: Random seeds — set torch.manual_seed(), numpy.random.seed()
Layer 3: Data versioning — DVC or Delta Lake snapshot for training datasets
Layer 2: Code versioning — Git SHA pinned in pipeline metadata
Layer 1: Environment — Docker image digest (not tag) pinned in pipeline
Git tags and Docker image tags are mutable — someone can push a new image to the same tag. Image digests (sha256:...) are content-addressed and immutable. Pinning pipeline steps to image digests rather than tags is the single most impactful reproducibility practice available [Source: https://learn.microsoft.com/en-us/azure/aks/best-practices-ml-ops].
A complete reproducibility workflow in KFP pins the image digest at compile time:
@dsl.component(
base_image="my-registry.io/ml-base@sha256:a3f9b2c1..." # digest, not tag
)
def train(dataset: Input[Dataset], model: Output[Model]):
...
The compiled pipeline IR records the exact image digest, so re-running a pipeline from its compiled IR three months later uses precisely the same environment — regardless of what has been pushed to the registry since.
Key Takeaway: GitOps with ArgoCD makes ML deployments declarative, auditable, and automatically reconciled. Combining validation gates, pinned image digests, data versioning, and Git-tracked configuration is the foundation of genuinely reproducible ML production systems.
Chapter Summary
This chapter traced the full MLOps lifecycle on Kubernetes, from raw data through deployed model. The key architectural pattern is a layered system of concerns:
- Orchestration (KFP / Argo Workflows) — schedules and coordinates containerized pipeline steps as DAGs, with artifact lineage and caching
- Experiment tracking (MLflow) — captures every run’s parameters, metrics, and artifacts; the model registry governs promotion via semantic aliases
- Feature management (Feast) — eliminates training-serving skew with a single versioned feature definition shared between batch and real-time paths
- Deployment automation (ArgoCD / GitOps) — makes cluster state a function of Git state, with validation gates that block unsafe promotions
The thread connecting all four layers is reproducibility. Pinned image digests, versioned data, explicit parameters, and structured governance (RBAC, audit trails, policy-as-code) together ensure that every model in production can be explained, audited, and recreated.
Key Terms
| Term | Definition |
|---|---|
| Kubeflow Pipelines | An MLOps platform component providing a Python SDK and execution backend for defining and running ML pipelines as DAGs on Kubernetes |
| Argo Workflows | A Kubernetes-native workflow engine that executes DAGs of containerized steps; the default execution backend for Kubeflow Pipelines |
| MLflow | An open-source platform for experiment tracking, model registry, and artifact management, deployable as a Kubernetes service |
| Model registry | A versioned catalog of trained model artifacts with promotion stages, metadata, audit trails, and access controls |
| Feature store | A data system that provides a single, versioned definition of ML features shared between training (offline) and serving (online) pipelines |
| Feast | An open-source feature store that provides offline storage for training and online storage (Redis) for low-latency serving, with native Kubeflow integration |
| GitOps | A deployment philosophy where Git is the single source of truth for cluster configuration, with an automated controller reconciling live state to match the repository |
| ArgoCD | A Kubernetes-native GitOps continuous delivery controller that watches Git repositories and reconciles cluster state to the declared configuration |
| CI/CD | Continuous Integration / Continuous Delivery; the practice of automating build, test, and deployment stages in a software delivery pipeline |
| Experiment tracking | The systematic recording of training run parameters, metrics, and output artifacts to enable comparison, reproducibility, and governance |
| Reproducibility | The ability to recreate an identical ML training result at a later time, requiring pinned environments, versioned data, explicit parameters, and fixed random seeds |
| Tekton | A Kubernetes-native CI/CD framework using Task and Pipeline CRDs; used in ML contexts primarily for image building and deployment triggering |
Chapter 8: Networking and Security for AI Workloads
Learning Objectives
By the end of this chapter, you will be able to:
- Configure high-performance networking for distributed training using RDMA, GPUDirect RDMA, SR-IOV, and Multus CNI
- Implement Kubernetes NetworkPolicy and service mesh solutions to isolate AI workload traffic
- Secure model artifacts, training data, and inference endpoints with encryption, secrets management, and mTLS
- Apply RBAC and Pod Security Standards to harden AI namespaces against privilege escalation and lateral movement
Introduction
Running AI workloads at scale on Kubernetes is not just a scheduling problem — it is simultaneously a networking problem and a security problem. A large language model training run distributes gradient updates across dozens or hundreds of GPU nodes every few seconds. If those collective communication operations crawl over a conventional Ethernet stack, the GPU utilization collapses and the training job takes weeks instead of days. At the same time, AI clusters aggregate enormous amounts of sensitive intellectual property: proprietary model weights, licensed training datasets, inference API keys, and the PII-laden outputs of fine-tuning pipelines. A misconfigured RBAC binding or an overly permissive network policy can expose all of it.
This chapter addresses both sides of that equation. The first half covers the specialist networking technologies — RDMA, GPUDirect, SR-IOV, and Multus — that allow GPU collective communication to operate at wire speed. The second half covers the security controls — NetworkPolicy, service mesh, RBAC, Pod Security Standards, and secrets management — that protect the cluster and its data.
Section 1: High-Performance Networking for Distributed Training
Ordinary TCP/IP networking was designed for flexibility and correctness, not for the microsecond-latency, near-zero-CPU-overhead requirements of distributed deep learning. When a training job executes an AllReduce collective across 256 GPUs, each participating process must exchange gradient tensors with its peers hundreds of times per training step. The bottleneck is not the GPU computation — it is the time the gradients spend waiting in CPU buffers and kernel queues.
Think of the difference between shipping packages through a postal sorting center versus loading them directly onto a dedicated freight train. TCP/IP is the sorting center: flexible, universal, but slow and full of hand-offs. High-performance AI networking is the freight train: faster, more direct, with far fewer touch points.
RDMA and InfiniBand with Kubernetes
Remote Direct Memory Access (RDMA) allows a network adapter to read from and write to the memory of a remote host without involving the remote CPU. This eliminates the copy-to-kernel-buffer and context-switch overhead that dominates conventional networking at high message rates.
InfiniBand is the dominant physical transport for RDMA in HPC and AI data centers. It provides link-level reliability, congestion management, and switch fabrics rated at 400 Gbps (HDR) or 800 Gbps (NDR) per port, with end-to-end latency measured in single-digit microseconds.
To expose InfiniBand or RoCE (RDMA over Converged Ethernet) devices to Kubernetes pods, you use the NVIDIA Network Operator. The Network Operator manages the full driver and plugin lifecycle [Source: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-rdma.html]:
- MOFED / DOCA-OFED drivers — the Mellanox OpenFabrics Enterprise Distribution kernel modules that expose
/dev/infiniband/*devices - RDMA shared device plugin — advertises
rdma/ibresources to the Kubernetes scheduler - nvidia-peermem module — enables peer-to-peer RDMA between the NIC and the GPU’s BAR memory
Install the GPU Operator alongside the Network Operator. The two operators cooperate to wire GPUDirect RDMA end-to-end. Three installation strategies exist depending on how MOFED drivers are managed [Source: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-rdma.html]:
# Strategy 1: DMA-BUF — Network Operator manages MOFED drivers (recommended)
helm install gpu-operator nvidia/gpu-operator \
--version 26.3.0 \
--namespace gpu-operator \
--set driver.rdma.useHostMofed=false
# Strategy 2: Host MOFED — drivers pre-installed on the node
helm install gpu-operator nvidia/gpu-operator \
--set driver.rdma.useHostMofed=true
# Strategy 3: Legacy nvidia-peermem (for older MOFED versions)
helm install gpu-operator nvidia/gpu-operator \
--set driver.rdma.enabled=true
Pods request InfiniBand resources through the standard Kubernetes resource request mechanism. Setting rdma/ib: 1 acts as a scheduling boolean that ensures the pod lands on a node with InfiniBand hardware [Source: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-rdma.html]:
resources:
requests:
nvidia.com/gpu: 8
rdma/ib: 1
limits:
nvidia.com/gpu: 8
rdma/ib: 1
Table 8.1 — RDMA Transport Options
| Transport | Physical Medium | Typical Bandwidth | Latency | Notes |
|---|---|---|---|---|
| InfiniBand HDR | InfiniBand cable | 200 Gb/s per port | ~1 µs | Industry standard for HPC/AI |
| InfiniBand NDR | InfiniBand cable | 400 Gb/s per port | <1 µs | Latest generation |
| RoCEv2 | 25/100 GbE | 100 Gb/s per port | 2–5 µs | Runs on existing Ethernet switches |
| iWARP | Standard Ethernet | 25–100 Gb/s | 5–10 µs | Software-friendly, higher latency |
GPUDirect RDMA for GPU-to-GPU Communication
GPUDirect RDMA extends RDMA into the GPU memory address space. Without GPUDirect, gradient data on GPU A must travel: GPU memory → CPU pinned buffer → NIC → network → NIC → CPU pinned buffer → GPU memory. With GPUDirect, the NIC can DMA directly into and out of the GPU’s BAR memory region, eliminating both CPU copies [Source: https://developer.nvidia.com/blog/deploying-gpudirect-rdma-on-egx-stack-with-the-network-operator/].
The practical impact is significant: collective communication libraries like NCCL (NVIDIA Collective Communications Library) can achieve 90–95% of theoretical InfiniBand bandwidth when GPUDirect RDMA is enabled, versus 40–60% through a CPU-mediated path.
Without GPUDirect RDMA:
GPU VRAM → [PCIe] → CPU DRAM → [PCIe] → NIC → Network → NIC → [PCIe] → CPU DRAM → [PCIe] → GPU VRAM
With GPUDirect RDMA:
GPU BAR Memory → [PCIe P2P] → NIC → Network → NIC → [PCIe P2P] → GPU BAR Memory
Figure 8.1: GPUDirect RDMA eliminates CPU-mediated copies from the gradient communication path
flowchart LR
subgraph "Without GPUDirect RDMA"
direction LR
A1["GPU VRAM\n(Node A)"] -->|PCIe| B1["CPU DRAM\n(Node A)"]
B1 -->|PCIe| C1["NIC\n(Node A)"]
C1 -->|Network| C2["NIC\n(Node B)"]
C2 -->|PCIe| B2["CPU DRAM\n(Node B)"]
B2 -->|PCIe| A2["GPU VRAM\n(Node B)"]
end
subgraph "With GPUDirect RDMA"
direction LR
D1["GPU BAR Memory\n(Node A)"] -->|PCIe P2P| E1["NIC\n(Node A)"]
E1 -->|InfiniBand| E2["NIC\n(Node B)"]
E2 -->|PCIe P2P| D2["GPU BAR Memory\n(Node B)"]
end
The prerequisite software stack on each node is: NVIDIA GPU driver, MOFED/DOCA-OFED, and the nvidia-peermem kernel module (which creates the P2P DMA mapping between the GPU and the NIC). The Network Operator manages all three automatically [Source: https://developers.redhat.com/articles/2025/04/29/accelerate-model-training-openshift-ai-nvidia-gpudirect-rdma].
To validate GPUDirect RDMA performance after deployment, run the perftest benchmark from within a pod that holds both GPU and rdma/ib resources [Source: https://kubernetes.recipes/recipes/networking/validate-gpudirect-rdma-performance/]:
# On receiver pod
ib_write_bw --use_cuda=0 -d mlx5_0
# On sender pod (replace <receiver-ip> with the receiver pod's InfiniBand IP)
ib_write_bw --use_cuda=0 -d mlx5_0 <receiver-ip>
Bandwidth results above 180 Gb/s on HDR InfiniBand indicate GPUDirect RDMA is functioning correctly.
SR-IOV for High-Bandwidth Network Interfaces
SR-IOV (Single Root I/O Virtualization) is a PCIe specification that allows a single physical NIC to be partitioned into multiple lightweight Virtual Functions (VFs). Each VF is a near-full hardware NIC with its own DMA engine, queues, and interrupt lines. When a VF is assigned exclusively to a pod, that pod communicates with the NIC with near-native performance and hardware-enforced isolation — no other pod can interfere with its DMA traffic.
The analogy here is a highway with dedicated express lanes. The main NIC is the highway, and each SR-IOV VF is a dedicated lane that a specific vehicle (pod) has reserved. Traffic in one lane does not block or spy on traffic in another.
Deploying SR-IOV in Kubernetes requires three components working together [Source: https://docs.rke2.io/networking/multus_sriov]:
- SR-IOV CNI plugin — moves a VF into the pod network namespace
- SR-IOV device plugin — advertises VF resources to the scheduler
- SR-IOV Network Operator — automates VF creation and node configuration
Node labeling triggers automatic VF discovery [Source: https://docs.rke2.io/networking/multus_sriov]:
kubectl label node gpu-node-01 \
feature.node.kubernetes.io/network-sriov.capable=true
After labeling, the sriov-network-config DaemonSet deploys a discovery pod to each labeled node and populates a SriovNetworkNodeState CRD with the available interfaces and their capabilities.
# Example SriovNetworkNodeState snippet (auto-populated by operator)
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodeState
metadata:
name: gpu-node-01
spec:
interfaces:
- name: ens3f0
numVfs: 8
pciAddress: "0000:18:00.0"
deviceType: netdevice
Multus CNI for Secondary Network Attachments
Standard Kubernetes pods have exactly one network interface: the one created by the cluster’s primary CNI (Calico, Cilium, Flannel). For AI training pods, this is a problem: you want management traffic (Kubernetes API calls, health probes) to travel over the primary network while gradient communication travels over a dedicated high-speed InfiniBand or SR-IOV network with different routing, QoS, and security policies.
Multus CNI solves this by acting as a meta-CNI plugin multiplexer [Source: https://kubeops.net/blog/exploring-multus-an-advanced-networking-solution-for-kubernetes]. It intercepts the pod network setup call, creates the primary interface using the cluster’s default CNI, and then creates additional interfaces using any other CNI plugins you specify.
Pod network interfaces after Multus:
eth0 → primary CNI (Calico) → Kubernetes service network
net1 → SR-IOV VF (sriov-cni) → InfiniBand / RoCE fabric
net2 → macvlan → storage network
Figure 8.2: Multus CNI attaches multiple network interfaces to a single GPU training pod
flowchart TB
subgraph "GPU Training Pod"
P["pytorch-trainer\ncontainer"]
ETH0["eth0"]
NET1["net1"]
NET2["net2"]
P --- ETH0
P --- NET1
P --- NET2
end
subgraph "Multus CNI (meta-plugin)"
M["Multus\nCNI"]
CNI1["Primary CNI\n(Calico)"]
CNI2["SR-IOV CNI\n(sriov-cni)"]
CNI3["macvlan CNI"]
M --> CNI1
M --> CNI2
M --> CNI3
end
ETH0 -->|"created by"| CNI1
NET1 -->|"created by"| CNI2
NET2 -->|"created by"| CNI3
CNI1 --> SVC["Kubernetes\nService Network"]
CNI2 --> IB["InfiniBand /\nRoCE Fabric\n(NCCL AllReduce)"]
CNI3 --> STORE["Storage\nNetwork"]
The Network Operator uses Multus internally to attach secondary RDMA interfaces to GPU training pods. It does this through two CRDs: NicClusterPolicy (which configures the desired Network Operator state) and HostDeviceNetwork (which creates the Multus NetworkAttachmentDefinition) [Source: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-rdma.html].
Pods reference the secondary network attachment via an annotation:
metadata:
annotations:
k8s.v1.cni.cncf.io/networks: ib-network
spec:
containers:
- name: pytorch-trainer
resources:
requests:
nvidia.com/gpu: 8
rdma/ib: 1
Table 8.2 — CNI Plugin Performance Comparison
| CNI Plugin | Performance | Hardware Requirement | Use Case |
|---|---|---|---|
| Calico / Cilium (primary) | Standard | None | Management, services |
| macvlan | Near-native | Standard NIC | Storage, data ingestion |
| IPvlan | Near-native | Standard NIC | Storage (no promiscuous mode) |
| SR-IOV + sriov-cni | Native (hardware-isolated) | SR-IOV capable NIC | AI collective comms |
| InfiniBand (via rdma-cni) | >200 Gb/s, <2 µs | ConnectX NIC | Distributed training NCCL |
Key Takeaway: High-performance AI networking requires a layered approach. Multus CNI gives training pods a secondary high-speed network interface alongside the standard cluster network. SR-IOV provides hardware-isolated near-native performance via Virtual Functions. GPUDirect RDMA eliminates CPU copies by routing gradient data directly between GPU memory and the NIC. Together, these three technologies allow NCCL collective operations to run at close to theoretical InfiniBand wire speed.
Section 2: Network Policies and Isolation
High-speed networking for training is a capability concern. Network isolation is a security concern. In a shared AI cluster, different teams run workloads with different data sensitivity levels: a research team might be fine-tuning on public datasets while a compliance team fine-tunes on HIPAA-regulated patient records in the same cluster. Without explicit network isolation, those two workloads can communicate freely at the pod level.
Kubernetes NetworkPolicy for AI Namespace Isolation
Kubernetes NetworkPolicy resources are enforced by the CNI plugin (Calico, Cilium, or Antrea are common enforcers) and operate at Layer 3/4. The most important first step is a default-deny policy applied to every AI namespace [Source: https://www.tigera.io/blog/securing-ai-workloads-in-kubernetes-why-traditional-network-security-isnt-enough/]:
# default-deny-all.yaml — apply to every AI namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: ai-training
spec:
podSelector: {} # matches all pods in namespace
policyTypes:
- Ingress
- Egress
After applying the default deny, add explicit allow rules only for what each workload legitimately needs [Source: https://dev.to/young_gao/kubernetes-security-hardening-for-production-ai-workloads-in-2026-5dk1]:
# allow-training-comms.yaml — training pods may reach data storage and model registry
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-training-comms
namespace: ai-training
spec:
podSelector:
matchLabels:
role: trainer
policyTypes:
- Egress
egress:
- to:
- namespaceSelector:
matchLabels:
name: data-storage
ports:
- port: 443
- to:
- namespaceSelector:
matchLabels:
name: model-registry
ports:
- port: 5000
- ports: # allow DNS
- port: 53
protocol: UDP
Training pods should have no internet egress — there is no legitimate reason for a gradient synchronization process to reach the public internet, and allowing it creates an exfiltration path [Source: https://dev.to/young_gao/kubernetes-security-hardening-for-production-ai-workloads-in-2026-5dk1].
Figure 8.3: Default-deny NetworkPolicy with explicit allow rules limits AI namespace egress to approved endpoints only
flowchart LR
subgraph NS["Namespace: ai-training"]
TP["Training Pod\n(role: trainer)"]
DP["Default-Deny\nNetworkPolicy\n(all pods)"]
AP["Allow-Training\nNetworkPolicy\n(role: trainer)"]
end
subgraph DS["Namespace: data-storage"]
OB["Object Storage\n:443"]
end
subgraph MR["Namespace: model-registry"]
REG["Model Registry\n:5000"]
end
INET["Public Internet"]
DNS["CoreDNS :53"]
DP -. "blocks all\ningress & egress\nby default" .-> TP
TP -->|"allowed by\nallow-training-comms"| OB
TP -->|"allowed by\nallow-training-comms"| REG
TP -->|"allowed by\nallow-training-comms"| DNS
TP -. "BLOCKED\n(no rule)" .-> INET
For advanced enforcement beyond what standard NetworkPolicy supports, Calico’s egress gateway feature routes all outbound traffic from an AI namespace through dedicated gateway pods. Those gateways become the single chokepoint where traffic logging, deep inspection, and outbound filtering happen [Source: https://www.tigera.io/blog/securing-ai-workloads-in-kubernetes-why-traditional-network-security-isnt-enough/]. Cilium provides equivalent functionality through its eBPF-based CiliumNetworkPolicy CRD, which supports Layer 7 HTTP/gRPC filtering directly in the kernel data plane.
Service Mesh (Istio, Linkerd) for Inference Traffic Management
For inference workloads — where model servers respond to external or internal API calls — a Kubernetes NetworkPolicy alone is often insufficient. It cannot route based on HTTP headers, enforce per-route traffic weights, or produce per-service latency histograms. A service mesh adds those capabilities.
A service mesh injects a sidecar proxy (Envoy in Istio’s case, a Linkerd-specific proxy in Linkerd’s case) into every pod. The sidecar transparently intercepts all inbound and outbound TCP connections and provides:
- Traffic management — canary deployments, A/B testing, circuit breaking for model versions
- Observability — per-route latency p50/p95/p99, error rates, request tracing
- Security — mutual TLS between all pods, with identity bound to Kubernetes service accounts
For AI inference specifically, the traffic management features are high-value. Imagine rolling out a new version of a Llama-3 fine-tune: with an Istio VirtualService, you can shift 5% of inference traffic to the new model while monitoring error rates before committing to a full cutover — all without redeploying pods.
# Istio VirtualService: canary inference rollout
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: model-server
namespace: ai-inference
spec:
hosts:
- model-server
http:
- route:
- destination:
host: model-server
subset: v2-candidate
weight: 5
- destination:
host: model-server
subset: v1-stable
weight: 95
mTLS for Secure Inter-Pod Communication
Mutual TLS (mTLS) means both ends of a connection present and verify certificates. Without mTLS, a compromised pod inside the cluster can impersonate any service it can reach at the network level. With mTLS, each pod’s identity is cryptographically bound to its Kubernetes service account, and connections between pods that cannot present a valid certificate are rejected.
In Istio, mTLS can be enforced cluster-wide with a single PeerAuthentication policy [Source: https://www.appsecengineer.com/blog/securing-ai-ml-workloads-in-kubernetes-practical-strategies-for-2025]:
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: enforce-mtls
namespace: ai-inference
spec:
mtls:
mode: STRICT
In STRICT mode, Istio rejects any unencrypted or unauthenticated connection to pods in the namespace. This means that even if an attacker compromises a node and tries to make direct TCP connections to model servers, the connections are rejected at the sidecar proxy layer before the model server process ever sees them.
Figure 8.4: Istio service mesh enforces mTLS between pods — each sidecar proxy validates peer identity before traffic reaches the application
sequenceDiagram
participant C as Client Pod<br/>(Envoy Sidecar)
participant S as Model Server Pod<br/>(Envoy Sidecar)
participant MS as Model Server<br/>Application
Note over C,S: mTLS Handshake (STRICT mode)
C->>S: ClientHello + Client Certificate<br/>(bound to Kubernetes ServiceAccount)
S->>C: ServerHello + Server Certificate<br/>(bound to Kubernetes ServiceAccount)
C->>S: Verify server cert → OK
S->>C: Verify client cert → OK
Note over C,S: Encrypted mTLS tunnel established
C->>S: Inference request (encrypted)
S->>MS: Forwarded request (plaintext, localhost)
MS->>S: Inference response
S->>C: Response (encrypted)
Note over C,S: Attacker with no valid cert is<br/>rejected at sidecar before<br/>reaching MS
Key Takeaway: Network isolation for AI workloads is a defense-in-depth problem. Start with default-deny NetworkPolicies in every namespace and add only the egress rules each workload needs. Introduce a service mesh for inference workloads to gain mTLS identity verification, traffic routing, and observability. Never allow training pods to initiate connections to the public internet.
Section 3: Security Best Practices
RBAC for Multi-Team AI Clusters
RBAC misconfigurations account for over 35% of Kubernetes breaches, with attackers exploiting role bindings rather than breaking into systems [Source: https://dasroot.net/posts/2026/01/kubernetes-security-rbac-policies-scanning/]. The single most common mistake in ML platforms is granting cluster-admin to training job service accounts because it eliminates permission errors during development.
RBAC in Kubernetes is composed of four object types [Source: https://dev.to/young_gao/kubernetes-security-hardening-for-production-ai-workloads-in-2026-5dk1]:
| Object | Scope | Purpose |
|---|---|---|
Role | Namespace | Defines permissions within one namespace |
ClusterRole | Cluster-wide | Defines permissions across all namespaces |
RoleBinding | Namespace | Binds a Role or ClusterRole to subjects within a namespace |
ClusterRoleBinding | Cluster-wide | Binds a ClusterRole to subjects at cluster scope |
A well-structured multi-team AI cluster uses namespace-scoped roles that match what each job type actually needs. A training job needs to read ConfigMaps (for hyperparameters), read Secrets (for data access credentials), and create Pods (for workers). It does not need to list nodes, modify RBAC policies, or access other namespaces.
Figure 8.5: RBAC object relationships — a RoleBinding binds a Role to a ServiceAccount, granting only the permissions that role specifies
flowchart LR
SA["ServiceAccount\ntraining-sa\n(namespace: ai-training)"]
RB["RoleBinding\ntraining-job-binding\n(namespace: ai-training)"]
R["Role\ntraining-job-role\n(namespace: ai-training)"]
subgraph PERMS["Permissions Granted"]
P1["configmaps → get, list"]
P2["secrets → get"]
P3["pods → get, list, watch"]
end
subgraph DENIED["Denied (no rule = no access)"]
D1["nodes → any"]
D2["roles → any"]
D3["other namespaces → any"]
end
SA -->|"subject of"| RB
RB -->|"references"| R
R --> P1
R --> P2
R --> P3
R -. "does NOT grant" .-> D1
R -. "does NOT grant" .-> D2
R -. "does NOT grant" .-> D3
# Least-privilege role for a PyTorch training job
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: training-job-role
namespace: ai-training
rules:
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["get", "list"]
- apiGroups: [""]
resources: ["secrets"]
verbs: ["get"]
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: training-job-binding
namespace: ai-training
subjects:
- kind: ServiceAccount
name: training-sa
namespace: ai-training
roleRef:
kind: Role
name: training-job-role
apiGroup: rbac.authorization.k8s.io
Table 8.3 — RBAC Roles for Common AI Cluster Personas
| Persona | Scope | Allowed Verbs | Denied |
|---|---|---|---|
| Data scientist | Own namespace | create/delete Jobs, read Logs | create Roles, access other namespaces |
| ML engineer | Own namespace + model-registry | create Deployments, push to registry | delete Namespaces, modify ClusterRoles |
| Training job SA | Own namespace | get ConfigMap/Secret, list Pods | everything else |
| Inference job SA | Own namespace | get Secret, create/delete Pods | create Roles, access PVCs in other NS |
| Platform admin | Cluster-wide | all | — (this role should use MFA + audit logging) |
Pod Security Standards and Admission Controllers
Pod Security Policies were deprecated in Kubernetes 1.21 and removed in 1.25. The replacement is Pod Security Admission (PSA), which enforces one of three built-in Pod Security Standards profiles at the namespace level [Source: https://dasroot.net/posts/2026/01/kubernetes-security-rbac-pod-security-standards-policy-engines/]:
| Profile | Description | When to Use |
|---|---|---|
privileged | No restrictions | Only for system namespaces (e.g., kube-system) |
baseline | Prevents known privilege escalations | Default for AI training namespaces |
restricted | Enforces hardened configuration | Inference namespaces, multi-tenant environments |
Apply PSA enforcement via namespace labels:
# Enforce Baseline on AI training namespace
kubectl label namespace ai-training \
pod-security.kubernetes.io/enforce=baseline \
pod-security.kubernetes.io/warn=restricted \
pod-security.kubernetes.io/audit=restricted
# Enforce Restricted on AI inference namespace
kubectl label namespace ai-inference \
pod-security.kubernetes.io/enforce=restricted
The warn and audit modes are valuable during migration: they surface violations in API server responses and audit logs without blocking pod creation, so you can see what needs to change before switching enforce to restricted.
For more granular policy enforcement beyond the three built-in profiles, policy engines like OPA/Gatekeeper or Kyverno act as validating admission webhooks. They can enforce custom rules such as “all pods in AI namespaces must have resource limits defined” or “no training pod may mount the host filesystem.”
GPU nodes require an additional taint to prevent non-GPU workloads from landing on them. This is both a resource protection measure and a cryptomining defense [Source: https://dev.to/young_gao/kubernetes-security-hardening-for-production-ai-workloads-in-2026-5dk1]:
kubectl taint nodes gpu-node-01 nvidia.com/gpu=true:NoSchedule
Only pods with the corresponding toleration will be scheduled onto GPU nodes, ensuring that expensive GPU capacity is reserved for legitimate AI workloads.
Image Scanning for ML Container Vulnerabilities
ML containers are unusually large — a PyTorch training image including CUDA, cuDNN, and common Python libraries can easily reach 20–30 GB. More surface area means more CVEs. Common vulnerability sources in ML images include:
- Outdated CUDA base images with known driver vulnerabilities
- Python packages installed via
pipwith transitive dependencies that include compromised packages - Jupyter or development tools accidentally included in production images
- Pre-trained model weights embedded in images that may have been tampered with in supply chain attacks
Integrate image scanning into the CI/CD pipeline using tools such as Trivy, Grype, or Snyk Container. Enforce scan gates via an admission controller that rejects pods whose images have critical or high-severity CVEs:
# Kyverno policy: require images to come from a trusted registry
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: restrict-image-registries
spec:
validationFailureAction: enforce
rules:
- name: validate-registries
match:
any:
- resources:
kinds: [Pod]
namespaces: [ai-training, ai-inference]
validate:
message: "Images must come from the internal registry."
pattern:
spec:
containers:
- image: "registry.internal.company.com/*"
Secrets Management for API Keys and Model Credentials
AI workloads commonly need credentials for: cloud object storage (S3, GCS), model registries (HuggingFace Hub, internal Harbor), inference API keys (OpenAI, Anthropic), and database connections for experiment tracking.
The most important rule is: mount secrets as files, not environment variables [Source: https://dev.to/young_gao/kubernetes-security-hardening-for-production-ai-workloads-in-2026-5dk1]. Environment variables are visible in the process’s /proc/<pid>/environ, which any process running as the same user can read. A file mounted into the pod’s filesystem via volumeMounts is accessible only through the file path and can be given restrictive file permissions (0400).
# Correct: mount secret as a file
volumes:
- name: s3-credentials
secret:
secretName: s3-access-keys
defaultMode: 0400
containers:
- name: trainer
volumeMounts:
- name: s3-credentials
mountPath: /run/secrets/s3
readOnly: true
# Incorrect: secret as environment variable (avoid)
env:
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: s3-access-keys
key: secret_key
For secrets that change frequently or require audit trails, integrate an external secrets manager (HashiCorp Vault, AWS Secrets Manager, Azure Key Vault) using the External Secrets Operator or Vault Agent Injector. These systems provide dynamic secret generation (short-lived credentials), centralized access policies, and immutable audit logs of every secret access event.
Key Takeaway: Security hardening for AI clusters is a three-layer problem. At the identity layer, enforce least-privilege RBAC using namespace-scoped roles for every service account — never cluster-admin for training jobs. At the pod layer, apply Baseline or Restricted Pod Security Standards and taint GPU nodes to prevent resource theft. At the secrets layer, always mount credentials as files and consider external secrets managers for audit trail requirements.
Section 4: Data Security and Compliance
AI workloads present unique compliance challenges. Training datasets may contain regulated data (patient records, financial transactions, personally identifiable information). Model weights represent millions of dollars of compute investment and are themselves sensitive IP. Inference endpoints may process user queries that are subject to data residency requirements.
Encryption at Rest and in Transit for Training Data
Encryption at rest means that the raw bytes on disk are unintelligible without the encryption key. In Kubernetes, this applies at two levels:
-
etcd encryption — Kubernetes Secrets and other sensitive API objects stored in etcd should be encrypted using
aescbcoraesgcmwith keys managed by a KMS provider. Configure this in the API server’s--encryption-provider-configflag. -
Persistent volume encryption — Training datasets stored on PersistentVolumes should use volume-level encryption. Most cloud providers offer this natively (AWS EBS encryption, GCP CMEK for GCS/Filestore). For on-premises clusters, LUKS-encrypted block devices or encrypted NFS volumes provide equivalent protection.
Encryption in transit means TLS on all network paths carrying training data:
Training data flow with encryption:
Object Storage (TLS) → Data loader pod (mTLS, Istio) → GPU training pod
Gradient sync (RDMA — separate encryption layer via GPUDirect SHARP or NCCL PXN)
Note that RDMA-based collective communication has a subtlety: RDMA bypasses the CPU and therefore bypasses TLS. For workloads that require encrypted gradient communication (uncommon in on-premises HPC clusters, more relevant in multi-tenant clouds), NCCL supports encrypted collectives, or the cluster can enforce IPsec at the NIC level using NVIDIA’s GPUDirect SHARP with encryption extensions [Source: https://www.appsecengineer.com/blog/securing-ai-ml-workloads-in-kubernetes-practical-strategies-for-2025].
Data Access Auditing and Lineage Tracking
Knowing that data is encrypted is not enough for most compliance frameworks — you also need to know who accessed what data, when, and for what purpose.
Kubernetes audit logging provides a baseline: every API call — including secret reads, PVC mounts, and pod creations — is logged to the API server audit log. Configure an audit policy that captures RequestResponse-level events for sensitive resources:
# audit-policy.yaml (excerpt)
rules:
- level: RequestResponse
resources:
- group: ""
resources: [secrets, persistentvolumeclaims]
- level: Metadata
resources:
- group: ""
resources: [pods, services]
For AI-specific data lineage — tracking which training dataset version produced which model checkpoint — tools like MLflow, DVC (Data Version Control), and Kubeflow Pipelines provide metadata tracking at the ML workflow level. Integrating these with the Kubernetes audit log creates an end-to-end chain: API audit shows that pod X accessed PVC Y at time T, and the MLflow experiment record shows that pod X produced model checkpoint Z.
Compliance Frameworks for AI Workloads
Table 8.4 — Compliance Framework Requirements Mapping
| Framework | Key Requirement | Kubernetes Implementation |
|---|---|---|
| SOC 2 Type II | Logical access controls, audit logging | RBAC least-privilege + API server audit logs |
| HIPAA | PHI encryption at rest and in transit, access controls | PV encryption + TLS + RBAC + Secrets management |
| GDPR | Data minimization, right to erasure, access logging | Namespace isolation + audit logs + data retention policies |
| FedRAMP | FIPS 140-2 encryption, continuous monitoring | FIPS-mode ETCD encryption + Falco runtime monitoring |
| NIST AI RMF | Model documentation, risk assessment, bias monitoring | MLflow experiment tracking + OPA policy-as-code |
The most practically useful framing for a Kubernetes AI cluster is to treat compliance as a set of Kubernetes namespaces with different security boundaries. A HIPAA-scoped namespace for fine-tuning on patient data gets: enforce=restricted PSA, a dedicated node pool with encrypted root volumes, default-deny NetworkPolicy with egress only to approved storage endpoints, mandatory mTLS via Istio PeerAuthentication, and API server audit logs exported to an immutable log store.
Runtime threat detection using Falco adds a final defensive layer. Falco watches the Linux syscall stream from inside each pod and can alert on behaviors that indicate compromise: spawning a shell inside a model server container, attempting to read /etc/shadow, or making outbound connections on unusual ports [Source: https://www.anantacloud.com/post/kubernetes-security-in-2026-modern-threats-and-how-to-defend-against-them].
Key Takeaway: Data security for AI workloads requires protecting data at every layer: encryption at rest for PersistentVolumes and etcd, encryption in transit via TLS and mTLS, and comprehensive access auditing through Kubernetes audit logs. Map your compliance obligations (SOC 2, HIPAA, GDPR) to specific Kubernetes controls and enforce them programmatically via admission policies and namespace labels — manual processes drift over time, but policy-as-code does not.
Chapter Summary
This chapter covered the two critical infrastructure concerns that determine whether an AI Kubernetes cluster is both capable and trustworthy: high-performance networking and security hardening.
On the networking side, you learned how RDMA and GPUDirect RDMA eliminate CPU-mediated copies from the gradient communication path, how InfiniBand and RoCE provide the underlying transport at hundreds of gigabits per second, and how the NVIDIA Network Operator manages the full driver and plugin lifecycle required to expose these capabilities to Kubernetes pods. You saw how Multus CNI gives training pods secondary network interfaces that separate management traffic from the high-speed collective communication fabric, and how SR-IOV VFs provide hardware-isolated near-native performance for each pod.
On the security side, you learned how default-deny NetworkPolicies create an explicit-allowlist model for pod communication, how service meshes like Istio add mTLS, traffic management, and observability for inference workloads, how RBAC least-privilege principles prevent training jobs from becoming lateral movement vectors, and how Pod Security Standards enforce hardened pod configurations cluster-wide. You saw how secrets must be mounted as files rather than environment variables, and how compliance requirements like HIPAA and SOC 2 map to specific Kubernetes controls.
These two concerns — performance and security — are not in tension. A cluster with properly configured RDMA networking can also have properly enforced network policies, because the high-speed data plane and the management plane are on separate network interfaces managed by Multus. A cluster with strict RBAC can still give training jobs the precise permissions they need through well-scoped namespace roles. The goal is a cluster where GPUs run as fast as the physics allows and where the data, models, and access paths are defended in depth.
Key Terms
| Term | Definition |
|---|---|
| RDMA | Remote Direct Memory Access — a networking technique that allows a NIC to read/write remote host memory without involving the remote CPU, eliminating kernel and CPU copy overhead |
| InfiniBand | A high-speed, low-latency interconnect fabric used in HPC and AI clusters, providing 200–400+ Gb/s per port and single-digit microsecond latency |
| GPUDirect | NVIDIA technology enabling direct DMA between GPU memory and a third-party device (e.g., a NIC) over PCIe, used to bypass CPU copies in distributed training |
| SR-IOV | Single Root I/O Virtualization — a PCIe specification that partitions a single physical NIC into multiple hardware-isolated Virtual Functions (VFs), each assignable to a pod for near-native performance |
| Multus | A Kubernetes meta-CNI plugin that enables pods to have multiple network interfaces by delegating to multiple CNI plugins simultaneously |
| NetworkPolicy | A Kubernetes resource that defines Layer 3/4 rules controlling which pods can communicate with each other and with external endpoints; enforced by the CNI plugin |
| Service mesh | An infrastructure layer (e.g., Istio, Linkerd) that uses sidecar proxies to add mTLS, traffic management, observability, and policy enforcement to inter-pod communication transparently |
| mTLS | Mutual TLS — a TLS handshake in which both the client and server present and verify certificates, ensuring cryptographic identity verification in both directions |
| RBAC | Role-Based Access Control — Kubernetes’s authorization mechanism, using Roles/ClusterRoles and RoleBindings/ClusterRoleBindings to grant permissions to users, groups, and service accounts |
| Pod Security Standards | Three built-in Kubernetes security profiles (Privileged, Baseline, Restricted) enforced via Pod Security Admission at the namespace level, replacing the deprecated PodSecurityPolicy |
| Admission controller | A Kubernetes API server plugin (built-in or webhook-based) that intercepts API requests and can validate, mutate, or reject them before objects are persisted — used to enforce security policies |
Chapter 9: Monitoring, Observability, and Troubleshooting
Running AI workloads on Kubernetes introduces a new category of operational concern that traditional application monitoring was never designed to handle. A web service going down produces an error; a GPU silently throttling its clock speed costs you a week of training time without a single alert. Effective observability for AI on Kubernetes means watching three interlocking layers simultaneously: the hardware (GPUs and their interconnects), the platform (Kubernetes scheduling and resource allocation), and the application (model loss curves, inference latency, and prediction quality). This chapter builds out that full-stack observability picture, from deploying DCGM Exporter on every GPU node to writing Prometheus alert rules that catch training stalls before the team notices at standup.
Learning Objectives
By the end of this chapter, you will be able to:
- Deploy GPU and AI workload monitoring using Prometheus, Grafana, and NVIDIA DCGM Exporter
- Implement custom Prometheus metrics for training progress, loss curves, and inference latency
- Build alerting rules for GPU failures, OOM kills, and training stalls
- Troubleshoot the most common AI workload failures on Kubernetes, including GPU detection errors, NCCL communication failures, and thermal throttling
Section 1: GPU Monitoring with DCGM and Prometheus
The Three-Layer Observability Model
Think of AI workload observability as a hospital monitoring a patient. The pulse oximeter on the finger measures oxygen saturation — the equivalent of GPU utilization. The blood pressure cuff measures systemic health — the equivalent of GPU memory pressure. And the ECG monitors electrical activity — the equivalent of Xid error counts from the GPU driver. No single reading tells the full story; you need all three layers reading simultaneously to catch problems early. [Source: https://developer.nvidia.com/blog/monitoring-gpus-in-kubernetes-with-dcgm/]
The three observability layers for AI on Kubernetes are:
| Layer | What It Measures | Key Tools |
|---|---|---|
| Infrastructure | GPU utilization, memory, temperature, power, ECC errors | DCGM Exporter, node_exporter |
| Platform | Pod health, scheduling events, resource quotas, job status | kube-state-metrics, Kubernetes Events |
| Application | Training loss, learning rate, inference latency, throughput | Custom Prometheus metrics, MLflow, W&B |
[Source: https://www.wiz.io/academy/ai-security/ai-ml-kubernetes-best-practices]
Figure 9.1: Three-Layer Observability Model for AI Workloads on Kubernetes
graph TD
subgraph APP["Application Layer"]
A1[Training Loss]
A2[Inference Latency]
A3[Gradient Norm]
A4[Samples/sec]
end
subgraph PLAT["Platform Layer"]
P1[Pod Health]
P2[Scheduling Events]
P3[Resource Quotas]
P4[Job Status]
end
subgraph INFRA["Infrastructure Layer"]
I1[GPU Utilization]
I2[GPU Memory]
I3[Temperature]
I4[ECC Errors / Xid]
end
APP -->|Custom Prometheus metrics| PROM[Prometheus]
PLAT -->|kube-state-metrics + Events| PROM
INFRA -->|DCGM Exporter| PROM
PROM --> GRAFANA[Grafana Dashboards]
PROM --> ALERTMGR[Alertmanager]
NVIDIA DCGM Exporter Deployment
DCGM (Data Center GPU Manager) is NVIDIA’s suite for managing and monitoring GPUs at scale. The DCGM Exporter component uses the DCGM Go API to collect GPU telemetry and expose it at an HTTP /metrics endpoint that Prometheus can scrape. [Source: https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/dcgm-exporter.html]
DCGM Exporter is deployed as a Kubernetes DaemonSet, meaning one pod runs on every GPU-equipped node in the cluster. This is the correct topology: GPU metrics are node-local and cannot be collected from a central location without per-node agents. [Source: https://github.com/NVIDIA/dcgm-exporter]
Installation uses the official Helm chart:
helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts
helm repo update
helm install --generate-name gpu-helm-charts/dcgm-exporter
Verify the DaemonSet is running on every GPU node:
kubectl get daemonset -n gpu-operator-resources
kubectl get pods -n gpu-operator-resources -l app=dcgm-exporter -o wide
The exporter pod requires privileged access to communicate with the DCGM host engine running on each node. On hardened clusters with PodSecurity policies or OPA Gatekeeper, you will need to create a corresponding exception for the DCGM Exporter namespace. [Source: https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/kube-prometheus.html]
Figure 9.2: DCGM Exporter DaemonSet — Per-Node Deployment and Prometheus Scrape Path
graph LR
subgraph NODE1["GPU Node 1"]
GPU1A[GPU 0]
GPU1B[GPU 1]
DCGM1[DCGM Host Engine]
EXP1[DCGM Exporter Pod\n:9400/metrics]
GPU1A --> DCGM1
GPU1B --> DCGM1
DCGM1 --> EXP1
end
subgraph NODE2["GPU Node 2"]
GPU2A[GPU 0]
GPU2B[GPU 1]
DCGM2[DCGM Host Engine]
EXP2[DCGM Exporter Pod\n:9400/metrics]
GPU2A --> DCGM2
GPU2B --> DCGM2
DCGM2 --> EXP2
end
SM[ServiceMonitor CR] -->|scrape interval: 15s| EXP1
SM -->|scrape interval: 15s| EXP2
EXP1 -->|metrics| PROM[Prometheus]
EXP2 -->|metrics| PROM
PROM --> GRAFANA[Grafana\nDashboard 12239]
Key GPU Metrics
The following table covers the most operationally significant metrics exposed by DCGM Exporter:
| Metric Name | Description | Alert Threshold |
|---|---|---|
DCGM_FI_DEV_GPU_UTIL | GPU compute utilization (%) | Alert if < 20% for sustained training |
DCGM_FI_DEV_MEM_COPY_UTIL | GPU memory bandwidth utilization (%) | Informational |
DCGM_FI_DEV_FB_USED | Framebuffer memory used (MiB) | Alert if > 95% of total |
DCGM_FI_DEV_GPU_TEMP | GPU core temperature (°C) | Alert if > 83°C |
DCGM_FI_DEV_POWER_USAGE | Power draw (W) | Alert if > rated TDP |
DCGM_FI_DEV_SM_CLOCK | Streaming Multiprocessor clock (MHz) | Alert on sustained drop during training |
DCGM_FI_DEV_ECC_SBE_VOL_TOTAL | Volatile single-bit ECC errors | Alert if non-zero sustained |
DCGM_FI_DEV_ECC_DBE_VOL_TOTAL | Volatile double-bit ECC errors | Alert immediately on any occurrence |
DCGM_FI_DEV_XID_ERRORS | Xid error count (hardware/driver faults) | Alert on any increment |
[Source: https://developer.nvidia.com/blog/monitoring-gpus-in-kubernetes-with-dcgm/]
Double-bit ECC errors (DCGM_FI_DEV_ECC_DBE_VOL_TOTAL) are particularly critical: they indicate uncorrectable GPU memory faults and should immediately trigger node cordoning to prevent additional training jobs from landing on the affected hardware. [Source: https://medium.com/@penkow/tracking-gpu-usage-in-k8s-with-prometheus-and-dcgm-a-complete-guide-7c8590809d7c]
Prometheus ServiceMonitor Configuration
With the kube-prometheus-stack Helm chart providing the Prometheus Operator, you configure metric scraping declaratively using a ServiceMonitor custom resource. This tells Prometheus how to discover and scrape the DCGM Exporter service: [Source: https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/kube-prometheus.html]
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: dcgm-exporter
namespace: monitoring
labels:
release: kube-prometheus-stack # must match Prometheus operator's label selector
spec:
namespaceSelector:
matchNames:
- gpu-operator-resources
selector:
matchLabels:
app: dcgm-exporter
endpoints:
- port: metrics
interval: 15s
path: /metrics
The release: kube-prometheus-stack label in the ServiceMonitor metadata is a common stumbling block: the Prometheus Operator only picks up ServiceMonitor resources matching its own serviceMonitorSelector. If your GPU metrics do not appear in Prometheus, inspect the Prometheus Operator logs and verify the label alignment.
Deploy the full observability stack first:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install kube-prometheus-stack \
prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace
Grafana Dashboard
NVIDIA publishes an official pre-built Grafana dashboard for DCGM Exporter (Dashboard ID 12239 at [Source: https://grafana.com/grafana/dashboards/12239]). Import it into Grafana with:
# Port-forward Grafana
kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n monitoring
# Then in the Grafana UI: Dashboards → Import → enter 12239
The dashboard provides utilization heatmaps across all GPUs in the cluster, memory pressure time series, temperature trend lines, and power draw per GPU. For multi-node clusters, the node selector dropdown allows drilling down to individual hosts.
Key Takeaway: DCGM Exporter runs as a DaemonSet on every GPU node, exposing hardware telemetry at
/metrics. AServiceMonitorcustom resource connects it to Prometheus, and the official Grafana dashboard ID 12239 provides out-of-the-box GPU cluster visibility. Start with ECC double-bit errors and Xid counts as your most critical alert signals.
Section 2: Training and Inference Observability
Custom Metrics for Training Workloads
Infrastructure metrics tell you whether the GPU is busy; they do not tell you whether the model is converging. A GPU can run at 95% utilization while training a model whose loss has plateaued due to a learning rate misconfiguration — you would never know from DCGM metrics alone. Application-layer metrics close this gap. [Source: https://www.mirantis.com/blog/ai-workloads-management-and-best-practices/]
The most impactful training metrics to instrument are:
| Metric | What It Reveals | Alert Condition |
|---|---|---|
| Training loss (per step) | Model convergence direction | Loss not decreasing over N steps = stall |
| Validation loss (per epoch) | Overfitting detection | Val loss diverging from train loss |
| Samples per second | GPU throughput efficiency | > 20% drop vs. baseline |
| Gradient norm | Exploding/vanishing gradient health | Norm > 100 or < 1e-7 |
| Learning rate | Warmup and decay schedule | Verify against expected schedule |
| Checkpoint write success | Recovery point availability | Alert on write failure |
Expose these metrics from your training code using the Prometheus Python client. Here is a minimal instrumentation pattern for a PyTorch training loop:
from prometheus_client import Gauge, Counter, start_http_server
import os
# Start metrics server on port 8080
start_http_server(8080)
# Define metrics
training_loss = Gauge('training_loss', 'Current training loss',
['job_name', 'rank'])
samples_per_sec = Gauge('training_samples_per_second',
'Training throughput', ['job_name', 'rank'])
gradient_norm = Gauge('training_gradient_norm',
'L2 norm of gradient', ['job_name', 'rank'])
job_name = os.environ.get('JOB_NAME', 'unknown')
rank = str(os.environ.get('RANK', '0'))
for step, batch in enumerate(dataloader):
# ... forward pass, backward pass ...
loss_val = loss.item()
training_loss.labels(job_name=job_name, rank=rank).set(loss_val)
# Gradient norm
total_norm = sum(p.grad.data.norm(2).item() ** 2
for p in model.parameters()
if p.grad is not None) ** 0.5
gradient_norm.labels(job_name=job_name, rank=rank).set(total_norm)
# Throughput: samples in this batch / elapsed seconds
samples_per_sec.labels(job_name=job_name, rank=rank).set(
batch_size / step_time_seconds)
Figure 9.3: Training Observability Pipeline — From Code to Alerting
sequenceDiagram
participant TR as Training Pod
participant PM as prometheus_client<br/>(port 8080)
participant PROM as Prometheus
participant AM as Alertmanager
participant GRAF as Grafana
TR->>PM: training_loss.set(loss_val)
TR->>PM: gradient_norm.set(total_norm)
TR->>PM: training_last_step_timestamp.set(time.time())
PROM-->>PM: scrape /metrics every 15s
PM-->>PROM: gauge values
PROM->>PROM: evaluate alert rules
alt TrainingJobStalled: no step for 10m
PROM->>AM: fire alert
AM->>AM: route to on-call
end
PROM-->>GRAF: PromQL queries
GRAF-->>GRAF: render loss curves, throughput panels
For Kubernetes to scrape this endpoint, add a port to the pod spec and a corresponding ServiceMonitor:
# Pod spec excerpt
ports:
- name: metrics
containerPort: 8080
protocol: TCP
env:
- name: JOB_NAME
valueFrom:
fieldRef:
fieldPath: metadata.labels['job-name']
- name: RANK
valueFrom:
fieldRef:
fieldPath: metadata.annotations['rank']
Inference Observability: Latency Distributions and SLOs
Inference observability follows a different pattern from training. Where training favors throughput — how many samples processed per second — inference favors latency: how quickly does a single request return? For user-facing inference APIs, tail latency (p99) matters far more than mean latency, because it defines the worst experience a user receives. [Source: https://www.datadoghq.com/blog/ml-model-monitoring-in-production-best-practices/]
The standard inference metric set:
| Metric | Description | Why It Matters |
|---|---|---|
| p50 latency | Median request latency | Baseline serving speed |
| p95 latency | 95th percentile latency | Typical worst-case experience |
| p99 latency | 99th percentile latency | SLO boundary for most services |
| Requests per second (RPS) | Throughput capacity | Capacity planning, cost per prediction |
| Error rate | Fraction of failed inferences | Quality signal, downstream impact |
| Queue depth | Pending requests in inference queue | Scaling trigger |
| Time to First Token (TTFT) | Latency until first token returned | Critical for streaming LLM UX |
| Time Per Output Token (TPOT) | Generation throughput for generative models | LLM serving efficiency |
[Source: https://dzone.com/articles/ai-ml-kubernetes-mlflow-kserve-vllm]
Frameworks like vLLM and TorchServe expose many of these metrics natively in Prometheus format. For custom inference servers, the prometheus_client histogram type is the right abstraction — histograms allow Prometheus to compute arbitrary percentiles at query time:
from prometheus_client import Histogram
inference_latency = Histogram(
'inference_request_duration_seconds',
'End-to-end inference latency',
['model_name', 'model_version'],
buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)
# In the inference handler:
with inference_latency.labels(
model_name='bert-classifier',
model_version='v2'
).time():
result = model.predict(inputs)
Query p99 latency in PromQL:
histogram_quantile(
0.99,
sum(rate(inference_request_duration_seconds_bucket[5m])) by (le, model_name)
)
Distributed Training Collective Communication Monitoring
In multi-node distributed training, collective communication operations (AllReduce, AllGather, Broadcast) are the synchronization points where all GPU ranks must meet before proceeding. A single slow or dead GPU causes every other rank to stall waiting for the collective to complete.
Monitor collective communication health by tracking:
- Step time variance across ranks: If one rank consistently takes longer, it is the bottleneck
- NVLink bandwidth (
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL): Drop in NVLink bandwidth indicates interconnect degradation - PCIe throughput: Bottleneck for GPUs without NVLink on the same node
The torch.distributed.monitored_barrier() function adds an explicit timeout to barrier synchronization, surfacing which rank failed to arrive rather than hanging indefinitely: [Source: https://medium.com/@devaru.ai/debugging-nccl-errors-in-distributed-training-a-comprehensive-guide-28df87512a34]
Figure 9.4: Distributed Training Collective Communication and Straggler Detection
graph TD
subgraph RANKS["AllReduce Collective — 4 Ranks"]
R0[Rank 0\nNode A, GPU 0]
R1[Rank 1\nNode A, GPU 1]
R2[Rank 2\nNode B, GPU 0]
R3[Rank 3 STRAGGLER\nNode B, GPU 1]
end
R0 -->|gradient shard| RING[NCCL Ring / Tree]
R1 -->|gradient shard| RING
R2 -->|gradient shard| RING
R3 -.->|DELAYED / MISSING| RING
RING -->|60s timeout| MB[monitored_barrier]
MB -->|timeout fires| ERR["RuntimeError:\nRank 3 failed to arrive"]
ERR --> DIAG[Inspect DCGM Xid on Node B GPU 1]
style R3 fill:#ffcccc,stroke:#cc0000
style ERR fill:#fff3cd,stroke:#856404
import torch.distributed as dist
# Replace dist.barrier() with monitored_barrier during debugging
# All ranks must reach this point within 60 seconds or an error is raised
dist.monitored_barrier(timeout=datetime.timedelta(seconds=60))
Key Takeaway: Infrastructure metrics alone cannot diagnose AI workload problems. Instrument training loops with Prometheus gauges for loss, throughput, and gradient norms. Use Prometheus histograms for inference latency to enable p95/p99 quantile queries. For distributed training,
monitored_barrier()is the single most useful debugging tool for identifying stragglers in collective communication.
Section 3: Alerting and SLOs for AI Workloads
Designing Alerts That Matter
A useful mental model for AI workload alerting: alerts should page someone who can act on them immediately. Alerts about metric symptoms (high GPU temperature) without actionable context (which node, which job, what to do) cause alert fatigue and get ignored. Every alert should answer three questions: what is broken, where is it broken, and what is the immediate remediation step.
Keep a concise set of service-level objectives for each inference service and train alert rules to fire only when an SLO is threatened. [Source: https://www.finops.org/wg/scaling-kubernetes-for-ai-ml-workloads-with-finops/]
GPU Failure and Xid Error Alerts
Xid errors are the GPU driver’s hardware fault codes, emitted via dmesg and surfaced through DCGM. Common Xid codes for AI workloads: [Source: https://introl.com/blog/troubleshooting-gpu-clusters-common-issues-resolution-playbook]
| Xid Code | Meaning | Recommended Action |
|---|---|---|
| Xid 13 | Graphics Exception (general) | Investigate workload, restart if persistent |
| Xid 31 | GPU memory page fault | Check for out-of-bounds memory access in CUDA code |
| Xid 48 | Double-bit ECC error (DBE) | Cordon node immediately, replace GPU |
| Xid 63 | Row remapping failure | Schedule GPU replacement |
| Xid 79 | GPU has fallen off the bus | Node reboot required, hardware failure likely |
Prometheus alert rules for GPU health:
groups:
- name: gpu.rules
rules:
- alert: GPUXidErrorDetected
expr: increase(DCGM_FI_DEV_XID_ERRORS[5m]) > 0
for: 1m
labels:
severity: warning
annotations:
summary: "GPU Xid error on {{ $labels.instance }}"
description: >
GPU {{ $labels.gpu }} on node {{ $labels.instance }} has reported
{{ $value }} Xid errors in the last 5 minutes. Check dmesg for Xid code.
- alert: GPUDoubleBitECCError
expr: increase(DCGM_FI_DEV_ECC_DBE_VOL_TOTAL[5m]) > 0
for: 0m
labels:
severity: critical
annotations:
summary: "GPU double-bit ECC error — cordon node {{ $labels.instance }}"
description: >
Uncorrectable GPU memory error detected. Node {{ $labels.instance }}
should be cordoned immediately to prevent further job scheduling.
- alert: GPUTemperatureCritical
expr: DCGM_FI_DEV_GPU_TEMP > 83
for: 5m
labels:
severity: warning
annotations:
summary: "GPU thermal limit exceeded on {{ $labels.instance }}"
description: >
GPU {{ $labels.gpu }} temperature is {{ $value }}°C. Performance
throttling is likely. Check datacenter cooling.
[Source: https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/kube-prometheus.html]
OOM Detection and Memory Pressure Alerts
Kubernetes records OOM kills as pod exit code 137. Combine this with GPU memory pressure from DCGM to build a comprehensive memory alert: [Source: https://introl.com/blog/troubleshooting-gpu-clusters-common-issues-resolution-playbook]
- alert: GPUMemoryPressure
expr: >
(DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL) > 0.95
for: 5m
labels:
severity: warning
annotations:
summary: "GPU memory > 95% on {{ $labels.instance }}"
description: >
GPU {{ $labels.gpu }} is using {{ $value | humanizePercentage }}
of framebuffer memory. OOM kill risk is high.
- alert: PodOOMKilled
expr: >
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} == 1
for: 0m
labels:
severity: warning
annotations:
summary: "Pod OOM killed: {{ $labels.pod }}"
description: >
Container {{ $labels.container }} in pod {{ $labels.pod }} was
terminated by OOM killer. Consider reducing batch size or enabling
gradient checkpointing.
Training Stall Detection
A training stall is one of the most expensive failure modes in AI infrastructure: the cluster continues consuming expensive GPU-hours while producing no useful model updates. Detect stalls by alerting on the absence of training progress rather than the presence of an error: [Source: https://aws.amazon.com/blogs/containers/part-2-observing-and-scaling-mlops-infrastructure-on-amazon-eks/]
- alert: TrainingJobStalled
expr: >
(time() - training_last_step_timestamp_seconds) > 600
for: 5m
labels:
severity: critical
annotations:
summary: "Training job {{ $labels.job_name }} stalled"
description: >
No training steps recorded for job {{ $labels.job_name }} in the
last 10 minutes. Possible causes: NCCL hang, dead GPU, data loading
deadlock. Inspect pod logs immediately.
The training_last_step_timestamp_seconds metric must be emitted by your training code:
from prometheus_client import Gauge
import time
last_step_ts = Gauge('training_last_step_timestamp_seconds',
'Unix timestamp of last completed training step',
['job_name'])
# Updated at the end of each training step
last_step_ts.labels(job_name=job_name).set(time.time())
SLO Definitions for Inference Services
Well-defined SLOs translate business requirements into concrete alert thresholds. KEDA (Kubernetes Event-Driven Autoscaler) can use these same Prometheus metrics as scaling signals, automatically adding inference replicas when the p99 SLO is at risk. [Source: https://www.finops.org/wg/scaling-kubernetes-for-ai-ml-workloads-with-finops/]
A representative SLO table for an LLM inference service:
| SLO | Objective | Prometheus Query | Alert if |
|---|---|---|---|
| p99 latency | < 2 seconds | histogram_quantile(0.99, ...) | > 2s for 5m |
| p95 latency | < 500ms | histogram_quantile(0.95, ...) | > 500ms for 5m |
| Availability | > 99.9% | 1 - (error_rate / total_rate) | < 99.9% over 1h |
| TTFT | < 300ms | histogram_quantile(0.95, ttft_seconds_bucket) | > 300ms for 5m |
| Error rate | < 0.1% | rate(inference_errors_total[5m]) | > 0.1% for 5m |
A KEDA ScaledObject that scales inference replicas when p99 latency exceeds 1.5 seconds:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: inference-server-scaler
spec:
scaleTargetRef:
name: inference-deployment
minReplicaCount: 2
maxReplicaCount: 20
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus-operated.monitoring:9090
metricName: inference_p99_latency
query: >
histogram_quantile(0.99,
sum(rate(inference_request_duration_seconds_bucket[5m])) by (le))
threshold: "1.5"
Key Takeaway: Alert on SLO breaches, not just metric thresholds. The most critical GPU alerts are double-bit ECC errors (page immediately, cordon node) and Xid errors (investigate promptly). Training stall detection requires application-level metrics — infrastructure metrics cannot distinguish a stalled job from one doing useful work.
Section 4: Troubleshooting Common Issues
The following section is structured as a troubleshooting playbook. Each subsection describes a symptom, its root causes, and a diagnostic sequence to isolate and resolve it.
GPU Not Detected or Driver Mismatch
Symptom: Pod runs but uses CPU only; training throughput is ~100x lower than expected; no GPU resources shown in pod description.
Root Causes:
- Missing
nvidia.com/gpuresource request in the pod spec — the NVIDIA device plugin will not allocate GPUs to pods that do not request them - NVIDIA Container Runtime not configured as the default runtime on the node
- NVIDIA device plugin DaemonSet not running on the target node
- Node driver version mismatch with the container’s CUDA version
Diagnostic Sequence:
# 1. Check pod spec for GPU resource request
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[*].resources}'
# 2. Verify device plugin is running on the target node
kubectl get pods -n kube-system -l name=nvidia-device-plugin-ds -o wide
# 3. Check node capacity includes GPUs
kubectl get node <node-name> -o jsonpath='{.status.capacity}'
# Expected: "nvidia.com/gpu": "8"
# 4. Inspect pod events for scheduling failures
kubectl describe pod <pod-name> | grep -A 10 Events
# 5. Check CUDA version compatibility inside the container
kubectl exec -it <pod-name> -- nvidia-smi
A particularly dangerous failure mode: containers started without a GPU resource request appear to function but run CPU-only computation at roughly 1/100th the expected speed. Silent failures of this type have burned weeks of researcher time before being caught. [Source: https://introl.com/blog/troubleshooting-gpu-clusters-common-issues-resolution-playbook]
Figure 9.5: GPU Not Detected — Diagnostic Decision Tree
flowchart TD
START([Pod running but slow?]) --> CHECK1{GPU in pod spec\nresources.limits?}
CHECK1 -->|No| FIX1[Add nvidia.com/gpu: 1\nto resources.limits and requests]
CHECK1 -->|Yes| CHECK2{Device plugin pod\nrunning on target node?}
CHECK2 -->|No| FIX2[kubectl apply device plugin DaemonSet\nor restart it]
CHECK2 -->|Yes| CHECK3{Node capacity shows\nnvidia.com/gpu?}
CHECK3 -->|No| FIX3[Check NVIDIA driver install\nand node reboot]
CHECK3 -->|Yes| CHECK4{Pod events show\nscheduling error?}
CHECK4 -->|Yes| FIX4[Add toleration for\nnvidia.com/gpu NoSchedule taint]
CHECK4 -->|No| CHECK5{nvidia-smi works\ninside container?}
CHECK5 -->|No| FIX5[Fix NVIDIA Container Runtime\nas default runtime on node]
CHECK5 -->|Yes| DONE([GPU detected — investigate\nCUDA version compatibility])
style FIX1 fill:#d4edda,stroke:#28a745
style FIX2 fill:#d4edda,stroke:#28a745
style FIX3 fill:#d4edda,stroke:#28a745
style FIX4 fill:#d4edda,stroke:#28a745
style FIX5 fill:#d4edda,stroke:#28a745
The fix in the pod spec:
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
[Source: https://docs.cloud.google.com/kubernetes-engine/docs/troubleshooting/gpus]
NCCL Communication Failures in Distributed Training
Symptom: Distributed training hangs indefinitely, then fails after 1800 seconds (the default NCCL timeout) with a message like NCCL error: unhandled cuda error, NCCL version 2.x.x.
NCCL is the NVIDIA Collective Communications Library used by PyTorch DDP, DeepSpeed, and Megatron-LM for inter-GPU communication. When NCCL hangs, the entire training job stalls: all ranks wait for a collective operation that one rank cannot complete. [Source: https://medium.com/@devaru.ai/debugging-nccl-errors-in-distributed-training-a-comprehensive-guide-28df87512a34]
Root Causes and Diagnostics:
| Cause | Diagnostic Signal | Fix |
|---|---|---|
| Wrong network interface | NCCL logs show wrong IP prefix | Set NCCL_SOCKET_IFNAME=eth0 (or correct interface) |
| Clock skew > 1ms between nodes | NCCL timeout errors, NTP logs show drift | Synchronize NTP/chrony across all nodes |
| Mismatched tensor shapes | Stack trace at AllReduce/AllGather call | Audit model code for data-dependent tensor shapes |
| Dead GPU (silent failure) | One rank stops progressing | Check DCGM Xid metrics, nvidia-smi on affected node |
| Incorrect NCCL_NTHREADS or buffer size | Bandwidth degradation | Use nccl-tests to validate baseline bandwidth |
Step 1: Enable NCCL debug logging. This is always the first diagnostic action for NCCL failures: [Source: https://forums.developer.nvidia.com/t/some-nccl-operations-have-failed-or-timed-out-due-to-the-asynchronous-nature-of-cuda-kernels/283010]
# In the training pod spec environment variables
env:
- name: NCCL_DEBUG
value: "INFO"
- name: NCCL_ASYNC_ERROR_HANDLING
value: "1"
- name: NCCL_SOCKET_IFNAME
value: "eth0" # replace with your high-speed network interface
NCCL_DEBUG=INFO reveals topology detection (whether NVLink, InfiniBand, or TCP is selected), ring/tree construction, and which ranks are communicating with which. NCCL_ASYNC_ERROR_HANDLING=1 converts hang-then-timeout failures into immediate error reports.
Step 2: Run nccl-tests to validate baseline connectivity before launching training. This catches network configuration issues before they waste training time: [Source: https://kubernetes.recipes/recipes/troubleshooting/validate-gpu-topology-nccl/]
# Deploy nccl-tests as a job on two nodes
kubectl apply -f nccl-tests-job.yaml
# Check all-reduce bandwidth
kubectl logs -l job-name=nccl-tests | grep "Avg bus bandwidth"
# Expected: >100 GB/s for NVLink, >25 GB/s for InfiniBand, >10 GB/s for 100GbE
Step 3: Check GPU topology. Poor physical topology (GPUs on different PCIe switches communicating across the CPU) explains low NCCL bandwidth without any software bug: [Source: https://kubernetes.recipes/recipes/troubleshooting/validate-gpu-topology-nccl/]
kubectl exec -it <pod-name> -- nvidia-smi topo -m
# Output shows NVL (NVLink), PIX (PCIe peer), SYS (cross-NUMA) connections
# NVL > PIX > SYS for bandwidth
Step 4: Identify the straggler rank. If the training job hangs rather than fails immediately, use monitored_barrier() to identify which rank is not arriving at the synchronization point: [Source: https://github.com/huggingface/accelerate/issues/314]
import torch.distributed as dist
import datetime
import logging
logger = logging.getLogger(__name__)
logger.info(f"Rank {dist.get_rank()} reaching barrier at step {step}")
try:
dist.monitored_barrier(
timeout=datetime.timedelta(seconds=60),
wait_all_ranks=True # collects all failed ranks, not just the first
)
except RuntimeError as e:
logger.error(f"Rank {dist.get_rank()} barrier failed: {e}")
raise
Pod Scheduling Failures Due to Resource Constraints
Symptom: Pods remain in Pending state indefinitely. kubectl describe pod shows events like 0/12 nodes are available: 12 Insufficient nvidia.com/gpu.
Diagnostic Sequence:
# Check overall GPU resource availability across the cluster
kubectl describe nodes | grep -A 5 "Allocated resources"
# Check for GPU nodes that are cordoned or have NoSchedule taints
kubectl get nodes -l nvidia.com/gpu.present=true | grep -v Ready
# Check for existing pods holding GPU resources
kubectl get pods --all-namespaces -o json | \
jq '.items[] | select(.spec.containers[].resources.limits["nvidia.com/gpu"] != null)
| {name: .metadata.name, ns: .metadata.namespace, gpus: .spec.containers[].resources.limits["nvidia.com/gpu"]}'
# Check if the pod has correct tolerations for GPU node taints
kubectl get node <gpu-node> -o jsonpath='{.spec.taints}'
GPU nodes commonly carry a NoSchedule taint to prevent non-GPU workloads from consuming the expensive hardware. If your training pod is missing the matching toleration, it will never be scheduled:
# Required toleration for GPU nodes with NoSchedule taint
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
[Source: https://collabnix.com/distributed-training-on-kubernetes-best-practices-implementation/]
Performance Degradation and Thermal Throttling
Symptom: Training throughput (samples/sec) gradually decreases over the course of a long run, with no errors logged. The model is training, but slower and slower.
Thermal throttling is a silent performance killer. When GPU temperature exceeds the thermal design point (typically 83°C for data center GPUs), the hardware automatically reduces the SM clock frequency to stay within thermal limits. A GPU throttling from 1.95 GHz down to 1.2 GHz loses approximately 38% of compute throughput — enough to extend a week-long training run by 2-3 days. [Source: https://introl.com/blog/troubleshooting-gpu-clusters-common-issues-resolution-playbook]
Detection: Correlate DCGM_FI_DEV_GPU_TEMP with DCGM_FI_DEV_SM_CLOCK in a Grafana panel. A drop in SM clock correlated with rising temperature is the definitive signature of thermal throttling:
# In a Grafana panel using two Y-axes:
# Left axis (°C):
DCGM_FI_DEV_GPU_TEMP{instance="gpu-node-01"}
# Right axis (MHz):
DCGM_FI_DEV_SM_CLOCK{instance="gpu-node-01"}
Common Causes and Remediation:
| Cause | Signal | Remediation |
|---|---|---|
| Inadequate datacenter cooling | All GPUs on a rack throttling simultaneously | Escalate to datacenter operations |
| Fan failure on specific GPU | Single GPU throttling, others nominal | Replace GPU; alert on temperature outlier vs. rack peers |
| Thermal paste degradation | Gradual onset over months | Server hardware maintenance |
| High ambient temperature (power density) | Correlated with high room temperature | Redistribute workloads, add cooling capacity |
Add a Prometheus alert that fires when an SM clock drop is correlated with elevated temperature, giving early warning before throughput impact becomes obvious:
- alert: GPUThermalThrottling
expr: >
(DCGM_FI_DEV_SM_CLOCK / on(instance, gpu) DCGM_FI_DEV_SM_CLOCK offset 30m) < 0.85
AND DCGM_FI_DEV_GPU_TEMP > 78
for: 10m
labels:
severity: warning
annotations:
summary: "GPU thermal throttling suspected on {{ $labels.instance }}"
description: >
SM clock on GPU {{ $labels.gpu }} has dropped > 15% from 30 minutes ago
while temperature is {{ $value }}°C. Training throughput is degraded.
[Source: https://developer.nvidia.com/blog/monitoring-gpus-in-kubernetes-with-dcgm/]
Key Takeaway: The most insidious AI workload failures are silent ones — CPU-only execution masquerading as GPU training, NCCL hangs that never produce error messages, and thermal throttling that degrades throughput without triggering any alert. Build observability that detects absence of progress, not just presence of errors.
NCCL_DEBUG=INFOandmonitored_barrier()are your two most powerful debugging tools for distributed training failures.
Chapter Summary
Monitoring AI workloads on Kubernetes requires observability at three layers: hardware (DCGM Exporter metrics), platform (Kubernetes scheduling and resource allocation), and application (training loss, inference latency). The foundational monitoring stack combines the kube-prometheus-stack Helm chart with DCGM Exporter deployed as a DaemonSet on every GPU node, connected via a ServiceMonitor custom resource.
Training workloads need custom Prometheus metrics — loss, throughput, gradient norm — because infrastructure metrics alone cannot distinguish a converging model from a stalled one. Inference services require latency histograms to support p50/p95/p99 SLO tracking, with KEDA enabling SLO-driven autoscaling.
Alert design should prioritize actionable signals: double-bit ECC errors and Xid 79 events require immediate node cordoning; training stall detection requires application-level last_step_timestamp metrics; thermal throttling detection requires correlating SM clock drops with temperature rises.
Troubleshooting follows a systematic playbook: verify GPU resource requests before anything else; use NCCL_DEBUG=INFO and NCCL_ASYNC_ERROR_HANDLING=1 as the first step for distributed training hangs; run nccl-tests to validate inter-node bandwidth before large training runs; and correlate SM clock with temperature to catch thermal throttling before it wastes GPU-hours.
Key Terms
| Term | Definition |
|---|---|
| DCGM | NVIDIA Data Center GPU Manager — a suite of tools for managing and monitoring NVIDIA GPUs in data center and Kubernetes environments |
| DCGM Exporter | A Go-based Kubernetes DaemonSet that exposes DCGM GPU metrics at an HTTP /metrics endpoint for Prometheus scraping |
| Prometheus | Open-source time-series database and monitoring system that scrapes and stores metrics from instrumented services |
| Grafana | Open-source visualization platform used to build dashboards from Prometheus metrics and other data sources |
| ServiceMonitor | A Prometheus Operator custom resource that declaratively configures how Prometheus discovers and scrapes metric endpoints |
| Xid error | An NVIDIA GPU driver hardware fault code emitted via dmesg and surfaced through DCGM; different Xid codes map to specific hardware failure modes |
| NCCL | NVIDIA Collective Communications Library — the inter-GPU communication backend used by PyTorch DDP, DeepSpeed, and Megatron-LM for distributed training |
| OOM | Out of Memory — a condition where a process attempts to allocate more memory than is available; in Kubernetes, the OOM killer terminates the offending container with exit code 137 |
| SLO | Service Level Objective — a measurable target for service quality (e.g., p99 latency < 2 seconds) used to define acceptable operating bounds and trigger alerts |
| p99 latency | The 99th percentile of request latency — the response time below which 99% of requests fall; represents the worst experience for all but the top 1% of requests |
| Thermal throttling | An automatic GPU protection mechanism that reduces SM clock frequency when the GPU temperature exceeds its thermal design point, degrading compute throughput silently |
| kube-prometheus-stack | A Helm chart bundling Prometheus Operator, Grafana, Alertmanager, kube-state-metrics, and node_exporter into a complete Kubernetes monitoring stack |
| KEDA | Kubernetes Event-Driven Autoscaler — extends Kubernetes HPA to support scaling based on custom metrics including Prometheus queries |
| ECC | Error-Correcting Code — a memory protection mechanism on data center GPUs; single-bit errors (SBE) are corrected in hardware, while double-bit errors (DBE) are uncorrectable and indicate hardware failure |
| Gradient norm | The L2 norm of the model’s gradient vector; used to detect exploding gradients (norm >> 1) or vanishing gradients (norm ≈ 0) during training |
Chapter 10: Production Patterns and Scaling AI Platforms
Learning Objectives
By the end of this chapter, you will be able to:
- Design a multi-tenant AI platform on Kubernetes with proper isolation and resource governance
- Implement multi-cluster and hybrid strategies for scaling AI workloads
- Apply production hardening patterns for reliability and disaster recovery
- Evaluate emerging technologies (DRA, Kubernetes AI/ML WG) shaping the future of AI on Kubernetes
Introduction
Running a proof-of-concept AI model on a single Kubernetes cluster is relatively straightforward. Running a production AI platform — one that serves dozens of data science teams, survives regional outages, recovers from disasters, and scales to hundreds of GPUs — is a fundamentally different engineering challenge.
This chapter brings together the threads from earlier chapters and examines the operational patterns that distinguish a toy ML cluster from a platform that an enterprise can rely on. We will explore how to share resources fairly between teams, how to extend capacity across cloud boundaries, and how to build the kind of resilience that lets platform engineers sleep soundly at night. We will also look ahead at the next generation of Kubernetes primitives — Dynamic Resource Allocation, LeaderWorkerSet, JobSet, and emerging accelerator support — that are reshaping what is possible.
Think of this chapter as the architectural blueprint review that happens before a building opens to the public: the foundation has been laid, the systems are in place, but now we must ensure the structure is safe, equitable, and built to last.
10.1 Multi-Tenant AI Platform Design
The Tenant Problem
Imagine a university research computing center. Every faculty lab needs GPU time, every PhD student needs to run experiments, and everyone believes their deadline is the most important one. Without structure, the researcher with the most aggressive scripts wins — and everyone else waits. The Kubernetes equivalent of this situation is called the noisy neighbor problem, and it is the central challenge of multi-tenancy.
Multi-tenancy on Kubernetes is defined along a spectrum. Soft multi-tenancy assumes tenants within the same organization are cooperative; the goal is fairness, not adversarial isolation. Hard multi-tenancy treats tenants as potentially hostile — guarding against data exfiltration, privilege escalation, and denial-of-service attacks [Source: https://kubernetes.io/docs/concepts/security/multi-tenancy/]. Most enterprise AI platforms fall somewhere between these poles: internal teams are trusted, but resource boundaries are strict.
Namespace-Based Isolation vs. Virtual Clusters
The first design decision is the isolation boundary itself.
Namespace-based isolation is the Kubernetes default. Each team or project gets one or more namespaces, and Kubernetes scopes many objects to namespaces: RBAC Roles, NetworkPolicies, ResourceQuotas, and LimitRanges. This approach is operationally simple, requires no additional tooling, and scales well for organizations where teams share a common Kubernetes API version.
Its limitation is that namespaces are a logical boundary, not a physical one. Teams share the same API server and the same control plane. Critically, Custom Resource Definitions (CRDs) are cluster-scoped, which means if the computer-vision team installs a custom operator at version v1alpha1 and the NLP team needs the same CRD at v1beta1, there is a conflict with no easy resolution [Source: https://www.vcluster.com/blog/isolating-workloads-multitenant-gpu-cluster].
Virtual clusters (exemplified by tools like vCluster) solve this by giving each tenant a fully isolated Kubernetes API server — etcd, controller manager, and all — running as pods inside the host cluster. Tenants can install arbitrary CRDs and operators without affecting one another. The host cluster still owns the actual compute (Nodes, persistent volumes), so the hardware footprint is shared, but the control plane experience is fully isolated [Source: https://www.vcluster.com/guides/gpu-cluster-to-ai-factory-multi-tenant-infrastructure].
The trade-off table below summarizes the decision:
| Dimension | Namespace Isolation | Virtual Clusters |
|---|---|---|
| Operational complexity | Low | Medium |
| CRD isolation | No — cluster-scoped | Yes — per-vcluster |
| Control plane overhead | None | One vcluster pod set per tenant |
| Security boundary | Logical | Near-physical |
| Recommended for | Internal teams, shared tooling | External tenants, CRD conflicts |
| Tooling examples | Native Kubernetes | vCluster, Kamaji |
Figure 10.1: Multi-Tenancy Isolation Models — Namespace Boundaries vs. Virtual Clusters
graph TB
subgraph HostCluster["Host Kubernetes Cluster"]
subgraph NS["Namespace Isolation"]
NS_A["Namespace: team-a\n(RBAC, ResourceQuota,\nNetworkPolicy)"]
NS_B["Namespace: team-b\n(RBAC, ResourceQuota,\nNetworkPolicy)"]
SharedAPI["Shared API Server\n& Control Plane"]
SharedCRDs["Shared CRDs\n(cluster-scoped)"]
NS_A --- SharedAPI
NS_B --- SharedAPI
SharedAPI --- SharedCRDs
end
subgraph VC["Virtual Cluster Isolation"]
VCA["vCluster: team-c\n(own API server, etcd,\ncontrollers, CRDs)"]
VCB["vCluster: team-d\n(own API server, etcd,\ncontrollers, CRDs)"]
end
SharedNodes["Shared Physical Nodes\n(GPUs, Storage)"]
end
NS_A --> SharedNodes
NS_B --> SharedNodes
VCA --> SharedNodes
VCB --> SharedNodes
style NS fill:#dbeafe,stroke:#3b82f6
style VC fill:#dcfce7,stroke:#22c55e
style SharedNodes fill:#fef9c3,stroke:#eab308
Resource Governance with Kueue and Quotas
Isolation defines who can access resources; governance defines how much they can use. Kubernetes provides two native primitives:
- ResourceQuota: caps the aggregate consumption of CPU, memory, and GPU resources within a namespace. A team with
requests.nvidia.com/gpu: "8"in their quota cannot schedule more than eight GPU pods simultaneously. - LimitRange: sets default and maximum resource requests and limits for individual pods, preventing any single workload from claiming unbounded resources.
For more sophisticated batch scheduling — where teams have budgets that accumulate over time and jobs queue when budgets are exhausted — the Kueue project (a CNCF incubating project) provides ClusterQueue and LocalQueue objects. A ClusterQueue defines a pool of resources with nominal quotas and borrowing limits; teams draw from their LocalQueue, and unused quota can be lent to neighbors. This models the university computing center analogy well: each department has a baseline allocation, but idle capacity flows to whoever needs it.
For GPU-intensive workloads, NVIDIA Multi-Instance GPU (MIG) adds a hardware layer to this governance stack. MIG partitions a physical A100 or H100 into isolated instances with dedicated memory and compute slices. An operator can expose a 1g.10gb MIG profile to small fine-tuning jobs and a 4g.40gb profile to larger training runs, ensuring physical isolation between workloads sharing the same card [Source: https://apxml.com/courses/advanced-ai-infrastructure-design-optimization/chapter-3-advanced-kubernetes-orchestration/multi-tenancy-kubernetes-ml].
Self-Service Interfaces for Data Science Teams
Platform teams should not be ticket-queue gatekeepers. The most effective AI platforms provide self-service interfaces that let data scientists provision compute without filing a support ticket.
Common patterns include:
- Namespace-as-a-Service: A GitOps-driven workflow where a data scientist opens a pull request to a
tenantsrepository. A controller (often built with Crossplane or a custom operator) detects the merge and provisions a namespace, RBAC roles, a LocalQueue, and default LimitRanges automatically. - ML Platform portals: Tools like Kubeflow, MLflow on Kubernetes, or internal dashboards built on the Kubernetes API give data scientists a web UI to launch training jobs, browse experiments, and track model versions without touching
kubectl. - Policy enforcement with OPA Gatekeeper or Kyverno: These admission controllers enforce governance rules automatically — blocking pods without resource limits, requiring specific label schemas, or preventing privileged containers — without relying on manual review processes [Source: https://kubernetes.io/docs/concepts/security/multi-tenancy/].
Platform Engineering Patterns for ML Platforms
The platform engineering discipline applies software engineering practices to internal developer platforms. For ML, this means treating the AI platform as a product with internal customers.
Key patterns include:
- Golden paths: Pre-configured templates for common job types (PyTorch DDP training, vLLM inference deployment, batch feature engineering). Teams use the template, not blank YAML.
- Paved roads with guardrails: Defaults that are safe and cost-effective out of the box, with escape hatches for teams that know what they are doing.
- Chargeback and showback: Integrating cost data (via tools like OpenCost or Kubecost) with namespace labels so each team can see what their GPU spend was last month. Transparency alone changes behavior.
Figure 10.2: Kueue Resource Governance — From ClusterQueue to Workload Scheduling
flowchart TD
Admin["Cluster Admin\ncreates ClusterQueue\n(nominal quota + borrowing limits)"]
CQ["ClusterQueue\n'ml-platform-pool'\nnvidia.com/gpu: 32 nominal\nnvidia.com/gpu: 48 max-burst"]
LQ_A["LocalQueue\nteam-a\n(draws from ClusterQueue)"]
LQ_B["LocalQueue\nteam-b\n(draws from ClusterQueue)"]
LQ_C["LocalQueue\nteam-c\n(draws from ClusterQueue)"]
Job_A1["Training Job A1\n(8 GPUs)"]
Job_A2["Training Job A2\n(8 GPUs) — QUEUED"]
Job_B1["Training Job B1\n(16 GPUs)"]
Job_C1["Experiment C1\n(4 GPUs)"]
Scheduler["Kueue Scheduler\nfair-share + borrowing\npreemption if needed"]
Nodes["GPU Node Pool\n(32 physical GPUs)"]
Admin --> CQ
CQ --> LQ_A
CQ --> LQ_B
CQ --> LQ_C
LQ_A --> Job_A1
LQ_A --> Job_A2
LQ_B --> Job_B1
LQ_C --> Job_C1
Job_A1 --> Scheduler
Job_B1 --> Scheduler
Job_C1 --> Scheduler
Scheduler --> Nodes
style Job_A2 fill:#fee2e2,stroke:#ef4444
style Scheduler fill:#ede9fe,stroke:#8b5cf6
style Nodes fill:#fef9c3,stroke:#eab308
Key Takeaway: Multi-tenant AI platforms require layered governance — namespace or virtual-cluster isolation for boundaries, ResourceQuotas and Kueue for fair resource sharing, NVIDIA MIG for hardware-level partitioning, and self-service tooling so platform engineers scale their impact without becoming a bottleneck.
10.2 Multi-Cluster and Hybrid Architectures
Why One Cluster Is Not Enough
A single Kubernetes cluster, no matter how large, has fundamental limitations for enterprise AI platforms:
- Blast radius: A cluster-level failure (control plane outage, network partition) takes down all workloads simultaneously.
- Geographic constraints: Regulations may require data and compute to remain in specific regions. A European financial institution cannot train on customer data in a US cluster.
- Cost optimization: On-premises hardware provides lower per-GPU cost for steady-state workloads, but provisioning enough on-prem capacity to handle peak demand wastes money when that peak is infrequent.
- Technology heterogeneity: Training jobs may need A100s available on-premises, while serving workloads benefit from cloud-managed inference endpoints with global load balancing.
Training On-Prem, Serving in Cloud
A common and economically rational pattern separates the training and serving lifecycles:
- Training on-prem: Large distributed training jobs run on owned hardware where GPU costs are amortized over years. Data stays on-premises, satisfying data governance requirements. On-prem clusters are tuned for throughput: high-bandwidth networking (InfiniBand or RoCE), fast NVMe storage, and large GPU node pools.
- Serving in cloud: Trained model artifacts are pushed to a model registry (backed by object storage like S3 or GCS). Cloud-native inference deployments — using managed services or Kubernetes-on-cloud clusters — serve end-users with auto-scaling, global load balancing, and pay-per-use economics.
The synchronization bridge is the model registry: a versioned artifact store (MLflow, Weights & Biases, or a simple OCI registry) that both sides can read from. A training job publishes model:v2.1.0 to the registry; a cloud deployment pipeline detects the new version and rolls it out.
Figure 10.3: Hybrid Training/Serving Architecture — On-Prem Training, Cloud Serving
flowchart LR
subgraph OnPrem["On-Premises Cluster"]
direction TB
RawData["Raw Training Data\n(NVMe / NFS)"]
TrainingJob["Distributed Training Job\n(PyTorch DDP / FSDP)\nA100 / H100 GPUs"]
Checkpoint["Model Checkpoint\n(periodic saves)"]
RawData --> TrainingJob
TrainingJob --> Checkpoint
end
subgraph Registry["Model Registry\n(Object Storage: S3 / GCS)"]
ModelArtifact["model:v2.1.0\n(immutable, versioned)"]
end
subgraph Cloud["Cloud Cluster"]
direction TB
CDPipeline["CD Pipeline\n(detects new version)"]
InferenceDeployment["Inference Deployment\n(vLLM / Triton)\nauto-scaling"]
Users["End Users\n(global load balancer)"]
CDPipeline --> InferenceDeployment
InferenceDeployment --> Users
end
Checkpoint -->|"push artifact"| ModelArtifact
ModelArtifact -->|"pull on new version"| CDPipeline
style OnPrem fill:#dbeafe,stroke:#3b82f6
style Registry fill:#fef9c3,stroke:#eab308
style Cloud fill:#dcfce7,stroke:#22c55e
Burst-to-Cloud for Peak Training Demand
Not every organization can afford enough on-premises GPU capacity for their peak training demand. The burst-to-cloud pattern addresses this: on-premises clusters handle the baseline, and cloud clusters absorb spikes.
Think of it like a town’s water supply: the municipal reservoir (on-prem cluster) handles daily demand, but during a drought or a heat wave, the town draws from an interconnected regional reservoir (cloud cluster) to meet peak demand. When the surge passes, the connection is released.
Implementing burst-to-cloud on Kubernetes typically involves:
- Cluster federation or virtual workspaces: Tools like Liqo, Open Cluster Management (OCM), or Karmada federate multiple clusters behind a single control plane, allowing a workload scheduler to place jobs on whichever cluster has available capacity.
- Cost-aware scheduling: The federation layer monitors spot instance pricing in the cloud and prefers on-prem when utilization is below a threshold, falling back to cloud spot instances when on-prem is saturated.
- Data locality considerations: Large training datasets should not cross expensive WAN links on every job. Burst-to-cloud works best when data is replicated to cloud object storage in advance, or when the burst workloads are less data-intensive (hyperparameter sweeps, evaluation runs).
Multi-Cluster Federation for GPU Pools
For organizations operating at true scale — think large tech companies or national research labs — a single physical cluster cannot hold enough GPU nodes to justify the management overhead. Multi-cluster federation creates a logical GPU pool that spans physical boundaries.
The federation control plane maintains a global view of resource availability. When a data scientist submits a 1,000-GPU training job, the scheduler can place it across two or three clusters transparently. From the user’s perspective, the job was submitted and eventually completed; the cross-cluster placement was invisible.
The main challenges are:
- Consistent networking: Jobs that span clusters need low-latency, high-bandwidth inter-node communication. This is achievable within a data center but degrades significantly across WAN links, making cross-region GPU federation impractical for tight communication patterns.
- Consistent software environments: GPU drivers, CUDA versions, and operator versions must align across clusters to avoid job failures caused by environment mismatches [Source: https://thenewstack.io/ai-workloads-kubernetes-infrastructure-drift/].
- Unified observability: Metrics, logs, and traces from federated clusters must flow to a common observability stack for a coherent operational picture.
Cross-Cluster Model and Data Synchronization
Two classes of data must move across cluster boundaries:
| Data Type | Direction | Tooling |
|---|---|---|
| Training datasets | On-prem → Cloud (for burst jobs) | Rclone, AWS DataSync, object storage replication |
| Model checkpoints | Training cluster → Model registry | MLflow, W&B, OCI artifact push |
| Serving model artifacts | Registry → Serving cluster | CD pipeline, image pull, PVC snapshot |
| Experiment metadata | All clusters → Central tracking | MLflow tracking server, remote backend |
| Cluster state backups | All clusters → Remote object store | Velero, Kasten K10 |
The key insight is that model artifacts are immutable and versioned — once a model version is published, it does not change. This immutability makes cross-cluster synchronization straightforward compared to synchronizing mutable databases.
Key Takeaway: Multi-cluster and hybrid architectures let organizations optimize cost (on-prem for steady-state, cloud for bursts), meet data governance requirements, and eliminate single points of failure. The coordination layer — federation, model registries, and data replication pipelines — is what makes the multi-cluster boundary transparent to data science teams.
10.3 Production Hardening and Reliability
The Reliability Contract
Production AI platforms carry an implicit contract: training jobs should not be interrupted without warning, inference services should not drop requests during maintenance, and the platform should recover gracefully from failures that are inevitable at scale. This section covers the mechanisms that fulfill that contract.
Pod Disruption Budgets for Inference Services
When a node is drained for maintenance — whether for a kernel update, hardware replacement, or cluster upgrade — Kubernetes evicts the pods running on that node. For stateless web services, eviction and rescheduling is fast and relatively painless. For a high-traffic inference service with strict latency SLAs, evicting all replicas simultaneously is catastrophic.
PodDisruptionBudgets (PDBs) constrain the number of pods that can be voluntarily disrupted at any time. A PDB is the platform’s commitment to the scheduling system: “you may only evict pods from this Deployment if at least N replicas remain available.”
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: inference-pdb
spec:
minAvailable: 2 # At least 2 replicas must stay up during disruption
selector:
matchLabels:
app: llm-inference
For a three-replica inference Deployment, this PDB ensures that a rolling node drain can only remove one pod at a time, keeping the service at two-thirds capacity throughout maintenance rather than going dark entirely.
The inverse parameter, maxUnavailable, is useful for large deployments where a fixed percentage of disruption is acceptable:
spec:
maxUnavailable: 1 # At most 1 pod unavailable at any time
PDBs only constrain voluntary disruptions (node drains, cluster upgrades). Involuntary disruptions — node hardware failure, OOM kills — bypass PDBs and require replica counts and topology spread constraints for resilience.
Graceful Shutdown and Drain for Long-Running Training
Training jobs present the inverse problem from inference: they are not serving external requests, but they run for hours or days and cannot simply be killed without losing progress.
When a node must be drained and a training pod is running on it, the default behavior is a SIGTERM followed by a SIGKILL after terminationGracePeriodSeconds. For a training job mid-epoch, this often means losing the last checkpoint.
Production hardening strategies for training jobs include:
- Frequent checkpointing: The simplest mitigation. Checkpointing every 10-15 minutes limits the maximum work lost to 10-15 minutes regardless of why the job is interrupted. Frameworks like PyTorch Lightning and Hugging Face Trainer have built-in checkpoint hooks.
- Checkpoint-and-resume on SIGTERM: Training code can register a SIGTERM handler that triggers a checkpoint save before exiting. The pod then exits cleanly, the job controller reschedules it, and training resumes from the checkpoint [Source: https://nebius.com/blog/posts/how-to-use-kubernetes-for-ai-workloads].
- Extended termination grace periods: Setting
terminationGracePeriodSecondsto 300-600 seconds gives the training process enough time to complete a checkpoint write before the pod is forcibly killed. - Job prioritization: Using Kueue or PriorityClasses to mark production training jobs as higher priority than exploratory experiments ensures that when resources are scarce, experiments are preempted first, not production jobs.
Disaster Recovery for Model Registries and Pipelines
AI platforms accumulate irreplaceable artifacts: trained model weights, experiment metadata, pipeline definitions, and data preprocessing code. Losing this data can set a team back weeks or months.
A tiered DR strategy for AI platforms:
Tier 1 — Cluster state backups: Tools like Velero, Stash, or Kasten K10 snapshot Kubernetes resource manifests and persistent volume contents to remote object storage (S3, GCS, Azure Blob). These backups enable restoring the entire cluster configuration after a catastrophic failure [Source: https://kubeify.com/blog/2025-02-05-how-to-implement-kubernetes-for-high-availability-and-disaster-recovery/]. The etcd datastore should be snapshotted independently on a regular schedule.
Tier 2 — Model artifact replication: Model registries backed by object storage benefit from cross-region replication. S3 Cross-Region Replication or GCS multi-region buckets provide near-instant copies of model artifacts in a secondary region with no application-level changes required.
Tier 3 — Active-passive pipeline replication: For organizations with strict RTO (Recovery Time Objective) requirements, an active-passive second cluster can be kept warm with the same pipeline definitions and operator versions. Failover is a matter of switching DNS or load balancer targets rather than rebuilding from scratch.
| DR Tier | What It Protects | Tooling | Typical RPO |
|---|---|---|---|
| Cluster state | YAML manifests, PVCs | Velero, Kasten K10 | Hours (scheduled backup interval) |
| Model artifacts | Trained weights, checkpoints | S3 CRR, GCS multi-region | Minutes (async replication) |
| Metadata (experiments) | MLflow runs, metrics | MLflow tracking DB backup | Hours |
| Active-passive cluster | Full platform availability | OCM, Karmada, manual DR | Minutes (DNS failover) |
RPO = Recovery Point Objective; how much data can be lost. RTO = Recovery Time Objective; how long recovery takes.
Figure 10.4: Tiered Disaster Recovery Strategy for AI Platforms
flowchart TD
subgraph Primary["Primary Cluster (Active)"]
ClusterState["Kubernetes Resources\n(Deployments, CRDs,\nOperators, PVCs)"]
ModelReg["Model Registry\n(trained weights,\ncheckpoints)"]
ExpMeta["Experiment Metadata\n(MLflow runs, metrics)"]
Pipelines["Pipeline Definitions\n& Operator configs"]
end
subgraph Tier1["Tier 1 — Cluster State Backup"]
Velero["Velero / Kasten K10\nsnapshots to object store\nRPO: Hours"]
end
subgraph Tier2["Tier 2 — Model Artifact Replication"]
S3CRR["S3 Cross-Region Replication\n/ GCS Multi-Region\nRPO: Minutes"]
end
subgraph Tier3["Tier 3 — Active-Passive Cluster"]
PassiveCluster["Passive Cluster\n(warm standby)\nDNS failover\nRTO: Minutes"]
end
ClusterState -->|"scheduled snapshots"| Velero
ExpMeta -->|"DB backup"| Velero
ModelReg -->|"async replication"| S3CRR
Pipelines -->|"GitOps sync"| PassiveCluster
S3CRR -->|"available in\nsecondary region"| PassiveCluster
style Primary fill:#dbeafe,stroke:#3b82f6
style Tier1 fill:#fef9c3,stroke:#eab308
style Tier2 fill:#dcfce7,stroke:#22c55e
style Tier3 fill:#ede9fe,stroke:#8b5cf6
Chaos Engineering for AI Infrastructure
Even well-designed systems fail in unexpected ways. Chaos engineering is the discipline of deliberately injecting failures into a system to discover weaknesses before they manifest in production [Source: https://www.wiz.io/academy/ai-ml-kubernetes-best-practices].
The analogy is a fire drill: you do not wait for a real fire to discover that the exit is blocked. You schedule a drill, find the blocked exit, and fix it.
For AI infrastructure, relevant chaos experiments include:
- Node failure injection: Terminating a GPU node mid-training to verify that checkpoint-and-resume works correctly.
- Network partition simulation: Cutting the link between the training cluster and the model registry to verify that jobs fail gracefully and retry.
- etcd latency injection: Introducing artificial latency on the control plane to observe how operator reconciliation loops behave under degraded API server performance.
- GPU memory pressure: Injecting a memory-intensive workload alongside a production training job to validate that MIG partitioning or namespace quotas prevent interference.
Tools like Chaos Mesh and LitmusChaos integrate natively with Kubernetes and provide a library of fault injection experiments that can be scheduled, monitored, and automatically validated.
The output of a chaos engineering program is not just discovered bugs — it is resilience confidence: documented proof that the platform handles specific failure modes gracefully, which is invaluable during incident postmortems and compliance audits.
Key Takeaway: Production hardening for AI on Kubernetes combines PodDisruptionBudgets to protect inference availability during maintenance, checkpoint-and-resume patterns to protect training progress during disruptions, tiered disaster recovery for irreplaceable model artifacts, and chaos engineering to validate resilience proactively rather than discovering gaps during real incidents.
10.4 The Future of AI on Kubernetes
Dynamic Resource Allocation (DRA) for GPUs
Every previous chapter in this book used the resources.limits pattern to request GPUs: nvidia.com/gpu: 1. This works, but it is a blunt instrument. The scheduler knows only that a pod needs one GPU — it has no visibility into GPU memory capacity, NVLink connectivity, PCIe topology, or the relationship between GPUs and high-speed NICs. This opacity leads to suboptimal placements, idle hardware, and fragile manual workarounds using node labels and selectors.
Dynamic Resource Allocation (DRA) is the successor to Kubernetes Device Plugins and represents a fundamental redesign of how Kubernetes understands and allocates specialized hardware [Source: https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/]. DRA reached stable (GA) status in Kubernetes 1.34 [Source: https://developers.redhat.com/articles/2026/03/25/dynamic-resource-allocation-goes-ga-red-hat-openshift-421-smarter-gpu], and is also GA in Red Hat OpenShift 4.21 as of March 2026.
DRA introduces three new API objects in the resource.k8s.io API group:
| Object | Role | Who Creates It |
|---|---|---|
ResourceSlice | Advertises hardware available on each node, including device attributes (memory, architecture, vendor capabilities) | DRA driver / node agent |
DeviceClass | Defines a category of devices that can be requested, e.g., “any NVIDIA H100” or “any GPU with >= 40 GB VRAM” | Cluster administrator or device driver |
ResourceClaim | A workload’s request for a specific device or combination of devices | User / pipeline |
Attribute-based requests are the headline feature. Instead of requesting nvidia.com/gpu: 1 and hoping the scheduler places the pod on a node with the right GPU, a user can write:
apiVersion: resource.k8s.io/v1alpha3
kind: ResourceClaim
metadata:
name: training-gpu-claim
spec:
devices:
requests:
- name: gpu
deviceClassName: nvidia-gpu
selectors:
- cel:
expression: device.attributes["memory"].isGreaterThan(quantity("40Gi"))
This claim requests any NVIDIA GPU with more than 40 GiB of memory. The scheduler consults the ResourceSlice objects published by DRA drivers on each node and finds a matching device automatically — no manual node selectors required [Source: https://cloud.google.com/blog/products/containers-kubernetes/kubernetes-device-management-with-dra-dynamic-resource-allocation].
Topology-aware co-scheduling is equally important for distributed training. DRA supports inter-device constraints: a workload can request both a GPU and a high-speed NIC and specify that they must be attached to the same PCIe Root Complex to minimize latency. Previously, this required complex custom scheduler plugins or manual affinity rules; with DRA, it is a first-class declarative expression [Source: https://thenewstack.io/kubernetes-primer-dynamic-resource-allocation-dra-for-gpu-workloads/].
Figure 10.5: Dynamic Resource Allocation (DRA) — Object Model and Scheduling Flow
flowchart TD
subgraph Drivers["Node-Level DRA Drivers"]
NVDriver["NVIDIA DRA Driver\n(node agent)"]
AMDDriver["AMD DRA Driver\n(node agent)"]
end
subgraph DRAObjects["DRA API Objects (resource.k8s.io)"]
RS_NV["ResourceSlice\n(node: gpu-node-01)\ndevice: H100 80GB\nmemory: 80Gi\nnvlink: true"]
RS_AMD["ResourceSlice\n(node: gpu-node-05)\ndevice: MI300X\nmemory: 192Gi"]
DC["DeviceClass\n'nvidia-gpu'\n(vendor selector,\ndefault attributes)"]
RC["ResourceClaim\n(workload request)\nmemory >= 40Gi\nnvlink == true"]
end
subgraph Workload["User Workload"]
Pod["Training Pod\n(references ResourceClaim)"]
end
Scheduler["Kubernetes Scheduler\n(DRA-aware)\nmatches Claims to Slices"]
NVDriver -->|"publishes"| RS_NV
AMDDriver -->|"publishes"| RS_AMD
DC -->|"scopes"| RC
Pod -->|"references"| RC
RC -->|"evaluated by"| Scheduler
RS_NV -->|"consulted by"| Scheduler
RS_AMD -->|"consulted by"| Scheduler
Scheduler -->|"binds Pod to\ngpu-node-01"| Pod
style DRAObjects fill:#dbeafe,stroke:#3b82f6
style Scheduler fill:#ede9fe,stroke:#8b5cf6
style Workload fill:#dcfce7,stroke:#22c55e
AWS has demonstrated DRA on EKS with EC2 P6e-GB200 instances (NVIDIA GB200 NVL72), and AKS has shipped DRA support for NVIDIA vGPU partitioning in Azure as of March 2026 [Source: https://blog.aks.azure.com/2026/03/06/dra-with-vGPUs-on-aks]. NVIDIA’s NIM Operator also supports DRA for deploying inference microservices [Source: https://docs.nvidia.com/nim-operator/latest/dra.html].
Kubernetes AI/ML Working Group Initiatives
The Kubernetes project formalized the AI/ML Working Group (wg-ai-ml) to coordinate the development of primitives and patterns needed for machine learning workloads. The working group focuses on use cases that cut across SIGs (Special Interest Groups): batch scheduling, accelerator management, topology-aware placement, and job lifecycle management.
Working group outputs feed directly into the APIs described in this chapter. The working group also maintains reference architectures and best practices documentation that platform engineers can use to evaluate whether their cluster configuration aligns with community consensus.
LeaderWorkerSet and JobSet APIs
Two new batch APIs have emerged from the AI/ML working group that address the limitations of plain Kubernetes Job for distributed training:
JobSet (JobSet.batch.x-k8s.io) is an API for managing groups of related Job objects as a single unit. A distributed training run often needs several coordinated job components: parameter servers, workers, and evaluation processes. Managing these as separate Job objects means writing custom controllers to coordinate their lifecycle. JobSet treats the group as a single logical unit with shared failure policies, indexing, and lifecycle management [Source: https://rafay.co/ai-and-cloud-native-blog/introduction-to-dynamic-resource-allocation-dra-in-kubernetes].
LeaderWorkerSet (LeaderWorkerSet.leaderworkerset.x-k8s.io) targets the specific pattern of LLM training and inference, where one process (the leader) coordinates many worker processes in a tightly coupled group. Rather than expressing this as a headless Service plus a StatefulSet — a common but awkward workaround — LeaderWorkerSet provides a dedicated API with first-class concepts for leader election, worker group sizing, and restart policies that preserve group cohesion.
Together, JobSet and LeaderWorkerSet reduce the boilerplate required to run distributed AI workloads, moving the community toward standardized, interoperable primitives rather than each MLOps platform reinventing its own coordination layer.
Emerging Hardware: TPUs, AWS Trainium/Inferentia, Intel Gaudi on Kubernetes
NVIDIA GPUs have dominated the first wave of AI on Kubernetes, but the hardware landscape is diversifying rapidly. Each accelerator brings its own Kubernetes integration patterns:
| Hardware | Vendor | Primary Use | Kubernetes Integration |
|---|---|---|---|
| TPU v4/v5 | Large-scale training, inference | GKE TPU node pools, google.com/tpu resource, DRA in development | |
| AWS Trainium (Trn1/Trn2) | Amazon | Cost-efficient training | EKS Neuron device plugin, aws.amazon.com/neuron resource |
| AWS Inferentia (Inf2) | Amazon | High-throughput inference | EKS Neuron device plugin, NeuronRT runtime |
| Intel Gaudi 2/3 | Intel | Training, inference (open alternative) | Gaudi device plugin, habana.ai/gaudi resource |
| AMD Instinct (MI300X) | AMD | Training, LLM inference | ROCm device plugin, upstream Device Plugin framework |
The common thread is that each new accelerator requires a vendor-specific device plugin (or DRA driver, the modern equivalent) to expose the hardware to the Kubernetes scheduler. DRA’s attribute-based model is architecturally better suited to this diversity: a workload that can run on either an NVIDIA H100 or an AMD MI300X can express that flexibility as a single ResourceClaim rather than requiring separate YAML manifests for each.
For platform engineers, the practical implication is hardware abstraction: by adopting DRA-based resource claims, AI platforms can support multiple accelerator backends without changing job definitions. The data scientist writes “I need a GPU with 80 GB of memory and NVLink”; the platform matches it to whatever hardware is available, whether that is an H100, an MI300X, or a future accelerator not yet on the market.
Key Takeaway: The future of AI on Kubernetes is declarative, hardware-agnostic, and increasingly topology-aware. DRA replaces the blunt-instrument device plugin model with attribute-based scheduling that matches workloads to hardware intelligently. LeaderWorkerSet and JobSet provide first-class APIs for distributed AI job patterns, while expanding hardware diversity — TPUs, Trainium, Gaudi — is pushing the ecosystem toward abstraction layers that make AI platforms portable across accelerator generations.
Chapter Summary
This chapter assembled the production engineering patterns that transform a functional Kubernetes AI cluster into a reliable, scalable platform.
Multi-tenancy requires a deliberate isolation strategy — namespace boundaries for cooperative internal teams, virtual clusters for stronger isolation or CRD diversity — combined with ResourceQuotas, Kueue, and NVIDIA MIG for fair, enforceable resource governance. Self-service tooling and policy enforcement with OPA Gatekeeper or Kyverno turn the platform team from a bottleneck into an enabler.
Multi-cluster and hybrid architectures allow organizations to optimize cost (on-prem steady-state, cloud bursts), satisfy data sovereignty requirements, and eliminate single points of failure. The key engineering investment is the synchronization layer: model registries, data replication pipelines, and federation tooling that make cluster boundaries invisible to data scientists.
Production hardening protects both inference services (PodDisruptionBudgets) and training jobs (checkpoint-and-resume, extended grace periods). Disaster recovery for AI platforms is tiered — cluster state, model artifacts, and pipeline definitions each require different backup strategies and different recovery objectives. Chaos engineering validates that these mechanisms actually work when needed.
The future belongs to DRA — attribute-based, topology-aware hardware scheduling that replaces rigid device counts with declarative capability expressions — alongside dedicated APIs like LeaderWorkerSet and JobSet that make distributed AI workloads first-class Kubernetes citizens. As the hardware landscape diversifies beyond NVIDIA GPUs, the platform patterns established in this chapter provide the foundation for AI infrastructure that can evolve with the technology.
Key Terms
| Term | Definition |
|---|---|
| multi-tenancy | The practice of sharing a Kubernetes cluster (or platform) among multiple teams or organizations, with isolation and resource governance to prevent interference |
| virtual cluster | A fully isolated Kubernetes control plane (API server, etcd, controllers) running as pods inside a host cluster, providing per-tenant CRD isolation |
| PodDisruptionBudget | A Kubernetes policy object that constrains the number of pods that can be voluntarily evicted simultaneously, protecting service availability during node maintenance |
| DRA (Dynamic Resource Allocation) | A Kubernetes API framework (GA in 1.34) that allows workloads to request specialized hardware based on device attributes rather than simple counts, succeeding the Device Plugin model |
| LeaderWorkerSet | A Kubernetes API for managing tightly coupled leader/worker process groups in LLM training and inference, providing first-class lifecycle management for multi-process AI workloads |
| JobSet | A Kubernetes batch API for managing groups of related Job objects as a single logical unit, with shared failure policies and lifecycle management for distributed training |
| hybrid cloud | An architecture that combines on-premises infrastructure with public cloud resources, allowing workloads to be placed based on cost, latency, data governance, or availability requirements |
| burst-to-cloud | A scaling pattern where on-premises clusters handle steady-state workloads and cloud clusters absorb peak demand spikes, with federation or pipeline tooling managing workload placement |
| chaos engineering | The discipline of deliberately injecting controlled failures into a system to discover and remediate weaknesses before they occur in uncontrolled production incidents |
| platform engineering | The practice of treating internal developer infrastructure as a product, building self-service capabilities and golden paths that allow application teams to move faster with less platform team involvement |
| ResourceSlice | A DRA API object published by hardware drivers that advertises available devices on each node, including detailed attributes like memory capacity and architecture version |
| DeviceClass | A DRA API object that defines a category of requestable devices in a cluster, allowing cluster admins to specify what devices workloads can request and under what conditions |
| ResourceClaim | A DRA API object representing a workload’s request for one or more devices, matched against ResourceSlices by the Kubernetes scheduler |
| noisy neighbor | A multi-tenancy problem where one tenant’s resource-intensive workload degrades performance for other tenants sharing the same physical or logical infrastructure |
| Kueue | A CNCF project providing Kubernetes-native batch job queueing with ClusterQueue and LocalQueue objects, enabling fair resource sharing, borrowing, and preemption across teams |