Running AI Workloads on Kubernetes: From Infrastructure to Production

A comprehensive intermediate guide to deploying, scaling, and managing machine learning and AI workloads on Kubernetes clusters.

Table of Contents


Chapter 1: Introduction to AI on Kubernetes

Learning Objectives


Why Kubernetes for AI/ML

Resource Orchestration Needs of AI Workloads

Training a large language model or running a computer vision pipeline is nothing like serving a web application. A web server might need two CPU cores and a few hundred megabytes of memory; a distributed training job might need dozens of GPUs, terabytes of RAM, and high-bandwidth interconnects — all at the same time, all on the same job. The fundamental question for any AI team is: what infrastructure layer can manage these heterogeneous, massive, and bursty resource demands without requiring a team of specialists to babysit every job?

Kubernetes — the open-source container orchestration system originally released by Google in 2014 — has become the dominant answer to that question. Orchestration here means the automated scheduling, scaling, networking, and lifecycle management of containerized workloads. Think of Kubernetes as an air-traffic control system for your compute cluster: it knows what resources are available on every node (server), what each incoming workload needs, and it lands jobs on the right runway without collisions.

For AI workloads specifically, Kubernetes handles the hard parts of resource management that would otherwise require manual intervention. Through plugins such as the NVIDIA device plugin for Kubernetes, the scheduler gains awareness of GPU hardware on each node, allowing it to place GPU-hungry training jobs only on nodes that actually have GPUs available — and to reserve exactly the number requested, no more, no less. [Source: https://collabnix.com/kubernetes-and-ai-the-ultimate-guide-to-orchestrating-machine-learning-workloads-in-2025/]

Figure 1.1: How Kubernetes schedules a GPU training job

flowchart TD
    A[User submits YAML manifest\nrequesting 4 GPUs] --> B[Kubernetes API Server\nreceives request]
    B --> C[Scheduler queries\nnode resource state]
    C --> D{Node has\n4+ GPUs available?}
    D -- No --> E[Node skipped]
    D -- Yes --> F[Pod bound to\nGPU-equipped node]
    F --> G[NVIDIA Device Plugin\nexposes nvidia.com/gpu resources]
    G --> H[Training job runs\non reserved GPUs]
    E --> C

Worked example: Imagine your team has a cluster of 10 nodes. Three have NVIDIA A100 GPUs; the rest are CPU-only. You submit a PyTorch training job requesting 4 GPUs. Without orchestration, someone has to SSH into the right machines, check GPU availability, and manually launch the job. With Kubernetes and the NVIDIA device plugin, you declare your resource requirements in a YAML manifest, and the scheduler places your pods on GPU-equipped nodes automatically — freeing your team to focus on the model, not the plumbing.

Beyond GPU placement, Kubernetes enforces resource quotas — hard limits on how much CPU, memory, or GPU a team or namespace can consume. This is critical when multiple data science teams share a cluster: without quotas, one team running a poorly-tuned training job could starve every other team’s workloads. [Source: https://portworx.com/knowledge-hub/kubernetes-ai/]

Portability Across Cloud and On-Premises Environments

AI teams frequently need to move workloads between environments: a data scientist develops a model locally, validates it on a staging cluster, and promotes it to a production cluster running on a different cloud provider — or on bare metal in a corporate data center. Without a common abstraction layer, each environment is a bespoke snowflake requiring custom scripts and tribal knowledge.

Kubernetes solves this through containerization. A container packages an application and all its dependencies (Python version, CUDA libraries, model weights, configuration files) into a portable, self-contained image. Because Kubernetes runs on AWS (EKS), Google Cloud (GKE), Azure (AKS), and on-premises infrastructure using the same API surface, a container image that works in development works in production — without modification.

This consistency is not merely a convenience; it is a correctness guarantee. AI models are notoriously sensitive to environment variations: a different version of NumPy or a different CUDA toolkit can produce different numerical outputs, making debugging nightmarish. Containers eliminate that class of problem by making the environment a reproducible artifact alongside the code. [Source: https://www.kubermatic.com/blog/ai-and-machine-learning-integration-into-kubernetes/]

An analogy: think of a container image like a shipping container in global freight logistics. It doesn’t matter whether the container moves on a ship, a train, or a truck — the contents are sealed and consistent. Kubernetes is the port authority that knows how to unload, route, and deliver containers across all those transport modes.

Ecosystem Maturity and Community Momentum

The numbers here are striking. According to the CNCF Annual Survey 2023, more than 96% of organizations are using or evaluating Kubernetes, and 72% use it in production environments. A separate Linux Foundation study on Sovereign AI found that 82% of organizations are building custom AI solutions, and 58% use Kubernetes to support those workloads. [Source: https://www.cncf.io/blog/2026/03/05/the-great-migration-why-every-ai-platform-is-converging-on-kubernetes/]

This adoption level matters because it creates a self-reinforcing ecosystem. Every major AI framework — PyTorch, TensorFlow, JAX, Hugging Face Transformers — has Kubernetes-native tooling built around it. Platforms like Kubeflow, Ray, and MLflow have all converged on Kubernetes as their deployment substrate. The CNCF launched a Certified Kubernetes AI Conformance Program in November 2025 to formalize and standardize how AI workloads run on conformant clusters. [Source: https://www.cncf.io/announcements/2025/11/11/cncf-launches-certified-kubernetes-ai-conformance-program-to-standardize-ai-workloads-on-kubernetes/]

The practical consequence: when you invest in Kubernetes skills and tooling for AI, you are investing in a platform with a decade of production hardening, a vast community answering questions on GitHub and Stack Overflow, and a growing set of purpose-built AI extensions. The alternative — building bespoke cluster managers or relying on HPC schedulers like Slurm — is increasingly a harder sell when the broader industry has converged. [Source: https://blog.skypilot.co/slurm-vs-k8s/]

FactorKubernetesTraditional HPC (Slurm)Cloud-Managed VMs
GPU schedulingNative via device pluginsNative, matureManual or autoscaling groups
PortabilityExcellent (cloud + on-prem)Limited (usually on-prem)Cloud-vendor specific
Ecosystem toolingVery rich (Kubeflow, KServe, Kueue)Limited ML-specific toolingVendor-specific services
Multi-tenancyNamespaces + quotasFair-share queuesSeparate accounts/projects
Learning curveSteepModerate (for HPC teams)Low initially, high at scale
Manifest verbosityHigh (3x+ over Slurm scripts)LowLow

Key Takeaway: Kubernetes provides the resource orchestration, environment consistency, and ecosystem momentum that AI teams need at scale. Its GPU-aware scheduling and resource quotas directly address the cost and fairness challenges of shared accelerator infrastructure. The 96%+ industry adoption rate means skills and tooling investment compounds over time.


AI/ML Workload Characteristics

Batch Training vs Real-Time Inference Patterns

Not all AI workloads look alike, and understanding the distinction between batch training and real-time inference is foundational to everything that follows in this book.

Batch training is the process of adjusting a model’s internal parameters by repeatedly feeding it large datasets and computing gradients. It is a compute marathon: a job starts, runs for hours or days consuming enormous GPU resources, and then finishes. The outcome is a saved model artifact — a file containing the learned weights. Training is not user-facing; a few extra minutes of latency are irrelevant.

Real-time inference (also called online serving) is the opposite. A trained model sits behind an API endpoint, and users or applications send requests expecting predictions in milliseconds. Latency is critical; a recommendation engine that takes two seconds to respond loses user engagement. Unlike training, inference workloads must handle unpredictable traffic spikes — a viral product launch might generate 100x normal traffic in minutes.

The contrast is stark enough to warrant a table:

DimensionBatch TrainingReal-Time Inference
DurationHours to daysMilliseconds per request
GPU utilizationContinuously saturatedBursty, often underutilized
Latency sensitivityLowVery high
Traffic patternPredictable (job-scheduled)Unpredictable spikes
Cost profileLarge upfront GPU costPay-per-request, ongoing
Scaling strategyScale-up (more GPUs per job)Scale-out (more replicas)
Failure toleranceRestart job or checkpointMust be highly available

This distinction has direct consequences for how you configure Kubernetes. A training job runs as a Job resource (discussed in Section 4) — it runs to completion and then stops. An inference service runs as a Deployment behind a Service, with autoscaling rules to add replicas under load. Using the wrong resource type for the workload type leads to wasted resources and operational headaches. [Source: https://www.cncf.io/blog/2026/03/05/the-great-migration-why-every-ai-platform-is-converging-on-kubernetes/]

Figure 1.2: Batch training vs real-time inference on Kubernetes

flowchart LR
    subgraph Training ["Batch Training (Job)"]
        T1[Dataset\nin storage] --> T2[Training pods\nwith GPUs]
        T2 --> T3[Model artifact\nsaved to storage]
    end
    subgraph Inference ["Real-Time Inference (Deployment)"]
        I1[User / Application\nrequest] --> I2[Inference pod\nbehind Service]
        I2 --> I3[Prediction\nreturned in ms]
        I2 --> I4[Autoscaler adds\nreplicas on load spike]
    end
    T3 -. model weights .-> I2

GPU and Accelerator Requirements

A GPU (Graphics Processing Unit) was originally designed to render video game graphics by performing thousands of simple mathematical operations in parallel. It turns out this parallel architecture is ideal for the matrix multiplications that dominate neural network training and inference. An accelerator is the broader category: GPUs are the most common, but AI chips from Google (TPUs), AWS (Trainium/Inferentia), and dedicated inference ASICs are all accelerators in this sense.

Kubernetes does not natively understand GPUs out of the box — they are not CPU cores or RAM, which Kubernetes has tracked since day one. Instead, hardware vendors publish device plugins that extend the Kubernetes API to expose their accelerators as schedulable resources. The NVIDIA device plugin, for example, registers each GPU on a node as a nvidia.com/gpu resource unit. A pod can then request nvidia.com/gpu: 2 in its resource spec, and the scheduler places that pod only on a node with at least 2 available NVIDIA GPUs. [Source: https://www.anantacloud.com/post/ai-ml-workloads-on-kubernetes-running-scalable-machine-learning-pipelines-with-gpu-acceleration-and]

GPU resources are expensive — a cloud A100 instance can cost $30/hour or more. Setting precise requests and limits in your pod spec is therefore not optional. Without them, the scheduler has no way to bin-pack GPU workloads efficiently, pods compete for the same physical device, and costs balloon. [Source: https://www.wiz.io/academy/ai-ml-kubernetes-best-practices]

Data-Intensive I/O Patterns

Training a large model requires moving enormous volumes of data — often hundreds of gigabytes or terabytes — into GPU memory as efficiently as possible. A GPU that spends more time waiting for data from disk than computing gradients is a very expensive idle resource. This is called the I/O bottleneck, and it is one of the less-obvious challenges of AI on Kubernetes.

Traditional microservices typically read modest amounts of configuration or database records. AI training pipelines read entire datasets, often in random-access patterns, requiring high-throughput distributed storage. In a Kubernetes context, this means choosing the right PersistentVolume type — fast NVMe-backed storage for hot data, object storage (S3-compatible) for cold datasets — and ensuring the storage class is available on the nodes where training pods land.

Long-Running and Ephemeral Job Types

Kubernetes was originally built to keep long-running services alive: if a web server pod crashes, Kubernetes restarts it. But AI workloads span both extremes. A multi-day distributed training run is an extreme example of a long-running job that must tolerate node failures gracefully (via checkpointing). A feature engineering step might be ephemeral — a pod spins up, processes a batch of data, writes results to storage, and exits cleanly.

The key insight: AI workloads are fundamentally batch in nature. The infrastructure must support tens to thousands of long-running batch jobs concurrently, not millions of short-lived service requests. [Source: https://connect.redhat.com/hydra/prm/v1/business/companies/bf36e6f9100044ef903614234b0f70ad/linked-resources/72b65d25acf341a8a75ac498a349c2e8/content/public/view] This is a different operational profile from what Kubernetes was originally optimized for, which is why additional tools like Kueue (covered in later chapters) are needed to handle queue management, preemption, and fair-share scheduling across competing teams.

Key Takeaway: AI workloads split into two fundamentally different patterns — batch training (resource-heavy, run-to-completion) and real-time inference (latency-sensitive, always-on) — each requiring different Kubernetes configurations. GPU and accelerator management, data I/O throughput, and long-running job handling are the three technical challenges that most distinguish AI operations from standard microservice deployments.


The AI/ML Lifecycle on Kubernetes

Data Preparation and Feature Engineering

Every AI project begins with data. Raw data — sensor logs, transaction records, text corpora, images — rarely arrives in a form that a machine learning model can directly consume. Data preparation covers ingestion, cleaning, normalization, and transformation. Feature engineering is the more ML-specific step: extracting or constructing the numerical representations (features) that a model actually trains on.

On Kubernetes, this stage typically runs as batch jobs using distributed data-processing frameworks: Apache Spark, Dask, Flink, or Ray. These frameworks distribute data-processing work across many pods, enabling parallelism that reduces hours-long preprocessing to minutes. A typical Kubernetes manifest for this stage would use a Job resource that launches a Spark driver pod coordinating a cluster of executor pods. [Source: https://www.kubeflow.org/docs/started/architecture/]

The output of this stage — cleaned datasets and feature tables — is written to a feature store or an object storage bucket, making it available to the training stage without re-computation.

Model Training and Experimentation

With prepared data in hand, the team enters the model training stage. This is where most of the GPU spend happens. Teams often run many experiments in parallel — varying hyperparameters, architectures, or training data — to find the best-performing model configuration.

Distributed training extends a single training run across multiple pods, each handling a partition of the data or a slice of the model. Frameworks like PyTorch’s DistributedDataParallel and DeepSpeed coordinate gradient synchronization across workers. On Kubernetes, Kubeflow Trainer (formerly the Training Operator) provides Kubernetes-native custom resources — PyTorchJob, JAXJob, XGBoostJob — that manage the pod lifecycle for these distributed patterns. [Source: https://www.kubeflow.org/]

The key scheduling challenge at this stage is gang scheduling: all worker pods for a distributed job must start simultaneously, because a job where 7 of 8 workers are running but waiting for the 8th is consuming 7 GPUs while doing zero useful work. The default Kubernetes scheduler has no concept of gang scheduling, which can cause GPU deadlocks where multiple jobs each hold some GPUs while waiting for others. Kueue, the emerging community standard for batch workload management on Kubernetes, addresses this with atomic job admission. [Source: https://www.cncf.io/blog/2026/03/05/the-great-migration-why-every-ai-platform-is-converging-on-kubernetes/]

Model Serving and Inference

Once training produces a satisfactory model artifact, that artifact must be exposed as a service that applications can query. Model serving on Kubernetes typically uses a dedicated inference server — a containerized process that loads model weights, exposes an HTTP or gRPC API, and handles request batching and preprocessing.

KServe is the Kubernetes-native standard for this. It provides a unified InferenceService custom resource that abstracts over multiple serving runtimes (Triton, TorchServe, ONNX Runtime, vLLM), handles autoscaling, and can scale to zero when no traffic is present. [Source: https://medium.com/@jushijun/deploying-machine-learning-models-with-kubeflow-and-kserve-a-comprehensive-guide-2e3d1449dc54]

For large language models specifically, NVIDIA has introduced disaggregated inference architectures that separate the prefill stage (processing the input prompt) and the decode stage (generating output tokens) into independent Kubernetes services. Because prefill and decode have fundamentally different compute profiles — prefill is compute-bound, decode is memory-bandwidth-bound — running them as separate pods enables fine-grained resource allocation and better GPU utilization. [Source: https://developer.nvidia.com/blog/deploying-disaggregated-llm-inference-workloads-on-kubernetes/]

Worked example: An e-commerce platform deploys a product recommendation model as a KServe InferenceService. During normal hours the service runs two replicas. On Black Friday, traffic spikes 40x. KServe’s Knative-based autoscaling detects the request queue growing and launches additional replicas within seconds, keeping response latency under 100ms. When traffic subsides overnight, replicas scale back down to two, reducing GPU costs.

Monitoring and Retraining Loops

A deployed model is not a static artifact. Model performance degrades over time as the real-world data distribution shifts — a phenomenon called model drift. The final stage of the AI/ML lifecycle is monitoring production model behavior and triggering retraining when performance drops below acceptable thresholds.

On Kubernetes, this stage involves:

Kubeflow Pipelines orchestrates these multi-step workflows as directed acyclic graphs (DAGs) running on Kubernetes, enabling the team to define the entire retrain-evaluate-deploy loop as code. [Source: https://blog.kubeflow.org/fraud-detection-e2e/]

Figure 1.3: AI/ML lifecycle stages on Kubernetes

flowchart TD
    A[Raw Data\nsensor logs, text, images] --> B[Data Preparation\nSpark / Dask / Ray\nKubernetes Job]
    B --> C[Feature Store\nor Object Storage]
    C --> D[Model Training\nPyTorch / DeepSpeed\nKubeflow Trainer Job]
    D --> E[Model Artifact\nsaved weights]
    E --> F[Model Serving\nKServe InferenceService\nautoscaling Deployment]
    F --> G[Production Traffic\nuser predictions]
    G --> H[Monitoring\nPrometheus / Grafana\nDrift Detection]
    H -- drift detected --> D
Lifecycle StagePrimary ToolsKubernetes Resource Type
Data preparationSpark, Dask, RayJob, StatefulSet
Model trainingPyTorch, DeepSpeed, JAXJob (via Kubeflow Trainer)
Experiment trackingMLflow, Weights & BiasesDeployment
Model servingKServe, Triton, vLLMInferenceService (CRD)
Pipeline orchestrationKubeflow PipelinesCustom resources
MonitoringPrometheus, GrafanaDeployment, DaemonSet

Key Takeaway: The AI/ML lifecycle maps cleanly onto Kubernetes: data preparation runs as batch Jobs, distributed training uses Kubeflow Trainer custom resources with gang scheduling, model serving deploys as autoscaling InferenceServices via KServe, and monitoring closes the loop through scheduled retraining pipelines. Each stage has a distinct resource profile and tooling ecosystem.


Kubernetes Architecture Refresher for AI Practitioners

This section is not a full Kubernetes tutorial — that would fill its own book. Instead, it covers the specific concepts you need to understand the rest of this textbook. If you have used Kubernetes before, treat this as a vocabulary alignment.

Control Plane and Worker Node Roles

A Kubernetes cluster consists of two types of machines: control plane nodes and worker nodes.

The control plane is the brain of the cluster. It runs the API server (the single entry point for all Kubernetes operations), the scheduler (which decides where pods run), the controller manager (which reconciles desired vs. actual cluster state), and etcd (a distributed key-value store holding all cluster configuration). For AI workloads, the control plane is rarely where your code runs — it is the management layer that handles placement and lifecycle decisions.

Worker nodes are where your actual workloads run. Each worker node runs a kubelet — an agent that communicates with the control plane and manages the containers on that node — and a container runtime (typically containerd). For AI, worker nodes are the GPU-equipped machines where training pods and inference pods are scheduled.

An analogy: think of the control plane as the corporate headquarters of a logistics company and worker nodes as the individual warehouses and delivery facilities. Headquarters receives orders, assigns work to facilities, and monitors status — but the actual packages move through the facilities, not headquarters.

Figure 1.4: Kubernetes cluster architecture for AI workloads

flowchart LR
    subgraph ControlPlane ["Control Plane"]
        CP1[API Server\nentry point for all operations]
        CP2[Scheduler\ndecides pod placement]
        CP3[Controller Manager\nreconciles desired state]
        CP4[etcd\ncluster configuration store]
    end
    subgraph Workers ["Worker Nodes"]
        subgraph WN1 ["GPU Node 1"]
            K1[kubelet] --> C1[Training Pod\nnvidia.com/gpu: 4]
        end
        subgraph WN2 ["GPU Node 2"]
            K2[kubelet] --> C2[Inference Pod\nnvidia.com/gpu: 1]
        end
        subgraph WN3 ["CPU Node"]
            K3[kubelet] --> C3[Preprocessing Pod\nCPU only]
        end
    end
    CP1 --> CP2
    CP2 --> K1
    CP2 --> K2
    CP2 --> K3
    K1 --> CP1
    K2 --> CP1
    K3 --> CP1

Pods, Deployments, Jobs, and CronJobs

Pod: the smallest schedulable unit in Kubernetes. A pod wraps one or more containers that share a network namespace and storage volumes. Every AI workload — a training process, an inference server, a preprocessing script — ultimately runs inside a pod. Pods are ephemeral: when they terminate, they do not restart unless managed by a higher-level resource.

Deployment: a controller that manages a set of identical, long-running pods and ensures the desired number of replicas is always running. Deployments are the right resource for stateless inference servers that should always be available. If an inference pod crashes, the Deployment controller immediately schedules a replacement.

Job: a controller that runs one or more pods to completion. When all pods complete successfully, the Job is considered done. This is the natural fit for batch training runs, preprocessing pipelines, and evaluation scripts. Unlike a Deployment, a completed Job is not restarted — it sits as a record of what ran.

CronJob: a Job that runs on a schedule, expressed in cron syntax. CronJobs are useful for periodic tasks in the AI/ML lifecycle: nightly feature recomputation, weekly model retraining triggers, or hourly data ingestion from upstream sources.

ResourceUse Case in AIRestart Behavior
Pod (bare)Rarely used directlyNever restarted
DeploymentInference serving, MLflow serverRestarted on failure
JobTraining run, preprocessing, evaluationRuns to completion
CronJobScheduled retraining, data ingestionRuns on schedule
StatefulSetDistributed databases, feature storesRestarted with stable identity

Namespaces and Resource Quotas

A namespace is a virtual partition within a Kubernetes cluster. Resources (pods, services, secrets, configmaps) in one namespace are logically isolated from resources in another. In multi-team AI environments, namespaces are typically assigned per team or per project: the nlp-team namespace, the cv-team namespace, the staging namespace.

Resource quotas are constraints applied to a namespace that cap its total resource consumption. A quota might specify that the nlp-team namespace can use at most 32 GPU units, 256 GiB of RAM, and 128 CPU cores across all pods. When a team tries to submit a job that would exceed its quota, Kubernetes rejects the request with a clear error — protecting other teams from resource starvation.

Resource requests and limits operate at the pod level. A request is the amount of a resource the scheduler reserves for the pod when placing it; a limit is the maximum the pod is allowed to consume. For CPU, exceeding the limit causes throttling. For memory, exceeding the limit causes the pod to be killed (OOMKilled). For GPUs, Kubernetes does not currently support fractional GPU allocation natively — a pod gets whole GPUs.

Worked example: A research team submits a hyperparameter sweep that accidentally requests 100 GPU pods simultaneously. Without namespace quotas, this would saturate the entire cluster. With a quota of 16 GPUs on the research namespace, Kubernetes admits only as many pods as fit within the quota, queuing the rest. The production inference service in a separate namespace is unaffected.

Key Takeaway: Understanding the control plane / worker node split, the four core workload resource types (Pod, Deployment, Job, CronJob), and the namespace + quota system gives you the vocabulary to read and write Kubernetes configurations for AI throughout the rest of this book. These abstractions map directly to the stages of the AI/ML lifecycle: Jobs for training, Deployments for serving, CronJobs for periodic retraining.


Chapter Summary

Kubernetes has emerged as the dominant substrate for AI/ML workloads not by accident, but because its core primitives — containerized execution, declarative scheduling, resource quotas, and a rich extension API — address the exact challenges that AI teams face at scale. The 96%+ industry adoption rate [Source: https://www.cncf.io/announcements/2025/11/11/cncf-launches-certified-kubernetes-ai-conformance-program-to-standardize-ai-workloads-on-kubernetes/] and the “great migration” of every major AI platform onto Kubernetes [Source: https://www.cncf.io/blog/2026/03/05/the-great-migration-why-every-ai-platform-is-converging-on-kubernetes/] reflect a genuine technical fit: Kubernetes gives AI teams portability across clouds, reproducible environments via containers, GPU-aware scheduling via device plugins, and multi-tenant fairness via namespaces and quotas.

AI workloads are not just web services with bigger machines. They are long-running batch jobs with extreme GPU requirements, data-intensive I/O patterns, and a dual nature — expensive batch training followed by latency-sensitive always-on inference — that demands different scheduling and scaling strategies for each phase. The default Kubernetes scheduler was not built for gang scheduling or fair-share GPU queuing; extensions like Kueue, Kubeflow Trainer, and KServe fill these gaps, and later chapters in this book will cover each in depth.

The AI/ML lifecycle — data preparation, model training, model serving, and monitoring with retraining loops — maps cleanly onto Kubernetes resource types and tooling. Data pipelines run as Jobs, training runs use Kubeflow Trainer custom resources, inference services deploy via KServe, and orchestration pipelines stitch the stages together via Kubeflow Pipelines. By the end of this book, you will be able to design and operate each of these stages on a production Kubernetes cluster.


Key Terms

TermDefinition
AI/ML lifecycleThe end-to-end sequence of stages for building and operating a machine learning model: data preparation, model training and experimentation, model serving, and monitoring with retraining loops.
OrchestrationThe automated scheduling, scaling, networking, and lifecycle management of containerized workloads across a cluster of machines. In Kubernetes, the control plane performs orchestration.
AcceleratorA hardware component designed to perform the parallel matrix computations required by AI workloads more efficiently than a general-purpose CPU. GPUs are the most common accelerators; TPUs and inference ASICs are also in this category.
Batch trainingThe process of training a machine learning model by iterating over large datasets and updating model parameters via gradient descent. Batch training runs to completion (hours to days), is GPU-saturated, and is not user-facing.
Real-time inferenceThe deployment of a trained model as a live API endpoint that processes individual prediction requests with low latency (typically milliseconds). Also called online serving.
Resource quotaA Kubernetes mechanism that caps the total amount of CPU, memory, GPU, or other resources that all pods within a namespace can collectively consume, enforcing fairness among teams sharing a cluster.

Chapter 2: GPU and Accelerator Management

Learning Objectives

By the end of this chapter, you will be able to:


Introduction

Running AI workloads efficiently on Kubernetes requires more than just spinning up pods — it demands precise management of the most expensive and specialized hardware in your cluster: GPUs. A single NVIDIA H100 can cost tens of thousands of dollars, and a poorly configured cluster can leave that hardware sitting at 13% utilization while your cloud bill climbs. [Source: https://www.cio.com/article/4152554/how-kubernetes-is-finally-solving-the-gpu-utilization-crisis-to-save-your-ai-budget.html]

Think of GPU management in Kubernetes like managing a fleet of high-performance racing cars in a taxi company. The cars are far more capable than ordinary vehicles, they require specialized mechanics and fuel, and they are far too expensive to leave parked. Your job as the Kubernetes operator is to make sure those cars are constantly running the right passengers (workloads) to the right destinations — with the right driver qualifications (hardware compatibility) and the right scheduling policies so no car sits idle while others are overloaded.

This chapter walks through the full stack of GPU management in Kubernetes: from understanding the hardware landscape and device plugin architecture, through deploying the NVIDIA GPU Operator, to implementing advanced GPU sharing strategies that can push cluster utilization well past 80%.


Section 1: GPU Fundamentals for Kubernetes

The GPU Landscape for AI

Graphics Processing Units were originally designed for rendering pixels in parallel. That same parallel architecture — thousands of smaller compute cores working simultaneously — turns out to be exactly what training and inferencing neural networks requires. Today, three major GPU vendors compete in the Kubernetes AI workload space:

VendorKey Products for AIKubernetes SupportNotes
NVIDIAA100, H100, L40, RTX 40-seriesMature, via device plugin + GPU OperatorDominant ecosystem; CUDA runtime widely supported
AMDInstinct MI300, RX 7900 XTXStable, via ROCm device pluginGrowing adoption; ROCm is the CUDA equivalent
IntelGaudi 2/3, Arc GPUsEmerging, via Intel Device Plugin for KubernetesStrong in data center inference; OpenVINO runtime

NVIDIA holds the dominant position in Kubernetes AI workloads, primarily because the CUDA (Compute Unified Device Architecture) ecosystem — the programming model and library stack for GPU-accelerated computing — has been the default target for almost every major AI framework (PyTorch, TensorFlow, JAX). This chapter focuses primarily on NVIDIA, but the architectural patterns (device plugins, resource requests, sharing strategies) apply across all vendors. [Source: https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/]

Device Plugin Architecture and Discovery

Kubernetes was designed around a generic resource model. Out of the box, the scheduler knows about CPU cores and memory. GPUs are not built-in concepts — they are extended resources: arbitrary named quantities that a node advertises and a pod can request. The mechanism that bridges physical GPU hardware and the Kubernetes resource model is the device plugin framework.

A device plugin is a small program that runs as a DaemonSet on each node where special hardware is present. It communicates with the kubelet (the per-node Kubernetes agent) over a Unix socket using gRPC, and it performs three core functions:

  1. Discovery: It detects available devices on the node — for NVIDIA GPUs, this is done via the NVML (NVIDIA Management Library).
  2. Advertisement: It registers those devices as extended resources (e.g., nvidia.com/gpu: 4 for a four-GPU node) so the Kubernetes API server and scheduler are aware of them.
  3. Allocation: When a pod is scheduled to the node and requests a GPU, the device plugin handles the actual device injection into the container — mounting the right device files and configuring the environment.

The NVIDIA device plugin runs on the Unix socket /var/lib/kubelet/device-plugins/nvidia.sock and exposes GPUs under the nvidia.com/gpu resource name. [Source: https://medium.com/@rifewang/overview-of-kubernetes-gpu-scheduling-device-plugin-cdi-nfd-and-gpu-operator-48a7c4213b28]

An emerging complement to the device plugin framework is the Container Device Interface (CDI), a newer specification for injecting devices into containers in a more portable and robust manner. CDI decouples device injection from the container runtime, which simplifies support across different runtimes (containerd, CRI-O). [Source: https://medium.com/@rifewang/overview-of-kubernetes-gpu-scheduling-device-plugin-cdi-nfd-and-gpu-operator-48a7c4213b28]

Analogy: Think of the device plugin like a hotel concierge for special equipment. The hotel (Kubernetes) doesn’t know by default which rooms have hot tubs. The concierge (device plugin) surveys the property, reports to the front desk (kubelet/API server) which rooms have hot tubs, and when a guest requests a hot-tub room, the concierge hands them the correct key and ensures the hot tub is ready.

Figure 2.1: Device Plugin Architecture — Discovery, Advertisement, and Allocation

flowchart LR
    subgraph Node["GPU Worker Node"]
        HW["Physical GPU Hardware\n(e.g., NVIDIA A100)"]
        NVML["NVML\n(NVIDIA Management Library)"]
        DP["Device Plugin\n(DaemonSet)"]
        CT["NVIDIA Container Toolkit"]
        C["Container\n(AI Workload)"]
    end

    subgraph Cluster["Kubernetes Control Plane"]
        K["kubelet"]
        API["API Server"]
        SCH["Scheduler"]
    end

    HW -- "exposes hardware state" --> NVML
    NVML -- "1. Discovery:\ndetect GPUs" --> DP
    DP -- "2. Advertisement:\nnvidia.com/gpu extended resource\n(gRPC over Unix socket)" --> K
    K -- "registers capacity" --> API
    SCH -- "reads available capacity" --> API
    DP -- "3. Allocation:\nassign device paths +\nenv vars" --> K
    K -- "inject /dev/nvidia0" --> CT
    CT -- "mounts GPU devices +\nCUDA libraries at runtime" --> C

Container Runtime GPU Passthrough: NVIDIA Container Toolkit

Even with a device plugin advertising GPU resources, the container itself still needs to be able to use those GPUs. Running a GPU-accelerated workload inside a container requires that the NVIDIA drivers, CUDA libraries, and device files from the host be accessible inside the container namespace.

The NVIDIA Container Toolkit (formerly nvidia-docker2) is the component that makes this passthrough transparent. It hooks into the container runtime (containerd or CRI-O) and, at container launch time, injects the correct GPU devices and driver libraries into the container without requiring those libraries to be baked into every container image. This means an AI workload container only needs to include the CUDA toolkit at the application layer — the driver itself lives on the host and is injected at runtime. [Source: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html]

Worked Example — What happens when a GPU pod starts:

  1. Pod spec requests nvidia.com/gpu: 1.
  2. Kubernetes scheduler finds a node with available nvidia.com/gpu capacity.
  3. kubelet calls the NVIDIA device plugin to allocate one GPU.
  4. Device plugin returns device paths (e.g., /dev/nvidia0) and environment variables.
  5. NVIDIA Container Toolkit intercepts the container start, mounts /dev/nvidia0, and injects CUDA libraries.
  6. The container starts with full GPU access.

Figure 2.2: GPU Pod Startup Sequence

sequenceDiagram
    participant User as User / Job Controller
    participant SCH as Kubernetes Scheduler
    participant K as kubelet
    participant DP as NVIDIA Device Plugin
    participant CT as NVIDIA Container Toolkit
    participant C as Container

    User->>SCH: Submit pod spec (nvidia.com/gpu: 1)
    SCH->>SCH: Find node with available nvidia.com/gpu capacity
    SCH->>K: Bind pod to node
    K->>DP: Allocate 1 GPU (gRPC call)
    DP-->>K: Return device paths (/dev/nvidia0) + env vars
    K->>CT: Start container with device allocation
    CT->>CT: Mount /dev/nvidia0 + inject CUDA libraries
    CT-->>C: Container starts with full GPU access

Key Takeaway: GPUs enter Kubernetes as extended resources via the device plugin framework. The NVIDIA device plugin runs as a DaemonSet, discovers GPUs with NVML, advertises them as nvidia.com/gpu resources, and the Container Toolkit handles runtime injection of GPU drivers and device files into containers.


Section 2: NVIDIA GPU Operator

Managing GPU nodes manually — installing drivers, configuring the container toolkit, deploying the device plugin, setting up monitoring — is a fragile, error-prone process that breaks whenever a node is replaced, an OS is upgraded, or a CUDA version changes. The NVIDIA GPU Operator solves this by applying the Kubernetes Operator pattern to the entire GPU software stack.

A Kubernetes Operator is a controller that encodes operational knowledge about a specific piece of software into a custom controller loop. Rather than a human following a runbook, the Operator continuously watches cluster state and reconciles it toward the desired configuration. The GPU Operator applies this model to the full lifecycle of GPU software on every node. [Source: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html]

What the GPU Operator Manages

The GPU Operator provisions and manages the following components automatically on every GPU-enabled node: [Source: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html]

ComponentPurpose
NVIDIA DriversKernel module that enables CUDA on the node; the Operator installs these as a container
NVIDIA Container ToolkitHooks the container runtime to inject GPU access into pods
Kubernetes Device PluginAdvertises nvidia.com/gpu extended resources to the scheduler
GPU Feature Discovery (GFD)Labels nodes with detailed GPU metadata (model, memory, CUDA capabilities)
DCGM ExporterExposes GPU metrics (utilization, temperature, memory) for Prometheus
MIG ManagerConfigures MIG partitioning on supported hardware
CUDA ValidatorRuns a validation workload to confirm CUDA is functional before scheduling begins

Analogy: Without the GPU Operator, managing GPU nodes is like individually maintaining a fleet of specialized vehicles by hand — oil changes, tyre rotations, and software updates all done manually per vehicle. The GPU Operator is the automated maintenance depot that keeps every vehicle in the fleet correctly configured at all times, and refuses to send a vehicle on a job until it passes inspection.

Operator Installation and Configuration

The GPU Operator is installed via Helm. The prerequisites are straightforward: [Source: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html]

Worked Example — Installing the GPU Operator:

# Add the NVIDIA Helm repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

# Create the namespace and label it for privileged pod security
kubectl create namespace gpu-operator
kubectl label --overwrite namespace gpu-operator \
  pod-security.kubernetes.io/enforce=privileged

# Install the GPU Operator
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --wait

After installation, the Operator’s controller watches for nodes with NVIDIA GPUs and begins the automated provisioning workflow. [Source: https://www.infracloud.io/blogs/guide-to-nvidia-gpu-operator-in-kubernetes/]

Automatic Driver Management and CUDA Toolkit Provisioning

One of the most operationally valuable features of the GPU Operator is containerized driver management. Instead of installing NVIDIA drivers directly onto the host OS (which creates version lock-in and complicates node upgrades), the Operator deploys drivers as a privileged container. This driver container mounts the necessary kernel interfaces and loads the NVIDIA kernel module, but the driver files themselves live inside the container image.

The automated provisioning workflow for each new GPU node follows three steps: [Source: https://sagar-parmar.medium.com/nvidia-gpu-operator-explained-simplifying-gpu-workloads-on-kubernetes-436e0a60d0ac]

  1. Discovery: The Operator identifies nodes with NVIDIA GPUs using node labels and hardware detection.
  2. Installation and Configuration: It deploys the driver container, configures the Container Toolkit in the runtime, and starts the device plugin and monitoring stack.
  3. Validation: The CUDA Validator runs a test workload to confirm the full stack is functional. Only after validation passes does the node become eligible to receive GPU workloads.

This validation gate is significant: it prevents misconfigured nodes from silently accepting GPU pods that would then fail at runtime.

Figure 2.3: GPU Operator Automated Node Provisioning Workflow

flowchart TD
    A["New GPU Node Joins Cluster"] --> B["GPU Operator Controller\nDetects Node via Labels"]
    B --> C["Step 1: Discovery\nIdentify NVIDIA GPU hardware\nvia node labels + hardware probes"]
    C --> D["Step 2: Installation and Configuration\nDeploy driver container\nConfigure Container Toolkit in runtime\nStart device plugin + DCGM monitoring"]
    D --> E["Step 3: Validation\nCUDA Validator runs test workload\nConfirms full software stack is functional"]
    E --> F{Validation Passed?}
    F -- "Yes" --> G["Node Marked Ready\nAccepts GPU workloads\nnvidia.com/gpu resources advertised"]
    F -- "No" --> H["Node Remains Unschedulable\nOperator logs error\nRetries provisioning"]
    H --> C

Node Labeling and GPU Feature Discovery

GPU Feature Discovery (GFD) is a component deployed by the GPU Operator that automatically labels Kubernetes nodes with rich GPU metadata. These labels enable fine-grained scheduling decisions.

Example labels applied by GFD to a node with an A100 GPU:

nvidia.com/gpu.present=true
nvidia.com/gpu.product=A100-SXM4-80GB
nvidia.com/gpu.memory=81920
nvidia.com/gpu.count=8
nvidia.com/cuda.driver.major=525
nvidia.com/mig.capable=true

[Source: https://www.spectrocloud.com/blog/the-real-world-guide-to-the-nvidia-gpu-operator-for-kubernetes-ai]

With these labels, a pod that requires an 80 GB A100 with MIG support can express that as a node affinity rule rather than relying on the cluster operator to manually maintain node pools.

Key Takeaway: The GPU Operator automates the entire GPU software lifecycle on Kubernetes nodes — drivers, runtime configuration, device plugin deployment, and monitoring — using a containerized, validated approach. GFD labels provide rich node metadata that enables intelligent scheduling decisions without manual node management.


Section 3: GPU Scheduling and Resource Requests

Requesting GPUs in Pod Specs

Requesting a GPU in a Kubernetes pod spec follows the same resources.requests and resources.limits pattern as CPU and memory, with one important constraint: GPU resources must have matching requests and limits, and they are not divisible by default — you request whole numbers.

Worked Example — Basic GPU pod spec:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-training-job
spec:
  containers:
  - name: trainer
    image: pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
    resources:
      limits:
        nvidia.com/gpu: 1
    command: ["python", "train.py"]
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

The tolerations section is important: GPU nodes are typically tainted with nvidia.com/gpu:NoSchedule to prevent non-GPU workloads from being scheduled onto expensive GPU nodes. A pod must explicitly tolerate this taint to be eligible for placement. [Source: https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/]

For multi-GPU workloads (e.g., distributed training across 8 GPUs), simply increase the request:

resources:
  limits:
    nvidia.com/gpu: 8

This requests all 8 GPUs on a single node. For workloads that span multiple nodes, you would use a framework like PyTorch’s DistributedDataParallel combined with a Kubernetes Job controller, which is covered in Chapter 4.

Node Affinity and GPU Topology Awareness

When a cluster contains multiple GPU types — for example, A100 nodes for training and T4 nodes for inference — raw GPU count requests are insufficient. A pod requesting nvidia.com/gpu: 1 might land on either node type, leading to performance inconsistency or cost inefficiency.

Node affinity solves this by allowing pods to specify hardware requirements as label match expressions: [Source: https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/]

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: nvidia.com/gpu.product
            operator: In
            values:
            - A100-SXM4-80GB

This pod will only schedule on nodes with an 80 GB A100, regardless of what other GPU types are available in the cluster.

Topology-Aware Scheduling addresses a more subtle problem in multi-GPU training: not all GPUs on a node have equal bandwidth to each other. On high-end NVIDIA hardware, GPUs are connected via NVLink, a proprietary high-speed interconnect that provides up to 600 GB/s of bandwidth. GPUs that communicate over PCIe instead are limited to 32 GB/s — nearly a 20x difference. [Source: https://oneuptime.com/blog/post/2026-01-19-kubernetes-gpu-workload-scheduling/view]

For a large language model training run that moves gradient tensors between GPUs thousands of times per second, this bandwidth difference can dominate training time. Topology-aware scheduling ensures that multi-GPU pods are placed on nodes (and across nodes) where the GPUs are interconnected via the fastest available path.

Comparison: GPU Interconnect Bandwidth

InterconnectBandwidthTypical Use Case
NVLink (4th gen, H100)900 GB/sMulti-GPU LLM training on a single node
NVLink (3rd gen, A100)600 GB/sDistributed training, large model sharding
PCIe 4.0 x1632 GB/sInference, single-GPU training
InfiniBand HDR (inter-node)200 Gb/sMulti-node distributed training

[Source: https://oneuptime.com/blog/post/2026-01-19-kubernetes-gpu-workload-scheduling/view]

Extended Resources and Custom Schedulers

The standard Kubernetes scheduler understands extended resources like nvidia.com/gpu but treats them as opaque integer quantities. It can count available units and subtract allocations, but it cannot reason about GPU topology, memory bandwidth, or NUMA locality.

Extended resources are the mechanism by which nodes expose non-standard hardware capacities. They follow the naming convention <vendor>/<resource> (e.g., nvidia.com/gpu, amd.com/gpu, intel.com/gpu). Any resource following this pattern can be requested in pod specs and tracked by the scheduler. [Source: https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/]

For workloads that require topology-aware placement, the NVIDIA GPU Operator integrates with the Kubernetes Topology Manager and the NUMA-aware scheduler to co-locate GPU and CPU resources on the same NUMA node — reducing memory access latency for large model training. [Source: https://www.perfectscale.io/blog/kubernetes-gpu]

Key Takeaway: GPU resource requests use the nvidia.com/gpu extended resource in pod specs. Node affinity rules (populated by GFD labels) enable targeting specific GPU models. For distributed training, topology-aware scheduling maximizes NVLink bandwidth utilization by up to 20x compared to PCIe fallback paths.


Section 4: GPU Sharing and Multi-Tenancy

A single NVIDIA A100 GPU costs roughly $10,000–$30,000. Running a single small inference workload on that GPU at 5% utilization is financially untenable at scale. GPU sharing is the practice of running multiple workloads on a single GPU simultaneously or in rapid succession, increasing utilization and reducing per-workload cost.

NVIDIA provides three distinct sharing mechanisms, each with different trade-offs across isolation, performance, and hardware requirements: [Source: https://rafay.co/ai-and-cloud-native-blog/demystifying-fractional-gpus-in-kubernetes-mig-time-slicing-and-custom-schedulers]

Figure 2.4: GPU Sharing Strategy Architecture Comparison

flowchart LR
    subgraph Physical["One Physical GPU (e.g., A100 80GB)"]
        subgraph MIG["MIG Partitioning\n(Ampere+ hardware required)"]
            MI1["MIG Instance 1\n10 GB — dedicated HW"]
            MI2["MIG Instance 2\n10 GB — dedicated HW"]
            MI3["MIG Instance 3–7\n..."]
        end
        subgraph TS["Time-Slicing\n(any NVIDIA GPU)"]
            V1["Virtual GPU 1\nshared VRAM"]
            V2["Virtual GPU 2\nshared VRAM"]
            V3["Virtual GPU 3\nshared VRAM"]
            V4["Virtual GPU 4\nshared VRAM"]
        end
        subgraph MPS["MPS — Multi-Process Service\n(any NVIDIA GPU, not with MIG)"]
            MPSd["Shared CUDA Context\n(MPS Daemon)"]
            P1["Process 1"]
            P2["Process 2"]
            P3["Process 3"]
        end
    end

    MI1 -- "hard isolation\nno interference" --> MI2
    V1 -. "no isolation\ncontext switch jitter" .-> V2
    P1 -- "low-overhead\nmultiplexing" --> MPSd
    P2 -- "low-overhead\nmultiplexing" --> MPSd
    P3 -- "low-overhead\nmultiplexing" --> MPSd
StrategyIsolation LevelHardware RequirementLatency ImpactBest Use Case
MIGHardware (hard)Ampere+ (A100, H100, L40)None (dedicated partitions)Production inference with SLA requirements
Time-SlicingNone (soft)Any NVIDIA GPUYes (context switch jitter)Dev/test, bursty workloads
MPSProcess-level (soft)Any NVIDIA GPU (not with MIG)Low overheadThroughput-focused inference, same model replicas

NVIDIA Multi-Instance GPU (MIG) Partitioning

MIG (Multi-Instance GPU) is a hardware feature introduced with NVIDIA’s Ampere architecture (A100, A30) and available on subsequent generations (H100, L40, H200). It allows a single physical GPU to be partitioned into up to seven independent GPU instances, each with its own dedicated compute engines, memory bandwidth, L2 cache, and DRAM allocation. [Source: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-sharing.html]

The key word is dedicated: MIG partitions are enforced in hardware. One workload running in a MIG instance cannot read memory from another instance, cannot monopolize shared resources, and cannot interfere with another instance’s performance. This is fundamentally different from software-based sharing.

Analogy: MIG is like subdividing a large conference room into smaller breakout rooms using permanent walls. Each team gets their own space with guaranteed resources. The shared lobby (PCIe interface) is the only truly shared resource.

MIG partition profiles on an A100 80GB GPU:

ProfileCompute FractionMemoryMax Instances
1g.10gb1/7~10 GB7
2g.20gb2/7~20 GB3
3g.40gb3/7~40 GB2
4g.40gb4/7~40 GB1
7g.80gb7/7~80 GB1 (whole GPU)

[Source: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-sharing.html]

Worked Example — Enabling MIG with the GPU Operator:

The GPU Operator includes a MIG Manager component that handles MIG configuration. To enable MIG on a node and create seven 1g.10gb partitions:

# Label the node with the desired MIG strategy
kubectl label node <gpu-node> nvidia.com/mig.config=all-1g.10gb

# The GPU Operator's MIG Manager detects the label change and
# reconfigures the GPU automatically. Verify the result:
kubectl describe node <gpu-node> | grep nvidia.com/mig

After configuration, the node will advertise nvidia.com/mig-1g.10gb: 7 as an extended resource. Pods request MIG slices explicitly:

resources:
  limits:
    nvidia.com/mig-1g.10gb: 1

[Source: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-sharing.html]

Time-Slicing GPUs Across Pods

Time-slicing is the simplest GPU sharing mechanism and works on any NVIDIA GPU. It configures the device plugin to advertise a GPU as multiple virtual units. The GPU hardware rapidly switches between the active workloads, giving each workload a time slice of compute — analogous to how a CPU shares time between processes.

Analogy: Time-slicing is like one chef trying to cook for multiple tables simultaneously by rapidly switching attention between them. Each table gets served, but none gets the chef’s undivided attention, and complex dishes may take longer because of the constant context switching.

Time-slicing is configured via a ConfigMap applied to the device plugin: [Source: https://aws.amazon.com/blogs/containers/gpu-sharing-on-amazon-eks-with-nvidia-time-slicing-and-accelerated-ec2-instances/]

apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: gpu-operator
data:
  any: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        replicas: 4

This configuration tells the device plugin to advertise each physical GPU as 4 virtual GPUs. A node with 2 physical GPUs will appear to have nvidia.com/gpu: 8. [Source: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-sharing.html]

Critical limitations of time-slicing:

Time-slicing is appropriate for development clusters, CI/CD pipelines, or bursty workloads where occasional latency is acceptable. It is not suitable for production inference with strict SLA requirements. [Source: https://rafay.co/ai-and-cloud-native-blog/demystifying-fractional-gpus-in-kubernetes-mig-time-slicing-and-custom-schedulers]

MPS (Multi-Process Service) for Concurrent Access

MPS (Multi-Process Service) is an NVIDIA software feature that allows multiple CUDA processes to share a single GPU with lower switching overhead than time-slicing. Instead of context-switching between separate CUDA contexts, MPS funnels all processes through a single CUDA context via a server daemon. This reduces overhead and increases throughput when running multiple instances of the same model. [Source: https://github.com/rh-aiservices-bu/gpu-partitioning-guide]

MPS is ideal for replicated inference: running four identical copies of a medium-sized model on one GPU to handle parallel requests with higher aggregate throughput than time-slicing would provide.

Important constraints on MPS: [Source: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-sharing.html]

Enabling MPS through the GPU Operator follows the same ConfigMap-based pattern as time-slicing, but uses the mps sharing strategy instead.

Combining Strategies: MIG + Time-Slicing

For maximum density on A100 or H100 hardware, MIG partitioning and time-slicing can be combined. A single A100 can be divided into 7 MIG instances (1g.10gb), and each MIG instance can then be time-sliced into 4 virtual replicas. The result: up to 28 GPU resource allocations from a single physical GPU. [Source: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-sharing.html]

A100 (1 physical GPU)
├── MIG Instance 1 (1g.10gb) → 4 time-sliced replicas
├── MIG Instance 2 (1g.10gb) → 4 time-sliced replicas
├── ...
└── MIG Instance 7 (1g.10gb) → 4 time-sliced replicas
                                    = 28 total pod allocations

This configuration is powerful for workloads that are small enough to fit in 10 GB but benefit from MIG’s hard memory isolation between the 7 instances.

Cost Optimization Through GPU Sharing Strategies

The financial impact of GPU sharing is substantial. Production case studies from CNCF member organizations show that applying advanced GPU scheduling and sharing strategies can improve cluster GPU utilization from approximately 13% (typical for naively scheduled clusters) to over 80%. [Source: https://www.cio.com/article/4152554/how-kubernetes-is-finally-solving-the-gpu-utilization-crisis-to-save-your-ai-budget.html]

Decision Framework — Choosing a GPU Sharing Strategy:

Does your workload require strict memory and compute isolation?
  YES → Use MIG (requires Ampere+ hardware)
  NO  ↓

Does your hardware support MIG?
  NO → Is throughput more important than latency?
         YES → Use MPS (same-model replicas, trusted tenants)
         NO  → Use Time-Slicing (dev/test, bursty, or heterogeneous workloads)
  YES ↓

Do you need more than 7 partitions per GPU?
  YES → Combine MIG + Time-Slicing (up to 28 allocations per A100)
  NO  → Use MIG alone for cleanest isolation

Figure 2.5: GPU Sharing Strategy Decision Tree

flowchart TD
    A["New GPU Workload to Schedule"] --> B{Requires strict memory\nand compute isolation?}
    B -- "Yes" --> C{Hardware is\nAmpere or newer?}
    C -- "Yes" --> D{Need more than\n7 partitions per GPU?}
    D -- "Yes" --> E["MIG + Time-Slicing\n(up to 28 allocations per A100)"]
    D -- "No" --> F["MIG Alone\nHardware-enforced isolation\nBest for production SLA workloads"]
    C -- "No\n(pre-Ampere hardware)" --> G{Is throughput more\nimportant than latency?}
    B -- "No" --> G
    G -- "Yes\n(trusted tenants,\nsame-model replicas)" --> H["MPS\nLow-overhead shared CUDA context\nBest for inference replicas"]
    G -- "No\n(dev/test,\nbursty workloads)" --> I["Time-Slicing\nSimplest setup, no isolation\nBest for development clusters"]

[Source: https://rafay.co/ai-and-cloud-native-blog/demystifying-fractional-gpus-in-kubernetes-mig-time-slicing-and-custom-schedulers]

Key Takeaway: MIG provides hardware-enforced isolation and is the right choice for production workloads with SLA requirements. Time-slicing is simpler but offers no isolation and introduces latency jitter — best for dev/test environments. MPS reduces sharing overhead for throughput-oriented inference replicas but provides weaker fault isolation. The strategies cannot all be combined: MIG + MPS is unsupported, and time-slicing + MPS cannot coexist. The right choice depends on isolation requirements first, hardware second, workload patterns third.


Chapter Summary

GPU management is one of the most operationally complex domains in Kubernetes, but the tooling has matured significantly. The key architectural pieces fit together as a layered stack:

  1. Hardware foundation: GPU vendor choice (NVIDIA, AMD, Intel) determines the available management tooling and software ecosystem.
  2. Device plugin layer: Extended resources (nvidia.com/gpu) bridge physical hardware and the Kubernetes scheduler via gRPC DaemonSets.
  3. Container runtime layer: The NVIDIA Container Toolkit injects driver access into containers at runtime, decoupling workload images from host driver versions.
  4. Operator layer: The GPU Operator automates the full GPU software lifecycle — drivers, toolkit, device plugin, GFD labels, DCGM monitoring — with a validation gate that prevents misconfigured nodes from receiving workloads.
  5. Scheduling layer: Node affinity rules on GFD-generated labels, topology-aware scheduling for NVLink bandwidth, and extended resource requests enable intelligent workload placement.
  6. Sharing layer: MIG, time-slicing, and MPS unlock multi-tenancy and cost optimization, with trade-offs in isolation, latency, and hardware requirements.

Clusters that implement this full stack can move GPU utilization from the typical 13% baseline to over 80%, dramatically reducing the cost of running AI workloads at scale. [Source: https://www.cio.com/article/4152554/how-kubernetes-is-finally-solving-the-gpu-utilization-crisis-to-save-your-ai-budget.html]


Key Terms

TermDefinition
CUDACompute Unified Device Architecture; NVIDIA’s parallel computing platform and programming model used by AI frameworks like PyTorch and TensorFlow
Device PluginA Kubernetes framework component that runs as a DaemonSet to discover, advertise, and allocate special hardware resources (GPUs, FPGAs, etc.) as extended resources
Extended ResourcesCustom, vendor-namespaced resource types (e.g., nvidia.com/gpu) that nodes can advertise and pods can request beyond the built-in CPU and memory
GPU OperatorAn NVIDIA Kubernetes Operator that automates the full GPU software stack lifecycle: drivers, Container Toolkit, device plugin, GFD, and DCGM monitoring
MIG (Multi-Instance GPU)A hardware-level GPU partitioning feature available on NVIDIA Ampere and later architectures that divides one physical GPU into up to seven isolated instances with dedicated compute and memory
Time-SlicingA software-based GPU sharing mechanism that overcommits a GPU by advertising it as multiple virtual units, rapidly context-switching between workloads; provides no memory or compute isolation
MPS (Multi-Process Service)An NVIDIA software feature enabling multiple CUDA processes to share a GPU through a single shared CUDA context, reducing overhead compared to time-slicing; best for throughput-oriented inference
Topology AwarenessThe practice of scheduling multi-GPU pods to nodes and NUMA domains where GPUs are interconnected via high-bandwidth paths (NVLink) rather than lower-bandwidth PCIe
GFD (GPU Feature Discovery)A GPU Operator component that automatically labels Kubernetes nodes with detailed GPU metadata (model, memory, CUDA version, MIG capability) to enable label-based scheduling
NVMLNVIDIA Management Library; the C-based API used by the device plugin and monitoring tools to query GPU hardware state, utilization, and configuration
DCGMData Center GPU Manager; NVIDIA’s monitoring and management toolkit that the GPU Operator deploys to expose GPU metrics (utilization, temperature, memory) to Prometheus
NVLinkNVIDIA’s high-speed GPU interconnect providing up to 600–900 GB/s bandwidth between GPUs on the same node; critical for distributed training performance
CDI (Container Device Interface)An emerging standard for injecting hardware devices into containers in a runtime-agnostic manner, complementing the device plugin framework

Chapter 3: Storage and Data Management for AI Pipelines

Learning Objectives

By the end of this chapter, you will be able to:


Introduction

Running AI workloads on Kubernetes is not simply a question of scheduling GPUs — data is the other half of the equation. A model training job that must wait for its next batch of images from a slow network file system will leave expensive GPUs idle, wasting both time and money. Conversely, a checkpoint written to the wrong storage tier can corrupt weeks of training progress in a single pod eviction.

Think of storage for AI pipelines the way a professional kitchen thinks about ingredient staging: raw supplies come from a warehouse (object storage), prep stations keep frequently used items close at hand (cache tiers), and the chef’s worktop holds only what is needed right now (local NVMe). Getting the staging wrong means the chef waits — and in AI terms, the GPU waits.

This chapter covers the full journey from storage fundamentals through to advanced caching strategies, with concrete Kubernetes configuration examples at every step.


Section 1: Storage Requirements for AI Workloads

Before choosing a storage technology, it is essential to understand what AI workloads actually demand. The access patterns during training differ sharply from those during inference, and checkpoint storage introduces yet another distinct set of requirements.

Dataset Sizes and I/O Patterns in Training vs Inference

Training is overwhelmingly read-heavy. A large language model (LLM) training run may iterate over terabytes of tokenized text hundreds of times across multiple epochs. Each epoch performs a full sequential or randomised pass over the dataset, generating sustained, high-bandwidth read traffic — often tens to hundreds of gigabytes per second when aggregated across a distributed job. Individual read requests are typically small (tens of kilobytes for image patches, a few hundred kilobytes for audio segments), so IOPS (input/output operations per second) and throughput both matter simultaneously [Source: https://simplyblock.io/glossary/kubernetes-storage-for-ai-workloads/].

Inference is quite different. Model weights are loaded once at startup (a large sequential read), and then each serving request triggers a small, latency-sensitive read of key-value cache or embedding lookups. Inference storage optimisation therefore focuses on low latency for random reads rather than raw throughput [Source: https://www.hyperstack.cloud/blog/case-study/types-of-storage-for-ai-workloads-what-you-need-to-know].

The table below summarises these contrasting profiles:

CharacteristicTrainingInference
Primary operationSequential/random readsRandom reads (small KV)
Access frequencyRepeated (multi-epoch)Continuous, latency-sensitive
Data volumeTens to hundreds of TBModel weights + serving cache
Throughput priorityVery high (GB/s aggregate)Moderate
Latency priorityModerateVery low (ms)
Write patternCheckpoints (periodic burst)Logs, output tokens (small)

POSIX vs Object Storage Tradeoffs

Most AI frameworks (PyTorch DataLoader, TensorFlow tf.data) were written expecting a POSIX filesystem — the standard interface that provides familiar directory trees, file opens, seeks, and symbolic links. POSIX storage maps naturally to training pipelines because no code changes are required: your dataset loader works identically whether the data lives on a local disk or a remote network filesystem that speaks POSIX.

Object storage (Amazon S3, Google Cloud Storage, MinIO) organises data as key-value blobs rather than hierarchical files. It offers massive scalability, pay-per-byte economics, and high durability, making it the natural home for raw datasets and model artefacts [Source: https://www.min.io/use-cases/ai-training]. The tradeoff is that object storage does not natively support POSIX operations such as truncate, append, fallocate, or symbolic links. Training code that relies on those operations will fail or require adaptation.

Practical AI pipelines commonly adopt a hybrid approach: raw datasets live in object storage for durability and cost efficiency, while a POSIX-compatible cache layer (such as JuiceFS or a parallel file system) presents the data to training pods as a conventional filesystem [Source: https://juicefs.com/en/blog/usage-tips/train-large-language-model-kubernetes-storage].

Checkpoint and Model Artifact Storage Needs

Checkpoints are periodic snapshots of model weights, optimizer state, and training metadata written to disk so that a job can resume after a failure without restarting from epoch zero. Their storage requirements are distinctive:

Platform teams should therefore provision separate storage tiers for checkpoints vs training data, with high burst write throughput and strong durability guarantees for the former. Storage Quality-of-Service (QoS) — per-volume IOPS and bandwidth limits — prevents checkpoint writes from crowding out the read bandwidth that training pods depend on [Source: https://portworx.com/knowledge-hub/kubernetes-ai/].

Completed model artefacts (the final trained weights, tokenizers, configuration files) are typically written once and then read frequently by inference servers. Object storage is ideal here: it provides content-addressable retrieval, versioning, and global distribution at low cost [Source: https://docs.cloud.google.com/architecture/ai-ml/storage-for-ai-ml].

Figure 3.1: Storage tier requirements across AI workload phases

flowchart LR
    subgraph Training["Training Phase"]
        direction TB
        T1["High-throughput reads\n(GB/s aggregate)"]
        T2["Repeated multi-epoch\nsequential/random reads"]
        T3["Periodic checkpoint\nwrite bursts"]
    end
    subgraph Checkpointing["Checkpointing"]
        direction TB
        C1["High burst\nwrite throughput"]
        C2["Strong durability\nguarantees"]
        C3["Read-rare\n(restart only)"]
    end
    subgraph Inference["Inference Phase"]
        direction TB
        I1["One-time large\nsequential read (weights)"]
        I2["Low-latency\nrandom reads (KV cache)"]
        I3["Small output\ntoken writes"]
    end
    subgraph Storage["Recommended Storage Tier"]
        S1["Parallel FS\n(Lustre / BeeGFS)"]
        S2["High-write durable\nblock or object storage"]
        S3["Low-latency block\nor local NVMe"]
    end
    Training --> S1
    Checkpointing --> S2
    Inference --> S3

Key Takeaway: Training, checkpointing, and inference each carry distinct I/O profiles. Choosing a single storage technology for all three is usually the wrong answer; effective AI platforms provision purpose-built tiers for each phase and enforce QoS to prevent workloads from interfering with each other.


Section 2: Kubernetes Storage Primitives

Kubernetes provides a well-defined set of abstractions for attaching durable storage to workloads. Understanding these primitives is essential before exploring specific high-performance backends.

PersistentVolumes, PVCs, and StorageClasses

A PersistentVolume (PV) is a cluster-level resource that represents a piece of storage — a disk, a network share, or an object bucket — that has been provisioned either manually by an administrator or automatically by a storage driver. A PV exists independently of any pod; it persists across pod restarts, rescheduling, and node failures [Source: https://simplyblock.io/glossary/kubernetes-storage-for-ai-workloads/].

A PersistentVolumeClaim (PVC) is a namespaced request for storage made by a workload. A pod does not reference a PV directly; instead it references a PVC, and Kubernetes matches (or dynamically provisions) the appropriate PV. This indirection decouples application configuration from infrastructure decisions.

A StorageClass defines a profile of storage — its provisioner, performance tier, reclaim policy, and any driver-specific parameters. When a PVC references a StorageClass, the associated provisioner creates a PV automatically. This dynamic provisioning eliminates the need for administrators to pre-create hundreds of individual volumes.

The following example illustrates a StorageClass backed by NVMe SSDs, followed by a PVC referencing it:

# StorageClass: high-performance NVMe-backed block storage
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: nvme-training
provisioner: ebs.csi.aws.com        # AWS EBS CSI driver
parameters:
  type: io2
  iopsPerGB: "50"
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
# PVC requesting 500 GiB of NVMe storage for training scratch space
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: training-scratch
  namespace: ml-training
spec:
  storageClassName: nvme-training
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 500Gi

When a training pod references training-scratch, Kubernetes dynamically provisions a 500 GiB io2 EBS volume and attaches it to the node where the pod is scheduled.

Figure 3.2: Kubernetes storage abstraction — from StorageClass to mounted volume

flowchart TD
    SC["StorageClass\n(nvme-training)\nprovisioner: ebs.csi.aws.com"]
    PVC["PersistentVolumeClaim\n(training-scratch)\n500 GiB, ReadWriteOnce"]
    PV["PersistentVolume\n(auto-provisioned)\n500 GiB io2 EBS"]
    POD["Training Pod\nmountPath: /data/scratch"]
    DISK["Physical EBS Volume\n(AWS infrastructure)"]

    SC -->|"dynamic provisioning triggers"| PV
    PVC -->|"bound to"| PV
    POD -->|"references"| PVC
    PV -->|"backed by"| DISK

CSI Drivers for Cloud and On-Premises Storage

The Container Storage Interface (CSI) is an industry standard that decouples storage provisioning from the Kubernetes core. Before CSI, storage plugins were compiled directly into the Kubernetes binary, making it impossible to ship updates independently. CSI allows any storage vendor to publish a driver that Kubernetes can load without modifying the platform [Source: https://portworx.com/knowledge-hub/a-complete-guide-to-kubernetes-csi/].

A CSI driver typically consists of:

Key CSI drivers relevant to AI workloads are listed below:

DriverBackendBest Fit
ebs.csi.aws.comAmazon EBS (SSD, io2 NVMe)Single-node training scratch
efs.csi.aws.comAmazon EFS (NFS)Shared datasets, ReadWriteMany
fsx.csi.aws.comAmazon FSx for LustreHigh-throughput distributed training
pd.csi.storage.gke.ioGCP Persistent DiskSingle-node GKE training
lustre.csi.storage.gke.ioGKE Managed LustreHPC/ML parallel workloads on GKE
rook-ceph.rbd.csi.ceph.comRook-Ceph (block)On-premises block storage
rook-ceph.cephfs.csi.ceph.comRook-Ceph (file)On-premises ReadWriteMany
csi.juicefs.comJuiceFSDistributed POSIX cache layer

[Source: https://kubernetes-csi.github.io/docs/drivers.html] [Source: https://storageclass.info/drivers]

ReadWriteMany Access Modes for Distributed Training

Kubernetes PVs support three access modes:

ReadWriteMany is the critical mode for distributed training. When a PyTorch DistributedDataParallel job launches 100 pods across 20 nodes, every pod must read from the same dataset PVC. Block storage backed by a single NVMe disk cannot satisfy RWX; only network file systems and parallel file systems support it natively [Source: https://aws.amazon.com/blogs/storage/using-high-performance-persistent-storage-for-machine-learning-workloads-on-kubernetes/].

For static provisioning on Lustre, a single PV can be shared across multiple clusters and jobs simultaneously, supporting large parallel training runs with hundreds of pods reading the same dataset [Source: https://cloud.google.com/blog/products/containers-kubernetes/gke-managed-lustre-csi-driver-for-aiml-and-hpc-workloads].

Key Takeaway: PVs, PVCs, StorageClasses, and CSI drivers form the foundational Kubernetes storage vocabulary. Distributed training jobs require ReadWriteMany volumes backed by network or parallel file systems; single-node jobs can use faster block storage with ReadWriteOnce semantics.


Section 3: High-Performance Storage Solutions

With the primitives understood, this section surveys the storage backends that are actually deployed in production AI clusters.

Network File Systems: NFS, Lustre, and BeeGFS

NFS (Network File System) is the simplest shared filesystem. A central NFS server exports one or more directories, and any Kubernetes node with network access can mount them as ReadWriteMany PVs. NFS is easy to operate and works well for small teams or inference model serving (where weights are read-only). However, it does not scale to multi-GB/s aggregate bandwidth because a single NFS server becomes a bottleneck [Source: https://juicefs.com/en/blog/usage-tips/train-large-language-model-kubernetes-storage].

Lustre is a high-performance parallel filesystem originally developed for supercomputers and now ubiquitous in GPU clusters. Lustre separates metadata (directory trees, file attributes) from data (file content) across distinct server roles — Metadata Servers (MDS) and Object Storage Servers (OSS). Data is striped across multiple OSSes, allowing aggregate read bandwidth that scales linearly with the number of servers. This architecture is why Lustre is the recommended backend for the most demanding distributed training workloads [Source: https://aws.amazon.com/blogs/storage/using-high-performance-persistent-storage-for-machine-learning-workloads-on-kubernetes/].

Both AWS (FSx for Lustre) and Google Cloud (GKE Managed Lustre) provide fully managed Lustre instances with integrated CSI drivers [Source: https://cloud.google.com/blog/products/containers-kubernetes/gke-managed-lustre-csi-driver-for-aiml-and-hpc-workloads]. A Kubernetes job with 100 parallel pods can all mount the same Lustre PVC and read disjoint subsets of a dataset concurrently, with each pod working on a different data shard [Source: https://aws.amazon.com/blogs/machine-learning/configure-and-verify-a-distributed-training-cluster-with-aws-deep-learning-containers-on-amazon-eks/].

Figure 3.3: Lustre parallel filesystem architecture for distributed training

flowchart TD
    subgraph Clients["Kubernetes Training Pods (100x)"]
        P1["Pod A\n(shard 0-99)"]
        P2["Pod B\n(shard 100-199)"]
        P3["Pod C\n(shard 200-299)"]
        Px["... 97 more pods"]
    end
    subgraph Lustre["Lustre Filesystem"]
        MDS["Metadata Server (MDS)\nDirectory trees, file attributes,\nblock locations"]
        OSS1["Object Storage Server 1\n(data stripe 1)"]
        OSS2["Object Storage Server 2\n(data stripe 2)"]
        OSS3["Object Storage Server 3\n(data stripe 3)"]
        OSSn["... N more OSSes"]
    end
    P1 & P2 & P3 & Px -->|"metadata lookup"| MDS
    MDS -->|"directs reads to"| OSS1 & OSS2 & OSS3 & OSSn
    P1 -->|"parallel data read"| OSS1
    P2 -->|"parallel data read"| OSS2
    P3 -->|"parallel data read"| OSS3

BeeGFS is another parallel filesystem popular in on-premises HPC environments. Like Lustre it separates metadata from data and supports wide striping, but uses a simpler client daemon and is often favoured in enterprise settings for its operational simplicity relative to Lustre.

The diagram below maps these filesystems to their appropriate use cases:

Low complexity / Low throughput
        NFS
         |
    BeeGFS (on-prem)
         |
    Lustre (on-prem / managed)
         |
High complexity / High throughput

Object Storage Integration: S3, MinIO, and GCS

MinIO is the most widely deployed self-hosted object storage system for AI pipelines. It speaks the S3 API, which means any code written for Amazon S3 can use MinIO without modification. MinIO manages unstructured data — images, video, tokenized text, model weights — at petabyte scale with high durability [Source: https://www.min.io/use-cases/ai-training].

Object storage is not normally mounted as a POSIX filesystem. Instead, training pipelines access it through the S3 SDK or through tools like s3fs or goofys that present an S3 bucket as a virtual filesystem. The caching strategies in Section 4 address the performance gap this creates.

For model artefact distribution — shipping trained weights to inference servers — object storage is the correct choice. A model registry backed by MinIO or GCS provides version control, content addressability, and the ability to serve weights to inference pods running anywhere in a multi-cloud deployment.

Local NVMe and hostPath for Scratch Data

The fastest storage available to a Kubernetes pod is the local NVMe SSD attached directly to the node. NVMe/TCP — running the NVMe protocol over standard Ethernet — delivers sub-millisecond latency and multi-GB/s throughput per volume without requiring specialised fabric hardware [Source: https://simplyblock.io/glossary/kubernetes-storage-for-ai-workloads/].

Local NVMe is ideal for scratch data: temporary working space that does not need to persist beyond the lifetime of the pod. A training job that preprocesses its dataset into a fast local cache before beginning epoch iteration can eliminate network filesystem latency during the hot loop.

In Kubernetes, local storage can be exposed via:

# Local PV backed by an NVMe device on a specific node
apiVersion: v1
kind: PersistentVolume
metadata:
  name: local-nvme-node01
spec:
  capacity:
    storage: 2Ti
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Delete
  storageClassName: local-nvme
  local:
    path: /mnt/nvme0
  nodeAffinity:
    required:
      nodeSelectorTerms:
        - matchExpressions:
            - key: kubernetes.io/hostname
              operator: In
              values:
                - gpu-node-01

The nodeAffinity field ensures that the pod using this PV always runs on gpu-node-01 where the physical NVMe device is present.

Rook-Ceph for Distributed Block and File Storage

Rook-Ceph is a Kubernetes operator that deploys and manages Ceph — an open-source distributed storage system — entirely within the cluster. Rook translates Kubernetes-native CRD declarations into Ceph cluster configuration, making it possible to run a production-grade distributed storage system on-premises without external storage appliances [Source: https://portworx.com/knowledge-hub/kubernetes-ai/].

Ceph, as managed by Rook, provides three storage interfaces simultaneously:

InterfaceKubernetes AccessUse Case
RBD (block)CSI rook-ceph.rbd.csi.ceph.comBoot volumes, single-pod scratch, checkpoints
CephFS (file)CSI rook-ceph.cephfs.csi.ceph.comReadWriteMany shared datasets
RGW (object)S3-compatible APIRaw dataset storage, model artefacts

A key advantage of Rook-Ceph for on-premises AI clusters is that it turns commodity server disks into a resilient, self-healing storage pool. When a node fails, Ceph automatically re-replicates the affected data across surviving nodes — no manual intervention required.

Multi-level caching architectures commonly use Ceph as the third storage tier: memory page cache (first tier) → local NVMe SSDs (second tier) → Ceph distributed storage (third tier) [Source: https://juicefs.com/en/blog/user-stories/storage-evolution-of-unisound-hpc-platform-with-juicefs].

Key Takeaway: No single storage system serves all AI workload phases. Lustre or Rook-CephFS satisfies ReadWriteMany training data access; object storage (MinIO, S3) provides economical raw dataset and artefact storage; local NVMe handles scratch computation; and Rook-Ceph can unify all three backends on-premises through a single operator-managed platform.


Section 4: Data Caching and Pipeline Optimization

High-performance storage solves the infrastructure problem; caching solves the access pattern problem. Even a Lustre cluster with 100 GB/s of aggregate bandwidth can become a bottleneck when hundreds of pods repeatedly read the same training samples across many epochs. Caching brings frequently accessed data closer to the compute, transforming repeated remote reads into local memory or SSD accesses.

Fluid: Dataset Acceleration on Kubernetes

Fluid is a CNCF sandbox project that acts as a Kubernetes-native orchestration layer for distributed dataset caching. Rather than requiring data engineers to manage caching infrastructure manually, Fluid exposes two new Kubernetes resource types — Dataset and Runtime — that describe what data should be cached and how [Source: https://fluid-cloudnative.github.io/docs].

The core innovation is the Elastic Dataset abstraction. A Dataset CRD declares the location of training data (an S3 bucket, an NFS share, a HDFS path) without coupling the application to the underlying storage technology. The Runtime CRD selects the caching engine (Alluxio, JuiceFS, Vineyard, Jindo) and configures its parameters. Fluid then instantiates the chosen engine as Kubernetes pods and registers the cached dataset as a PVC that training pods can mount normally [Source: https://billychen1.github.io/fluid-website-demo/docs/core-concepts/introduction].

Data-affinity scheduling is Fluid’s second major contribution. Fluid tracks which nodes hold cached copies of each dataset partition and annotates them with this information. The Kubernetes scheduler can then place training pods on nodes that already hold their required data, eliminating network reads entirely for the cached portions [Source: https://fluid-cloudnative.github.io/docs/core-concepts/architecture-and-concepts].

Fluid also supports automated data operations via CRDs:

These operations support one-time, scheduled, and event-driven trigger modes, enabling GitOps-style automation of data pipeline stages [Source: https://aws.amazon.com/blogs/containers/build-deep-learning-model-training-apps-using-cncf-fluid-with-amazon-eks/].

# Fluid Dataset: declare ImageNet stored in S3
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: imagenet
  namespace: ml-training
spec:
  mounts:
    - mountPoint: s3://my-bucket/imagenet/
      name: imagenet
      options:
        fs.s3a.endpoint: s3.amazonaws.com
---
# Fluid JuiceFSRuntime: cache the dataset locally
apiVersion: data.fluid.io/v1alpha1
kind: JuiceFSRuntime
metadata:
  name: imagenet
  namespace: ml-training
spec:
  replicas: 4
  tieredstore:
    levels:
      - mediumtype: SSD
        path: /dev/shm
        quota: 200Gi
        high: "0.95"
        low: "0.7"

When a training pod mounts the imagenet PVC, Fluid routes reads through the JuiceFS cache layer. On the first epoch, data is fetched from S3; on subsequent epochs, it is served from the local SSD cache — dramatically reducing network egress costs and storage latency.

Figure 3.4: Fluid dataset caching architecture — from S3 to training pod

flowchart LR
    subgraph Control["Fluid Control Plane"]
        DS["Dataset CRD\n(imagenet)"]
        RT["JuiceFSRuntime CRD\n(4 replicas, SSD cache)"]
        DL["DataLoad CRD\n(pre-warm /train)"]
        SCHED["Data-Affinity\nScheduler Plugin"]
    end
    subgraph Cache["Cache Layer (per node)"]
        FUSE["FUSE Pod\n(POSIX interface)"]
        WORKER["Worker Pod\n(cache lifecycle)"]
        SSD["Local SSD Cache\n(200 GiB)"]
    end
    subgraph Backend["Remote Storage"]
        S3["S3 Bucket\ns3://my-bucket/imagenet/"]
    end
    subgraph Compute["Training Pod"]
        POD["PyTorch DataLoader\nmount: imagenet PVC"]
    end

    DS & RT --> FUSE & WORKER
    DL -->|"pre-warms"| SSD
    SCHED -->|"schedules pod to\ncached node"| POD
    POD -->|"POSIX read"| FUSE
    FUSE -->|"cache hit"| SSD
    FUSE -->|"cache miss (epoch 1)"| S3
    S3 -->|"fills cache"| SSD

Alluxio and JuiceFS for Tiered Caching

Alluxio is a data orchestration platform that sits between compute frameworks and storage systems, presenting a unified namespace over multiple disparate backends (S3, HDFS, NFS, GCS). Training pods address data through Alluxio as if it were a local filesystem, while Alluxio transparently fetches and caches the underlying objects. This makes it particularly useful in heterogeneous environments where training data spans multiple storage systems [Source: https://juicefs.com/en/blog/user-stories/juicefs-vs-alluxio-ai-storage-naver].

However, Alluxio has a documented limitation: it does not fully implement the POSIX API. Missing support for symbolic links, truncate, fallocate, append, and xattr can cause subtle failures in AI tooling that relies on those calls [Source: https://juicefs.com/en/blog/solutions/ai-data-storage-challenges-capabilities-solution-comparison].

JuiceFS provides full POSIX compliance with a metadata/data separation architecture. Metadata (directory trees, file attributes, block maps) is stored in a fast metadata engine (Redis, TiKV, or a managed database), while file content is stored as fixed-size chunks in an object store. This separation allows JuiceFS to scale metadata operations independently of data access [Source: https://juicefs.com/en/blog/usage-tips/train-large-language-model-kubernetes-storage].

When used with Fluid, JuiceFS launches two component types per worker node:

Data loading performance through the three-layer JuiceFS cache stack (page cache → NVMe SSD → object store) can reach 300–800 MB/s or higher per node, enabling fast multi-epoch training without saturating the backend object store on every pass [Source: https://www.alibabacloud.com/blog/601515].

Figure 3.5: JuiceFS three-layer cache stack — read path per training node

flowchart TD
    POD["Training Pod\nread request"]
    L1["Layer 1: Kernel Page Cache\n(RAM — sub-microsecond latency)"]
    L2["Layer 2: Local NVMe SSD\n(300–800 MB/s per node)"]
    L3["Layer 3: Object Store\n(S3 / MinIO — network latency)"]
    META["Metadata Engine\n(Redis / TiKV / managed DB)"]

    POD -->|"1. check page cache"| L1
    L1 -->|"miss → check SSD"| L2
    L2 -->|"miss → fetch from object store"| L3
    L3 -->|"populate SSD cache"| L2
    L2 -->|"populate page cache"| L1
    L1 -->|"data returned to pod"| POD
    POD <-->|"file metadata lookup"| META

The table below compares Alluxio and JuiceFS for AI workloads:

FeatureAlluxioJuiceFS
POSIX compliancePartial (missing truncate, symlinks, xattr)Full
Metadata architectureIntegratedSeparated (pluggable metadata engine)
Kubernetes-native CRDsVia FluidVia Fluid
Multi-storage unificationStrong (HDFS, S3, NFS, GCS)S3-compatible backends
Cache performanceHighVery high (300–800 MB/s per node)
Operational complexityHighModerate
On-premises supportYesYes
Cloud-managed optionNoYes (JuiceFS Cloud)

[Source: https://juicefs.com/en/blog/user-stories/juicefs-vs-alluxio-ai-storage-naver] [Source: https://juicefs.com/en/blog/user-stories/ai-storage-life-sciences-solution-juicefs-vs-lustre-alluxio]

Prefetching and Data Locality Strategies

Even the fastest cache is ineffective if data arrives after the GPU is already waiting for it. Prefetching — loading the next batch of data into cache while the current batch is being processed — is the key technique for hiding I/O latency behind computation.

Kubernetes provides several mechanisms to implement prefetching and data locality:

1. Fluid DataLoad (pre-warming)

Before submitting a training job, issue a DataLoad CRD that instructs Fluid to pull the entire dataset (or a specified subset) into the local cache. The training job then starts with a warm cache and experiences no cold-start penalty.

apiVersion: data.fluid.io/v1alpha1
kind: DataLoad
metadata:
  name: imagenet-preload
  namespace: ml-training
spec:
  dataset:
    name: imagenet
    namespace: ml-training
  loadMetadata: true
  target:
    - path: /train
      replicas: 1

2. List caching on CSI FUSE drivers

For GKE Cloud Storage FUSE and Lustre CSI drivers, enabling list caching (via flags like --kernel-list-cache-ttl-secs) caches directory and file listings in the kernel. This is particularly beneficial for AI/ML training workloads that repeatedly iterate full directory listings to build file indexes at the start of each epoch [Source: https://docs.cloud.google.com/kubernetes-engine/docs/how-to/cloud-storage-fuse-csi-driver-perf].

3. File cache for multi-epoch training

The file cache feature in Cloud Storage FUSE stores frequently accessed data on local node storage. For multi-epoch training where the same files are read dozens of times, this eliminates redundant network fetches after the first epoch [Source: https://docs.cloud.google.com/kubernetes-engine/docs/how-to/cloud-storage-fuse-csi-driver-perf].

4. Data-affinity pod scheduling

Fluid annotates nodes with information about which dataset partitions they currently cache. Combined with nodeAffinity or podAffinity rules, training pods can be scheduled onto nodes where their required data already resides. This transforms a potential network read into a local memory or SSD read — the lowest latency option available.

5. Storage QoS tiering

Platform teams should enforce per-volume IOPS and bandwidth limits to create distinct performance tiers. Training volumes receive high burst throughput, but a cap prevents them from consuming the entire network fabric during checkpoint writes. Inference volumes receive a guaranteed IOPS floor that is protected from training load spikes [Source: https://portworx.com/knowledge-hub/kubernetes-ai/]. This QoS enforcement is the storage equivalent of CPU resource requests and limits: it prevents noisy neighbours and ensures predictable latency for production inference endpoints.

Key Takeaway: Caching is not optional for production AI training — it is the mechanism that prevents GPU idle time caused by storage I/O bottlenecks. Fluid provides a Kubernetes-native control plane for managing caching engines (Alluxio, JuiceFS) as observable, elastically scalable services with data-affinity scheduling, automated prefetching, and GitOps-compatible data operations.


Chapter Summary

This chapter surveyed the complete storage stack for AI pipelines on Kubernetes, from foundational abstractions to advanced caching strategies.

Section 1 established that training, checkpointing, and inference carry fundamentally different I/O profiles. Training demands sustained high-throughput sequential reads across repeated epochs; inference requires low-latency random reads; checkpointing imposes periodic write bursts. A single storage tier cannot optimally serve all three.

Section 2 covered the Kubernetes storage vocabulary — PersistentVolumes, PVCs, StorageClasses, and CSI drivers — that provides a vendor-agnostic abstraction layer over diverse backends. The ReadWriteMany access mode, supported by network and parallel filesystems but not block storage, is essential for distributed training jobs where dozens to hundreds of pods read the same dataset concurrently.

Section 3 explored the leading storage backends: Lustre and BeeGFS for high-throughput parallel file access; NFS for simple shared access; MinIO and S3-compatible object stores for dataset ingestion and model artefact distribution; local NVMe for maximum-speed scratch computation; and Rook-Ceph for a unified on-premises storage platform that provides block, file, and object interfaces from commodity hardware.

Section 4 demonstrated how Fluid, Alluxio, and JuiceFS close the performance gap between remote storage and GPU-local compute. Fluid’s Elastic Dataset abstraction and data-affinity scheduler turn distributed caching systems into first-class Kubernetes citizens. JuiceFS’s three-layer cache stack (page cache → NVMe SSD → object store) achieves 300–800 MB/s per node while maintaining full POSIX compliance. Prefetching via DataLoad CRDs, list caching, and file caching eliminate cold-start penalties before training epochs begin.


Key Terms

TermDefinition
PersistentVolume (PV)A cluster-level Kubernetes resource representing a piece of durable storage provisioned independently of any pod
PersistentVolumeClaim (PVC)A namespaced request for storage made by a workload, decoupling pod configuration from storage infrastructure
StorageClassA Kubernetes object that defines a profile of storage (provisioner, performance parameters, reclaim policy) and enables dynamic volume provisioning
CSI driverA Container Storage Interface plugin that implements the CSI standard to provision, attach, and mount volumes from a specific storage backend without modifying Kubernetes core
ReadWriteMany (RWX)A PersistentVolume access mode allowing the volume to be mounted read-write by multiple nodes simultaneously; required for distributed training
LustreA high-performance parallel filesystem that stripes data across multiple object storage servers; used in HPC and large-scale distributed ML training
Rook-CephA Kubernetes operator that deploys and manages Ceph distributed storage, providing block (RBD), file (CephFS), and object (RGW) interfaces from within the cluster
CheckpointA periodic snapshot of model weights, optimizer state, and training metadata written to durable storage so training can resume after a failure
Data localityThe principle of scheduling compute pods on nodes where their required data is already cached, minimising network transfers and storage latency
FluidA CNCF sandbox project providing Kubernetes-native Dataset and Runtime CRDs that wrap distributed caching engines as observable, elastically scalable services with data-affinity scheduling
AlluxioA data orchestration platform providing a unified namespace over multiple storage backends as a caching layer; lacks full POSIX API support
JuiceFSA POSIX-compatible distributed filesystem with separated metadata and data architecture, capable of 300–800 MB/s per-node cache performance via a three-layer cache stack
Elastic DatasetFluid’s core abstraction that decouples application data access from the underlying storage platform, enabling portable data pipelines across diverse Kubernetes environments
IOPSInput/Output Operations Per Second; a measure of storage performance critical for random-access workloads such as small-file dataset reads during training

Chapter 4: Training Workloads: Jobs, Operators, and Frameworks

Learning Objectives

By the end of this chapter, you will be able to:


4.1 Kubernetes Job Primitives for Training

Before reaching for a specialized ML operator, it is worth understanding the building blocks that Kubernetes already provides. Native Job primitives handle a surprisingly wide range of training workloads, and the higher-level operators covered later in this chapter are themselves built on top of them.

4.1.1 Jobs and CronJobs for Batch Training

A Kubernetes Job creates one or more Pods and guarantees that a specified number of them complete successfully. When the Pod finishes, the Job marks itself complete. For a single-node training run — perhaps a nightly fine-tuning pass on a small model — a plain Job is often all you need.

apiVersion: batch/v1
kind: Job
metadata:
  name: finetune-bert-small
spec:
  backoffLimit: 3          # retry up to 3 times on failure
  template:
    spec:
      restartPolicy: OnFailure
      containers:
        - name: trainer
          image: myregistry/bert-trainer:v2.1
          resources:
            limits:
              nvidia.com/gpu: "1"
          command: ["python", "train.py", "--epochs=10", "--output=/checkpoints"]
          volumeMounts:
            - name: checkpoint-vol
              mountPath: /checkpoints
      volumes:
        - name: checkpoint-vol
          persistentVolumeClaim:
            claimName: training-checkpoints-pvc

A CronJob wraps a Job with a schedule expression, making it straightforward to run recurring model refreshes — for example, retraining a recommendation model every night from the latest day’s data.

Figure 4.1: Kubernetes Job and CronJob lifecycle for training workloads

flowchart TD
    A[User applies Job manifest] --> B[Kubernetes Job Controller]
    B --> C[Creates Pod]
    C --> D{Pod executes training script}
    D -->|Success| E[Pod status: Completed]
    D -->|Failure| F{backoffLimit reached?}
    F -->|No| C
    F -->|Yes| G[Job status: Failed]
    E --> H[Job status: Complete]

    CJ[CronJob schedule triggers] -->|"0 2 * * *"| B2[Job Controller spawns new Job]
    B2 --> C2[Creates Pod]
    C2 --> D2[Executes retraining script]
    D2 -->|Success| H2[Job Complete — next run waits for next schedule]

    style A fill:#4A90D9,color:#fff
    style CJ fill:#7B68EE,color:#fff
    style H fill:#27AE60,color:#fff
    style H2 fill:#27AE60,color:#fff
    style G fill:#E74C3C,color:#fff
apiVersion: batch/v1
kind: CronJob
metadata:
  name: nightly-retrain
spec:
  schedule: "0 2 * * *"   # 02:00 UTC every day
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: OnFailure
          containers:
            - name: trainer
              image: myregistry/rec-trainer:latest
              resources:
                limits:
                  nvidia.com/gpu: "1"

4.1.2 Indexed Jobs for Parallel Work Distribution

When you want to fan out independent work across multiple Pods — for example, processing 100 data shards in parallel or running a hyperparameter sweep — the Indexed Job is the right primitive. Each Pod receives a unique JOB_COMPLETION_INDEX environment variable (0, 1, 2, …), which your training script can use to select its shard.

apiVersion: batch/v1
kind: Job
metadata:
  name: hyperparam-sweep
spec:
  completions: 8
  parallelism: 8
  completionMode: Indexed   # each Pod gets a unique index
  template:
    spec:
      restartPolicy: OnFailure
      containers:
        - name: trainer
          image: myregistry/sweep-trainer:v1
          command:
            - python
            - sweep.py
            - "--config-index=$(JOB_COMPLETION_INDEX)"
          resources:
            limits:
              nvidia.com/gpu: "1"

Analogy: Think of an Indexed Job as a restaurant kitchen where each chef is assigned a numbered ticket. Chef 0 handles order 0, Chef 1 handles order 1, and so on. The kitchen manager (Kubernetes) ensures every ticket is completed and can re-assign a ticket if a chef calls in sick (pod failure).

Figure 4.2: Indexed Job fan-out for parallel hyperparameter sweep

flowchart LR
    IJ["Indexed Job\ncompletions=8\nparallelism=8"]
    IJ --> P0["Pod 0\nJOB_COMPLETION_INDEX=0\nconfig-index=0"]
    IJ --> P1["Pod 1\nJOB_COMPLETION_INDEX=1\nconfig-index=1"]
    IJ --> P2["Pod 2\nJOB_COMPLETION_INDEX=2\nconfig-index=2"]
    IJ --> Pdot["..."]
    IJ --> P7["Pod 7\nJOB_COMPLETION_INDEX=7\nconfig-index=7"]

    P0 --> R0[Result 0]
    P1 --> R1[Result 1]
    P2 --> R2[Result 2]
    P7 --> R7[Result 7]

    R0 & R1 & R2 & R7 --> AGG[Aggregate best hyperparameters]

    style IJ fill:#4A90D9,color:#fff
    style AGG fill:#27AE60,color:#fff

4.1.3 Init Containers for Data Preparation

Training jobs often need data to be staged before the main training container starts. Init containers run to completion before any app containers start in the same Pod, making them a clean mechanism for downloading datasets, decompressing archives, or validating checksums.

spec:
  initContainers:
    - name: download-dataset
      image: amazon/aws-cli:latest
      command:
        - aws
        - s3
        - sync
        - s3://my-datasets/imagenet-shards/
        - /data/
      volumeMounts:
        - name: dataset-vol
          mountPath: /data
  containers:
    - name: trainer
      image: myregistry/resnet-trainer:v3
      volumeMounts:
        - name: dataset-vol
          mountPath: /data

The training container does not start until the download-dataset init container exits successfully. If the download fails, Kubernetes retries the init container according to the Pod’s restartPolicy, preventing training from starting with an incomplete dataset. [Source: https://collabnix.com/building-distributed-training-systems-on-kubernetes-a-complete-guide/]

Figure 4.3: Init container sequence for data preparation before training

sequenceDiagram
    participant K as Kubernetes Scheduler
    participant I as Init Container<br/>(download-dataset)
    participant S as Shared Volume<br/>(/data PVC)
    participant T as Training Container<br/>(resnet-trainer)

    K->>I: Start init container
    I->>S: aws s3 sync → /data/
    alt Download succeeds
        I-->>K: Exit 0 (success)
        K->>T: Start training container
        T->>S: Read /data/ for training
        T-->>K: Training complete → Exit 0
    else Download fails
        I-->>K: Exit non-zero (failure)
        K->>I: Retry init container (restartPolicy)
    end

Key Takeaway: Native Kubernetes Job primitives — plain Jobs, CronJobs, and Indexed Jobs — cover single-node and embarrassingly-parallel training workloads without additional tooling. Init containers provide a clean pre-flight hook for data preparation and validation.


4.2 Kubeflow Training Operator

Once you need tightly coupled distributed training — where multiple workers must communicate gradient updates in real time — native Jobs become awkward. You would need to manually wire up environment variables, coordinate startup order, and handle failure recovery. The Kubeflow Training Operator solves this by providing Kubernetes-native custom resources (CRDs) for the most popular deep learning frameworks. [Source: https://www.kubeflow.org/docs/components/trainer/legacy-v1/overview/]

4.2.1 Installing and Configuring Training Operator

The simplest installation applies the standalone overlay from the official manifest:

kubectl apply -k \
  "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.7.0"

This creates the Training Operator deployment in the kubeflow namespace, registers the PyTorchJob, TFJob, MPIJob, MXNetJob, and XGBoostJob CRDs, and sets up the RBAC rules the operator needs to manage Pods on your behalf. [Source: https://www.kubeflow.org/docs/components/trainer/legacy-v1/getting-started/]

Verify the operator is running:

kubectl get pods -n kubeflow
# NAME                                READY   STATUS    RESTARTS
# training-operator-5d97c5f67-xk9qp  1/1     Running   0

4.2.2 PyTorchJob, TFJob, and MPIJob Custom Resources

The three most commonly used custom resources are summarized below.

Custom ResourceFrameworkCommunication PatternTypical Use Case
PyTorchJobPyTorchNCCL / Gloo / MPIDDP, FSDP, tensor parallelism
TFJobTensorFlowgRPC parameter server or AllReduceClassic TF distributed strategies
MPIJobAny MPI-basedMPI (OpenMPI / MPICH)Horovod, custom HPC workloads

PyTorchJob example — 1 master + 3 workers:

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: pytorch-distributed-ddp
  namespace: training
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: myregistry/ddp-trainer:v1
              resources:
                limits:
                  nvidia.com/gpu: "8"
              command:
                - python
                - -m
                - torch.distributed.run
                - --nproc_per_node=8
                - train_ddp.py
    Worker:
      replicas: 3
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: myregistry/ddp-trainer:v1
              resources:
                limits:
                  nvidia.com/gpu: "8"
              command:
                - python
                - -m
                - torch.distributed.run
                - --nproc_per_node=8
                - train_ddp.py

When the Training Operator reconciles this resource, it automatically injects the following environment variables into every Pod: MASTER_ADDR, MASTER_PORT, WORLD_SIZE, and RANK. Your training code uses these directly through torch.distributed.init_process_group() — no manual coordination required. [Source: https://www.kubeflow.org/docs/components/trainer/legacy-v1/reference/distributed-training/]

Figure 4.4: Kubeflow Training Operator reconciliation flow for a PyTorchJob

flowchart TD
    User["User applies PyTorchJob YAML"] --> API[Kubernetes API Server]
    API --> CRD["PyTorchJob CRD stored in etcd"]
    CRD --> TO["Training Operator Controller\n(watches PyTorchJob events)"]

    TO --> MP["Create Master Pod\n(rank=0, 8 GPUs)"]
    TO --> W1["Create Worker Pod 1\n(rank=1, 8 GPUs)"]
    TO --> W2["Create Worker Pod 2\n(rank=2, 8 GPUs)"]
    TO --> W3["Create Worker Pod 3\n(rank=3, 8 GPUs)"]

    MP & W1 & W2 & W3 --> ENV["Inject env vars:\nMASTER_ADDR, MASTER_PORT\nWORLD_SIZE=4, RANK=n"]
    ENV --> DDP["torch.distributed.init_process_group\nNCCL backend → training begins"]

    TO --> STATUS["Update PyTorchJob\nstatus conditions"]

    style User fill:#4A90D9,color:#fff
    style TO fill:#7B68EE,color:#fff
    style DDP fill:#27AE60,color:#fff

TFJob example — parameter server strategy:

apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  name: tf-parameter-server
  namespace: training
spec:
  tfReplicaSpecs:
    PS:
      replicas: 1
      restartPolicy: Never
      template:
        spec:
          containers:
            - name: tensorflow
              image: myregistry/tf-ps-trainer:v2
              resources:
                limits:
                  nvidia.com/gpu: "0"   # PS is CPU-bound
    Worker:
      replicas: 4
      restartPolicy: Never
      template:
        spec:
          containers:
            - name: tensorflow
              image: myregistry/tf-ps-trainer:v2
              resources:
                limits:
                  nvidia.com/gpu: "4"

MPIJob example — Horovod training:

apiVersion: kubeflow.org/v1
kind: MPIJob
metadata:
  name: horovod-resnet
  namespace: training
spec:
  slotsPerWorker: 8
  runPolicy:
    cleanPodPolicy: Running
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          containers:
            - name: mpi-launcher
              image: myregistry/horovod-trainer:v1
              command:
                - mpirun
                - --allow-run-as-root
                - -np
                - "16"
                - python
                - train_horovod.py
    Worker:
      replicas: 2
      template:
        spec:
          containers:
            - name: mpi-worker
              image: myregistry/horovod-trainer:v1
              resources:
                limits:
                  nvidia.com/gpu: "8"

4.2.3 Worker Pod Topology and Launcher/Worker Patterns

The Training Operator supports two broad topologies:

Master/Worker (PyTorchJob, TFJob): One designated master Pod coordinates the distributed rendezvous, while worker Pods connect back to it. The master is typically rank 0 in the process group. This pattern fits tightly-coupled synchronous training where all ranks participate in AllReduce operations.

Launcher/Worker (MPIJob): A lightweight launcher Pod runs mpirun to spawn and coordinate worker processes across the worker Pods. The launcher itself performs no model computation; it only manages the MPI process topology. Workers execute the actual training code. [Source: https://oneuptime.com/blog/post/2026-02-09-pytorch-kubeflow-training-operator/view]

┌──────────────────────────────────────────────────────────┐
│  PyTorchJob / TFJob Pattern          MPIJob Pattern        │
│                                                            │
│   ┌─────────┐                        ┌──────────┐         │
│   │ Master  │◄──rendezvous──┐        │ Launcher │         │
│   │ (rank 0)│               │        │ (mpirun) │         │
│   └────┬────┘               │        └────┬─────┘         │
│        │ AllReduce          │             │ SSH/MPI        │
│   ┌────▼────┐         ┌─────┴───┐   ┌────▼────┐  ┌──────┐│
│   │Worker 1 │         │Worker 2 │   │Worker 0 │  │Worker││
│   │(rank 1) │         │(rank 2) │   │(slots:8)│  │  1   ││
│   └─────────┘         └─────────┘   └─────────┘  └──────┘│
└──────────────────────────────────────────────────────────┘

4.2.4 Gang Scheduling with Volcano or Coscheduling

A critical challenge with distributed training on shared Kubernetes clusters is partial scheduling deadlock: if only 3 of 4 required worker Pods can be scheduled (because node resources are fragmented), those 3 Pods sit idle waiting for the 4th — consuming GPU resources without doing any work.

Gang scheduling solves this by treating all Pods in a job as an atomic unit: either all Pods are scheduled simultaneously, or none are. [Source: https://collabnix.com/distributed-training-on-kubernetes-best-practices-implementation/]

Two options are commonly used:

SchedulerInstall MethodKey Feature
Volcanokubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development.yamlFull-featured batch scheduler with queue management and fair-share policies
Coscheduling (scheduler-plugins)Scheduler plugin, lower overheadLightweight gang scheduling as a plugin to the default kube-scheduler

To use Volcano with a PyTorchJob, annotate the job to request gang scheduling:

metadata:
  annotations:
    scheduling.volcano.sh/group-name: pytorch-ddp-group
spec:
  runPolicy:
    schedulingPolicy:
      minAvailable: 4   # all 4 workers must be available before any start

Key Takeaway: The Kubeflow Training Operator abstracts away the complexity of distributed job coordination. It automatically injects rendezvous environment variables, manages Pod lifecycles, and integrates with gang schedulers like Volcano to prevent partial-allocation deadlocks.


4.3 Distributed Training Strategies

Choosing the right parallelism strategy is one of the most consequential decisions in large-scale ML training. The wrong choice can waste GPU memory, saturate the network, or make training 10x slower than necessary.

Analogy: Think of training a large model like writing an encyclopedia. Data parallelism means you have 8 identical editorial teams, each working through a different section of source material and comparing notes at the end of each day. Model parallelism means one team writes Volume 1, another writes Volume 2, etc. Pipeline parallelism is the assembly-line version: one team drafts, the next team edits, the next team fact-checks, all working in parallel on different articles.

4.3.1 Data Parallelism with PyTorch DDP and Horovod

Data parallelism is the most widely used strategy. Each worker holds a complete copy of the model and trains on a different slice of the data. At the end of each step, all workers synchronize their gradients via an AllReduce collective operation — every worker ends up with identical, averaged gradients. [Source: https://towardsdatascience.com/distributed-parallel-training-data-parallelism-and-model-parallelism-ec2d234e3214/]

Worker 0: batch_0 → forward → backward → ──┐
Worker 1: batch_1 → forward → backward → ──┤ AllReduce → averaged gradients
Worker 2: batch_2 → forward → backward → ──┤  (NCCL)  → all workers update
Worker 3: batch_3 → forward → backward → ──┘

Figure 4.5: Data parallelism AllReduce synchronization across workers

flowchart TD
    subgraph W0["Worker 0 (GPU 0)"]
        D0[Batch shard 0] --> F0[Forward pass]
        F0 --> B0[Backward pass]
        B0 --> G0[Local gradients]
    end
    subgraph W1["Worker 1 (GPU 1)"]
        D1[Batch shard 1] --> F1[Forward pass]
        F1 --> B1[Backward pass]
        B1 --> G1[Local gradients]
    end
    subgraph W2["Worker 2 (GPU 2)"]
        D2[Batch shard 2] --> F2[Forward pass]
        F2 --> B2[Backward pass]
        B2 --> G2[Local gradients]
    end
    subgraph W3["Worker 3 (GPU 3)"]
        D3[Batch shard 3] --> F3[Forward pass]
        F3 --> B3[Backward pass]
        B3 --> G3[Local gradients]
    end

    G0 & G1 & G2 & G3 --> AR["AllReduce\n(NCCL)\nSum + average gradients"]
    AR --> UPD["All workers receive\nidentical averaged gradients"]
    UPD --> OPT["optimizer.step()\nWeights updated identically\non all workers"]

    style AR fill:#E67E22,color:#fff
    style OPT fill:#27AE60,color:#fff

Key implementation considerations:

ConcernRecommendation
Data shardingUse torch.utils.data.DistributedSampler so each worker sees a unique shard with no overlap [Source: https://www.digitalocean.com/community/conceptual-articles/data-parallelism-distributed-training]
Learning rateScale linearly with world size: lr = base_lr * world_size [Source: https://collabnix.com/distributed-training-on-kubernetes-best-practices-implementation/]
Batch sizeGlobal batch size = per-worker batch size × world size; warmup scheduling helps convergence
Communication backendNCCL for GPU-to-GPU; Gloo for CPU or debugging; NCCL with InfiniBand/RoCE for optimal throughput [Source: https://www.zettabyte.space/blog-new/kubernetes-for-ai-container-orchestration-best-practices]

PyTorch DDP training script skeleton:

import os
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

def main():
    dist.init_process_group(backend="nccl")   # env vars injected by Training Operator
    rank = dist.get_rank()
    world_size = dist.get_world_size()
    local_rank = int(os.environ["LOCAL_RANK"])

    torch.cuda.set_device(local_rank)
    model = MyModel().to(local_rank)
    model = DDP(model, device_ids=[local_rank])

    dataset = MyDataset(...)
    sampler = torch.utils.data.DistributedSampler(dataset, num_replicas=world_size, rank=rank)
    loader = torch.utils.data.DataLoader(dataset, sampler=sampler, batch_size=64)

    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4 * world_size)

    for epoch in range(100):
        sampler.set_epoch(epoch)   # ensures different shuffling each epoch
        for batch in loader:
            loss = model(batch)
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()

if __name__ == "__main__":
    main()

Horovod provides a framework-agnostic alternative for data parallelism, usable with PyTorch, TensorFlow, and MXNet through a consistent API. It is particularly well-suited to MPIJob deployments. [Source: https://www.min.io/learn/distributed-training]

import horovod.torch as hvd

hvd.init()
torch.cuda.set_device(hvd.local_rank())

model = MyModel().cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01 * hvd.size())
optimizer = hvd.DistributedOptimizer(optimizer, named_parameters=model.named_parameters())
hvd.broadcast_parameters(model.state_dict(), root_rank=0)

4.3.2 Model Parallelism and Tensor Parallelism

When a model is too large to fit in a single GPU’s VRAM, model parallelism distributes the model itself across multiple devices. Different layers (or layer groups) live on different GPUs. Each GPU processes the full mini-batch but only through its own portion of the network, passing activations forward and gradients backward across device boundaries. [Source: https://medium.com/@sulaiman.shamasna/distributed-model-training-5b460f2af482]

GPU 0: Embedding + Layers 0–11  ──activations──►  GPU 1: Layers 12–23  ──►  GPU 2: Layers 24–35 + Head
                                 ◄──gradients────                         ◄──

Tensor parallelism is a finer-grained form of model parallelism where individual weight matrices are sharded across GPUs. For example, in a transformer’s attention block, the Q/K/V projection matrices can be column-partitioned, computed in parallel, and the results gathered before the next operation. [Source: https://www.kubeflow.org/docs/components/trainer/legacy-v1/reference/distributed-training/]

PyTorch’s Fully Sharded Data Parallel (FSDP) combines model and data parallelism: model parameters, gradients, and optimizer states are all sharded across workers. FSDP2 (the next-generation implementation) improves memory efficiency further and is directly supported through the Training Operator. [Source: https://pytorch.org/blog/pytorch-on-kubernetes-kubeflow-trainer-joins-the-pytorch-ecosystem/]

StrategyMemory per GPUCommunication OverheadBest For
Data Parallel (DDP)Full model copyGradient AllReduceModels that fit in GPU VRAM
Model ParallelFraction of modelActivation transfer between stagesExtremely large models, vertical split
Tensor ParallelFraction of tensorsAllReduce within each layerTransformer attention layers
FSDPFraction of model + optimizerAllGather + ReduceScatterLarge models, memory-efficient training

4.3.3 Pipeline Parallelism with DeepSpeed and Megatron-LM

Pipeline parallelism is a specialized form of model parallelism that divides the model into sequential stages, one per device, and then streams multiple micro-batches through the pipeline simultaneously. While Stage 1 processes micro-batch 2, Stage 2 processes micro-batch 1 — overlapping computation across stages. [Source: https://towardsdatascience.com/distributed-parallel-training-data-parallelism-and-model-parallelism-ec2d234e3214/]

Time →
Stage 0: [MB1 fwd] [MB2 fwd] [MB3 fwd] [idle] [MB3 bwd] [MB2 bwd] [MB1 bwd]
Stage 1: [idle] [MB1 fwd] [MB2 fwd] [MB3 fwd] [MB3 bwd] [MB2 bwd] [MB1 bwd]
Stage 2: [idle] [idle] [MB1 fwd] [MB2 fwd] [MB3 fwd] [MB3 bwd] [MB2 bwd]

DeepSpeed implements pipeline parallelism alongside its well-known ZeRO memory optimization stages. ZeRO partitions optimizer states (ZeRO-1), gradients (ZeRO-2), and model parameters (ZeRO-3) across data-parallel ranks, dramatically reducing the per-GPU memory footprint for very large models. [Source: https://oneuptime.com/blog/post/2026-01-27-kubeflow-model-training/view]

A minimal DeepSpeed configuration for ZeRO-3 + pipeline parallelism:

{
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": { "device": "cpu" },
    "offload_param": { "device": "cpu" }
  },
  "pipeline": {
    "stages": 4,
    "activation_checkpoint_interval": 1
  },
  "fp16": { "enabled": true }
}

Megatron-LM is NVIDIA’s framework for training transformer-based language models at scale. It natively combines tensor parallelism and pipeline parallelism and is commonly used for models in the 10B–1T parameter range.

4.3.4 Hybrid Parallelism Strategies for Large Models

For the largest models — those in the tens to hundreds of billions of parameters — no single parallelism strategy is sufficient. 3D parallelism (or hybrid parallelism) combines all three axes:

DimensionStrategyExample
Intra-layerTensor parallelismShard attention Q/K/V across 8 GPUs
Inter-layerPipeline parallelism4 pipeline stages across 4 node groups
Cross-replicaData parallelism16 DDP replicas of the full pipeline

The total GPU count is the product: 8 (tensor) × 4 (pipeline) × 16 (data) = 512 GPUs.

Figure 4.6: 3D hybrid parallelism combining tensor, pipeline, and data parallelism

flowchart LR
    subgraph DP["Data Parallelism (16 replicas)"]
        subgraph PP0["Pipeline replica 0"]
            subgraph TP0["Tensor Parallel\n(8 GPUs)"]
                S0["Stage 0\nLayers 0-11\n8× GPU shards"]
            end
            subgraph TP1["Tensor Parallel\n(8 GPUs)"]
                S1["Stage 1\nLayers 12-23\n8× GPU shards"]
            end
            subgraph TP2["Tensor Parallel\n(8 GPUs)"]
                S2["Stage 2\nLayers 24-35\n8× GPU shards"]
            end
            subgraph TP3["Tensor Parallel\n(8 GPUs)"]
                S3["Stage 3\nLayer 36+Head\n8× GPU shards"]
            end
            S0 -->|"activations\n(pipeline)"| S1 --> S2 --> S3
        end
        PP0 -->|"AllReduce\ngradients"| PP1["... 15 more\nidentical replicas"]
    end

    TOTAL["Total GPUs: 8 × 4 × 16 = 512"]

    style TOTAL fill:#E67E22,color:#fff
    style DP fill:#EBF5FB

A rule of thumb for choosing dimensions:

Key Takeaway: Data parallelism (DDP/FSDP) is the right default for models that fit in GPU memory. For larger models, layer-wise or tensor-model parallelism is required. DeepSpeed’s ZeRO stages and Megatron-LM’s 3D parallelism provide the framework machinery; the Kubeflow Training Operator provides the Kubernetes plumbing to launch and manage these workloads at scale.


4.4 Checkpointing and Fault Tolerance

A distributed training run on 256 GPUs can take days or weeks. On any modern cloud or on-premises cluster, the probability of at least one node failure during that window is non-trivial. Without checkpointing and fault tolerance, a single hardware glitch can erase days of compute. This section describes how to design training workloads that survive failures gracefully.

4.4.1 Periodic Checkpoint Saving to Shared Storage

The foundation of fault tolerance is periodic checkpointing — saving the model weights, optimizer state, and training metadata to persistent shared storage at regular intervals. If a job fails and restarts, it loads the latest checkpoint and resumes from where it left off rather than restarting from epoch 0. [Source: https://cloud.google.com/blog/products/ai-machine-learning/elastic-training-and-optimized-checkpointing-improve-ml-goodput]

A standard checkpoint save pattern in PyTorch:

import torch
import os

CHECKPOINT_DIR = "/checkpoints"
CHECKPOINT_INTERVAL = 100  # save every 100 steps

def save_checkpoint(model, optimizer, step, epoch, loss):
    if dist.get_rank() == 0:   # only rank 0 writes to avoid race conditions
        path = os.path.join(CHECKPOINT_DIR, f"checkpoint_step_{step}.pt")
        torch.save({
            "step": step,
            "epoch": epoch,
            "model_state_dict": model.state_dict(),
            "optimizer_state_dict": optimizer.state_dict(),
            "loss": loss,
        }, path)
        # Write a pointer to the latest valid checkpoint
        with open(os.path.join(CHECKPOINT_DIR, "latest.txt"), "w") as f:
            f.write(path)

def load_checkpoint(model, optimizer):
    latest_file = os.path.join(CHECKPOINT_DIR, "latest.txt")
    if not os.path.exists(latest_file):
        return 0, 0   # no checkpoint — start fresh
    with open(latest_file) as f:
        path = f.read().strip()
    checkpoint = torch.load(path, map_location="cpu")
    model.load_state_dict(checkpoint["model_state_dict"])
    optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
    return checkpoint["step"], checkpoint["epoch"]

Asynchronous checkpointing substantially reduces the GPU stall time associated with writing large checkpoints. Instead of blocking the training loop while serializing model state, the state is first copied from GPU HBM to host CPU memory (a fast, non-blocking operation) and then written to shared storage in a background thread. [Source: https://cloud.google.com/blog/products/ai-machine-learning/elastic-training-and-optimized-checkpointing-improve-ml-goodput]

Shared storage options for checkpoints:

Storage TypeUse CaseNotes
NFS / CephFS (ReadWriteMany PVC)On-premises multi-node trainingAll workers can mount the same PVC
AWS EFS / Azure Files / GCS FuseCloud-based trainingManaged, high-durability object store mounts
Local NVMe + periodic syncHighest-speed intermediate checkpointsSync to shared storage every N checkpoints

4.4.2 Automatic Restart and Recovery from Node Failures

Kubernetes itself provides basic restart capability through the restartPolicy: OnFailure setting in Job/Pod specs. For the Training Operator, per-replica restartPolicy can be set to OnFailure or ExitCode (the latter allows fine-grained exit-code-based decisions on whether to restart or mark the job failed).

However, a node failure typically kills the entire distributed job — not just the Pod on the failed node — because the remaining workers cannot make progress without their peers. The full restart flow is:

  1. Node failure is detected (default: within 5 minutes via node condition taints)
  2. Pods on the failed node enter Unknown or Failed state
  3. Training Operator restarts all Pods (if restartPolicy: OnFailure)
  4. Each worker calls load_checkpoint() and resumes from the last saved step

TorchFT (PyTorch Fault Tolerance) enables a more sophisticated recovery path for sub-group failures. Integrated with the TorchTitan training framework on platforms like AMD’s Primus-SaFE Kubernetes environment, when a node fails, the platform can preempt lower-priority workloads to reclaim GPU resources and automatically replace the failed process group — restoring the job to full scale without a full restart. [Source: https://rocm.blogs.amd.com/artificial-intelligence/primus-torchft/README.html]

Normal training: [Group 0] [Group 1] [Group 2] [Group 3] → AllReduce
Node failure:    [Group 0] [  DEAD ] [Group 2] [Group 3] → TorchFT detects
Recovery:        [Group 0] [Group 1'] [Group 2] [Group 3] → restored, resume from checkpoint
                  (preempted lower-priority job freed GPUs for Group 1')

For cloud environments with spot/preemptible instances, the Training Operator’s backoffLimit and checkpoint-on-exit handlers are essential. Set backoffLimit generously (e.g., 20) and ensure the training script writes a checkpoint on SIGTERM — the signal Kubernetes sends before evicting a Pod:

import signal, sys

def sigterm_handler(signum, frame):
    print("Received SIGTERM — saving emergency checkpoint")
    save_checkpoint(model, optimizer, current_step, current_epoch, current_loss)
    sys.exit(0)

signal.signal(signal.SIGTERM, sigterm_handler)

4.4.3 Elastic Training with Dynamic Scaling

Elastic training allows a distributed job to continue running even when the number of available workers changes — workers can join or leave mid-training without requiring a full restart. This makes it feasible to use spot instances for long training runs, absorbing preemptions as temporary worker departures rather than catastrophic failures. [Source: https://aws.amazon.com/blogs/containers/fault-tolerant-distributed-machine-learning-training-with-the-torchelastic-controller-for-kubernetes/]

The TorchElastic Controller for Kubernetes (TECK) provides native elastic training support. You specify minReplicas, maxReplicas, and the desired replica count:

apiVersion: elastic.pytorch.org/v1alpha1
kind: ElasticJob
metadata:
  name: elastic-llm-pretrain
  namespace: training
spec:
  minReplicas: 4    # minimum to make progress
  maxReplicas: 16   # scale up to here when GPUs are available
  replicaSpecs:
    Worker:
      replicas: 8   # start with 8, scale between 4 and 16
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: trainer
              image: myregistry/elastic-trainer:v1
              resources:
                limits:
                  nvidia.com/gpu: "8"
              command:
                - python
                - -m
                - torch.distributed.run
                - --rdzv_backend=etcd
                - --rdzv_endpoint=etcd-service:2379
                - --min_nodes=4
                - --max_nodes=16
                - train_elastic.py

When a node is added (scale-up), torchrun triggers a rendezvous round: all current workers checkpoint, the new worker joins, and training resumes with an updated WORLD_SIZE. When a node is lost (scale-down or preemption), the same rendezvous mechanism fires, workers rejoin with the updated count, and training continues — provided the worker count stays above minReplicas. [Source: https://medium.com/pytorch/reduce-time-and-cost-by-running-distributed-elastic-pytorch-jobs-on-kubernetes-4f7ac3986307]

Hyperparameters such as learning rate and effective batch size need adjustment when the world size changes. TorchElastic supports user-defined callbacks that fire on scale events:

from torch.distributed.elastic.rendezvous import RendezvousParameters

def on_worker_count_changed(old_world_size, new_world_size):
    # Linearly scale learning rate with world size change
    scale = new_world_size / old_world_size
    for param_group in optimizer.param_groups:
        param_group['lr'] *= scale

PyTorch Elastic is fully native to PyTorch — no additional frameworks like Horovod or Ray are required for basic elasticity. For users who need distributed execution beyond training (e.g., data preprocessing pipelines integrated with training), Ray’s elastic actor model is a complementary alternative. [Source: https://www.infracloud.io/blogs/distributed-parallel-processing-ray-kuberay/]

FeatureTorchElastic (TECK)Horovod ElasticRay Train
Min/max worker boundsYesYesYes
Framework nativePyTorch onlyMulti-frameworkMulti-framework
Hyperparameter callbacks on scaleYesYesYes
Kubernetes CRDElasticJobVia MPIJobVia RayCluster + RayJob
Checkpoint integrationBuilt-in via TorchSnapshotManualVia Ray Train checkpointing API

Key Takeaway: Robust fault tolerance requires three cooperating mechanisms: periodic checkpointing to shared persistent storage, automatic Pod restart configured in the Training Operator, and optionally elastic scaling via TorchElastic to absorb node fluctuations without job interruption. Asynchronous checkpointing minimizes the GPU time lost to checkpoint I/O.


Chapter Summary

This chapter covered the full spectrum of Kubernetes tooling for running training workloads, from simple single-GPU jobs to elastic multi-node distributed training.

Native Kubernetes Job primitives — plain Jobs, CronJobs, and Indexed Jobs — handle single-node and embarrassingly-parallel workloads without additional dependencies. Init containers cleanly separate data preparation from training execution.

The Kubeflow Training Operator is the de facto standard for tightly-coupled distributed training on Kubernetes. It provides the PyTorchJob, TFJob, and MPIJob custom resources, automatically injects distributed environment variables, and integrates with gang schedulers like Volcano to prevent partial-allocation deadlocks that would stall distributed jobs.

Distributed training strategies range from straightforward data parallelism (DDP, FSDP, Horovod AllReduce) to memory-efficient model and tensor parallelism to throughput-optimized pipeline parallelism (DeepSpeed, Megatron-LM). For the largest models, 3D hybrid parallelism combines all three axes.

Finally, fault tolerance is not optional for long training runs. Periodic checkpointing to shared storage, SIGTERM handlers for graceful preemption, and elastic training via TorchElastic together ensure that hardware failures and spot interruptions do not erase days of compute.


Key Terms

TermDefinition
Training OperatorKubeflow component providing Kubernetes CRDs and a controller for managing distributed ML training jobs across multiple frameworks
PyTorchJobKubeflow Training Operator custom resource for running distributed PyTorch training jobs with automatic environment variable injection
TFJobKubeflow Training Operator custom resource for running distributed TensorFlow training jobs using parameter server or AllReduce strategies
MPIJobKubeflow Training Operator custom resource for running MPI-based distributed training jobs, including Horovod workloads, via a launcher/worker pattern
Data parallelismDistributed training strategy where each worker holds a full model copy and trains on a different data shard; gradients are synchronized via AllReduce at each step
Model parallelismDistributed training strategy where model layers are split across multiple devices; each device processes the full mini-batch through its portion of the network
Pipeline parallelismForm of model parallelism that divides the model into sequential stages and streams micro-batches through the pipeline simultaneously to improve throughput
Gang schedulingScheduling policy that requires all Pods of a distributed job to be allocated simultaneously, preventing partial-allocation deadlocks
VolcanoOpen-source Kubernetes batch scheduler providing gang scheduling, queue management, and fair-share policies for ML and HPC workloads
DeepSpeedMicrosoft’s deep learning optimization library providing ZeRO memory optimization stages, pipeline parallelism, and mixed-precision training at scale
HorovodUber’s framework-agnostic distributed training library using the AllReduce pattern; integrates with PyTorch, TensorFlow, and MXNet via MPIJob
Elastic trainingTraining paradigm where the number of workers can change dynamically during a run (via TorchElastic / TECK) without stopping and restarting the job
FSDPFully Sharded Data Parallel — PyTorch strategy that shards model parameters, gradients, and optimizer states across data-parallel workers to reduce per-GPU memory usage
NCCLNVIDIA Collective Communications Library — the high-performance backend for GPU-to-GPU AllReduce, AllGather, and other collective operations used in distributed training
TorchElasticPyTorch native library enabling elastic and fault-tolerant distributed training; used by the TorchElastic Controller for Kubernetes (TECK)
AllReduceCollective communication operation that aggregates (e.g., sums) tensors from all processes and distributes the result back to all processes; the core synchronization primitive in data-parallel training
ZeROZero Redundancy Optimizer — DeepSpeed technique that partitions optimizer states, gradients, and/or parameters across data-parallel ranks to reduce memory redundancy

Chapter 5: Model Serving and Inference

Learning Objectives

By the end of this chapter, you will be able to:


Introduction

Training a model is only half the job. Getting it in front of users — reliably, quickly, and cost-effectively — is where the real operational work begins. Model serving on Kubernetes is the discipline of turning a trained artifact into a production-grade API endpoint that can handle bursty traffic, scale down to zero at 2am, and be updated without disrupting users.

Think of model serving like a restaurant kitchen. The trained model is the recipe, but KServe is the kitchen management system: it routes orders (requests) to the right cook (predictor), ensures backup cooks are on standby when things get busy (autoscaling), and lets the head chef trial a new dish on a small table before rolling it out to the whole menu (canary deployment). This chapter covers the full serving stack, from choosing the right runtime to shipping model updates safely.


Section 1: Model Serving Architectures

Online vs. Batch Inference Patterns

Not all inference looks the same. Two fundamental patterns govern how predictions are generated:

PatternTriggerLatency TargetTypical Use Case
Online (real-time)Single HTTP/gRPC request< 200msChatbots, recommendation APIs, fraud detection
BatchScheduled job or queueMinutes to hoursOffline scoring, ETL enrichment, bulk predictions

Online inference prioritizes low latency: a user waits for each response, so the serving system must return results quickly. Batch inference prioritizes throughput: thousands of records are processed together, and nobody cares if it takes ten minutes as long as the job completes.

On Kubernetes, online serving uses Deployment or InferenceService resources fronted by a Service, while batch inference is better expressed as a Job or CronJob that runs, processes a dataset, and exits. This chapter focuses primarily on online inference, where the architectural complexity is highest.

Figure 5.1: Online vs. Batch Inference Request Flows

flowchart LR
    subgraph Online["Online Inference (Real-Time)"]
        direction LR
        U1[User / Client] -->|HTTP / gRPC request| S1[Service]
        S1 --> P1[InferenceService Pod]
        P1 -->|Response < 200ms| U1
    end

    subgraph Batch["Batch Inference (Scheduled)"]
        direction LR
        CJ[CronJob / Job] -->|Reads dataset| DS[(Dataset Store)]
        CJ --> BP[Batch Predictor Pod]
        BP -->|Writes scored records| OUT[(Output Store)]
    end

Model Server Options

Choosing the right model server is one of the most consequential decisions in your serving stack. Each server occupies a different point in the design space:

ServerBest ForKey FeatureProtocols
TorchServePyTorch models, custom handlersNative PyTorch integration, model archiverREST, gRPC
Triton Inference ServerMulti-framework, mixed workloadsMulti-model concurrency, dynamic batchingREST, gRPC, HTTP/2
vLLMLarge language model text generationPagedAttention, continuous batchingOpenAI-compatible REST
Text Generation Inference (TGI)LLM serving (HuggingFace ecosystem)Flash Attention, tensor parallelismREST

vLLM and TGI are specialized for transformer-based LLMs and handle the unique memory challenges of autoregressive generation. Triton is a generalist: it supports TensorFlow, PyTorch, ONNX, TensorRT, and can even run vLLM as a backend engine [Source: https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/vllm_backend/README.html]. TorchServe suits teams that want tight PyTorch model lifecycle management without introducing a heavier abstraction.

Sidecar and Standalone Serving Topologies

Model servers can be deployed in two main topologies:

Standalone: The model server runs as the primary container in a pod. The pod’s main process is the inference engine. This is simpler and appropriate for most cases.

Sidecar: A lightweight proxy or transformer runs alongside the model server container in the same pod. KServe uses this pattern to inject its storage initializer as an init container that downloads model weights before the serving container starts, and optionally attaches a transformer container for pre/post-processing.

┌─────────────── Pod ──────────────────┐
│  [init: storage-initializer]         │
│  ┌─────────────────────────────────┐ │
│  │ transformer container           │ │  ← optional pre/post-processing
│  └────────────────┬────────────────┘ │
│                   │                   │
│  ┌────────────────▼────────────────┐ │
│  │ predictor container (vLLM/      │ │  ← core inference engine
│  │   Triton/TorchServe)            │ │
│  └─────────────────────────────────┘ │
└──────────────────────────────────────┘

Key Takeaway: Choose online or batch inference based on latency requirements, and pick your model server based on model framework and workload mix. Triton excels at heterogeneous multi-model deployments; vLLM is the default for high-throughput LLM serving.


Section 2: KServe (formerly KFServing)

KServe is the leading Kubernetes-native model serving platform, providing a single InferenceService CRD that abstracts away the complexity of autoscaling, networking, health checking, and runtime configuration [Source: https://kserve.github.io/website/]. It graduated from the Kubeflow project and is now a standalone CNCF project, with v0.15 released in June 2025 adding expanded generative AI serving capabilities [Source: https://www.cncf.io/blog/2025/06/18/announcing-kserve-v0-15-advancing-generative-ai-model-serving/].

The InferenceService Custom Resource

The InferenceService is KServe’s primary abstraction. A minimal definition for an sklearn model looks like this:

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "sklearn-iris"
spec:
  predictor:
    model:
      modelFormat:
        name: sklearn
      storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"
      resources:
        requests:
          cpu: "100m"
          memory: "128Mi"
        limits:
          cpu: "1"
          memory: "1Gi"

[Source: https://kserve.github.io/website/docs/model-serving/predictive-inference/frameworks/overview]

When this resource is applied, KServe:

  1. Launches a storage initializer init container to download the model from GCS
  2. Starts the appropriate serving runtime (sklearn-server in this case)
  3. Configures a Knative Service or Kubernetes Deployment depending on the installation mode
  4. Wires up ingress routing via Istio or Kourier
  5. Optionally configures autoscaling

Supported modelFormat values include sklearn, pytorch, tensorflow, onnx, xgboost, lightgbm, pmml, and custom runtimes [Source: https://kserve.github.io/website/docs/model-serving/predictive-inference/frameworks/overview].

Predictor, Transformer, and Explainer Components

A full InferenceService can include up to three logical components:

ComponentPurposeExample
PredictorCore inference — runs the modelvLLM server, Triton, sklearn-server
TransformerPre/post-processing pipelineTokenization, feature engineering, output formatting
ExplainerModel explainabilitySHAP values, LIME explanations

The transformer and explainer are optional. When a transformer is present, KServe routes each request through it before (and after) the predictor, enabling clean separation between business logic and model inference.

Figure 5.2: KServe InferenceService Request Pipeline

flowchart LR
    Client([Client Request]) --> T
    subgraph IS["InferenceService"]
        T["Transformer\n(pre-processing)"] -->|Transformed input| P["Predictor\n(model inference)"]
        P -->|Raw prediction| T2["Transformer\n(post-processing)"]
        P -->|Prediction| E["Explainer\n(SHAP / LIME)"]
    end
    T2 --> R([Response to Client])
    E -->|Explanation| R

    style T fill:#dbeafe,stroke:#3b82f6
    style P fill:#dcfce7,stroke:#16a34a
    style T2 fill:#dbeafe,stroke:#3b82f6
    style E fill:#fef9c3,stroke:#ca8a04
spec:
  transformer:
    containers:
      - name: kserve-container
        image: my-org/feature-transformer:v2
        env:
          - name: STORAGE_URI
            value: "gs://my-bucket/vocab.json"
  predictor:
    model:
      modelFormat:
        name: pytorch
      storageUri: "gs://my-bucket/models/classifier/v3"
  explainer:
    alibi:
      type: AnchorTabular
      storageUri: "gs://my-bucket/explainers/anchor"

Model Storage Initialization and Model Mesh

The storage initializer is an init container that KServe injects into every predictor pod. It supports multiple storage backends out of the box:

BackendURI PrefixExample
Google Cloud Storagegs://gs://my-bucket/models/bert
AWS S3s3://s3://my-bucket/models/bert
Azure Blobazureblob://azureblob://mycontainer/models/bert
HTTP/HTTPShttps://https://huggingface.co/...
PersistentVolumeClaimpvc://pvc://model-store/bert-base

Model Mesh is KServe’s multi-model serving capability. Instead of one pod per model, Model Mesh co-locates many models within a shared pool of serving pods. This dramatically reduces the per-model pod overhead when an organization maintains hundreds of smaller models. Model Mesh maintains an in-memory cache and loads/unloads models from GPU or CPU memory on demand [Source: https://github.com/kserve/kserve].

Multi-Model Serving for Cost Efficiency

The economics of model serving can be challenging. A team with 200 fraud-detection models — each serving low traffic — cannot afford a dedicated GPU pod per model. Multi-model serving solves this by packing multiple models into shared infrastructure.

The tradeoff is isolation: in a single-model pod, a poorly behaved model can only crash itself. In a shared pod, a memory leak affects all co-located models. KServe’s Model Mesh mitigates this with separate model loader processes and health monitoring per model.

Key Takeaway: KServe’s InferenceService CRD is the central abstraction for production ML serving on Kubernetes. Its predictor/transformer/explainer decomposition encourages clean separation of concerns, and Model Mesh enables cost-efficient multi-model deployments where per-model pod overhead would otherwise be prohibitive.


Section 3: LLM Serving on Kubernetes

Large language models present unique serving challenges. A 70-billion-parameter model in FP16 requires approximately 140GB of GPU VRAM — more than any single consumer GPU. The autoregressive generation pattern (producing one token at a time) creates inherently sequential workloads that are hard to batch naively. This section covers the specialized tools and techniques for serving LLMs efficiently on Kubernetes.

vLLM Deployment and PagedAttention Optimization

vLLM is the dominant open-source LLM serving engine in production environments [Source: https://github.com/vllm-project/vllm]. Its defining innovation is PagedAttention, an algorithm that manages the KV (key-value) cache using virtual memory techniques borrowed from operating system page tables.

Why does KV cache management matter? During autoregressive generation, each token produced depends on all previous tokens. The model must cache the key and value vectors for every token in the context window to avoid recomputing them. With a 4096-token context window and large batch sizes, this cache can consume more memory than the model weights themselves — and traditional servers reserved it as one large contiguous VRAM block.

PagedAttention stores the KV cache in non-contiguous “pages,” just like virtual memory in an OS. This eliminates internal fragmentation and allows the cache to be shared across concurrent requests. The result: up to 24x more requests can be served on the same hardware compared to naive approaches [Source: https://www.kubenatives.com/p/vllm-vs-triton-vs-kserve-kubernetes].

Figure 5.3: PagedAttention KV Cache — Contiguous vs. Paged Allocation

block-beta
    columns 6

    block:traditional["Traditional (Contiguous)"]:6
        T1["Req-A\nTokens 0-15\n(reserved)"]
        T2["Req-B\nTokens 0-7\n(reserved)"]
        T3["Req-C\nTokens 0-31\n(reserved)"]
        T4["WASTED\n(fragmentation)"]
        T5["WASTED"]
        T6["WASTED"]
    end

    block:paged["PagedAttention (Non-Contiguous Pages)"]:6
        P1["Page 0\nReq-A t0-3"]
        P2["Page 1\nReq-B t0-3"]
        P3["Page 2\nReq-A t4-7"]
        P4["Page 3\nReq-C t0-3"]
        P5["Page 4\nReq-A t8-11"]
        P6["Page 5\nReq-B t4-7"]
    end

    style T4 fill:#fecaca,stroke:#ef4444
    style T5 fill:#fecaca,stroke:#ef4444
    style T6 fill:#fecaca,stroke:#ef4444
    style P1 fill:#dcfce7,stroke:#16a34a
    style P2 fill:#dbeafe,stroke:#3b82f6
    style P3 fill:#dcfce7,stroke:#16a34a
    style P4 fill:#fef9c3,stroke:#ca8a04
    style P5 fill:#dcfce7,stroke:#16a34a
    style P6 fill:#dbeafe,stroke:#3b82f6

A basic vLLM deployment on Kubernetes:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama3-8b
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-llama3-8b
  template:
    metadata:
      labels:
        app: vllm-llama3-8b
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          args:
            - "--model"
            - "meta-llama/Meta-Llama-3-8B-Instruct"
            - "--gpu-memory-utilization"
            - "0.85"
            - "--tensor-parallel-size"
            - "1"
            - "--max-model-len"
            - "8192"
          resources:
            limits:
              nvidia.com/gpu: "1"
          volumeMounts:
            - name: shm
              mountPath: /dev/shm
            - name: model-cache
              mountPath: /root/.cache/huggingface
      volumes:
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: 8Gi          # ← critical for LLM IPC
        - name: model-cache
          persistentVolumeClaim:
            claimName: huggingface-cache

A critical Kubernetes-specific detail: pods default to only 64MB of shared memory (/dev/shm), which AI frameworks use heavily for inter-process communication. Without the emptyDir with medium: Memory override, large models will crash with OSError: [Errno 28] No space left on device during initialization [Source: https://www.kubenatives.com/p/vllm-vs-triton-vs-kserve-kubernetes].

vLLM’s gpu_memory_utilization defaults to 0.9 (90% of VRAM). When co-locating models or using MIG-partitioned GPUs, this should be reduced to leave headroom [Source: https://bizety.com/2025/09/29/vllm-vs-triton-competing-or-complementary/].

Text Generation Inference (TGI) Setup

HuggingFace’s Text Generation Inference is an alternative LLM server that integrates deeply with the HuggingFace Hub ecosystem. TGI supports Flash Attention 2, continuous batching, and tensor parallelism, making it competitive with vLLM for many use cases.

containers:
  - name: tgi
    image: ghcr.io/huggingface/text-generation-inference:2.3
    args:
      - "--model-id"
      - "mistralai/Mistral-7B-Instruct-v0.3"
      - "--num-shard"
      - "1"
      - "--quantize"
      - "bitsandbytes-nf4"
    resources:
      limits:
        nvidia.com/gpu: "1"
    env:
      - name: HUGGING_FACE_HUB_TOKEN
        valueFrom:
          secretKeyRef:
            name: hf-token
            key: token

Quantization (GPTQ, AWQ, GGUF) for Reduced GPU Memory

Quantization converts model weights from high-precision formats (FP32, FP16) to lower-precision formats (INT8, INT4), trading a small amount of accuracy for significant memory and speed gains:

FormatPrecisionMemory ReductionTypical Accuracy LossSupported By
FP16 (baseline)16-bit floatAll servers
GPTQINT4~4x vs FP32< 1% on most benchmarksvLLM, TGI, Triton
AWQ (Activation-Aware)INT4~4x vs FP32Slightly better than GPTQvLLM, TGI
GGUF (llama.cpp)2–8 bit mixedUp to 8xVaries by bit depthllama.cpp, Ollama

AWQ improves on GPTQ by identifying and protecting the most salient weights before quantization, yielding better perplexity at the same bit width [Source: https://www.inferless.com/learn/vllm-vs-triton-inference-server-choosing-the-best-inference-library-for-large-language-models].

To serve a GPTQ-quantized model with vLLM:

args:
  - "--model"
  - "TheBloke/Llama-2-13B-chat-GPTQ"
  - "--quantization"
  - "gptq"
  - "--gpu-memory-utilization"
  - "0.80"

A 13B parameter model in FP16 requires ~26GB VRAM. The GPTQ INT4 version requires ~7GB — fitting on a single A10G or consumer RTX 3090.

Continuous Batching and Speculative Decoding

Continuous batching is vLLM’s scheduling strategy that keeps the GPU pipeline full at all times. In traditional static batching, the server waits for a batch of N requests to arrive before processing any of them. If one request in the batch generates 200 tokens and another generates 10, the short request sits idle for 190 tokens.

Continuous batching solves this by treating each decode step as an opportunity to insert new requests. As soon as a sequence completes, a new request from the queue immediately fills its slot in the next forward pass [Source: https://bizety.com/2025/09/29/vllm-vs-triton-competing-or-complementary/]. This keeps GPU utilization high and minimizes time-to-first-token for queued requests.

Time →   [Step 1][Step 2][Step 3][Step 4][Step 5]
Slot A:  [Req-1 ][ Req-1][ Req-1][Req-4 ][Req-4 ]  ← Req-1 finishes at step 3
Slot B:  [Req-2 ][ Req-2][ Req-3][Req-3 ][Req-5 ]  ← Req-2 finishes, Req-3 added
Slot C:  [Req-3 ][ Req-3][ Req-3][Req-3 ][Req-3 ]

Speculative decoding accelerates generation further by using a small draft model to propose several tokens, which the large target model then verifies in parallel. When the draft model is right (which it often is for common phrases), multiple tokens are produced per forward pass, reducing effective latency.

Figure 5.4: Continuous Batching — Dynamic Request Slot Assignment

sequenceDiagram
    participant Q as Request Queue
    participant S as Scheduler
    participant G as GPU (Forward Pass)

    Q->>S: Req-1, Req-2, Req-3 arrive
    S->>G: Step 1: [Req-1, Req-2, Req-3]
    G-->>S: Req-2 completes (short output)
    Q->>S: Req-4 waiting in queue
    S->>G: Step 2: [Req-1, Req-4, Req-3]
    Note over S,G: Req-4 inserted immediately — no idle slot
    G-->>S: Req-1 completes
    Q->>S: Req-5 waiting in queue
    S->>G: Step 3: [Req-5, Req-4, Req-3]

Key Takeaway: vLLM’s PagedAttention and continuous batching are the two most impactful optimizations for LLM serving throughput. Quantization (AWQ or GPTQ) allows models 2–4x too large for your GPU to fit in memory, often with minimal quality degradation. Always configure /dev/shm for LLM pods on Kubernetes.


Section 4: Autoscaling and Traffic Management

HPA with Custom GPU and Request Latency Metrics

The Horizontal Pod Autoscaler (HPA) is Kubernetes’ built-in mechanism for scaling workloads based on observed metrics. For inference services, the most useful scaling signals are:

MetricSignalScaling Behavior
CPU utilizationGeneral compute pressureScale out when CPU > threshold
GPU utilizationGPU saturationScale out when GPU DCGM metric > threshold
Request latency (p99)User-visible slowdownScale out when p99 latency > SLO
Request concurrencyInflight request queueScale to match target concurrency
Pending request queue depthBackpressure in LLM servingScale out before latency degrades

KServe’s HPA integration in Standard mode uses scaleMetric and scaleTarget fields in the InferenceService spec [Source: https://kserve.github.io/website/docs/model-serving/predictive-inference/autoscaling/hpa-autoscaler]:

spec:
  predictor:
    scaleMetric: cpu
    scaleTarget: 80          # scale out when avg CPU > 80%
    minReplicas: 1
    maxReplicas: 10
    model:
      modelFormat:
        name: pytorch
      storageUri: "gs://my-bucket/models/classifier"

For GPU-based scaling, you need to expose DCGM (Data Center GPU Manager) metrics via Prometheus and configure an HPA with a custom metric reference. The Metrics Server must be installed for standard HPA to function [Source: https://kserve.github.io/website/docs/model-serving/predictive-inference/autoscaling/hpa-autoscaler].

Scale-to-Zero with Knative or KEDA

One of the most cost-effective features in Kubernetes ML serving is scale-to-zero: completely removing all pods when no traffic is present, and scaling back up on the first request.

Knative Pod Autoscaler (KPA) is KServe’s default autoscaler in serverless mode. It scales based on concurrent requests per pod, with built-in support for scaling to zero [Source: https://kserve.github.io/website/docs/model-serving/predictive-inference/autoscaling/kpa-autoscaler]:

metadata:
  annotations:
    autoscaling.knative.dev/target: "5"      # 5 concurrent requests per pod
spec:
  predictor:
    minReplicas: 0                           # enables scale-to-zero
    maxReplicas: 8

The tradeoff with scale-to-zero for LLMs is cold start time. A 7B parameter model takes 30–90 seconds to load from storage and warm up, making scale-to-zero unsuitable for latency-sensitive endpoints. For LLMs, minReplicas: 1 is usually the right choice; scale-to-zero works well for smaller classification or embedding models.

KEDA (Kubernetes Event-Driven Autoscaling) extends HPA with support for custom event sources and LLM-specific metrics. KServe integrates with KEDA via the serving.kserve.io/autoscalerClass: "keda" annotation [Source: https://kserve.github.io/website/docs/model-serving/predictive-inference/autoscaling/keda-autoscaler]:

metadata:
  annotations:
    serving.kserve.io/autoscalerClass: "keda"
    serving.kserve.io/scaleTarget: "10"
    serving.kserve.io/scaleMetric: "rps"
spec:
  predictor:
    minReplicas: 0
    maxReplicas: 5
    model:
      modelFormat:
        name: pytorch
      storageUri: "s3://my-bucket/models/recommender"

KEDA’s advantage over standard HPA for LLM workloads is its ability to scale on metrics like pending request queue depth or token throughput — signals that predict when the serving tier is about to saturate before latency degrades [Source: https://kserve.github.io/website/docs/model-serving/generative-inference/autoscaling].

The three autoscaler classes compared:

AutoscalerTriggerBest ForScale-to-Zero
KPA (Knative)Request concurrencyServerless, bursty trafficYes (via Knative activator)
HPACPU, memory, custom Prometheus metricsStable, predictable trafficNo (min 1 replica)
KEDAAny event source — queues, LLM metricsLLM serving, event-driven workloadsYes

Canary Rollouts and Traffic Splitting

Deploying a new model version to 100% of production traffic immediately is high risk: if the new version regresses on accuracy or introduces higher latency, every user is affected. Canary deployments mitigate this by routing a small percentage of traffic to the new version while the majority continues to use the stable version.

KServe natively supports percentage-based traffic splitting in the InferenceService spec [Source: https://github.com/kserve/kserve]:

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "flower-classifier"
spec:
  predictor:
    canaryTrafficPercent: 20         # 20% to new version
    model:
      modelFormat:
        name: sklearn
      storageUri: "gs://my-bucket/models/classifier/v2"   # new (canary)

The stable version is maintained in the previous revision; KServe routes 80% of traffic there and 20% to the canary predictor. No external routing tool is required.

For automated, metrics-driven promotion, Flagger integrates with KServe to progressively shift traffic based on Prometheus metrics. Flagger starts with a small canary percentage, observes error rate and latency metrics, and automatically promotes or rolls back based on configured thresholds [Source: https://oneuptime.com/blog/post/2026-01-30-mlops-canary-model-deployment/view]:

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: flower-classifier
spec:
  targetRef:
    apiVersion: serving.kserve.io/v1beta1
    kind: InferenceService
    name: flower-classifier
  progressDeadlineSeconds: 600
  analysis:
    interval: 1m
    threshold: 5
    maxWeight: 50
    stepWeight: 10
    metrics:
      - name: request-success-rate
        thresholdRange:
          min: 99
        interval: 1m
      - name: request-duration
        thresholdRange:
          max: 500
        interval: 1m

This configuration starts the canary at 0%, increments by 10% every minute if metrics hold, caps at 50%, and promotes to 100% automatically when analysis passes.

Figure 5.5: Flagger Canary Promotion Lifecycle

stateDiagram-v2
    [*] --> Initializing: New InferenceService version deployed

    Initializing --> Progressing: Canary pods healthy

    Progressing --> Progressing: Metrics pass\nshift +10% traffic

    Progressing --> Promoting: canaryWeight reaches maxWeight (50%)\nAll analysis intervals pass

    Promoting --> Succeeded: Stable promoted to 100%\nCanary destroyed

    Progressing --> RollingBack: Error rate > threshold\nor latency > 500ms

    RollingBack --> Failed: Traffic reverted to stable 100%

    Succeeded --> [*]
    Failed --> [*]

A/B Testing for Model Rollouts

While canary deployments gradually migrate all traffic from one version to another, A/B testing routes traffic based on user identity or request metadata — enabling true controlled experiments where specific cohorts see specific model versions simultaneously.

KServe, backed by Istio, supports header-based routing that enables A/B segmentation [Source: https://dasroot.net/posts/2026/03/building-multi-model-inference-platform-kubernetes/]. For example, users in experiment group B can be routed to model v2 while all others continue to use v1, based on an X-Experiment-Group: B header injected by your application layer.

The key distinction:

TechniqueTraffic Split BasisRollback MechanismEnd State
CanaryRandom percentageAutomatic (Flagger) or manualOne version wins
A/B TestUser cohort / headerManual analysisWinner determined by experiment
Blue/GreenAll-or-nothing switchFlip traffic weight to 0/100Immediate cutover

Load Balancing Strategies for Inference Endpoints

Multiple replicas of a model server need intelligent load balancing. Naive round-robin works for stateless REST APIs, but LLM inference has special considerations:

Kubernetes Service with the default iptables/IPVS kube-proxy load balancing handles simple round-robin. For more sophisticated strategies, the Kubernetes Gateway API (via Envoy or Istio) supports custom load balancing policies [Source: https://wetranscloud.com/blog/kubernetes-for-ml-scaling-pipelines-across-clouds/].

Key Takeaway: Combine KPA or KEDA for responsive autoscaling with KServe’s native traffic splitting for safe model rollouts. Scale-to-zero saves GPU costs for low-traffic models but requires cold start tolerance. For LLMs, KEDA’s ability to scale on queue depth and token throughput metrics makes it the most effective autoscaling option.


Chapter Summary

This chapter covered the complete lifecycle of deploying and managing model serving on Kubernetes. We began by distinguishing online from batch inference and surveying the model server landscape: Triton for heterogeneous multi-model workloads, vLLM for high-throughput LLM text generation, and TGI for the HuggingFace ecosystem.

KServe’s InferenceService CRD emerged as the central Kubernetes abstraction, encapsulating storage initialization, autoscaling, traffic management, and optional transformer and explainer sidecars in a single declarative resource. Multi-model serving via Model Mesh enables cost-effective sharing of GPU infrastructure across hundreds of low-traffic models.

For LLMs specifically, PagedAttention and continuous batching in vLLM unlock dramatically higher throughput on fixed hardware, while quantization (GPTQ, AWQ) enables models too large for a single GPU to fit in memory with manageable accuracy loss. Kubernetes-specific configuration — particularly shared memory sizing — is essential to avoid silent failures.

Autoscaling for inference requires matching the scaler to the workload: KPA concurrency scaling for serverless patterns, HPA for stable predictable traffic, and KEDA for LLM workloads where queue depth and token throughput are the most relevant signals. Canary deployments via KServe’s traffic splitting, automated by Flagger, enable safe model version promotion with automatic rollback. A/B testing via header-based routing allows controlled experiments where different user cohorts experience different model versions simultaneously.


Key Terms

TermDefinition
KServeA Kubernetes-native model serving platform providing the InferenceService CRD, autoscaling, and traffic management for ML models
InferenceServiceKServe’s primary Custom Resource Definition for declaring a model serving endpoint, including predictor, transformer, and explainer components
Triton Inference ServerNVIDIA’s multi-framework model serving engine supporting TensorFlow, PyTorch, ONNX, TensorRT, and vLLM backends with dynamic batching
vLLMAn open-source LLM serving engine featuring PagedAttention and continuous batching for high-throughput text generation
PagedAttentionvLLM’s KV-cache management algorithm that stores attention caches in non-contiguous memory pages, enabling up to 24x more concurrent requests
continuous batchingA scheduling strategy that inserts new requests into the inference pipeline as soon as previous requests complete, maximizing GPU utilization
quantizationThe process of reducing model weight precision (e.g., FP16 → INT4) to decrease memory footprint and increase throughput with minimal accuracy loss
HPAHorizontal Pod Autoscaler — Kubernetes’ built-in mechanism for scaling workloads based on CPU, memory, or custom Prometheus metrics
KEDAKubernetes Event-Driven Autoscaling — extends HPA with support for custom event sources including LLM-specific metrics like queue depth and token throughput
canary deploymentA deployment strategy that routes a small percentage of traffic to a new model version before full rollout, enabling safe validation
model meshKServe’s multi-model serving capability that co-locates many models within a shared pool of serving pods to reduce per-model infrastructure overhead
KPAKnative Pod Autoscaler — KServe’s default autoscaler in serverless mode, scaling based on request concurrency with built-in scale-to-zero support

Chapter 6: Resource Scheduling and Cluster Optimization

Learning Objectives

By the end of this chapter, you will be able to:


Introduction

Running AI workloads on Kubernetes is not simply a matter of packaging a training job in a container and letting the scheduler do the rest. GPU clusters are expensive, shared, and contended. A poorly configured cluster wastes money when GPUs sit idle waiting for the right pods, causes deadlock when distributed jobs receive only part of their requested resources, and frustrates users when a single research team monopolizes capacity for days.

Think of a large hospital with a limited number of operating rooms. You need a system that reserves the most critical surgeries, gives fair access to different departments, prevents one surgical team from booking all rooms indefinitely, and keeps rooms occupied as densely as possible throughout the day. Kubernetes scheduling for AI workloads is exactly that kind of resource governance problem — at data-center scale.

This chapter covers the tools and techniques that solve it: Kubernetes-native priority and affinity controls, the Kueue job queuing system, the Volcano batch scheduler, and cluster-level autoscaling and cost-optimization strategies.


Section 1: Kubernetes Scheduling for AI

Priority Classes and Preemption for Training Jobs

Every pod in Kubernetes can be assigned a PriorityClass, a cluster-scoped object that assigns a numerical priority value to a workload. When the scheduler cannot fit a new pod onto any node, it will evict (preempt) lower-priority pods to make room [Source: https://debugg.ai/resources/kubernetes-gpu-scheduling-2025-kueue-volcano-mig].

A typical AI platform uses a tiered priority scheme:

Priority ClassValueTypical Use
system-critical1,000,000Kubernetes system components
production-inference10,000Real-time model serving
training-high1,000Scheduled training jobs with SLAs
training-standard500Standard research experiments
development100Interactive notebooks, dev jobs

With this setup, a newly submitted production inference pod can preempt a running development notebook without disrupting any higher-priority work. Critically, research jobs configured at training-standard cannot block production inference — the scheduler will always prefer the higher value [Source: https://debugg.ai/resources/kubernetes-gpu-scheduling-2025-kueue-volcano-mig].

Figure 6.1: Priority class hierarchy and preemption ordering

flowchart TD
    A["system-critical\n(1,000,000)\nKubernetes system components"]
    B["production-inference\n(10,000)\nReal-time model serving"]
    C["training-high\n(1,000)\nScheduled training with SLAs"]
    D["training-standard\n(500)\nResearch experiments"]
    E["development\n(100)\nNotebooks, dev jobs"]

    A -->|"can preempt"| B
    B -->|"can preempt"| C
    C -->|"can preempt"| D
    D -->|"can preempt"| E

    style A fill:#b91c1c,color:#fff
    style B fill:#c2410c,color:#fff
    style C fill:#b45309,color:#fff
    style D fill:#4d7c0f,color:#fff
    style E fill:#1d4ed8,color:#fff

Worked Example: Defining a PriorityClass

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: training-high
value: 1000
globalDefault: false
description: "High-priority training jobs with team SLAs"
---
apiVersion: batch/v1
kind: Job
metadata:
  name: gpt-finetune-v3
spec:
  template:
    spec:
      priorityClassName: training-high
      containers:
        - name: trainer
          image: my-registry/trainer:latest
          resources:
            limits:
              nvidia.com/gpu: "4"

The pod inherits the priority class by name. If the cluster is under pressure and only nodes with lower-priority pods are available, the scheduler will preempt those pods to place this job.

Node Affinity, Taints, and Tolerations for GPU Nodes

GPU nodes in a cluster are specialized and expensive. Without controls, ordinary CPU workloads can land on GPU nodes and waste hardware. Two mechanisms prevent this:

Taints and Tolerations act as a lock-and-key system. The cluster administrator applies a taint to every GPU node, which repels all pods that do not carry the matching toleration. GPU workloads declare that they tolerate the taint and are allowed to schedule there.

# Applied to the GPU node (by the node provisioner or admin)
kubectl taint nodes gpu-node-01 nvidia.com/gpu=present:NoSchedule

# Added to GPU pods
spec:
  tolerations:
    - key: "nvidia.com/gpu"
      operator: "Equal"
      value: "present"
      effect: "NoSchedule"

Node Affinity goes further by allowing pods to express a preference or requirement for specific node characteristics, such as GPU model, memory capacity, or network interconnect type [Source: https://sealos.io/blog/the-ultimate-guide-to-gpu-provisioning-and-management-in-kubernetes/].

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: nvidia.com/gpu.product
                operator: In
                values:
                  - "NVIDIA-A100-SXM4-80GB"

This job will only run on nodes with an A100 80GB GPU — ensuring large language model training is not accidentally placed on a smaller V100 with insufficient memory.

Figure 6.2: Taints, tolerations, and node affinity as a GPU node gate

flowchart LR
    subgraph Pods["Pod Submission"]
        P1["CPU Pod\n(no toleration,\nno affinity)"]
        P2["GPU Pod\n(toleration: nvidia.com/gpu,\naffinity: A100-80GB)"]
    end

    subgraph Nodes["Cluster Nodes"]
        N1["CPU Node\n(no taint)"]
        N2["GPU Node — V100\n(taint: nvidia.com/gpu=present:NoSchedule\nlabel: gpu.product=V100)"]
        N3["GPU Node — A100\n(taint: nvidia.com/gpu=present:NoSchedule\nlabel: gpu.product=A100-SXM4-80GB)"]
    end

    P1 -->|"schedules onto"| N1
    P1 -->|"blocked by taint"| N2
    P1 -->|"blocked by taint"| N3
    P2 -->|"toleration passes,\naffinity rejects V100"| N2
    P2 -->|"toleration passes,\naffinity matches A100"| N3

    style P1 fill:#6b7280,color:#fff
    style P2 fill:#1d4ed8,color:#fff
    style N1 fill:#374151,color:#fff
    style N2 fill:#7c3aed,color:#fff
    style N3 fill:#065f46,color:#fff

Topology-Aware Scheduling for Multi-GPU Locality

For distributed training across multiple GPUs, the physical placement of pods relative to each other has a direct impact on performance. Pods that share a NVLink fabric or are on the same physical host communicate orders of magnitude faster than pods on separate nodes connected only by Ethernet.

The topologySpreadConstraints feature allows a job to specify that its pods should be spread across, or concentrated within, specific topology domains (such as a rack, a zone, or a hostname). Combined with affinity rules for NVLink-capable nodes, this lets a training job request co-location on the same high-bandwidth domain [Source: https://www.ajeetraina.com/kubernetes-and-gpu-the-complete-guide-to-running-ai-ml-workloads-at-scale/].

Key Takeaway: PriorityClasses, taints, tolerations, and affinity rules form the foundation of GPU scheduling hygiene. They prevent resource leakage onto wrong hardware, guarantee preemption ordering across workload tiers, and place distributed training jobs on topologically optimal nodes.


Section 2: Kueue — Kubernetes-Native Job Queuing

The Problem Kueue Solves

The default Kubernetes scheduler assigns pods to nodes as fast as they arrive. For a multi-tenant GPU cluster, this creates problems: a single team can submit thousands of jobs, immediately consuming all GPUs. Other teams wait indefinitely. There is no mechanism for burst borrowing, fairness over time, or holding a job until all its resources are simultaneously available.

Kueue is a Kubernetes-native job queuing system that solves this by intercepting Jobs before their pods are created. It places jobs in a managed queue and only admits the entire workload to the cluster when all requested resources can be guaranteed at once [Source: https://kueue.sigs.k8s.io/docs/overview/]. The default scheduler still handles pod placement — Kueue handles admission control.

ClusterQueue, LocalQueue, and Workload Resources

Kueue’s architecture rests on three objects [Source: https://kueue.sigs.k8s.io/docs/concepts/cluster_queue/]:

ResourceScopePurpose
ClusterQueueCluster-wideDefines a resource pool with quotas, cohort membership, and sharing rules
LocalQueueNamespacedPoints to a ClusterQueue; teams submit jobs to their LocalQueue
WorkloadNamespacedKueue’s internal representation of an admitted job (auto-created)

The separation between ClusterQueue and LocalQueue is intentional. Cluster administrators control the resource pool through ClusterQueues. Individual teams interact only with their LocalQueue in their own namespace and cannot modify quota limits. This cleanly separates platform governance from user access.

Figure 6.3: Kueue architecture — from job submission to pod scheduling

flowchart TD
    subgraph TeamResearch["Namespace: team-research"]
        J1["Job\n(spec.suspend: true\nqueue-name: lq-research)"]
        LQ1["LocalQueue\nlq-research"]
        W1["Workload\n(auto-created by Kueue)"]
    end

    subgraph TeamProd["Namespace: team-prod"]
        J2["Job\n(spec.suspend: true\nqueue-name: lq-prod)"]
        LQ2["LocalQueue\nlq-prod"]
        W2["Workload\n(auto-created by Kueue)"]
    end

    subgraph ClusterLevel["Cluster Level (admin-controlled)"]
        CQ1["ClusterQueue\ncq-research\nnominalQuota: 8 GPUs\nborrowingLimit: 8"]
        CQ2["ClusterQueue\ncq-production\nnominalQuota: 16 GPUs\nborrowingLimit: 0"]
        COH["Cohort: org-cohort\n(shared idle quota pool)"]
    end

    subgraph Scheduling["Scheduling Layer"]
        SCHED["Default Kubernetes\nScheduler\n(pod placement)"]
        NODES["GPU Nodes"]
    end

    J1 --> LQ1 --> W1 --> CQ1
    J2 --> LQ2 --> W2 --> CQ2
    CQ1 <-->|"borrow/lend idle quota"| COH
    CQ2 <-->|"member"| COH
    CQ1 -->|"admit workload\n(unsuspend job)"| SCHED
    CQ2 -->|"admit workload\n(unsuspend job)"| SCHED
    SCHED --> NODES

    style CQ1 fill:#1d4ed8,color:#fff
    style CQ2 fill:#0f766e,color:#fff
    style COH fill:#7c3aed,color:#fff
    style SCHED fill:#374151,color:#fff

Worked Example: Setting Up a Two-Team Cluster

# ClusterQueue for the research team
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: cq-research
spec:
  cohort: "org-cohort"
  resourceGroups:
    - coveredResources: ["nvidia.com/gpu"]
      flavors:
        - name: "gpu-flavor"
          resources:
            - name: "nvidia.com/gpu"
              nominalQuota: 8      # guaranteed GPUs
              borrowingLimit: 8   # can borrow up to 8 more from cohort
---
# ClusterQueue for the production team
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: cq-production
spec:
  cohort: "org-cohort"
  resourceGroups:
    - coveredResources: ["nvidia.com/gpu"]
      flavors:
        - name: "gpu-flavor"
          resources:
            - name: "nvidia.com/gpu"
              nominalQuota: 16
              borrowingLimit: 0   # production does not need to borrow
---
# LocalQueue in the research namespace
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  name: lq-research
  namespace: team-research
spec:
  clusterQueue: cq-research

When a researcher submits a Job in the team-research namespace with spec.suspend: true and a label pointing to lq-research, Kueue takes over. If 8 GPUs are free in the research pool, the job is admitted immediately. If not, it waits in the queue. If the production pool has idle GPUs and borrowing is enabled, the research job can borrow those too [Source: https://kubernetes.io/blog/2022/10/04/introducing-kueue/].

Fair-Sharing and Borrowing Between Queues

When multiple ClusterQueues belong to the same cohort, their unused nominal quota becomes a shared pool that any member can borrow from. This prevents wasted GPU time when one team’s quota is idle while another team has pending jobs [Source: https://kueue.sigs.k8s.io/docs/concepts/fair_sharing/].

Fair sharing uses a scoring system: the queue consuming the least relative to its nominal entitlement receives the highest admission priority. The queue consuming the most is the first target for preemption when a lower-usage queue needs resources. This creates a self-balancing system that converges toward equitable utilization over time [Source: https://medium.com/google-cloud/mastering-workloads-in-kubernetes-with-kueue-part-3-embracing-fair-sharing-8fef4d7fc13f].

The analogy here is a shared bank account with guaranteed minimum balances per account holder, but where unused balances can be temporarily lent to others — and the system automatically claws back those loans when the original account holder needs their guaranteed amount.

Priority-Based Preemption Within Queues

Kueue supports two queueing strategies for ordering pending workloads: StrictFIFO (strict first-in, first-out) and BestEffortFIFO (first-in, first-out unless a later job fits better). On top of this, Workload Priority Classes provide a numerical weight that supersedes arrival order [Source: https://medium.com/google-cloud/mastering-workloads-in-kubernetes-with-kueue-part-1-queues-and-priorities-339257d8b4ba].

Three preemption policies govern what Kueue does when a high-priority pending workload cannot fit [Source: https://kueue.sigs.k8s.io/docs/concepts/preemption/]:

Policy: reclaimWithinCohortBehavior
NeverDo not preempt any workload in the cohort
LowerPriorityPreempt cohort workloads with lower priority than the pending one
AnyPreempt any cohort workload regardless of priority

The LessThanInitialShare policy adds a fairness guard: preemption is only allowed if the preempting queue’s share (including its new workload) would be strictly less than the target queue’s current share. This prevents aggressive queues from using high-priority preemption to permanently monopolize resources [Source: https://kueue.sigs.k8s.io/docs/concepts/preemption/].

Kueue’s design also prevents preemption loops — scenarios where workload A preempts B, then B preempts A, cycling indefinitely. The implementation detects and breaks such cycles to ensure stable allocation.

Integration with Jobs, RayJobs, and Training Operator

Kueue integrates natively with Kubernetes batch/v1 Job, but also supports custom resources through its framework: RayJob (KubeRay), PyTorchJob and TFJob (Training Operator), and JobSet. In each case, the integration wraps the custom resource in a Kueue Workload and manages its spec.suspend field — suspending the job while queued and releasing it when admitted [Source: https://docs.ray.io/en/latest/cluster/kubernetes/examples/rayjob-kueue-priority-scheduling.html].

# RayJob submitted via Kueue
apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: llm-training-run
  namespace: team-research
  labels:
    kueue.x-k8s.io/queue-name: lq-research
spec:
  suspend: true   # Kueue controls this field
  rayClusterSpec:
    headGroupSpec:
      ...
    workerGroupSpecs:
      - replicas: 4
        template:
          spec:
            containers:
              - resources:
                  limits:
                    nvidia.com/gpu: "2"

Key Takeaway: Kueue extends the default Kubernetes scheduler with admission control, quota enforcement, and fair-sharing without replacing it. Teams submit jobs to namespaced LocalQueues; administrators control resource pools through ClusterQueues; cohorts enable burst borrowing and preemption-based fairness across tenant boundaries.


Section 3: Volcano Batch Scheduler

Gang Scheduling and the Deadlock Problem

Consider a distributed training job that requires 16 GPUs across 4 nodes. The default Kubernetes scheduler will happily place 12 pods on available nodes and then block the remaining 4, waiting for capacity. The 12 placed pods begin executing and hold their GPU allocations while waiting for their peers. The remaining jobs in the cluster similarly hold partial allocations. The result is a deadlock: everyone waits, nobody makes progress, and GPU utilization collapses.

Gang scheduling solves this with an all-or-nothing guarantee. A job’s pod group is admitted only when all requested resources are simultaneously available. If 16 GPUs are not free at once, the entire job waits as a unit — but crucially, it does not hold any resources while waiting, leaving capacity available for jobs that can fit [Source: https://www.infracloud.io/blogs/batch-scheduling-on-kubernetes/].

Volcano implements gang scheduling through two objects: Queue (analogous to Kueue’s ClusterQueue) and PodGroup (a group of pods that must be co-scheduled). Volcano replaces the default Kubernetes scheduler entirely and handles pod placement itself [Source: https://volcano.sh/en/].

Figure 6.4: Partial allocation deadlock vs. gang scheduling all-or-nothing admission

flowchart LR
    subgraph Deadlock["Without Gang Scheduling — DEADLOCK"]
        direction TB
        DA["Job A\n(needs 16 GPUs)"]
        DB["Job B\n(needs 8 GPUs)"]
        DN["Cluster: 20 GPUs free"]
        DA -->|"12 pods placed\n(holds 12 GPUs)"| DN
        DB -->|"8 pods placed\n(holds 8 GPUs)"| DN
        DSTALL["Remaining pods\nPENDING forever\nGPU utilization collapses"]
        DN --> DSTALL
    end

    subgraph GangSched["With Gang Scheduling — ALL-OR-NOTHING"]
        direction TB
        GA["Job A PodGroup\n(minMember: 16)"]
        GB["Job B PodGroup\n(minMember: 8)"]
        GN["Cluster: 20 GPUs free"]
        GW["Job A waits as unit\n(holds 0 GPUs while queued)"]
        GA -->|"16 GPUs not free\nat once → wait"| GW
        GB -->|"8 GPUs available\n→ all 8 pods placed"| GN
        GRUN["Job B runs\nJob A queued cleanly\nno deadlock"]
        GN --> GRUN
    end

    style DSTALL fill:#b91c1c,color:#fff
    style GRUN fill:#065f46,color:#fff
    style GW fill:#92400e,color:#fff

Queue and PodGroup Gang Scheduling

# A Volcano Queue with capacity and weight
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: research-queue
spec:
  weight: 1
  capability:
    nvidia.com/gpu: "32"
---
# A PodGroup requiring all 16 pods before any start
apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
  name: pytorch-ddp-run
  namespace: team-research
spec:
  minMember: 16
  queue: research-queue
  priorityClassName: training-high

A PyTorchJob submitted through the Training Operator automatically creates a PodGroup. Volcano monitors the group and holds all pods in Pending state until minMember pods can be simultaneously scheduled. Once the threshold is met, all pods are placed at once [Source: https://docs.ray.io/en/latest/cluster/kubernetes/k8s-ecosystem/volcano.html].

Fair-Share, DRF, and Proportion Scheduling Plugins

Volcano uses a plugin architecture. Multiple scheduling plugins compose to produce a final decision. Key plugins for AI workloads:

PluginBehavior
proportionAllocates cluster capacity in proportion to queue weights
gangEnforces all-or-nothing PodGroup admission
binpackPacks pods densely onto fewer nodes to leave whole nodes free
drfDominant Resource Fairness — maximizes fairness across multi-resource requests
priorityRespects PriorityClass values when ordering jobs within a queue
nodeorderScores nodes based on resource fit, NUMA locality, and topology

Dominant Resource Fairness (DRF) is particularly valuable for AI clusters. When different jobs consume different mixes of resources (one job is GPU-heavy, another is memory-heavy), simple quota counting is unfair. DRF identifies each job’s dominant resource (the one it uses proportionally most) and equalizes fairness across that dimension. A pure GPU training job and a high-memory data-preprocessing job are both treated fairly, even though they consume resources in completely different ratios [Source: https://debugg.ai/resources/kubernetes-gpu-scheduling-2025-kueue-volcano-mig].

Preemption and Reclaim Actions

Volcano distinguishes between two corrective scheduling actions:

These actions fire as part of Volcano’s scheduling cycle, governed by configurable action lists in the scheduler configuration. Reclaim ensures that a burst of activity from one team does not permanently suppress other teams’ throughput [Source: https://volcano.sh/en/].

Kueue + Volcano: Better Together

The recommended pattern for large-scale GPU clusters is to combine both systems [Source: https://debugg.ai/resources/kubernetes-gpu-scheduling-2025-kueue-volcano-mig]:

ResponsibilityTool
Admission control and quota enforcementKueue
Burst borrowing across tenant boundariesKueue (cohorts)
Strict gang semantics (all-or-nothing)Volcano
MPI/PyTorch/TensorFlow operator integrationVolcano
DRF and multi-resource fairnessVolcano

Kueue controls whether a job is admitted to the cluster at all. Once admitted, Volcano handles the pod-level scheduling decisions. This layered architecture means teams benefit from Kueue’s quota visibility and borrowing logic while still getting Volcano’s battle-tested gang scheduling guarantees for distributed training.

Key Takeaway: Volcano solves the distributed training deadlock problem through PodGroup gang scheduling and provides a plugin-based architecture for DRF fairness, dense bin-packing, and priority preemption. For production clusters, pairing Kueue (admission control) with Volcano (gang scheduling) provides both quota governance and deadlock-free distributed job execution.


Section 4: Cluster Bin-Packing and Cost Optimization

GPU Utilization Monitoring and Right-Sizing

Before optimizing, you need visibility. Many clusters appear busy — many pods scheduled — while actual GPU compute utilization sits below 30%. The NVIDIA Data Center GPU Manager (DCGM) exposes per-GPU metrics including SM utilization, memory utilization, and NVLink bandwidth. These metrics flow into Prometheus and are visualized in Grafana dashboards.

Common utilization anti-patterns and their fixes:

Anti-PatternSymptomFix
Oversized resource requestsGPU allocated, SM utilization < 10%Right-size with DCGM data; use MIG for small jobs
Partial allocation deadlockJobs pending despite free GPUsEnable Volcano gang scheduling
Single-tenant monopolyOne namespace consumes all GPUsKueue ClusterQueue quotas
Node fragmentationMultiple nodes partially usedEnable Volcano binpack plugin

Right-sizing is the process of adjusting resources.limits.nvidia.com/gpu in job specs to match actual consumption. For jobs that need only a fraction of a GPU (interactive notebooks, small inference endpoints), GPU time-slicing or NVIDIA MIG (Multi-Instance GPU) allows multiple workloads to share one physical GPU [Source: https://developers.redhat.com/articles/2025/05/22/improve-gpu-utilization-kueue-openshift-ai].

MIG partitions a single A100 or H100 into isolated slices with dedicated memory and compute. An A100 80GB can be split into up to seven 1g.10gb MIG instances, each appearing as an independent GPU to Kubernetes. This lets seven small training jobs or inference workloads run simultaneously on what would otherwise be a single-tenant device [Source: https://debugg.ai/resources/kubernetes-gpu-scheduling-2025-kueue-volcano-mig].

Spot and Preemptible Instances for Training Workloads

GPU instances are among the most expensive compute resources in any cloud. Spot (AWS) and preemptible (GCP) instances offer the same hardware at discounts of up to 70%, with the tradeoff that the cloud provider can reclaim them with short notice [Source: https://cloudnativenow.com/contributed-content/the-ultimate-guide-to-gpu-scaling-with-karpenter/].

Training workloads — especially long-running experiments — are well-suited for spot if they implement checkpointing. A job that saves model state every 30 minutes loses at most 30 minutes of work when a spot node is reclaimed. Inference workloads serving live traffic are less suitable, though batch inference pipelines with job-level retry logic handle spot interruptions gracefully.

Key spot strategy considerations:

FactorRecommendation
Instance diversityRequest multiple GPU instance families (G4dn, G5, P4) to widen the spot pool
Availability Zone spreadRequest across all AZs; each AZ+instance type is a separate capacity pool
CheckpointingSave model state to durable storage (S3/GCS) at regular intervals
Interruption handlingUse node termination handlers to drain pods gracefully before reclaim
FallbackConfigure on-demand as fallback for critical jobs that cannot checkpoint

Cluster Autoscaler Configuration for GPU Node Pools

The legacy Cluster Autoscaler (CA) watches for unschedulable pods and scales node groups up or down based on pre-configured node group definitions. For GPU clusters, CA requires a separate node group per GPU type (one for A100 nodes, one for T4 nodes, etc.) and scales each group independently.

CA works well for homogeneous GPU pools but struggles with diversity. When a job needs a specific GPU model not available in any current node group, CA cannot provision it on the fly — the node group must already be defined [Source: https://aws.amazon.com/blogs/aws/introducing-karpenter-an-open-source-high-performance-kubernetes-cluster-autoscaler/].

Scale-down is conservative by default: CA will not remove a node until it has been underutilized for a configurable period (default 10 minutes), which matters for GPU node costs.

# Cluster Autoscaler annotation on a GPU node group
cluster-autoscaler.kubernetes.io/safe-to-evict: "false"   # prevent eviction of DaemonSet pods
# In CA ConfigMap:
scale-down-utilization-threshold: "0.5"   # remove node if < 50% utilized
scale-down-delay-after-add: "5m"          # wait 5 min after scale-up before evaluating scale-down

Karpenter for Just-in-Time GPU Provisioning

Karpenter is a Kubernetes node provisioner that replaces the legacy Cluster Autoscaler with a fundamentally different model. Rather than scaling fixed node groups, Karpenter watches for pending pods, reads their resource requirements and constraints directly, and provisions exactly the right node type in real time — typically under 60 seconds [Source: https://aws.amazon.com/blogs/aws/introducing-karpenter-an-open-source-high-performance-kubernetes-cluster-autoscaler/].

The core Karpenter object is a NodePool, which defines a flexible set of constraints that Karpenter uses when selecting instance types:

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-nodepool
spec:
  template:
    spec:
      requirements:
        - key: "karpenter.k8s.aws/instance-family"
          operator: In
          values: ["g4dn", "g5", "p4d"]      # multiple families = wider spot pool
        - key: "karpenter.sh/capacity-type"
          operator: In
          values: ["spot", "on-demand"]       # spot first, on-demand fallback
        - key: "topology.kubernetes.io/zone"
          operator: In
          values: ["us-east-1a", "us-east-1b", "us-east-1c"]   # all AZs
      nodeClassRef:
        apiVersion: karpenter.k8s.aws/v1
        kind: EC2NodeClass
        name: gpu-nodeclass
  disruption:
    consolidationPolicy: WhenUnderutilized    # automatically remove idle nodes
    consolidateAfter: 30s

Consolidation is Karpenter’s most powerful cost feature. It continuously evaluates whether running pods could be packed onto fewer nodes. If a 4-GPU node is running only one 1-GPU pod, Karpenter will drain that node and move the pod to an existing node with spare capacity, then terminate the now-empty node. This happens automatically without administrator intervention [Source: https://cloudnativenow.com/contributed-content/the-ultimate-guide-to-gpu-scaling-with-karpenter/].

Real-world results from organizations adopting Karpenter with Spot instances have included [Source: https://appinventiv.com/blog/kubernetes-cost-optimization-success-story/]:

The contrast with the legacy Cluster Autoscaler is significant:

Figure 6.5: Karpenter just-in-time provisioning and consolidation flow

flowchart TD
    subgraph Trigger["Trigger: Unschedulable Pod"]
        POD["Pending GPU Pod\n(requests: 4x A100,\nspot preferred)"]
    end

    subgraph KarpenterLogic["Karpenter Decision Engine"]
        KA["Read pod requirements\n& constraints from spec"]
        KB["Select optimal instance type\n(g5.12xlarge — 4x A100, spot)"]
        KC["Provision node\nin target AZ (<60s)"]
    end

    subgraph Consolidation["Ongoing Consolidation Loop"]
        CL["Monitor running nodes\nfor underutilization"]
        CM["Can pods fit on fewer nodes?"]
        CN["Drain underutilized node\nMigrate pods"]
        CO["Terminate empty node\n(cost eliminated)"]
    end

    subgraph Compare["Legacy Cluster Autoscaler"]
        CA1["Watches unschedulable pods"]
        CA2["Scale up matching\npre-defined node group\n(3–5 min)"]
        CA3["Scale down after\nutilization threshold\n(10 min default delay)"]
    end

    POD --> KA --> KB --> KC
    KC --> CL --> CM
    CM -->|"yes"| CN --> CO
    CM -->|"no"| CL

    CA1 --> CA2 --> CA3

    style KC fill:#065f46,color:#fff
    style CO fill:#1d4ed8,color:#fff
    style CA2 fill:#7c3aed,color:#fff
    style CA3 fill:#6b7280,color:#fff
CapabilityCluster AutoscalerKarpenter
Node selectionFixed node groupsDynamic, constraint-based
Provisioning speed3–5 minutesUnder 60 seconds
Instance diversityPer-group definitionFlexible NodePool requirements
ConsolidationBasic scale-down onlyActive bin-packing with pod migration
Spot integrationVia node group configurationFirst-class, automatic fallback
Availability zone awarenessGroup-levelPer-pending-pod

A Complete Cost Optimization Architecture

Combining the techniques from this section produces a layered cost optimization stack:

┌─────────────────────────────────────────────────────────────┐
│  Job Submission Layer                                       │
│  Kueue: quota enforcement, fair-sharing, burst borrowing   │
├─────────────────────────────────────────────────────────────┤
│  Gang Scheduling Layer                                      │
│  Volcano: all-or-nothing admission, DRF fairness           │
├─────────────────────────────────────────────────────────────┤
│  GPU Sharing Layer                                          │
│  NVIDIA MIG / Time-slicing: sub-GPU allocation for small   │
│  jobs; eliminates single-job GPU monopoly                  │
├─────────────────────────────────────────────────────────────┤
│  Node Provisioning Layer                                    │
│  Karpenter: just-in-time spot provisioning, consolidation  │
└─────────────────────────────────────────────────────────────┘

Each layer addresses a distinct waste vector: quota enforcement prevents monopoly, gang scheduling eliminates deadlock idle time, GPU sharing maximizes per-device utilization, and just-in-time provisioning ensures you are never paying for nodes with no work to do.

Key Takeaway: GPU cost optimization is a layered problem. Right-sizing with DCGM metrics, MIG partitioning for small workloads, spot instances for fault-tolerant training jobs, and Karpenter’s just-in-time provisioning with active consolidation together deliver the largest reductions in per-experiment compute cost while maintaining throughput for high-priority workloads.


Chapter Summary

This chapter has examined the full scheduling stack for production AI workloads on Kubernetes, from pod-level priority controls to cluster-level cost automation.

Kubernetes scheduling primitives — PriorityClasses, taints, tolerations, and node affinity — form the foundation. They ensure GPU hardware is reserved for appropriate workloads, define preemption ordering across production and research tiers, and co-locate distributed training pods on topologically optimal nodes.

Kueue introduces admission control at the job level. Rather than allowing all submitted jobs to compete for GPUs immediately, Kueue holds jobs in namespaced LocalQueues backed by ClusterQueue resource pools. Cohorts enable burst borrowing between teams. Fair sharing scoring ensures no single team permanently dominates shared capacity. Three preemption policies give platform administrators fine-grained control over how high-priority jobs displace running lower-priority ones.

Volcano addresses the distributed training deadlock problem with gang scheduling via PodGroups. Its plugin architecture supports DRF fairness, dense bin-packing, and deep integration with ML framework operators. For large-scale clusters, the recommended pattern combines Kueue for admission control with Volcano for gang-semantic pod placement.

Cluster optimization begins with visibility through DCGM GPU metrics, proceeds to right-sizing and MIG partitioning for small workloads, and uses spot instances for fault-tolerant training jobs. Karpenter replaces the legacy Cluster Autoscaler with just-in-time, constraint-based node provisioning and active consolidation — automatically eliminating idle GPU nodes and achieving up to 35% cost reduction in documented real-world deployments.


Key Terms

TermDefinition
KueueA Kubernetes-native job queuing system that manages workload admission via spec.suspend, enforcing quotas and fair-sharing without replacing the default scheduler
VolcanoA batch scheduling system for Kubernetes that replaces the default scheduler with gang scheduling, DRF fairness, and deep ML framework operator integration
Gang schedulingAn all-or-nothing scheduling strategy where all pods in a group must be simultaneously placeable before any are started, preventing distributed job deadlock
PriorityClassA Kubernetes object assigning a numerical priority value to pods; higher-priority pods preempt lower-priority pods when node resources are exhausted
PreemptionThe eviction of a running lower-priority pod to free resources for a higher-priority pending pod
Fair-shareA scheduling policy that distributes resources proportionally to entitlements over time, giving underserved queues preference and targeting overserved queues for preemption
DRFDominant Resource Fairness — a multi-resource fairness algorithm that equalizes the fraction of their dominant resource (the one they consume proportionally most) used by competing jobs
Bin-packingA scheduling strategy that concentrates pods onto the fewest possible nodes to leave entire nodes free or to maximize per-node utilization
KarpenterA Kubernetes node provisioner that dynamically provisions the right node type for pending pods in under 60 seconds, with active consolidation to eliminate idle nodes
Cluster autoscalerThe legacy Kubernetes node scaling component that adjusts the size of pre-configured node groups based on unschedulable pod counts and utilization thresholds
TaintA node property that repels pods unless those pods carry a matching toleration
TolerationA pod property that allows it to be scheduled onto nodes with a matching taint

Chapter 7: MLOps and ML Pipelines on Kubernetes

Learning Objectives

By the end of this chapter, you will be able to:


Introduction

Getting a machine learning model from a data scientist’s laptop into production is famously hard. Training a model once is a craft project; training it reliably, repeatedly, and with full auditability is an engineering discipline. MLOps is that discipline — and Kubernetes has become its natural home.

Think of an ML pipeline as a factory assembly line. Raw materials (data) enter one end; finished products (deployed models) exit the other. Each workstation on the line is a container doing a discrete job — preprocessing, training, evaluation, registration, deployment. Kubernetes is the factory floor: it schedules the workstations, manages the conveyor belts between them, and restarts any station that breaks down. The tools in this chapter — Kubeflow Pipelines, Argo Workflows, MLflow, Feast, and ArgoCD — are the machines and quality-control systems that make that factory hum.


Section 1: ML Pipeline Orchestration

An ML pipeline is a directed acyclic graph (DAG) of computational steps. Each node in the graph is a containerized task; edges carry data artifacts or parameter values from one task to the next. Orchestrating that graph reliably on Kubernetes is the job of pipeline engines.

Kubeflow Pipelines v2 Architecture and SDK

Kubeflow is an end-to-end MLOps platform purpose-built for Kubernetes. At its core sits Kubeflow Pipelines (KFP), which provides a Python SDK and domain-specific language (DSL) for defining pipelines, plus backend and frontend services for running and scheduling them on any Kubernetes cluster [Source: https://www.kubeflow.org/docs/components/pipelines/legacy-v1/introduction/].

KFP v2 introduced several significant improvements over v1:

FeatureKFP v1KFP v2
Component decorator@component (limited)@dsl.component and @dsl.container_component
Intermediate representationArgo YAMLBackend-agnostic IR
Artifact visibilityHidden implementation detailFirst-class DAG nodes
Nested pipelinesNot supportedPipelines as pipeline components
Single component executionFull pipeline requiredRun individual components

The @dsl.component decorator is the primary authoring tool. It converts a plain Python function into a self-contained pipeline step that KFP can containerize and schedule. The @dsl.container_component decorator is used when you need precise control over the container command — for example, when wrapping a non-Python tool or a shell script [Source: https://cloud.google.com/blog/products/ai-machine-learning/whats-new-in-kubeflow-pipelines-v2].

Worked Example — A Three-Step Training Pipeline:

from kfp import dsl
from kfp.dsl import Dataset, Model, Output, Input

@dsl.component(base_image="python:3.11-slim", packages_to_install=["pandas", "scikit-learn"])
def preprocess(raw_data_path: str, dataset: Output[Dataset]):
    import pandas as pd
    df = pd.read_csv(raw_data_path).dropna()
    df.to_csv(dataset.path, index=False)

@dsl.component(base_image="python:3.11-slim", packages_to_install=["pandas", "scikit-learn"])
def train(dataset: Input[Dataset], model: Output[Model], n_estimators: int = 100):
    import pandas as pd
    from sklearn.ensemble import RandomForestClassifier
    import joblib
    df = pd.read_csv(dataset.path)
    X, y = df.drop("label", axis=1), df["label"]
    clf = RandomForestClassifier(n_estimators=n_estimators)
    clf.fit(X, y)
    joblib.dump(clf, model.path)

@dsl.component(base_image="python:3.11-slim", packages_to_install=["pandas", "scikit-learn"])
def evaluate(dataset: Input[Dataset], model: Input[Model]) -> float:
    import pandas as pd
    from sklearn.metrics import accuracy_score
    import joblib
    df = pd.read_csv(dataset.path)
    X, y = df.drop("label", axis=1), df["label"]
    clf = joblib.load(model.path)
    return accuracy_score(y, clf.predict(X))

@dsl.pipeline(name="random-forest-pipeline")
def rf_pipeline(data_path: str, n_estimators: int = 100):
    prep = preprocess(raw_data_path=data_path)
    fit  = train(dataset=prep.outputs["dataset"], n_estimators=n_estimators)
    evaluate(dataset=prep.outputs["dataset"], model=fit.outputs["model"])

Notice how the Dataset and Model types are declared as Output and Input parameters. KFP v2 surfaces these as first-class artifact nodes in the pipeline graph visualization, so engineers can see exactly which model artifact came from which training run [Source: https://cloud.google.com/blog/products/ai-machine-learning/whats-new-in-kubeflow-pipelines-v2].

Figure 7.1: KFP v2 Three-Step Pipeline DAG with Artifact Flow

flowchart LR
    A([raw_data_path\nstring param]) --> B

    subgraph Pipeline["random-forest-pipeline DAG"]
        B["preprocess()\n@dsl.component"]
        C["train()\n@dsl.component"]
        D["evaluate()\n@dsl.component"]
    end

    B -- "Dataset artifact" --> C
    B -- "Dataset artifact" --> D
    C -- "Model artifact" --> D

    B --> E[(Artifact Store\nMinIO / GCS)]
    C --> E
    D --> F([float\naccuracy score])

    style B fill:#4a90d9,color:#fff,stroke:#2c6fad
    style C fill:#4a90d9,color:#fff,stroke:#2c6fad
    style D fill:#4a90d9,color:#fff,stroke:#2c6fad
    style E fill:#f5a623,color:#fff,stroke:#c47f00
    style Pipeline fill:#f0f4ff,stroke:#9ab0d9

Pipeline Caching is a standout productivity feature. When a component’s inputs — both parameter values and artifact content hashes — match a previous execution, KFP reuses the cached output rather than re-running the step. This is invaluable during iterative development: change the n_estimators parameter, and only the train and evaluate steps re-execute; the preprocess step hits the cache and returns instantly.

Argo Workflows for ML DAGs

Kubeflow Pipelines compiles Python pipelines into an intermediate representation that is executed by Argo Workflows as the default backend [Source: https://valohai.com/blog/kubeflow-vs-argo/]. Argo can also be used directly, without KFP, when you need fine-grained control over Kubernetes-native features.

Argo’s value proposition for ML is straightforward: every step is a container, and Kubernetes is the scheduler. Teams working in multiple languages or wrapping legacy tools often prefer writing Argo YAML directly rather than using the KFP Python SDK. Argo’s native capabilities relevant to ML include DAG-based workflow execution, parameter passing between steps, artifact management with S3/GCS integration, and configurable retry and error-handling policies [Source: https://argo-workflows.readthedocs.io/en/latest/use-cases/machine-learning/].

Worked Example — Argo DAG Workflow:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: ml-training-
spec:
  entrypoint: ml-dag
  templates:
    - name: ml-dag
      dag:
        tasks:
          - name: preprocess
            template: run-step
            arguments:
              parameters: [{name: script, value: "preprocess.py"}]
          - name: train
            template: run-step
            dependencies: [preprocess]
            arguments:
              parameters: [{name: script, value: "train.py"}]
          - name: evaluate
            template: run-step
            dependencies: [train]
            arguments:
              parameters: [{name: script, value: "evaluate.py"}]
    - name: run-step
      inputs:
        parameters:
          - name: script
      container:
        image: my-ml-image:latest
        command: [python, "{{inputs.parameters.script}}"]
        env:
          - name: DATA_BUCKET
            value: s3://my-training-data

Tekton Pipelines for ML CI/CD

Tekton occupies a different niche than KFP or raw Argo. Where KFP focuses on ML-specific orchestration (artifact lineage, experiment tracking integration), Tekton is a general-purpose CI/CD framework on Kubernetes. Its strength in MLOps is in the CI/CD layer: building container images for pipeline steps, running linting and unit tests before training, and triggering downstream deployments after model validation. Tekton’s Pipeline and Task resources integrate naturally with tools like ArgoCD to create a full GitOps deployment chain from code commit to production serving.

Pipeline Parameterization and Caching

Effective pipelines treat all tunable values as explicit parameters — never as hardcoded constants buried in code. This enables:

Both KFP and Argo store per-run parameter values in their metadata backends, providing a natural audit trail alongside the artifact lineage graph.

Key Takeaway: Kubeflow Pipelines provides a Python-first authoring experience that compiles to backend-agnostic IR executed by Argo Workflows. KFP v2’s first-class artifact support, nested pipelines, and per-step caching dramatically reduce iteration time and improve reproducibility.


Section 2: Experiment Tracking and Model Registry

MLflow on Kubernetes for Experiment Logging

Every training run produces metrics, parameters, and artifacts. Without a system to capture them, teams rapidly lose track of which configuration produced which result — the equivalent of a chemistry lab with no lab notebooks. MLflow is the most widely adopted open-source solution for this problem [Source: https://www.truefoundry.com/blog/mlops-tools].

Deploying MLflow as a production-ready service on Kubernetes uses a standard three-tier architecture:

┌─────────────────────────────────────────────────────────────┐
│  Kubernetes Cluster                                          │
│                                                              │
│  ┌──────────────────┐    ┌──────────────────┐               │
│  │  MLflow Tracking  │    │   PostgreSQL      │               │
│  │  Server           │───►│   (backend store) │               │
│  │  (Deployment)     │    │   (StatefulSet)   │               │
│  └────────┬─────────┘    └──────────────────┘               │
│           │                                                   │
│           │              ┌──────────────────┐               │
│           └─────────────►│   MinIO / S3     │               │
│                          │  (artifact store) │               │
│                          └──────────────────┘               │
└─────────────────────────────────────────────────────────────┘

Figure 7.2: MLflow Production Architecture on Kubernetes

flowchart TB
    subgraph Cluster["Kubernetes Cluster (mlops namespace)"]
        TServer["MLflow Tracking Server\nDeployment (2 replicas)"]
        PG["PostgreSQL\nStatefulSet\n(backend store)"]
        S3["MinIO / S3\n(artifact store)"]
        Ing["Ingress / Service\n(port 5000)"]
    end

    Client["Training Pod\nmlflow.log_metric(...)"]
    UI["Data Scientist\nBrowser"]

    Client -->|"REST API\nlogs params & metrics"| TServer
    Client -->|"uploads model\nartifacts directly"| S3
    TServer -->|"persists run\nmetadata"| PG
    TServer -->|"resolves artifact\nURIs"| S3
    UI -->|"HTTP"| Ing
    Ing --> TServer

    style TServer fill:#4a90d9,color:#fff,stroke:#2c6fad
    style PG fill:#7b68ee,color:#fff,stroke:#5a4fcf
    style S3 fill:#f5a623,color:#fff,stroke:#c47f00
    style Ing fill:#50c878,color:#fff,stroke:#2e8b57
    style Cluster fill:#f0f4ff,stroke:#9ab0d9

The deployment involves four Kubernetes objects [Source: https://medium.com/@nsalexamy/deploying-mlflow-on-kubernetes-a-production-ready-guide-171d8aa2ed30]:

ObjectPurpose
DeploymentRuns the MLflow tracking server as a scalable pod
Service + IngressExposes the UI and REST API externally
StatefulSet (PostgreSQL)Persists run metadata and model registry entries
Secret / ConfigMapStores database credentials and S3 endpoint configuration

Training code running anywhere on the cluster (or off-cluster) logs to the tracking server via the MLflow client SDK:

import mlflow

mlflow.set_tracking_uri("http://mlflow-service.mlops.svc.cluster.local:5000")
mlflow.set_experiment("fraud-detection-v2")

with mlflow.start_run():
    mlflow.log_param("n_estimators", 200)
    mlflow.log_param("max_depth", 8)
    mlflow.log_metric("accuracy", 0.942)
    mlflow.log_metric("f1_score", 0.917)
    mlflow.sklearn.log_model(clf, "model")

Model Versioning and Artifact Management

Logging a model to MLflow is only half the story. The MLflow Model Registry provides the governance layer: a versioned catalog of every model artifact with structured promotion stages and team-level controls [Source: https://oneuptime.com/blog/post/2026-02-09-mlflow-model-registry-kubernetes/view].

The registry uses semantic aliases to communicate a model’s production status. Rather than hard-coding version numbers in serving infrastructure, downstream systems reference aliases:

AliasMeaning
championThe model currently serving production traffic
challengerA candidate model undergoing A/B testing
shadowA model receiving traffic copies for offline evaluation
archivedA retired model retained for audit purposes

Promotion between stages is a deliberate, audited action — not an automatic side-effect of a training run. This separation of concerns is critical: a model registry tracks versions, metadata, and deployment status, whereas raw object storage holds files but provides no context, making rollbacks and dependency tracking prohibitively difficult [Source: https://oneuptime.com/blog/post/2026-02-09-mlflow-model-registry-kubernetes/view].

Figure 7.3: Model Registry Promotion Lifecycle

stateDiagram-v2
    [*] --> Registered : mlflow.register_model()

    Registered --> Challenger : Manual promotion\n(team review)
    Challenger --> Shadow : A/B test approved\n(offline eval passed)
    Shadow --> Champion : Shadow traffic\nvalidation passed

    Champion --> Archived : New champion\npromoted

    Challenger --> Archived : Validation\nfailed / rejected
    Shadow --> Challenger : Shadow metrics\ninsufficient

    note right of Champion
        Serving production traffic
        Referenced by alias "champion"
    end note
    note right of Challenger
        Undergoing A/B test
        Referenced by alias "challenger"
    end note
    note right of Shadow
        Receiving traffic copies
        offline evaluation only
    end note

Weights & Biases and Neptune Integration Patterns

For teams needing richer visualization, real-time collaboration, or experiment comparison across large sweeps, managed services like Weights & Biases (W&B) and Neptune extend the MLflow pattern. Both integrate with Kubernetes workloads via environment variables and lightweight SDK calls:

import wandb
wandb.init(project="fraud-detection", config={"n_estimators": 200})
wandb.log({"accuracy": 0.942, "f1": 0.917})

The key architectural decision is whether experiment metadata leaves the cluster (managed SaaS) or stays within it (self-hosted MLflow). Self-hosted keeps data governance simple; managed SaaS reduces operational burden. Most teams choose self-hosted MLflow for regulated data and SaaS tooling for exploratory research.

Key Takeaway: MLflow on Kubernetes provides experiment tracking, model versioning, and artifact storage in a single platform. Using semantic aliases like “champion” and “challenger” decouples deployment automation from specific version numbers and enables safe canary promotions.


Section 3: Feature Stores on Kubernetes

The Training-Serving Skew Problem

Imagine a data scientist building a fraud detection model. During training, they compute a feature — “transactions in the last 24 hours” — using a batch SQL query over historical data. In production, the same feature must be computed in real time from a live event stream. If the training computation and the serving computation diverge even slightly (different time windows, different null handling), the model’s performance degrades in ways that are extremely difficult to diagnose. This is called training-serving skew, and it is one of the most common sources of silent model degradation in production.

A feature store solves this by providing a single, versioned definition of every feature that is shared between training and serving pipelines.

Feast Deployment on Kubernetes

Feast is the most widely adopted open-source feature store, and it integrates natively with the Kubeflow ecosystem [Source: https://www.kubeflow.org/docs/external-add-ons/feast/introduction/]. Feast’s architecture separates two concerns with distinct storage backends [Source: https://docs.feast.dev]:

ComponentStorage BackendUse Case
Offline storePostgreSQL, BigQuery, SnowflakeHistorical feature retrieval for model training
Online storeRedisLow-latency feature lookup during model serving
RegistryPostgreSQLFeature definitions, versioning, and metadata

A production Kubernetes deployment of Feast uses the following objects [Source: https://oneuptime.com/blog/post/2026-02-09-feast-feature-store-kubernetes/view]:

# feast-server Deployment — 3 replicas for high availability
apiVersion: apps/v1
kind: Deployment
metadata:
  name: feast-server
  namespace: mlops
spec:
  replicas: 3
  selector:
    matchLabels:
      app: feast-server
  template:
    metadata:
      labels:
        app: feast-server
    spec:
      containers:
        - name: feast-server
          image: my-feast-image:latest
          ports:
            - containerPort: 6566
          env:
            - name: FEAST_REGISTRY_PATH
              value: postgresql://feast-db:5432/feast
---
# PostgreSQL for registry — StatefulSet with PVC
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: feast-db
  namespace: mlops
spec:
  serviceName: feast-db
  replicas: 1
  selector:
    matchLabels:
      app: feast-db
  template:
    spec:
      containers:
        - name: postgres
          image: postgres:15-alpine
          volumeMounts:
            - name: data
              mountPath: /var/lib/postgresql/data
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 20Gi

Figure 7.4: Feast Dual-Path Feature Architecture

flowchart TB
    Def["Feature Definitions\n(Python FeatureView\nin Git)"]

    subgraph Offline["Offline Path (Training)"]
        OStore["Offline Store\nPostgreSQL / BigQuery"]
        TrainJob["Training Pipeline\nget_historical_features()"]
    end

    subgraph Online["Online Path (Serving)"]
        Redis["Online Store\nRedis"]
        InfSvc["Inference Service\nget_online_features()"]
    end

    Mat["Materialization Job\n(Kubernetes CronJob)"]
    Reg["Feast Registry\nPostgreSQL\n(feature metadata)"]

    Def --> Reg
    Reg --> OStore
    Reg --> Redis
    OStore --> TrainJob
    OStore --> Mat
    Mat -->|"pushes computed\nfeatures"| Redis
    Redis --> InfSvc

    TrainJob -->|"trains model"| Model["Trained Model"]
    InfSvc -->|"scores request"| Pred["Prediction"]

    style Offline fill:#e8f4e8,stroke:#5a9a5a
    style Online fill:#e8f0ff,stroke:#5a6aad
    style Mat fill:#fff3cd,stroke:#c47f00
    style Reg fill:#f5f5f5,stroke:#888

Online and Offline Feature Serving

The workflow is as follows:

  1. Feature definitions are written as Python FeatureView objects and committed to Git
  2. Materialization runs on a schedule (or event trigger) to push computed features from the offline store into Redis for online serving
  3. Training pipelines retrieve historical point-in-time correct features from the offline store via get_historical_features
  4. Serving infrastructure retrieves the same features at inference time via get_online_features — using identical definitions
# Training: retrieve historical features
training_df = store.get_historical_features(
    entity_df=entity_df,
    features=["user_stats:transactions_24h", "user_stats:avg_amount_7d"]
).to_df()

# Serving: retrieve online features (identical feature names)
online_features = store.get_online_features(
    features=["user_stats:transactions_24h", "user_stats:avg_amount_7d"],
    entity_rows=[{"user_id": "u-12345"}]
).to_dict()

The user_stats:transactions_24h feature definition is the same object in both calls — that is the guarantee Feast provides.

Feature Engineering Pipelines

Feast’s default in-process materialization engine does not scale well for large data volumes. Kubernetes provides a natural scaling mechanism: run materialization as Kubernetes Jobs, processing multiple time-range partitions in parallel [Source: https://docs.feast.dev/how-to-guides/running-feast-in-production]. The Bytewax materialization engine provides a production-grade implementation of this pattern, distributing feature computation across pods while writing results to the online store.

Key Takeaway: Feast eliminates training-serving skew by providing a single versioned feature definition shared between batch training and real-time serving. On Kubernetes, a 3-replica Feast server backed by Redis and PostgreSQL gives a production-ready, horizontally scalable feature serving layer.


Section 4: CI/CD for Machine Learning

Traditional software CI/CD pipelines build code, run tests, and deploy binaries. ML CI/CD must do all of that, and also: retrain models on new data, validate statistical model quality before promotion, manage large binary artifacts (model weights), and handle the fact that “tests passing” does not guarantee a model will perform well on the next day’s data distribution.

GitOps Model Deployment with ArgoCD or Flux

GitOps is a deployment philosophy where Git is the single source of truth for both application configuration and infrastructure state. A GitOps controller (ArgoCD or Flux) watches a Git repository and continuously reconciles the cluster state to match what is declared there [Source: https://dev.to/apprecode/mlops-best-practices-10-practical-practices-teams-actually-use-h77].

For ML deployments, the pattern works as follows:

Developer pushes model config change to Git


┌─────────────────────┐
│  Git Repository      │
│  (infra/ directory)  │
│  - deployment.yaml   │
│  - serving.yaml      │
│  - model-config.yaml │
└──────────┬──────────┘
           │  ArgoCD watches for drift

┌─────────────────────┐
│  ArgoCD              │
│  (reconciliation     │
│   controller)        │
└──────────┬──────────┘
           │  applies changes

┌─────────────────────┐
│  Kubernetes Cluster  │
│  - KServe Inference  │
│    Service           │
│  - model version     │
│    updated           │
└─────────────────────┘

Figure 7.5: GitOps ML Deployment Flow with ArgoCD

flowchart TD
    Dev["ML Engineer\npushes model config"] --> Git

    subgraph GitRepo["Git Repository (infra/)"]
        Git["deployment.yaml\nserving.yaml\nmodel-config.yaml"]
    end

    Git -->|"ArgoCD detects\nconfiguration drift"| Argo

    subgraph ArgoCD["ArgoCD Controller"]
        Argo["Reconciliation Loop\n(watches for drift)"]
        Gate["Validation Gate\naccuracy ≥ threshold?"]
    end

    Argo --> Gate
    Gate -->|"PASSED"| Apply
    Gate -->|"FAILED\nexit non-zero"| Block["Promotion Blocked\nAlert sent to team"]

    subgraph K8s["Kubernetes Cluster"]
        Apply["Apply Manifests"]
        KServe["KServe InferenceService\n(new model version live)"]
    end

    Apply --> KServe

    style GitRepo fill:#f9f3e8,stroke:#c8a84b
    style ArgoCD fill:#e8f0ff,stroke:#5a6aad
    style K8s fill:#e8f4e8,stroke:#5a9a5a
    style Block fill:#ffe8e8,stroke:#c94040,color:#8b0000
    style Gate fill:#fff3cd,stroke:#c47f00

A typical GitOps pipeline for ML lint, validate, plan, then apply [Source: https://www.thirstysprout.com/post/mlops-best-practices]. Kustomize can layer environment-specific overlays (staging vs production model versions, replica counts, resource limits) on top of a shared base configuration — all managed declaratively in version control [Source: https://dev.to/apprecode/mlops-best-practices-10-practical-practices-teams-actually-use-h77].

Infrastructure as Code (IaC) is the enabling foundation. Because training, serving, and fine-tuning jobs have wildly different resource profiles — a training job might need 8 GPUs for 4 hours; an inference deployment needs 2 CPUs and 4 GB RAM indefinitely — IaC captures these differences explicitly and prevents the “environment drift” where staging and production slowly diverge [Source: https://kubeify.com/blog/the-role-of-kubernetes-in-mlops].

Automated Model Validation Gates

A validation gate is a pipeline step that blocks promotion unless a model meets defined quality thresholds. Gates typically check:

Gate TypeExample CriterionAction on Failure
Accuracy thresholdAccuracy ≥ 0.90 on held-out test setBlock promotion, alert team
Regression checkF1 score ≥ 95% of current champion’s F1Block promotion
Data quality checkInput feature distributions within 2σ of training distributionBlock promotion
Latency checkp99 inference latency ≤ 100ms under loadBlock promotion
Bias/fairness auditEqual opportunity difference ≤ 0.05Block promotion

In an Argo Workflows or Tekton pipeline, these gates are ordinary container steps that exit non-zero on failure, causing the workflow to halt and the downstream deployment step never to execute.

# Validation gate component (KFP)
@dsl.component(packages_to_install=["mlflow"])
def validate_model(model_uri: str, accuracy_threshold: float = 0.90) -> str:
    import mlflow
    client = mlflow.MlflowClient()
    run = client.get_run(model_uri.split("/")[1])
    accuracy = float(run.data.metrics["accuracy"])
    if accuracy < accuracy_threshold:
        raise ValueError(
            f"Model accuracy {accuracy:.3f} below threshold {accuracy_threshold}"
        )
    return "PASSED"

Container Image Management for ML Workloads

ML container images have several characteristics that complicate standard image build practices:

Best practices for managing ML images on Kubernetes:

PracticeImplementation
Multi-stage buildsSeparate build-time dependencies from runtime image
Pinned base imagesFROM pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime (never latest)
Image layer cachingStructure Dockerfile to copy requirements before source code
Image registryHarbor or ECR with vulnerability scanning enabled
Image signingCosign signatures verified at admission time
Shared base imagesOne organization-wide CUDA base; per-project layers on top

Reproducibility with Pinned Dependencies and Deterministic Builds

The worst-case scenario in ML production is a training result you cannot reproduce — a model in production that you cannot recreate if you need to retrain from scratch. Reproducibility requires discipline at four layers:

Layer 4: Random seeds         — set torch.manual_seed(), numpy.random.seed()
Layer 3: Data versioning      — DVC or Delta Lake snapshot for training datasets
Layer 2: Code versioning      — Git SHA pinned in pipeline metadata
Layer 1: Environment          — Docker image digest (not tag) pinned in pipeline

Git tags and Docker image tags are mutable — someone can push a new image to the same tag. Image digests (sha256:...) are content-addressed and immutable. Pinning pipeline steps to image digests rather than tags is the single most impactful reproducibility practice available [Source: https://learn.microsoft.com/en-us/azure/aks/best-practices-ml-ops].

A complete reproducibility workflow in KFP pins the image digest at compile time:

@dsl.component(
    base_image="my-registry.io/ml-base@sha256:a3f9b2c1..."  # digest, not tag
)
def train(dataset: Input[Dataset], model: Output[Model]):
    ...

The compiled pipeline IR records the exact image digest, so re-running a pipeline from its compiled IR three months later uses precisely the same environment — regardless of what has been pushed to the registry since.

Key Takeaway: GitOps with ArgoCD makes ML deployments declarative, auditable, and automatically reconciled. Combining validation gates, pinned image digests, data versioning, and Git-tracked configuration is the foundation of genuinely reproducible ML production systems.


Chapter Summary

This chapter traced the full MLOps lifecycle on Kubernetes, from raw data through deployed model. The key architectural pattern is a layered system of concerns:

  1. Orchestration (KFP / Argo Workflows) — schedules and coordinates containerized pipeline steps as DAGs, with artifact lineage and caching
  2. Experiment tracking (MLflow) — captures every run’s parameters, metrics, and artifacts; the model registry governs promotion via semantic aliases
  3. Feature management (Feast) — eliminates training-serving skew with a single versioned feature definition shared between batch and real-time paths
  4. Deployment automation (ArgoCD / GitOps) — makes cluster state a function of Git state, with validation gates that block unsafe promotions

The thread connecting all four layers is reproducibility. Pinned image digests, versioned data, explicit parameters, and structured governance (RBAC, audit trails, policy-as-code) together ensure that every model in production can be explained, audited, and recreated.


Key Terms

TermDefinition
Kubeflow PipelinesAn MLOps platform component providing a Python SDK and execution backend for defining and running ML pipelines as DAGs on Kubernetes
Argo WorkflowsA Kubernetes-native workflow engine that executes DAGs of containerized steps; the default execution backend for Kubeflow Pipelines
MLflowAn open-source platform for experiment tracking, model registry, and artifact management, deployable as a Kubernetes service
Model registryA versioned catalog of trained model artifacts with promotion stages, metadata, audit trails, and access controls
Feature storeA data system that provides a single, versioned definition of ML features shared between training (offline) and serving (online) pipelines
FeastAn open-source feature store that provides offline storage for training and online storage (Redis) for low-latency serving, with native Kubeflow integration
GitOpsA deployment philosophy where Git is the single source of truth for cluster configuration, with an automated controller reconciling live state to match the repository
ArgoCDA Kubernetes-native GitOps continuous delivery controller that watches Git repositories and reconciles cluster state to the declared configuration
CI/CDContinuous Integration / Continuous Delivery; the practice of automating build, test, and deployment stages in a software delivery pipeline
Experiment trackingThe systematic recording of training run parameters, metrics, and output artifacts to enable comparison, reproducibility, and governance
ReproducibilityThe ability to recreate an identical ML training result at a later time, requiring pinned environments, versioned data, explicit parameters, and fixed random seeds
TektonA Kubernetes-native CI/CD framework using Task and Pipeline CRDs; used in ML contexts primarily for image building and deployment triggering

Chapter 8: Networking and Security for AI Workloads

Learning Objectives

By the end of this chapter, you will be able to:


Introduction

Running AI workloads at scale on Kubernetes is not just a scheduling problem — it is simultaneously a networking problem and a security problem. A large language model training run distributes gradient updates across dozens or hundreds of GPU nodes every few seconds. If those collective communication operations crawl over a conventional Ethernet stack, the GPU utilization collapses and the training job takes weeks instead of days. At the same time, AI clusters aggregate enormous amounts of sensitive intellectual property: proprietary model weights, licensed training datasets, inference API keys, and the PII-laden outputs of fine-tuning pipelines. A misconfigured RBAC binding or an overly permissive network policy can expose all of it.

This chapter addresses both sides of that equation. The first half covers the specialist networking technologies — RDMA, GPUDirect, SR-IOV, and Multus — that allow GPU collective communication to operate at wire speed. The second half covers the security controls — NetworkPolicy, service mesh, RBAC, Pod Security Standards, and secrets management — that protect the cluster and its data.


Section 1: High-Performance Networking for Distributed Training

Ordinary TCP/IP networking was designed for flexibility and correctness, not for the microsecond-latency, near-zero-CPU-overhead requirements of distributed deep learning. When a training job executes an AllReduce collective across 256 GPUs, each participating process must exchange gradient tensors with its peers hundreds of times per training step. The bottleneck is not the GPU computation — it is the time the gradients spend waiting in CPU buffers and kernel queues.

Think of the difference between shipping packages through a postal sorting center versus loading them directly onto a dedicated freight train. TCP/IP is the sorting center: flexible, universal, but slow and full of hand-offs. High-performance AI networking is the freight train: faster, more direct, with far fewer touch points.

RDMA and InfiniBand with Kubernetes

Remote Direct Memory Access (RDMA) allows a network adapter to read from and write to the memory of a remote host without involving the remote CPU. This eliminates the copy-to-kernel-buffer and context-switch overhead that dominates conventional networking at high message rates.

InfiniBand is the dominant physical transport for RDMA in HPC and AI data centers. It provides link-level reliability, congestion management, and switch fabrics rated at 400 Gbps (HDR) or 800 Gbps (NDR) per port, with end-to-end latency measured in single-digit microseconds.

To expose InfiniBand or RoCE (RDMA over Converged Ethernet) devices to Kubernetes pods, you use the NVIDIA Network Operator. The Network Operator manages the full driver and plugin lifecycle [Source: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-rdma.html]:

Install the GPU Operator alongside the Network Operator. The two operators cooperate to wire GPUDirect RDMA end-to-end. Three installation strategies exist depending on how MOFED drivers are managed [Source: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-rdma.html]:

# Strategy 1: DMA-BUF — Network Operator manages MOFED drivers (recommended)
helm install gpu-operator nvidia/gpu-operator \
  --version 26.3.0 \
  --namespace gpu-operator \
  --set driver.rdma.useHostMofed=false

# Strategy 2: Host MOFED — drivers pre-installed on the node
helm install gpu-operator nvidia/gpu-operator \
  --set driver.rdma.useHostMofed=true

# Strategy 3: Legacy nvidia-peermem (for older MOFED versions)
helm install gpu-operator nvidia/gpu-operator \
  --set driver.rdma.enabled=true

Pods request InfiniBand resources through the standard Kubernetes resource request mechanism. Setting rdma/ib: 1 acts as a scheduling boolean that ensures the pod lands on a node with InfiniBand hardware [Source: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-rdma.html]:

resources:
  requests:
    nvidia.com/gpu: 8
    rdma/ib: 1
  limits:
    nvidia.com/gpu: 8
    rdma/ib: 1

Table 8.1 — RDMA Transport Options

TransportPhysical MediumTypical BandwidthLatencyNotes
InfiniBand HDRInfiniBand cable200 Gb/s per port~1 µsIndustry standard for HPC/AI
InfiniBand NDRInfiniBand cable400 Gb/s per port<1 µsLatest generation
RoCEv225/100 GbE100 Gb/s per port2–5 µsRuns on existing Ethernet switches
iWARPStandard Ethernet25–100 Gb/s5–10 µsSoftware-friendly, higher latency

GPUDirect RDMA for GPU-to-GPU Communication

GPUDirect RDMA extends RDMA into the GPU memory address space. Without GPUDirect, gradient data on GPU A must travel: GPU memory → CPU pinned buffer → NIC → network → NIC → CPU pinned buffer → GPU memory. With GPUDirect, the NIC can DMA directly into and out of the GPU’s BAR memory region, eliminating both CPU copies [Source: https://developer.nvidia.com/blog/deploying-gpudirect-rdma-on-egx-stack-with-the-network-operator/].

The practical impact is significant: collective communication libraries like NCCL (NVIDIA Collective Communications Library) can achieve 90–95% of theoretical InfiniBand bandwidth when GPUDirect RDMA is enabled, versus 40–60% through a CPU-mediated path.

Without GPUDirect RDMA:
  GPU VRAM → [PCIe] → CPU DRAM → [PCIe] → NIC → Network → NIC → [PCIe] → CPU DRAM → [PCIe] → GPU VRAM

With GPUDirect RDMA:
  GPU BAR Memory → [PCIe P2P] → NIC → Network → NIC → [PCIe P2P] → GPU BAR Memory

Figure 8.1: GPUDirect RDMA eliminates CPU-mediated copies from the gradient communication path

flowchart LR
    subgraph "Without GPUDirect RDMA"
        direction LR
        A1["GPU VRAM\n(Node A)"] -->|PCIe| B1["CPU DRAM\n(Node A)"]
        B1 -->|PCIe| C1["NIC\n(Node A)"]
        C1 -->|Network| C2["NIC\n(Node B)"]
        C2 -->|PCIe| B2["CPU DRAM\n(Node B)"]
        B2 -->|PCIe| A2["GPU VRAM\n(Node B)"]
    end

    subgraph "With GPUDirect RDMA"
        direction LR
        D1["GPU BAR Memory\n(Node A)"] -->|PCIe P2P| E1["NIC\n(Node A)"]
        E1 -->|InfiniBand| E2["NIC\n(Node B)"]
        E2 -->|PCIe P2P| D2["GPU BAR Memory\n(Node B)"]
    end

The prerequisite software stack on each node is: NVIDIA GPU driver, MOFED/DOCA-OFED, and the nvidia-peermem kernel module (which creates the P2P DMA mapping between the GPU and the NIC). The Network Operator manages all three automatically [Source: https://developers.redhat.com/articles/2025/04/29/accelerate-model-training-openshift-ai-nvidia-gpudirect-rdma].

To validate GPUDirect RDMA performance after deployment, run the perftest benchmark from within a pod that holds both GPU and rdma/ib resources [Source: https://kubernetes.recipes/recipes/networking/validate-gpudirect-rdma-performance/]:

# On receiver pod
ib_write_bw --use_cuda=0 -d mlx5_0

# On sender pod (replace <receiver-ip> with the receiver pod's InfiniBand IP)
ib_write_bw --use_cuda=0 -d mlx5_0 <receiver-ip>

Bandwidth results above 180 Gb/s on HDR InfiniBand indicate GPUDirect RDMA is functioning correctly.

SR-IOV for High-Bandwidth Network Interfaces

SR-IOV (Single Root I/O Virtualization) is a PCIe specification that allows a single physical NIC to be partitioned into multiple lightweight Virtual Functions (VFs). Each VF is a near-full hardware NIC with its own DMA engine, queues, and interrupt lines. When a VF is assigned exclusively to a pod, that pod communicates with the NIC with near-native performance and hardware-enforced isolation — no other pod can interfere with its DMA traffic.

The analogy here is a highway with dedicated express lanes. The main NIC is the highway, and each SR-IOV VF is a dedicated lane that a specific vehicle (pod) has reserved. Traffic in one lane does not block or spy on traffic in another.

Deploying SR-IOV in Kubernetes requires three components working together [Source: https://docs.rke2.io/networking/multus_sriov]:

  1. SR-IOV CNI plugin — moves a VF into the pod network namespace
  2. SR-IOV device plugin — advertises VF resources to the scheduler
  3. SR-IOV Network Operator — automates VF creation and node configuration

Node labeling triggers automatic VF discovery [Source: https://docs.rke2.io/networking/multus_sriov]:

kubectl label node gpu-node-01 \
  feature.node.kubernetes.io/network-sriov.capable=true

After labeling, the sriov-network-config DaemonSet deploys a discovery pod to each labeled node and populates a SriovNetworkNodeState CRD with the available interfaces and their capabilities.

# Example SriovNetworkNodeState snippet (auto-populated by operator)
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodeState
metadata:
  name: gpu-node-01
spec:
  interfaces:
  - name: ens3f0
    numVfs: 8
    pciAddress: "0000:18:00.0"
    deviceType: netdevice

Multus CNI for Secondary Network Attachments

Standard Kubernetes pods have exactly one network interface: the one created by the cluster’s primary CNI (Calico, Cilium, Flannel). For AI training pods, this is a problem: you want management traffic (Kubernetes API calls, health probes) to travel over the primary network while gradient communication travels over a dedicated high-speed InfiniBand or SR-IOV network with different routing, QoS, and security policies.

Multus CNI solves this by acting as a meta-CNI plugin multiplexer [Source: https://kubeops.net/blog/exploring-multus-an-advanced-networking-solution-for-kubernetes]. It intercepts the pod network setup call, creates the primary interface using the cluster’s default CNI, and then creates additional interfaces using any other CNI plugins you specify.

Pod network interfaces after Multus:
  eth0  → primary CNI (Calico)          → Kubernetes service network
  net1  → SR-IOV VF (sriov-cni)         → InfiniBand / RoCE fabric
  net2  → macvlan                        → storage network

Figure 8.2: Multus CNI attaches multiple network interfaces to a single GPU training pod

flowchart TB
    subgraph "GPU Training Pod"
        P["pytorch-trainer\ncontainer"]
        ETH0["eth0"]
        NET1["net1"]
        NET2["net2"]
        P --- ETH0
        P --- NET1
        P --- NET2
    end

    subgraph "Multus CNI (meta-plugin)"
        M["Multus\nCNI"]
        CNI1["Primary CNI\n(Calico)"]
        CNI2["SR-IOV CNI\n(sriov-cni)"]
        CNI3["macvlan CNI"]
        M --> CNI1
        M --> CNI2
        M --> CNI3
    end

    ETH0 -->|"created by"| CNI1
    NET1 -->|"created by"| CNI2
    NET2 -->|"created by"| CNI3

    CNI1 --> SVC["Kubernetes\nService Network"]
    CNI2 --> IB["InfiniBand /\nRoCE Fabric\n(NCCL AllReduce)"]
    CNI3 --> STORE["Storage\nNetwork"]

The Network Operator uses Multus internally to attach secondary RDMA interfaces to GPU training pods. It does this through two CRDs: NicClusterPolicy (which configures the desired Network Operator state) and HostDeviceNetwork (which creates the Multus NetworkAttachmentDefinition) [Source: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-rdma.html].

Pods reference the secondary network attachment via an annotation:

metadata:
  annotations:
    k8s.v1.cni.cncf.io/networks: ib-network
spec:
  containers:
  - name: pytorch-trainer
    resources:
      requests:
        nvidia.com/gpu: 8
        rdma/ib: 1

Table 8.2 — CNI Plugin Performance Comparison

CNI PluginPerformanceHardware RequirementUse Case
Calico / Cilium (primary)StandardNoneManagement, services
macvlanNear-nativeStandard NICStorage, data ingestion
IPvlanNear-nativeStandard NICStorage (no promiscuous mode)
SR-IOV + sriov-cniNative (hardware-isolated)SR-IOV capable NICAI collective comms
InfiniBand (via rdma-cni)>200 Gb/s, <2 µsConnectX NICDistributed training NCCL

Key Takeaway: High-performance AI networking requires a layered approach. Multus CNI gives training pods a secondary high-speed network interface alongside the standard cluster network. SR-IOV provides hardware-isolated near-native performance via Virtual Functions. GPUDirect RDMA eliminates CPU copies by routing gradient data directly between GPU memory and the NIC. Together, these three technologies allow NCCL collective operations to run at close to theoretical InfiniBand wire speed.


Section 2: Network Policies and Isolation

High-speed networking for training is a capability concern. Network isolation is a security concern. In a shared AI cluster, different teams run workloads with different data sensitivity levels: a research team might be fine-tuning on public datasets while a compliance team fine-tunes on HIPAA-regulated patient records in the same cluster. Without explicit network isolation, those two workloads can communicate freely at the pod level.

Kubernetes NetworkPolicy for AI Namespace Isolation

Kubernetes NetworkPolicy resources are enforced by the CNI plugin (Calico, Cilium, or Antrea are common enforcers) and operate at Layer 3/4. The most important first step is a default-deny policy applied to every AI namespace [Source: https://www.tigera.io/blog/securing-ai-workloads-in-kubernetes-why-traditional-network-security-isnt-enough/]:

# default-deny-all.yaml — apply to every AI namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: ai-training
spec:
  podSelector: {}        # matches all pods in namespace
  policyTypes:
  - Ingress
  - Egress

After applying the default deny, add explicit allow rules only for what each workload legitimately needs [Source: https://dev.to/young_gao/kubernetes-security-hardening-for-production-ai-workloads-in-2026-5dk1]:

# allow-training-comms.yaml — training pods may reach data storage and model registry
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-training-comms
  namespace: ai-training
spec:
  podSelector:
    matchLabels:
      role: trainer
  policyTypes:
  - Egress
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          name: data-storage
    ports:
    - port: 443
  - to:
    - namespaceSelector:
        matchLabels:
          name: model-registry
    ports:
    - port: 5000
  - ports:                    # allow DNS
    - port: 53
      protocol: UDP

Training pods should have no internet egress — there is no legitimate reason for a gradient synchronization process to reach the public internet, and allowing it creates an exfiltration path [Source: https://dev.to/young_gao/kubernetes-security-hardening-for-production-ai-workloads-in-2026-5dk1].

Figure 8.3: Default-deny NetworkPolicy with explicit allow rules limits AI namespace egress to approved endpoints only

flowchart LR
    subgraph NS["Namespace: ai-training"]
        TP["Training Pod\n(role: trainer)"]
        DP["Default-Deny\nNetworkPolicy\n(all pods)"]
        AP["Allow-Training\nNetworkPolicy\n(role: trainer)"]
    end

    subgraph DS["Namespace: data-storage"]
        OB["Object Storage\n:443"]
    end

    subgraph MR["Namespace: model-registry"]
        REG["Model Registry\n:5000"]
    end

    INET["Public Internet"]
    DNS["CoreDNS :53"]

    DP -. "blocks all\ningress & egress\nby default" .-> TP

    TP -->|"allowed by\nallow-training-comms"| OB
    TP -->|"allowed by\nallow-training-comms"| REG
    TP -->|"allowed by\nallow-training-comms"| DNS
    TP -. "BLOCKED\n(no rule)" .-> INET

For advanced enforcement beyond what standard NetworkPolicy supports, Calico’s egress gateway feature routes all outbound traffic from an AI namespace through dedicated gateway pods. Those gateways become the single chokepoint where traffic logging, deep inspection, and outbound filtering happen [Source: https://www.tigera.io/blog/securing-ai-workloads-in-kubernetes-why-traditional-network-security-isnt-enough/]. Cilium provides equivalent functionality through its eBPF-based CiliumNetworkPolicy CRD, which supports Layer 7 HTTP/gRPC filtering directly in the kernel data plane.

Service Mesh (Istio, Linkerd) for Inference Traffic Management

For inference workloads — where model servers respond to external or internal API calls — a Kubernetes NetworkPolicy alone is often insufficient. It cannot route based on HTTP headers, enforce per-route traffic weights, or produce per-service latency histograms. A service mesh adds those capabilities.

A service mesh injects a sidecar proxy (Envoy in Istio’s case, a Linkerd-specific proxy in Linkerd’s case) into every pod. The sidecar transparently intercepts all inbound and outbound TCP connections and provides:

For AI inference specifically, the traffic management features are high-value. Imagine rolling out a new version of a Llama-3 fine-tune: with an Istio VirtualService, you can shift 5% of inference traffic to the new model while monitoring error rates before committing to a full cutover — all without redeploying pods.

# Istio VirtualService: canary inference rollout
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: model-server
  namespace: ai-inference
spec:
  hosts:
  - model-server
  http:
  - route:
    - destination:
        host: model-server
        subset: v2-candidate
      weight: 5
    - destination:
        host: model-server
        subset: v1-stable
      weight: 95

mTLS for Secure Inter-Pod Communication

Mutual TLS (mTLS) means both ends of a connection present and verify certificates. Without mTLS, a compromised pod inside the cluster can impersonate any service it can reach at the network level. With mTLS, each pod’s identity is cryptographically bound to its Kubernetes service account, and connections between pods that cannot present a valid certificate are rejected.

In Istio, mTLS can be enforced cluster-wide with a single PeerAuthentication policy [Source: https://www.appsecengineer.com/blog/securing-ai-ml-workloads-in-kubernetes-practical-strategies-for-2025]:

apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: enforce-mtls
  namespace: ai-inference
spec:
  mtls:
    mode: STRICT

In STRICT mode, Istio rejects any unencrypted or unauthenticated connection to pods in the namespace. This means that even if an attacker compromises a node and tries to make direct TCP connections to model servers, the connections are rejected at the sidecar proxy layer before the model server process ever sees them.

Figure 8.4: Istio service mesh enforces mTLS between pods — each sidecar proxy validates peer identity before traffic reaches the application

sequenceDiagram
    participant C as Client Pod<br/>(Envoy Sidecar)
    participant S as Model Server Pod<br/>(Envoy Sidecar)
    participant MS as Model Server<br/>Application

    Note over C,S: mTLS Handshake (STRICT mode)
    C->>S: ClientHello + Client Certificate<br/>(bound to Kubernetes ServiceAccount)
    S->>C: ServerHello + Server Certificate<br/>(bound to Kubernetes ServiceAccount)
    C->>S: Verify server cert → OK
    S->>C: Verify client cert → OK
    Note over C,S: Encrypted mTLS tunnel established

    C->>S: Inference request (encrypted)
    S->>MS: Forwarded request (plaintext, localhost)
    MS->>S: Inference response
    S->>C: Response (encrypted)

    Note over C,S: Attacker with no valid cert is<br/>rejected at sidecar before<br/>reaching MS

Key Takeaway: Network isolation for AI workloads is a defense-in-depth problem. Start with default-deny NetworkPolicies in every namespace and add only the egress rules each workload needs. Introduce a service mesh for inference workloads to gain mTLS identity verification, traffic routing, and observability. Never allow training pods to initiate connections to the public internet.


Section 3: Security Best Practices

RBAC for Multi-Team AI Clusters

RBAC misconfigurations account for over 35% of Kubernetes breaches, with attackers exploiting role bindings rather than breaking into systems [Source: https://dasroot.net/posts/2026/01/kubernetes-security-rbac-policies-scanning/]. The single most common mistake in ML platforms is granting cluster-admin to training job service accounts because it eliminates permission errors during development.

RBAC in Kubernetes is composed of four object types [Source: https://dev.to/young_gao/kubernetes-security-hardening-for-production-ai-workloads-in-2026-5dk1]:

ObjectScopePurpose
RoleNamespaceDefines permissions within one namespace
ClusterRoleCluster-wideDefines permissions across all namespaces
RoleBindingNamespaceBinds a Role or ClusterRole to subjects within a namespace
ClusterRoleBindingCluster-wideBinds a ClusterRole to subjects at cluster scope

A well-structured multi-team AI cluster uses namespace-scoped roles that match what each job type actually needs. A training job needs to read ConfigMaps (for hyperparameters), read Secrets (for data access credentials), and create Pods (for workers). It does not need to list nodes, modify RBAC policies, or access other namespaces.

Figure 8.5: RBAC object relationships — a RoleBinding binds a Role to a ServiceAccount, granting only the permissions that role specifies

flowchart LR
    SA["ServiceAccount\ntraining-sa\n(namespace: ai-training)"]
    RB["RoleBinding\ntraining-job-binding\n(namespace: ai-training)"]
    R["Role\ntraining-job-role\n(namespace: ai-training)"]

    subgraph PERMS["Permissions Granted"]
        P1["configmaps → get, list"]
        P2["secrets → get"]
        P3["pods → get, list, watch"]
    end

    subgraph DENIED["Denied (no rule = no access)"]
        D1["nodes → any"]
        D2["roles → any"]
        D3["other namespaces → any"]
    end

    SA -->|"subject of"| RB
    RB -->|"references"| R
    R --> P1
    R --> P2
    R --> P3
    R -. "does NOT grant" .-> D1
    R -. "does NOT grant" .-> D2
    R -. "does NOT grant" .-> D3
# Least-privilege role for a PyTorch training job
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: training-job-role
  namespace: ai-training
rules:
- apiGroups: [""]
  resources: ["configmaps"]
  verbs: ["get", "list"]
- apiGroups: [""]
  resources: ["secrets"]
  verbs: ["get"]
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: training-job-binding
  namespace: ai-training
subjects:
- kind: ServiceAccount
  name: training-sa
  namespace: ai-training
roleRef:
  kind: Role
  name: training-job-role
  apiGroup: rbac.authorization.k8s.io

Table 8.3 — RBAC Roles for Common AI Cluster Personas

PersonaScopeAllowed VerbsDenied
Data scientistOwn namespacecreate/delete Jobs, read Logscreate Roles, access other namespaces
ML engineerOwn namespace + model-registrycreate Deployments, push to registrydelete Namespaces, modify ClusterRoles
Training job SAOwn namespaceget ConfigMap/Secret, list Podseverything else
Inference job SAOwn namespaceget Secret, create/delete Podscreate Roles, access PVCs in other NS
Platform adminCluster-wideall— (this role should use MFA + audit logging)

Pod Security Standards and Admission Controllers

Pod Security Policies were deprecated in Kubernetes 1.21 and removed in 1.25. The replacement is Pod Security Admission (PSA), which enforces one of three built-in Pod Security Standards profiles at the namespace level [Source: https://dasroot.net/posts/2026/01/kubernetes-security-rbac-pod-security-standards-policy-engines/]:

ProfileDescriptionWhen to Use
privilegedNo restrictionsOnly for system namespaces (e.g., kube-system)
baselinePrevents known privilege escalationsDefault for AI training namespaces
restrictedEnforces hardened configurationInference namespaces, multi-tenant environments

Apply PSA enforcement via namespace labels:

# Enforce Baseline on AI training namespace
kubectl label namespace ai-training \
  pod-security.kubernetes.io/enforce=baseline \
  pod-security.kubernetes.io/warn=restricted \
  pod-security.kubernetes.io/audit=restricted

# Enforce Restricted on AI inference namespace
kubectl label namespace ai-inference \
  pod-security.kubernetes.io/enforce=restricted

The warn and audit modes are valuable during migration: they surface violations in API server responses and audit logs without blocking pod creation, so you can see what needs to change before switching enforce to restricted.

For more granular policy enforcement beyond the three built-in profiles, policy engines like OPA/Gatekeeper or Kyverno act as validating admission webhooks. They can enforce custom rules such as “all pods in AI namespaces must have resource limits defined” or “no training pod may mount the host filesystem.”

GPU nodes require an additional taint to prevent non-GPU workloads from landing on them. This is both a resource protection measure and a cryptomining defense [Source: https://dev.to/young_gao/kubernetes-security-hardening-for-production-ai-workloads-in-2026-5dk1]:

kubectl taint nodes gpu-node-01 nvidia.com/gpu=true:NoSchedule

Only pods with the corresponding toleration will be scheduled onto GPU nodes, ensuring that expensive GPU capacity is reserved for legitimate AI workloads.

Image Scanning for ML Container Vulnerabilities

ML containers are unusually large — a PyTorch training image including CUDA, cuDNN, and common Python libraries can easily reach 20–30 GB. More surface area means more CVEs. Common vulnerability sources in ML images include:

Integrate image scanning into the CI/CD pipeline using tools such as Trivy, Grype, or Snyk Container. Enforce scan gates via an admission controller that rejects pods whose images have critical or high-severity CVEs:

# Kyverno policy: require images to come from a trusted registry
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: restrict-image-registries
spec:
  validationFailureAction: enforce
  rules:
  - name: validate-registries
    match:
      any:
      - resources:
          kinds: [Pod]
          namespaces: [ai-training, ai-inference]
    validate:
      message: "Images must come from the internal registry."
      pattern:
        spec:
          containers:
          - image: "registry.internal.company.com/*"

Secrets Management for API Keys and Model Credentials

AI workloads commonly need credentials for: cloud object storage (S3, GCS), model registries (HuggingFace Hub, internal Harbor), inference API keys (OpenAI, Anthropic), and database connections for experiment tracking.

The most important rule is: mount secrets as files, not environment variables [Source: https://dev.to/young_gao/kubernetes-security-hardening-for-production-ai-workloads-in-2026-5dk1]. Environment variables are visible in the process’s /proc/<pid>/environ, which any process running as the same user can read. A file mounted into the pod’s filesystem via volumeMounts is accessible only through the file path and can be given restrictive file permissions (0400).

# Correct: mount secret as a file
volumes:
- name: s3-credentials
  secret:
    secretName: s3-access-keys
    defaultMode: 0400
containers:
- name: trainer
  volumeMounts:
  - name: s3-credentials
    mountPath: /run/secrets/s3
    readOnly: true

# Incorrect: secret as environment variable (avoid)
env:
- name: AWS_SECRET_ACCESS_KEY
  valueFrom:
    secretKeyRef:
      name: s3-access-keys
      key: secret_key

For secrets that change frequently or require audit trails, integrate an external secrets manager (HashiCorp Vault, AWS Secrets Manager, Azure Key Vault) using the External Secrets Operator or Vault Agent Injector. These systems provide dynamic secret generation (short-lived credentials), centralized access policies, and immutable audit logs of every secret access event.

Key Takeaway: Security hardening for AI clusters is a three-layer problem. At the identity layer, enforce least-privilege RBAC using namespace-scoped roles for every service account — never cluster-admin for training jobs. At the pod layer, apply Baseline or Restricted Pod Security Standards and taint GPU nodes to prevent resource theft. At the secrets layer, always mount credentials as files and consider external secrets managers for audit trail requirements.


Section 4: Data Security and Compliance

AI workloads present unique compliance challenges. Training datasets may contain regulated data (patient records, financial transactions, personally identifiable information). Model weights represent millions of dollars of compute investment and are themselves sensitive IP. Inference endpoints may process user queries that are subject to data residency requirements.

Encryption at Rest and in Transit for Training Data

Encryption at rest means that the raw bytes on disk are unintelligible without the encryption key. In Kubernetes, this applies at two levels:

  1. etcd encryption — Kubernetes Secrets and other sensitive API objects stored in etcd should be encrypted using aescbc or aesgcm with keys managed by a KMS provider. Configure this in the API server’s --encryption-provider-config flag.

  2. Persistent volume encryption — Training datasets stored on PersistentVolumes should use volume-level encryption. Most cloud providers offer this natively (AWS EBS encryption, GCP CMEK for GCS/Filestore). For on-premises clusters, LUKS-encrypted block devices or encrypted NFS volumes provide equivalent protection.

Encryption in transit means TLS on all network paths carrying training data:

Training data flow with encryption:
  Object Storage (TLS) → Data loader pod (mTLS, Istio) → GPU training pod
  Gradient sync (RDMA — separate encryption layer via GPUDirect SHARP or NCCL PXN)

Note that RDMA-based collective communication has a subtlety: RDMA bypasses the CPU and therefore bypasses TLS. For workloads that require encrypted gradient communication (uncommon in on-premises HPC clusters, more relevant in multi-tenant clouds), NCCL supports encrypted collectives, or the cluster can enforce IPsec at the NIC level using NVIDIA’s GPUDirect SHARP with encryption extensions [Source: https://www.appsecengineer.com/blog/securing-ai-ml-workloads-in-kubernetes-practical-strategies-for-2025].

Data Access Auditing and Lineage Tracking

Knowing that data is encrypted is not enough for most compliance frameworks — you also need to know who accessed what data, when, and for what purpose.

Kubernetes audit logging provides a baseline: every API call — including secret reads, PVC mounts, and pod creations — is logged to the API server audit log. Configure an audit policy that captures RequestResponse-level events for sensitive resources:

# audit-policy.yaml (excerpt)
rules:
- level: RequestResponse
  resources:
  - group: ""
    resources: [secrets, persistentvolumeclaims]
- level: Metadata
  resources:
  - group: ""
    resources: [pods, services]

For AI-specific data lineage — tracking which training dataset version produced which model checkpoint — tools like MLflow, DVC (Data Version Control), and Kubeflow Pipelines provide metadata tracking at the ML workflow level. Integrating these with the Kubernetes audit log creates an end-to-end chain: API audit shows that pod X accessed PVC Y at time T, and the MLflow experiment record shows that pod X produced model checkpoint Z.

Compliance Frameworks for AI Workloads

Table 8.4 — Compliance Framework Requirements Mapping

FrameworkKey RequirementKubernetes Implementation
SOC 2 Type IILogical access controls, audit loggingRBAC least-privilege + API server audit logs
HIPAAPHI encryption at rest and in transit, access controlsPV encryption + TLS + RBAC + Secrets management
GDPRData minimization, right to erasure, access loggingNamespace isolation + audit logs + data retention policies
FedRAMPFIPS 140-2 encryption, continuous monitoringFIPS-mode ETCD encryption + Falco runtime monitoring
NIST AI RMFModel documentation, risk assessment, bias monitoringMLflow experiment tracking + OPA policy-as-code

The most practically useful framing for a Kubernetes AI cluster is to treat compliance as a set of Kubernetes namespaces with different security boundaries. A HIPAA-scoped namespace for fine-tuning on patient data gets: enforce=restricted PSA, a dedicated node pool with encrypted root volumes, default-deny NetworkPolicy with egress only to approved storage endpoints, mandatory mTLS via Istio PeerAuthentication, and API server audit logs exported to an immutable log store.

Runtime threat detection using Falco adds a final defensive layer. Falco watches the Linux syscall stream from inside each pod and can alert on behaviors that indicate compromise: spawning a shell inside a model server container, attempting to read /etc/shadow, or making outbound connections on unusual ports [Source: https://www.anantacloud.com/post/kubernetes-security-in-2026-modern-threats-and-how-to-defend-against-them].

Key Takeaway: Data security for AI workloads requires protecting data at every layer: encryption at rest for PersistentVolumes and etcd, encryption in transit via TLS and mTLS, and comprehensive access auditing through Kubernetes audit logs. Map your compliance obligations (SOC 2, HIPAA, GDPR) to specific Kubernetes controls and enforce them programmatically via admission policies and namespace labels — manual processes drift over time, but policy-as-code does not.


Chapter Summary

This chapter covered the two critical infrastructure concerns that determine whether an AI Kubernetes cluster is both capable and trustworthy: high-performance networking and security hardening.

On the networking side, you learned how RDMA and GPUDirect RDMA eliminate CPU-mediated copies from the gradient communication path, how InfiniBand and RoCE provide the underlying transport at hundreds of gigabits per second, and how the NVIDIA Network Operator manages the full driver and plugin lifecycle required to expose these capabilities to Kubernetes pods. You saw how Multus CNI gives training pods secondary network interfaces that separate management traffic from the high-speed collective communication fabric, and how SR-IOV VFs provide hardware-isolated near-native performance for each pod.

On the security side, you learned how default-deny NetworkPolicies create an explicit-allowlist model for pod communication, how service meshes like Istio add mTLS, traffic management, and observability for inference workloads, how RBAC least-privilege principles prevent training jobs from becoming lateral movement vectors, and how Pod Security Standards enforce hardened pod configurations cluster-wide. You saw how secrets must be mounted as files rather than environment variables, and how compliance requirements like HIPAA and SOC 2 map to specific Kubernetes controls.

These two concerns — performance and security — are not in tension. A cluster with properly configured RDMA networking can also have properly enforced network policies, because the high-speed data plane and the management plane are on separate network interfaces managed by Multus. A cluster with strict RBAC can still give training jobs the precise permissions they need through well-scoped namespace roles. The goal is a cluster where GPUs run as fast as the physics allows and where the data, models, and access paths are defended in depth.


Key Terms

TermDefinition
RDMARemote Direct Memory Access — a networking technique that allows a NIC to read/write remote host memory without involving the remote CPU, eliminating kernel and CPU copy overhead
InfiniBandA high-speed, low-latency interconnect fabric used in HPC and AI clusters, providing 200–400+ Gb/s per port and single-digit microsecond latency
GPUDirectNVIDIA technology enabling direct DMA between GPU memory and a third-party device (e.g., a NIC) over PCIe, used to bypass CPU copies in distributed training
SR-IOVSingle Root I/O Virtualization — a PCIe specification that partitions a single physical NIC into multiple hardware-isolated Virtual Functions (VFs), each assignable to a pod for near-native performance
MultusA Kubernetes meta-CNI plugin that enables pods to have multiple network interfaces by delegating to multiple CNI plugins simultaneously
NetworkPolicyA Kubernetes resource that defines Layer 3/4 rules controlling which pods can communicate with each other and with external endpoints; enforced by the CNI plugin
Service meshAn infrastructure layer (e.g., Istio, Linkerd) that uses sidecar proxies to add mTLS, traffic management, observability, and policy enforcement to inter-pod communication transparently
mTLSMutual TLS — a TLS handshake in which both the client and server present and verify certificates, ensuring cryptographic identity verification in both directions
RBACRole-Based Access Control — Kubernetes’s authorization mechanism, using Roles/ClusterRoles and RoleBindings/ClusterRoleBindings to grant permissions to users, groups, and service accounts
Pod Security StandardsThree built-in Kubernetes security profiles (Privileged, Baseline, Restricted) enforced via Pod Security Admission at the namespace level, replacing the deprecated PodSecurityPolicy
Admission controllerA Kubernetes API server plugin (built-in or webhook-based) that intercepts API requests and can validate, mutate, or reject them before objects are persisted — used to enforce security policies

Chapter 9: Monitoring, Observability, and Troubleshooting

Running AI workloads on Kubernetes introduces a new category of operational concern that traditional application monitoring was never designed to handle. A web service going down produces an error; a GPU silently throttling its clock speed costs you a week of training time without a single alert. Effective observability for AI on Kubernetes means watching three interlocking layers simultaneously: the hardware (GPUs and their interconnects), the platform (Kubernetes scheduling and resource allocation), and the application (model loss curves, inference latency, and prediction quality). This chapter builds out that full-stack observability picture, from deploying DCGM Exporter on every GPU node to writing Prometheus alert rules that catch training stalls before the team notices at standup.


Learning Objectives

By the end of this chapter, you will be able to:


Section 1: GPU Monitoring with DCGM and Prometheus

The Three-Layer Observability Model

Think of AI workload observability as a hospital monitoring a patient. The pulse oximeter on the finger measures oxygen saturation — the equivalent of GPU utilization. The blood pressure cuff measures systemic health — the equivalent of GPU memory pressure. And the ECG monitors electrical activity — the equivalent of Xid error counts from the GPU driver. No single reading tells the full story; you need all three layers reading simultaneously to catch problems early. [Source: https://developer.nvidia.com/blog/monitoring-gpus-in-kubernetes-with-dcgm/]

The three observability layers for AI on Kubernetes are:

LayerWhat It MeasuresKey Tools
InfrastructureGPU utilization, memory, temperature, power, ECC errorsDCGM Exporter, node_exporter
PlatformPod health, scheduling events, resource quotas, job statuskube-state-metrics, Kubernetes Events
ApplicationTraining loss, learning rate, inference latency, throughputCustom Prometheus metrics, MLflow, W&B

[Source: https://www.wiz.io/academy/ai-security/ai-ml-kubernetes-best-practices]

Figure 9.1: Three-Layer Observability Model for AI Workloads on Kubernetes

graph TD
    subgraph APP["Application Layer"]
        A1[Training Loss]
        A2[Inference Latency]
        A3[Gradient Norm]
        A4[Samples/sec]
    end
    subgraph PLAT["Platform Layer"]
        P1[Pod Health]
        P2[Scheduling Events]
        P3[Resource Quotas]
        P4[Job Status]
    end
    subgraph INFRA["Infrastructure Layer"]
        I1[GPU Utilization]
        I2[GPU Memory]
        I3[Temperature]
        I4[ECC Errors / Xid]
    end

    APP -->|Custom Prometheus metrics| PROM[Prometheus]
    PLAT -->|kube-state-metrics + Events| PROM
    INFRA -->|DCGM Exporter| PROM
    PROM --> GRAFANA[Grafana Dashboards]
    PROM --> ALERTMGR[Alertmanager]

NVIDIA DCGM Exporter Deployment

DCGM (Data Center GPU Manager) is NVIDIA’s suite for managing and monitoring GPUs at scale. The DCGM Exporter component uses the DCGM Go API to collect GPU telemetry and expose it at an HTTP /metrics endpoint that Prometheus can scrape. [Source: https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/dcgm-exporter.html]

DCGM Exporter is deployed as a Kubernetes DaemonSet, meaning one pod runs on every GPU-equipped node in the cluster. This is the correct topology: GPU metrics are node-local and cannot be collected from a central location without per-node agents. [Source: https://github.com/NVIDIA/dcgm-exporter]

Installation uses the official Helm chart:

helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts
helm repo update
helm install --generate-name gpu-helm-charts/dcgm-exporter

Verify the DaemonSet is running on every GPU node:

kubectl get daemonset -n gpu-operator-resources
kubectl get pods -n gpu-operator-resources -l app=dcgm-exporter -o wide

The exporter pod requires privileged access to communicate with the DCGM host engine running on each node. On hardened clusters with PodSecurity policies or OPA Gatekeeper, you will need to create a corresponding exception for the DCGM Exporter namespace. [Source: https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/kube-prometheus.html]

Figure 9.2: DCGM Exporter DaemonSet — Per-Node Deployment and Prometheus Scrape Path

graph LR
    subgraph NODE1["GPU Node 1"]
        GPU1A[GPU 0]
        GPU1B[GPU 1]
        DCGM1[DCGM Host Engine]
        EXP1[DCGM Exporter Pod\n:9400/metrics]
        GPU1A --> DCGM1
        GPU1B --> DCGM1
        DCGM1 --> EXP1
    end
    subgraph NODE2["GPU Node 2"]
        GPU2A[GPU 0]
        GPU2B[GPU 1]
        DCGM2[DCGM Host Engine]
        EXP2[DCGM Exporter Pod\n:9400/metrics]
        GPU2A --> DCGM2
        GPU2B --> DCGM2
        DCGM2 --> EXP2
    end
    SM[ServiceMonitor CR] -->|scrape interval: 15s| EXP1
    SM -->|scrape interval: 15s| EXP2
    EXP1 -->|metrics| PROM[Prometheus]
    EXP2 -->|metrics| PROM
    PROM --> GRAFANA[Grafana\nDashboard 12239]

Key GPU Metrics

The following table covers the most operationally significant metrics exposed by DCGM Exporter:

Metric NameDescriptionAlert Threshold
DCGM_FI_DEV_GPU_UTILGPU compute utilization (%)Alert if < 20% for sustained training
DCGM_FI_DEV_MEM_COPY_UTILGPU memory bandwidth utilization (%)Informational
DCGM_FI_DEV_FB_USEDFramebuffer memory used (MiB)Alert if > 95% of total
DCGM_FI_DEV_GPU_TEMPGPU core temperature (°C)Alert if > 83°C
DCGM_FI_DEV_POWER_USAGEPower draw (W)Alert if > rated TDP
DCGM_FI_DEV_SM_CLOCKStreaming Multiprocessor clock (MHz)Alert on sustained drop during training
DCGM_FI_DEV_ECC_SBE_VOL_TOTALVolatile single-bit ECC errorsAlert if non-zero sustained
DCGM_FI_DEV_ECC_DBE_VOL_TOTALVolatile double-bit ECC errorsAlert immediately on any occurrence
DCGM_FI_DEV_XID_ERRORSXid error count (hardware/driver faults)Alert on any increment

[Source: https://developer.nvidia.com/blog/monitoring-gpus-in-kubernetes-with-dcgm/]

Double-bit ECC errors (DCGM_FI_DEV_ECC_DBE_VOL_TOTAL) are particularly critical: they indicate uncorrectable GPU memory faults and should immediately trigger node cordoning to prevent additional training jobs from landing on the affected hardware. [Source: https://medium.com/@penkow/tracking-gpu-usage-in-k8s-with-prometheus-and-dcgm-a-complete-guide-7c8590809d7c]

Prometheus ServiceMonitor Configuration

With the kube-prometheus-stack Helm chart providing the Prometheus Operator, you configure metric scraping declaratively using a ServiceMonitor custom resource. This tells Prometheus how to discover and scrape the DCGM Exporter service: [Source: https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/kube-prometheus.html]

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: dcgm-exporter
  namespace: monitoring
  labels:
    release: kube-prometheus-stack   # must match Prometheus operator's label selector
spec:
  namespaceSelector:
    matchNames:
    - gpu-operator-resources
  selector:
    matchLabels:
      app: dcgm-exporter
  endpoints:
  - port: metrics
    interval: 15s
    path: /metrics

The release: kube-prometheus-stack label in the ServiceMonitor metadata is a common stumbling block: the Prometheus Operator only picks up ServiceMonitor resources matching its own serviceMonitorSelector. If your GPU metrics do not appear in Prometheus, inspect the Prometheus Operator logs and verify the label alignment.

Deploy the full observability stack first:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install kube-prometheus-stack \
  prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace

Grafana Dashboard

NVIDIA publishes an official pre-built Grafana dashboard for DCGM Exporter (Dashboard ID 12239 at [Source: https://grafana.com/grafana/dashboards/12239]). Import it into Grafana with:

# Port-forward Grafana
kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n monitoring

# Then in the Grafana UI: Dashboards → Import → enter 12239

The dashboard provides utilization heatmaps across all GPUs in the cluster, memory pressure time series, temperature trend lines, and power draw per GPU. For multi-node clusters, the node selector dropdown allows drilling down to individual hosts.

Key Takeaway: DCGM Exporter runs as a DaemonSet on every GPU node, exposing hardware telemetry at /metrics. A ServiceMonitor custom resource connects it to Prometheus, and the official Grafana dashboard ID 12239 provides out-of-the-box GPU cluster visibility. Start with ECC double-bit errors and Xid counts as your most critical alert signals.


Section 2: Training and Inference Observability

Custom Metrics for Training Workloads

Infrastructure metrics tell you whether the GPU is busy; they do not tell you whether the model is converging. A GPU can run at 95% utilization while training a model whose loss has plateaued due to a learning rate misconfiguration — you would never know from DCGM metrics alone. Application-layer metrics close this gap. [Source: https://www.mirantis.com/blog/ai-workloads-management-and-best-practices/]

The most impactful training metrics to instrument are:

MetricWhat It RevealsAlert Condition
Training loss (per step)Model convergence directionLoss not decreasing over N steps = stall
Validation loss (per epoch)Overfitting detectionVal loss diverging from train loss
Samples per secondGPU throughput efficiency> 20% drop vs. baseline
Gradient normExploding/vanishing gradient healthNorm > 100 or < 1e-7
Learning rateWarmup and decay scheduleVerify against expected schedule
Checkpoint write successRecovery point availabilityAlert on write failure

[Source: https://aws.amazon.com/blogs/containers/part-2-observing-and-scaling-mlops-infrastructure-on-amazon-eks/]

Expose these metrics from your training code using the Prometheus Python client. Here is a minimal instrumentation pattern for a PyTorch training loop:

from prometheus_client import Gauge, Counter, start_http_server
import os

# Start metrics server on port 8080
start_http_server(8080)

# Define metrics
training_loss = Gauge('training_loss', 'Current training loss',
                      ['job_name', 'rank'])
samples_per_sec = Gauge('training_samples_per_second',
                        'Training throughput', ['job_name', 'rank'])
gradient_norm = Gauge('training_gradient_norm',
                      'L2 norm of gradient', ['job_name', 'rank'])

job_name = os.environ.get('JOB_NAME', 'unknown')
rank = str(os.environ.get('RANK', '0'))

for step, batch in enumerate(dataloader):
    # ... forward pass, backward pass ...
    
    loss_val = loss.item()
    training_loss.labels(job_name=job_name, rank=rank).set(loss_val)
    
    # Gradient norm
    total_norm = sum(p.grad.data.norm(2).item() ** 2
                     for p in model.parameters()
                     if p.grad is not None) ** 0.5
    gradient_norm.labels(job_name=job_name, rank=rank).set(total_norm)
    
    # Throughput: samples in this batch / elapsed seconds
    samples_per_sec.labels(job_name=job_name, rank=rank).set(
        batch_size / step_time_seconds)

Figure 9.3: Training Observability Pipeline — From Code to Alerting

sequenceDiagram
    participant TR as Training Pod
    participant PM as prometheus_client<br/>(port 8080)
    participant PROM as Prometheus
    participant AM as Alertmanager
    participant GRAF as Grafana

    TR->>PM: training_loss.set(loss_val)
    TR->>PM: gradient_norm.set(total_norm)
    TR->>PM: training_last_step_timestamp.set(time.time())
    PROM-->>PM: scrape /metrics every 15s
    PM-->>PROM: gauge values
    PROM->>PROM: evaluate alert rules
    alt TrainingJobStalled: no step for 10m
        PROM->>AM: fire alert
        AM->>AM: route to on-call
    end
    PROM-->>GRAF: PromQL queries
    GRAF-->>GRAF: render loss curves, throughput panels

For Kubernetes to scrape this endpoint, add a port to the pod spec and a corresponding ServiceMonitor:

# Pod spec excerpt
ports:
- name: metrics
  containerPort: 8080
  protocol: TCP
env:
- name: JOB_NAME
  valueFrom:
    fieldRef:
      fieldPath: metadata.labels['job-name']
- name: RANK
  valueFrom:
    fieldRef:
      fieldPath: metadata.annotations['rank']

Inference Observability: Latency Distributions and SLOs

Inference observability follows a different pattern from training. Where training favors throughput — how many samples processed per second — inference favors latency: how quickly does a single request return? For user-facing inference APIs, tail latency (p99) matters far more than mean latency, because it defines the worst experience a user receives. [Source: https://www.datadoghq.com/blog/ml-model-monitoring-in-production-best-practices/]

The standard inference metric set:

MetricDescriptionWhy It Matters
p50 latencyMedian request latencyBaseline serving speed
p95 latency95th percentile latencyTypical worst-case experience
p99 latency99th percentile latencySLO boundary for most services
Requests per second (RPS)Throughput capacityCapacity planning, cost per prediction
Error rateFraction of failed inferencesQuality signal, downstream impact
Queue depthPending requests in inference queueScaling trigger
Time to First Token (TTFT)Latency until first token returnedCritical for streaming LLM UX
Time Per Output Token (TPOT)Generation throughput for generative modelsLLM serving efficiency

[Source: https://dzone.com/articles/ai-ml-kubernetes-mlflow-kserve-vllm]

Frameworks like vLLM and TorchServe expose many of these metrics natively in Prometheus format. For custom inference servers, the prometheus_client histogram type is the right abstraction — histograms allow Prometheus to compute arbitrary percentiles at query time:

from prometheus_client import Histogram

inference_latency = Histogram(
    'inference_request_duration_seconds',
    'End-to-end inference latency',
    ['model_name', 'model_version'],
    buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)

# In the inference handler:
with inference_latency.labels(
    model_name='bert-classifier',
    model_version='v2'
).time():
    result = model.predict(inputs)

Query p99 latency in PromQL:

histogram_quantile(
  0.99,
  sum(rate(inference_request_duration_seconds_bucket[5m])) by (le, model_name)
)

Distributed Training Collective Communication Monitoring

In multi-node distributed training, collective communication operations (AllReduce, AllGather, Broadcast) are the synchronization points where all GPU ranks must meet before proceeding. A single slow or dead GPU causes every other rank to stall waiting for the collective to complete.

Monitor collective communication health by tracking:

The torch.distributed.monitored_barrier() function adds an explicit timeout to barrier synchronization, surfacing which rank failed to arrive rather than hanging indefinitely: [Source: https://medium.com/@devaru.ai/debugging-nccl-errors-in-distributed-training-a-comprehensive-guide-28df87512a34]

Figure 9.4: Distributed Training Collective Communication and Straggler Detection

graph TD
    subgraph RANKS["AllReduce Collective — 4 Ranks"]
        R0[Rank 0\nNode A, GPU 0]
        R1[Rank 1\nNode A, GPU 1]
        R2[Rank 2\nNode B, GPU 0]
        R3[Rank 3 STRAGGLER\nNode B, GPU 1]
    end

    R0 -->|gradient shard| RING[NCCL Ring / Tree]
    R1 -->|gradient shard| RING
    R2 -->|gradient shard| RING
    R3 -.->|DELAYED / MISSING| RING

    RING -->|60s timeout| MB[monitored_barrier]
    MB -->|timeout fires| ERR["RuntimeError:\nRank 3 failed to arrive"]
    ERR --> DIAG[Inspect DCGM Xid on Node B GPU 1]

    style R3 fill:#ffcccc,stroke:#cc0000
    style ERR fill:#fff3cd,stroke:#856404
import torch.distributed as dist

# Replace dist.barrier() with monitored_barrier during debugging
# All ranks must reach this point within 60 seconds or an error is raised
dist.monitored_barrier(timeout=datetime.timedelta(seconds=60))

Key Takeaway: Infrastructure metrics alone cannot diagnose AI workload problems. Instrument training loops with Prometheus gauges for loss, throughput, and gradient norms. Use Prometheus histograms for inference latency to enable p95/p99 quantile queries. For distributed training, monitored_barrier() is the single most useful debugging tool for identifying stragglers in collective communication.


Section 3: Alerting and SLOs for AI Workloads

Designing Alerts That Matter

A useful mental model for AI workload alerting: alerts should page someone who can act on them immediately. Alerts about metric symptoms (high GPU temperature) without actionable context (which node, which job, what to do) cause alert fatigue and get ignored. Every alert should answer three questions: what is broken, where is it broken, and what is the immediate remediation step.

Keep a concise set of service-level objectives for each inference service and train alert rules to fire only when an SLO is threatened. [Source: https://www.finops.org/wg/scaling-kubernetes-for-ai-ml-workloads-with-finops/]

GPU Failure and Xid Error Alerts

Xid errors are the GPU driver’s hardware fault codes, emitted via dmesg and surfaced through DCGM. Common Xid codes for AI workloads: [Source: https://introl.com/blog/troubleshooting-gpu-clusters-common-issues-resolution-playbook]

Xid CodeMeaningRecommended Action
Xid 13Graphics Exception (general)Investigate workload, restart if persistent
Xid 31GPU memory page faultCheck for out-of-bounds memory access in CUDA code
Xid 48Double-bit ECC error (DBE)Cordon node immediately, replace GPU
Xid 63Row remapping failureSchedule GPU replacement
Xid 79GPU has fallen off the busNode reboot required, hardware failure likely

Prometheus alert rules for GPU health:

groups:
- name: gpu.rules
  rules:
  - alert: GPUXidErrorDetected
    expr: increase(DCGM_FI_DEV_XID_ERRORS[5m]) > 0
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "GPU Xid error on {{ $labels.instance }}"
      description: >
        GPU {{ $labels.gpu }} on node {{ $labels.instance }} has reported
        {{ $value }} Xid errors in the last 5 minutes. Check dmesg for Xid code.

  - alert: GPUDoubleBitECCError
    expr: increase(DCGM_FI_DEV_ECC_DBE_VOL_TOTAL[5m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: "GPU double-bit ECC error — cordon node {{ $labels.instance }}"
      description: >
        Uncorrectable GPU memory error detected. Node {{ $labels.instance }}
        should be cordoned immediately to prevent further job scheduling.

  - alert: GPUTemperatureCritical
    expr: DCGM_FI_DEV_GPU_TEMP > 83
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "GPU thermal limit exceeded on {{ $labels.instance }}"
      description: >
        GPU {{ $labels.gpu }} temperature is {{ $value }}°C. Performance
        throttling is likely. Check datacenter cooling.

[Source: https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/kube-prometheus.html]

OOM Detection and Memory Pressure Alerts

Kubernetes records OOM kills as pod exit code 137. Combine this with GPU memory pressure from DCGM to build a comprehensive memory alert: [Source: https://introl.com/blog/troubleshooting-gpu-clusters-common-issues-resolution-playbook]

  - alert: GPUMemoryPressure
    expr: >
      (DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL) > 0.95
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "GPU memory > 95% on {{ $labels.instance }}"
      description: >
        GPU {{ $labels.gpu }} is using {{ $value | humanizePercentage }}
        of framebuffer memory. OOM kill risk is high.

  - alert: PodOOMKilled
    expr: >
      kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} == 1
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: "Pod OOM killed: {{ $labels.pod }}"
      description: >
        Container {{ $labels.container }} in pod {{ $labels.pod }} was
        terminated by OOM killer. Consider reducing batch size or enabling
        gradient checkpointing.

Training Stall Detection

A training stall is one of the most expensive failure modes in AI infrastructure: the cluster continues consuming expensive GPU-hours while producing no useful model updates. Detect stalls by alerting on the absence of training progress rather than the presence of an error: [Source: https://aws.amazon.com/blogs/containers/part-2-observing-and-scaling-mlops-infrastructure-on-amazon-eks/]

  - alert: TrainingJobStalled
    expr: >
      (time() - training_last_step_timestamp_seconds) > 600
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Training job {{ $labels.job_name }} stalled"
      description: >
        No training steps recorded for job {{ $labels.job_name }} in the
        last 10 minutes. Possible causes: NCCL hang, dead GPU, data loading
        deadlock. Inspect pod logs immediately.

The training_last_step_timestamp_seconds metric must be emitted by your training code:

from prometheus_client import Gauge
import time

last_step_ts = Gauge('training_last_step_timestamp_seconds',
                     'Unix timestamp of last completed training step',
                     ['job_name'])

# Updated at the end of each training step
last_step_ts.labels(job_name=job_name).set(time.time())

SLO Definitions for Inference Services

Well-defined SLOs translate business requirements into concrete alert thresholds. KEDA (Kubernetes Event-Driven Autoscaler) can use these same Prometheus metrics as scaling signals, automatically adding inference replicas when the p99 SLO is at risk. [Source: https://www.finops.org/wg/scaling-kubernetes-for-ai-ml-workloads-with-finops/]

A representative SLO table for an LLM inference service:

SLOObjectivePrometheus QueryAlert if
p99 latency< 2 secondshistogram_quantile(0.99, ...)> 2s for 5m
p95 latency< 500mshistogram_quantile(0.95, ...)> 500ms for 5m
Availability> 99.9%1 - (error_rate / total_rate)< 99.9% over 1h
TTFT< 300mshistogram_quantile(0.95, ttft_seconds_bucket)> 300ms for 5m
Error rate< 0.1%rate(inference_errors_total[5m])> 0.1% for 5m

A KEDA ScaledObject that scales inference replicas when p99 latency exceeds 1.5 seconds:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: inference-server-scaler
spec:
  scaleTargetRef:
    name: inference-deployment
  minReplicaCount: 2
  maxReplicaCount: 20
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus-operated.monitoring:9090
      metricName: inference_p99_latency
      query: >
        histogram_quantile(0.99,
          sum(rate(inference_request_duration_seconds_bucket[5m])) by (le))
      threshold: "1.5"

[Source: https://aws.amazon.com/blogs/containers/part-2-observing-and-scaling-mlops-infrastructure-on-amazon-eks/]

Key Takeaway: Alert on SLO breaches, not just metric thresholds. The most critical GPU alerts are double-bit ECC errors (page immediately, cordon node) and Xid errors (investigate promptly). Training stall detection requires application-level metrics — infrastructure metrics cannot distinguish a stalled job from one doing useful work.


Section 4: Troubleshooting Common Issues

The following section is structured as a troubleshooting playbook. Each subsection describes a symptom, its root causes, and a diagnostic sequence to isolate and resolve it.

GPU Not Detected or Driver Mismatch

Symptom: Pod runs but uses CPU only; training throughput is ~100x lower than expected; no GPU resources shown in pod description.

Root Causes:

  1. Missing nvidia.com/gpu resource request in the pod spec — the NVIDIA device plugin will not allocate GPUs to pods that do not request them
  2. NVIDIA Container Runtime not configured as the default runtime on the node
  3. NVIDIA device plugin DaemonSet not running on the target node
  4. Node driver version mismatch with the container’s CUDA version

Diagnostic Sequence:

# 1. Check pod spec for GPU resource request
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[*].resources}'

# 2. Verify device plugin is running on the target node
kubectl get pods -n kube-system -l name=nvidia-device-plugin-ds -o wide

# 3. Check node capacity includes GPUs
kubectl get node <node-name> -o jsonpath='{.status.capacity}'
# Expected: "nvidia.com/gpu": "8"

# 4. Inspect pod events for scheduling failures
kubectl describe pod <pod-name> | grep -A 10 Events

# 5. Check CUDA version compatibility inside the container
kubectl exec -it <pod-name> -- nvidia-smi

A particularly dangerous failure mode: containers started without a GPU resource request appear to function but run CPU-only computation at roughly 1/100th the expected speed. Silent failures of this type have burned weeks of researcher time before being caught. [Source: https://introl.com/blog/troubleshooting-gpu-clusters-common-issues-resolution-playbook]

Figure 9.5: GPU Not Detected — Diagnostic Decision Tree

flowchart TD
    START([Pod running but slow?]) --> CHECK1{GPU in pod spec\nresources.limits?}
    CHECK1 -->|No| FIX1[Add nvidia.com/gpu: 1\nto resources.limits and requests]
    CHECK1 -->|Yes| CHECK2{Device plugin pod\nrunning on target node?}
    CHECK2 -->|No| FIX2[kubectl apply device plugin DaemonSet\nor restart it]
    CHECK2 -->|Yes| CHECK3{Node capacity shows\nnvidia.com/gpu?}
    CHECK3 -->|No| FIX3[Check NVIDIA driver install\nand node reboot]
    CHECK3 -->|Yes| CHECK4{Pod events show\nscheduling error?}
    CHECK4 -->|Yes| FIX4[Add toleration for\nnvidia.com/gpu NoSchedule taint]
    CHECK4 -->|No| CHECK5{nvidia-smi works\ninside container?}
    CHECK5 -->|No| FIX5[Fix NVIDIA Container Runtime\nas default runtime on node]
    CHECK5 -->|Yes| DONE([GPU detected — investigate\nCUDA version compatibility])

    style FIX1 fill:#d4edda,stroke:#28a745
    style FIX2 fill:#d4edda,stroke:#28a745
    style FIX3 fill:#d4edda,stroke:#28a745
    style FIX4 fill:#d4edda,stroke:#28a745
    style FIX5 fill:#d4edda,stroke:#28a745

The fix in the pod spec:

resources:
  limits:
    nvidia.com/gpu: 1
  requests:
    nvidia.com/gpu: 1

[Source: https://docs.cloud.google.com/kubernetes-engine/docs/troubleshooting/gpus]

NCCL Communication Failures in Distributed Training

Symptom: Distributed training hangs indefinitely, then fails after 1800 seconds (the default NCCL timeout) with a message like NCCL error: unhandled cuda error, NCCL version 2.x.x.

NCCL is the NVIDIA Collective Communications Library used by PyTorch DDP, DeepSpeed, and Megatron-LM for inter-GPU communication. When NCCL hangs, the entire training job stalls: all ranks wait for a collective operation that one rank cannot complete. [Source: https://medium.com/@devaru.ai/debugging-nccl-errors-in-distributed-training-a-comprehensive-guide-28df87512a34]

Root Causes and Diagnostics:

CauseDiagnostic SignalFix
Wrong network interfaceNCCL logs show wrong IP prefixSet NCCL_SOCKET_IFNAME=eth0 (or correct interface)
Clock skew > 1ms between nodesNCCL timeout errors, NTP logs show driftSynchronize NTP/chrony across all nodes
Mismatched tensor shapesStack trace at AllReduce/AllGather callAudit model code for data-dependent tensor shapes
Dead GPU (silent failure)One rank stops progressingCheck DCGM Xid metrics, nvidia-smi on affected node
Incorrect NCCL_NTHREADS or buffer sizeBandwidth degradationUse nccl-tests to validate baseline bandwidth

Step 1: Enable NCCL debug logging. This is always the first diagnostic action for NCCL failures: [Source: https://forums.developer.nvidia.com/t/some-nccl-operations-have-failed-or-timed-out-due-to-the-asynchronous-nature-of-cuda-kernels/283010]

# In the training pod spec environment variables
env:
- name: NCCL_DEBUG
  value: "INFO"
- name: NCCL_ASYNC_ERROR_HANDLING
  value: "1"
- name: NCCL_SOCKET_IFNAME
  value: "eth0"   # replace with your high-speed network interface

NCCL_DEBUG=INFO reveals topology detection (whether NVLink, InfiniBand, or TCP is selected), ring/tree construction, and which ranks are communicating with which. NCCL_ASYNC_ERROR_HANDLING=1 converts hang-then-timeout failures into immediate error reports.

Step 2: Run nccl-tests to validate baseline connectivity before launching training. This catches network configuration issues before they waste training time: [Source: https://kubernetes.recipes/recipes/troubleshooting/validate-gpu-topology-nccl/]

# Deploy nccl-tests as a job on two nodes
kubectl apply -f nccl-tests-job.yaml

# Check all-reduce bandwidth
kubectl logs -l job-name=nccl-tests | grep "Avg bus bandwidth"
# Expected: >100 GB/s for NVLink, >25 GB/s for InfiniBand, >10 GB/s for 100GbE

Step 3: Check GPU topology. Poor physical topology (GPUs on different PCIe switches communicating across the CPU) explains low NCCL bandwidth without any software bug: [Source: https://kubernetes.recipes/recipes/troubleshooting/validate-gpu-topology-nccl/]

kubectl exec -it <pod-name> -- nvidia-smi topo -m
# Output shows NVL (NVLink), PIX (PCIe peer), SYS (cross-NUMA) connections
# NVL > PIX > SYS for bandwidth

Step 4: Identify the straggler rank. If the training job hangs rather than fails immediately, use monitored_barrier() to identify which rank is not arriving at the synchronization point: [Source: https://github.com/huggingface/accelerate/issues/314]

import torch.distributed as dist
import datetime
import logging

logger = logging.getLogger(__name__)
logger.info(f"Rank {dist.get_rank()} reaching barrier at step {step}")

try:
    dist.monitored_barrier(
        timeout=datetime.timedelta(seconds=60),
        wait_all_ranks=True  # collects all failed ranks, not just the first
    )
except RuntimeError as e:
    logger.error(f"Rank {dist.get_rank()} barrier failed: {e}")
    raise

Pod Scheduling Failures Due to Resource Constraints

Symptom: Pods remain in Pending state indefinitely. kubectl describe pod shows events like 0/12 nodes are available: 12 Insufficient nvidia.com/gpu.

Diagnostic Sequence:

# Check overall GPU resource availability across the cluster
kubectl describe nodes | grep -A 5 "Allocated resources"

# Check for GPU nodes that are cordoned or have NoSchedule taints
kubectl get nodes -l nvidia.com/gpu.present=true | grep -v Ready

# Check for existing pods holding GPU resources
kubectl get pods --all-namespaces -o json | \
  jq '.items[] | select(.spec.containers[].resources.limits["nvidia.com/gpu"] != null)
       | {name: .metadata.name, ns: .metadata.namespace, gpus: .spec.containers[].resources.limits["nvidia.com/gpu"]}'

# Check if the pod has correct tolerations for GPU node taints
kubectl get node <gpu-node> -o jsonpath='{.spec.taints}'

GPU nodes commonly carry a NoSchedule taint to prevent non-GPU workloads from consuming the expensive hardware. If your training pod is missing the matching toleration, it will never be scheduled:

# Required toleration for GPU nodes with NoSchedule taint
tolerations:
- key: nvidia.com/gpu
  operator: Exists
  effect: NoSchedule

[Source: https://collabnix.com/distributed-training-on-kubernetes-best-practices-implementation/]

Performance Degradation and Thermal Throttling

Symptom: Training throughput (samples/sec) gradually decreases over the course of a long run, with no errors logged. The model is training, but slower and slower.

Thermal throttling is a silent performance killer. When GPU temperature exceeds the thermal design point (typically 83°C for data center GPUs), the hardware automatically reduces the SM clock frequency to stay within thermal limits. A GPU throttling from 1.95 GHz down to 1.2 GHz loses approximately 38% of compute throughput — enough to extend a week-long training run by 2-3 days. [Source: https://introl.com/blog/troubleshooting-gpu-clusters-common-issues-resolution-playbook]

Detection: Correlate DCGM_FI_DEV_GPU_TEMP with DCGM_FI_DEV_SM_CLOCK in a Grafana panel. A drop in SM clock correlated with rising temperature is the definitive signature of thermal throttling:

# In a Grafana panel using two Y-axes:
# Left axis (°C):
DCGM_FI_DEV_GPU_TEMP{instance="gpu-node-01"}

# Right axis (MHz):
DCGM_FI_DEV_SM_CLOCK{instance="gpu-node-01"}

Common Causes and Remediation:

CauseSignalRemediation
Inadequate datacenter coolingAll GPUs on a rack throttling simultaneouslyEscalate to datacenter operations
Fan failure on specific GPUSingle GPU throttling, others nominalReplace GPU; alert on temperature outlier vs. rack peers
Thermal paste degradationGradual onset over monthsServer hardware maintenance
High ambient temperature (power density)Correlated with high room temperatureRedistribute workloads, add cooling capacity

Add a Prometheus alert that fires when an SM clock drop is correlated with elevated temperature, giving early warning before throughput impact becomes obvious:

  - alert: GPUThermalThrottling
    expr: >
      (DCGM_FI_DEV_SM_CLOCK / on(instance, gpu) DCGM_FI_DEV_SM_CLOCK offset 30m) < 0.85
      AND DCGM_FI_DEV_GPU_TEMP > 78
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "GPU thermal throttling suspected on {{ $labels.instance }}"
      description: >
        SM clock on GPU {{ $labels.gpu }} has dropped > 15% from 30 minutes ago
        while temperature is {{ $value }}°C. Training throughput is degraded.

[Source: https://developer.nvidia.com/blog/monitoring-gpus-in-kubernetes-with-dcgm/]

Key Takeaway: The most insidious AI workload failures are silent ones — CPU-only execution masquerading as GPU training, NCCL hangs that never produce error messages, and thermal throttling that degrades throughput without triggering any alert. Build observability that detects absence of progress, not just presence of errors. NCCL_DEBUG=INFO and monitored_barrier() are your two most powerful debugging tools for distributed training failures.


Chapter Summary

Monitoring AI workloads on Kubernetes requires observability at three layers: hardware (DCGM Exporter metrics), platform (Kubernetes scheduling and resource allocation), and application (training loss, inference latency). The foundational monitoring stack combines the kube-prometheus-stack Helm chart with DCGM Exporter deployed as a DaemonSet on every GPU node, connected via a ServiceMonitor custom resource.

Training workloads need custom Prometheus metrics — loss, throughput, gradient norm — because infrastructure metrics alone cannot distinguish a converging model from a stalled one. Inference services require latency histograms to support p50/p95/p99 SLO tracking, with KEDA enabling SLO-driven autoscaling.

Alert design should prioritize actionable signals: double-bit ECC errors and Xid 79 events require immediate node cordoning; training stall detection requires application-level last_step_timestamp metrics; thermal throttling detection requires correlating SM clock drops with temperature rises.

Troubleshooting follows a systematic playbook: verify GPU resource requests before anything else; use NCCL_DEBUG=INFO and NCCL_ASYNC_ERROR_HANDLING=1 as the first step for distributed training hangs; run nccl-tests to validate inter-node bandwidth before large training runs; and correlate SM clock with temperature to catch thermal throttling before it wastes GPU-hours.


Key Terms

TermDefinition
DCGMNVIDIA Data Center GPU Manager — a suite of tools for managing and monitoring NVIDIA GPUs in data center and Kubernetes environments
DCGM ExporterA Go-based Kubernetes DaemonSet that exposes DCGM GPU metrics at an HTTP /metrics endpoint for Prometheus scraping
PrometheusOpen-source time-series database and monitoring system that scrapes and stores metrics from instrumented services
GrafanaOpen-source visualization platform used to build dashboards from Prometheus metrics and other data sources
ServiceMonitorA Prometheus Operator custom resource that declaratively configures how Prometheus discovers and scrapes metric endpoints
Xid errorAn NVIDIA GPU driver hardware fault code emitted via dmesg and surfaced through DCGM; different Xid codes map to specific hardware failure modes
NCCLNVIDIA Collective Communications Library — the inter-GPU communication backend used by PyTorch DDP, DeepSpeed, and Megatron-LM for distributed training
OOMOut of Memory — a condition where a process attempts to allocate more memory than is available; in Kubernetes, the OOM killer terminates the offending container with exit code 137
SLOService Level Objective — a measurable target for service quality (e.g., p99 latency < 2 seconds) used to define acceptable operating bounds and trigger alerts
p99 latencyThe 99th percentile of request latency — the response time below which 99% of requests fall; represents the worst experience for all but the top 1% of requests
Thermal throttlingAn automatic GPU protection mechanism that reduces SM clock frequency when the GPU temperature exceeds its thermal design point, degrading compute throughput silently
kube-prometheus-stackA Helm chart bundling Prometheus Operator, Grafana, Alertmanager, kube-state-metrics, and node_exporter into a complete Kubernetes monitoring stack
KEDAKubernetes Event-Driven Autoscaler — extends Kubernetes HPA to support scaling based on custom metrics including Prometheus queries
ECCError-Correcting Code — a memory protection mechanism on data center GPUs; single-bit errors (SBE) are corrected in hardware, while double-bit errors (DBE) are uncorrectable and indicate hardware failure
Gradient normThe L2 norm of the model’s gradient vector; used to detect exploding gradients (norm >> 1) or vanishing gradients (norm ≈ 0) during training

Chapter 10: Production Patterns and Scaling AI Platforms

Learning Objectives

By the end of this chapter, you will be able to:


Introduction

Running a proof-of-concept AI model on a single Kubernetes cluster is relatively straightforward. Running a production AI platform — one that serves dozens of data science teams, survives regional outages, recovers from disasters, and scales to hundreds of GPUs — is a fundamentally different engineering challenge.

This chapter brings together the threads from earlier chapters and examines the operational patterns that distinguish a toy ML cluster from a platform that an enterprise can rely on. We will explore how to share resources fairly between teams, how to extend capacity across cloud boundaries, and how to build the kind of resilience that lets platform engineers sleep soundly at night. We will also look ahead at the next generation of Kubernetes primitives — Dynamic Resource Allocation, LeaderWorkerSet, JobSet, and emerging accelerator support — that are reshaping what is possible.

Think of this chapter as the architectural blueprint review that happens before a building opens to the public: the foundation has been laid, the systems are in place, but now we must ensure the structure is safe, equitable, and built to last.


10.1 Multi-Tenant AI Platform Design

The Tenant Problem

Imagine a university research computing center. Every faculty lab needs GPU time, every PhD student needs to run experiments, and everyone believes their deadline is the most important one. Without structure, the researcher with the most aggressive scripts wins — and everyone else waits. The Kubernetes equivalent of this situation is called the noisy neighbor problem, and it is the central challenge of multi-tenancy.

Multi-tenancy on Kubernetes is defined along a spectrum. Soft multi-tenancy assumes tenants within the same organization are cooperative; the goal is fairness, not adversarial isolation. Hard multi-tenancy treats tenants as potentially hostile — guarding against data exfiltration, privilege escalation, and denial-of-service attacks [Source: https://kubernetes.io/docs/concepts/security/multi-tenancy/]. Most enterprise AI platforms fall somewhere between these poles: internal teams are trusted, but resource boundaries are strict.

Namespace-Based Isolation vs. Virtual Clusters

The first design decision is the isolation boundary itself.

Namespace-based isolation is the Kubernetes default. Each team or project gets one or more namespaces, and Kubernetes scopes many objects to namespaces: RBAC Roles, NetworkPolicies, ResourceQuotas, and LimitRanges. This approach is operationally simple, requires no additional tooling, and scales well for organizations where teams share a common Kubernetes API version.

Its limitation is that namespaces are a logical boundary, not a physical one. Teams share the same API server and the same control plane. Critically, Custom Resource Definitions (CRDs) are cluster-scoped, which means if the computer-vision team installs a custom operator at version v1alpha1 and the NLP team needs the same CRD at v1beta1, there is a conflict with no easy resolution [Source: https://www.vcluster.com/blog/isolating-workloads-multitenant-gpu-cluster].

Virtual clusters (exemplified by tools like vCluster) solve this by giving each tenant a fully isolated Kubernetes API server — etcd, controller manager, and all — running as pods inside the host cluster. Tenants can install arbitrary CRDs and operators without affecting one another. The host cluster still owns the actual compute (Nodes, persistent volumes), so the hardware footprint is shared, but the control plane experience is fully isolated [Source: https://www.vcluster.com/guides/gpu-cluster-to-ai-factory-multi-tenant-infrastructure].

The trade-off table below summarizes the decision:

DimensionNamespace IsolationVirtual Clusters
Operational complexityLowMedium
CRD isolationNo — cluster-scopedYes — per-vcluster
Control plane overheadNoneOne vcluster pod set per tenant
Security boundaryLogicalNear-physical
Recommended forInternal teams, shared toolingExternal tenants, CRD conflicts
Tooling examplesNative KubernetesvCluster, Kamaji

Figure 10.1: Multi-Tenancy Isolation Models — Namespace Boundaries vs. Virtual Clusters

graph TB
    subgraph HostCluster["Host Kubernetes Cluster"]
        subgraph NS["Namespace Isolation"]
            NS_A["Namespace: team-a\n(RBAC, ResourceQuota,\nNetworkPolicy)"]
            NS_B["Namespace: team-b\n(RBAC, ResourceQuota,\nNetworkPolicy)"]
            SharedAPI["Shared API Server\n& Control Plane"]
            SharedCRDs["Shared CRDs\n(cluster-scoped)"]
            NS_A --- SharedAPI
            NS_B --- SharedAPI
            SharedAPI --- SharedCRDs
        end

        subgraph VC["Virtual Cluster Isolation"]
            VCA["vCluster: team-c\n(own API server, etcd,\ncontrollers, CRDs)"]
            VCB["vCluster: team-d\n(own API server, etcd,\ncontrollers, CRDs)"]
        end

        SharedNodes["Shared Physical Nodes\n(GPUs, Storage)"]
    end

    NS_A --> SharedNodes
    NS_B --> SharedNodes
    VCA --> SharedNodes
    VCB --> SharedNodes

    style NS fill:#dbeafe,stroke:#3b82f6
    style VC fill:#dcfce7,stroke:#22c55e
    style SharedNodes fill:#fef9c3,stroke:#eab308

Resource Governance with Kueue and Quotas

Isolation defines who can access resources; governance defines how much they can use. Kubernetes provides two native primitives:

For more sophisticated batch scheduling — where teams have budgets that accumulate over time and jobs queue when budgets are exhausted — the Kueue project (a CNCF incubating project) provides ClusterQueue and LocalQueue objects. A ClusterQueue defines a pool of resources with nominal quotas and borrowing limits; teams draw from their LocalQueue, and unused quota can be lent to neighbors. This models the university computing center analogy well: each department has a baseline allocation, but idle capacity flows to whoever needs it.

For GPU-intensive workloads, NVIDIA Multi-Instance GPU (MIG) adds a hardware layer to this governance stack. MIG partitions a physical A100 or H100 into isolated instances with dedicated memory and compute slices. An operator can expose a 1g.10gb MIG profile to small fine-tuning jobs and a 4g.40gb profile to larger training runs, ensuring physical isolation between workloads sharing the same card [Source: https://apxml.com/courses/advanced-ai-infrastructure-design-optimization/chapter-3-advanced-kubernetes-orchestration/multi-tenancy-kubernetes-ml].

Self-Service Interfaces for Data Science Teams

Platform teams should not be ticket-queue gatekeepers. The most effective AI platforms provide self-service interfaces that let data scientists provision compute without filing a support ticket.

Common patterns include:

Platform Engineering Patterns for ML Platforms

The platform engineering discipline applies software engineering practices to internal developer platforms. For ML, this means treating the AI platform as a product with internal customers.

Key patterns include:

Figure 10.2: Kueue Resource Governance — From ClusterQueue to Workload Scheduling

flowchart TD
    Admin["Cluster Admin\ncreates ClusterQueue\n(nominal quota + borrowing limits)"]
    CQ["ClusterQueue\n'ml-platform-pool'\nnvidia.com/gpu: 32 nominal\nnvidia.com/gpu: 48 max-burst"]
    LQ_A["LocalQueue\nteam-a\n(draws from ClusterQueue)"]
    LQ_B["LocalQueue\nteam-b\n(draws from ClusterQueue)"]
    LQ_C["LocalQueue\nteam-c\n(draws from ClusterQueue)"]
    Job_A1["Training Job A1\n(8 GPUs)"]
    Job_A2["Training Job A2\n(8 GPUs) — QUEUED"]
    Job_B1["Training Job B1\n(16 GPUs)"]
    Job_C1["Experiment C1\n(4 GPUs)"]
    Scheduler["Kueue Scheduler\nfair-share + borrowing\npreemption if needed"]
    Nodes["GPU Node Pool\n(32 physical GPUs)"]

    Admin --> CQ
    CQ --> LQ_A
    CQ --> LQ_B
    CQ --> LQ_C
    LQ_A --> Job_A1
    LQ_A --> Job_A2
    LQ_B --> Job_B1
    LQ_C --> Job_C1
    Job_A1 --> Scheduler
    Job_B1 --> Scheduler
    Job_C1 --> Scheduler
    Scheduler --> Nodes

    style Job_A2 fill:#fee2e2,stroke:#ef4444
    style Scheduler fill:#ede9fe,stroke:#8b5cf6
    style Nodes fill:#fef9c3,stroke:#eab308

Key Takeaway: Multi-tenant AI platforms require layered governance — namespace or virtual-cluster isolation for boundaries, ResourceQuotas and Kueue for fair resource sharing, NVIDIA MIG for hardware-level partitioning, and self-service tooling so platform engineers scale their impact without becoming a bottleneck.


10.2 Multi-Cluster and Hybrid Architectures

Why One Cluster Is Not Enough

A single Kubernetes cluster, no matter how large, has fundamental limitations for enterprise AI platforms:

  1. Blast radius: A cluster-level failure (control plane outage, network partition) takes down all workloads simultaneously.
  2. Geographic constraints: Regulations may require data and compute to remain in specific regions. A European financial institution cannot train on customer data in a US cluster.
  3. Cost optimization: On-premises hardware provides lower per-GPU cost for steady-state workloads, but provisioning enough on-prem capacity to handle peak demand wastes money when that peak is infrequent.
  4. Technology heterogeneity: Training jobs may need A100s available on-premises, while serving workloads benefit from cloud-managed inference endpoints with global load balancing.

Training On-Prem, Serving in Cloud

A common and economically rational pattern separates the training and serving lifecycles:

The synchronization bridge is the model registry: a versioned artifact store (MLflow, Weights & Biases, or a simple OCI registry) that both sides can read from. A training job publishes model:v2.1.0 to the registry; a cloud deployment pipeline detects the new version and rolls it out.

Figure 10.3: Hybrid Training/Serving Architecture — On-Prem Training, Cloud Serving

flowchart LR
    subgraph OnPrem["On-Premises Cluster"]
        direction TB
        RawData["Raw Training Data\n(NVMe / NFS)"]
        TrainingJob["Distributed Training Job\n(PyTorch DDP / FSDP)\nA100 / H100 GPUs"]
        Checkpoint["Model Checkpoint\n(periodic saves)"]
        RawData --> TrainingJob
        TrainingJob --> Checkpoint
    end

    subgraph Registry["Model Registry\n(Object Storage: S3 / GCS)"]
        ModelArtifact["model:v2.1.0\n(immutable, versioned)"]
    end

    subgraph Cloud["Cloud Cluster"]
        direction TB
        CDPipeline["CD Pipeline\n(detects new version)"]
        InferenceDeployment["Inference Deployment\n(vLLM / Triton)\nauto-scaling"]
        Users["End Users\n(global load balancer)"]
        CDPipeline --> InferenceDeployment
        InferenceDeployment --> Users
    end

    Checkpoint -->|"push artifact"| ModelArtifact
    ModelArtifact -->|"pull on new version"| CDPipeline

    style OnPrem fill:#dbeafe,stroke:#3b82f6
    style Registry fill:#fef9c3,stroke:#eab308
    style Cloud fill:#dcfce7,stroke:#22c55e

Burst-to-Cloud for Peak Training Demand

Not every organization can afford enough on-premises GPU capacity for their peak training demand. The burst-to-cloud pattern addresses this: on-premises clusters handle the baseline, and cloud clusters absorb spikes.

Think of it like a town’s water supply: the municipal reservoir (on-prem cluster) handles daily demand, but during a drought or a heat wave, the town draws from an interconnected regional reservoir (cloud cluster) to meet peak demand. When the surge passes, the connection is released.

Implementing burst-to-cloud on Kubernetes typically involves:

  1. Cluster federation or virtual workspaces: Tools like Liqo, Open Cluster Management (OCM), or Karmada federate multiple clusters behind a single control plane, allowing a workload scheduler to place jobs on whichever cluster has available capacity.
  2. Cost-aware scheduling: The federation layer monitors spot instance pricing in the cloud and prefers on-prem when utilization is below a threshold, falling back to cloud spot instances when on-prem is saturated.
  3. Data locality considerations: Large training datasets should not cross expensive WAN links on every job. Burst-to-cloud works best when data is replicated to cloud object storage in advance, or when the burst workloads are less data-intensive (hyperparameter sweeps, evaluation runs).

Multi-Cluster Federation for GPU Pools

For organizations operating at true scale — think large tech companies or national research labs — a single physical cluster cannot hold enough GPU nodes to justify the management overhead. Multi-cluster federation creates a logical GPU pool that spans physical boundaries.

The federation control plane maintains a global view of resource availability. When a data scientist submits a 1,000-GPU training job, the scheduler can place it across two or three clusters transparently. From the user’s perspective, the job was submitted and eventually completed; the cross-cluster placement was invisible.

The main challenges are:

Cross-Cluster Model and Data Synchronization

Two classes of data must move across cluster boundaries:

Data TypeDirectionTooling
Training datasetsOn-prem → Cloud (for burst jobs)Rclone, AWS DataSync, object storage replication
Model checkpointsTraining cluster → Model registryMLflow, W&B, OCI artifact push
Serving model artifactsRegistry → Serving clusterCD pipeline, image pull, PVC snapshot
Experiment metadataAll clusters → Central trackingMLflow tracking server, remote backend
Cluster state backupsAll clusters → Remote object storeVelero, Kasten K10

The key insight is that model artifacts are immutable and versioned — once a model version is published, it does not change. This immutability makes cross-cluster synchronization straightforward compared to synchronizing mutable databases.

Key Takeaway: Multi-cluster and hybrid architectures let organizations optimize cost (on-prem for steady-state, cloud for bursts), meet data governance requirements, and eliminate single points of failure. The coordination layer — federation, model registries, and data replication pipelines — is what makes the multi-cluster boundary transparent to data science teams.


10.3 Production Hardening and Reliability

The Reliability Contract

Production AI platforms carry an implicit contract: training jobs should not be interrupted without warning, inference services should not drop requests during maintenance, and the platform should recover gracefully from failures that are inevitable at scale. This section covers the mechanisms that fulfill that contract.

Pod Disruption Budgets for Inference Services

When a node is drained for maintenance — whether for a kernel update, hardware replacement, or cluster upgrade — Kubernetes evicts the pods running on that node. For stateless web services, eviction and rescheduling is fast and relatively painless. For a high-traffic inference service with strict latency SLAs, evicting all replicas simultaneously is catastrophic.

PodDisruptionBudgets (PDBs) constrain the number of pods that can be voluntarily disrupted at any time. A PDB is the platform’s commitment to the scheduling system: “you may only evict pods from this Deployment if at least N replicas remain available.”

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: inference-pdb
spec:
  minAvailable: 2          # At least 2 replicas must stay up during disruption
  selector:
    matchLabels:
      app: llm-inference

For a three-replica inference Deployment, this PDB ensures that a rolling node drain can only remove one pod at a time, keeping the service at two-thirds capacity throughout maintenance rather than going dark entirely.

The inverse parameter, maxUnavailable, is useful for large deployments where a fixed percentage of disruption is acceptable:

spec:
  maxUnavailable: 1        # At most 1 pod unavailable at any time

PDBs only constrain voluntary disruptions (node drains, cluster upgrades). Involuntary disruptions — node hardware failure, OOM kills — bypass PDBs and require replica counts and topology spread constraints for resilience.

Graceful Shutdown and Drain for Long-Running Training

Training jobs present the inverse problem from inference: they are not serving external requests, but they run for hours or days and cannot simply be killed without losing progress.

When a node must be drained and a training pod is running on it, the default behavior is a SIGTERM followed by a SIGKILL after terminationGracePeriodSeconds. For a training job mid-epoch, this often means losing the last checkpoint.

Production hardening strategies for training jobs include:

Disaster Recovery for Model Registries and Pipelines

AI platforms accumulate irreplaceable artifacts: trained model weights, experiment metadata, pipeline definitions, and data preprocessing code. Losing this data can set a team back weeks or months.

A tiered DR strategy for AI platforms:

Tier 1 — Cluster state backups: Tools like Velero, Stash, or Kasten K10 snapshot Kubernetes resource manifests and persistent volume contents to remote object storage (S3, GCS, Azure Blob). These backups enable restoring the entire cluster configuration after a catastrophic failure [Source: https://kubeify.com/blog/2025-02-05-how-to-implement-kubernetes-for-high-availability-and-disaster-recovery/]. The etcd datastore should be snapshotted independently on a regular schedule.

Tier 2 — Model artifact replication: Model registries backed by object storage benefit from cross-region replication. S3 Cross-Region Replication or GCS multi-region buckets provide near-instant copies of model artifacts in a secondary region with no application-level changes required.

Tier 3 — Active-passive pipeline replication: For organizations with strict RTO (Recovery Time Objective) requirements, an active-passive second cluster can be kept warm with the same pipeline definitions and operator versions. Failover is a matter of switching DNS or load balancer targets rather than rebuilding from scratch.

DR TierWhat It ProtectsToolingTypical RPO
Cluster stateYAML manifests, PVCsVelero, Kasten K10Hours (scheduled backup interval)
Model artifactsTrained weights, checkpointsS3 CRR, GCS multi-regionMinutes (async replication)
Metadata (experiments)MLflow runs, metricsMLflow tracking DB backupHours
Active-passive clusterFull platform availabilityOCM, Karmada, manual DRMinutes (DNS failover)

RPO = Recovery Point Objective; how much data can be lost. RTO = Recovery Time Objective; how long recovery takes.

Figure 10.4: Tiered Disaster Recovery Strategy for AI Platforms

flowchart TD
    subgraph Primary["Primary Cluster (Active)"]
        ClusterState["Kubernetes Resources\n(Deployments, CRDs,\nOperators, PVCs)"]
        ModelReg["Model Registry\n(trained weights,\ncheckpoints)"]
        ExpMeta["Experiment Metadata\n(MLflow runs, metrics)"]
        Pipelines["Pipeline Definitions\n& Operator configs"]
    end

    subgraph Tier1["Tier 1 — Cluster State Backup"]
        Velero["Velero / Kasten K10\nsnapshots to object store\nRPO: Hours"]
    end

    subgraph Tier2["Tier 2 — Model Artifact Replication"]
        S3CRR["S3 Cross-Region Replication\n/ GCS Multi-Region\nRPO: Minutes"]
    end

    subgraph Tier3["Tier 3 — Active-Passive Cluster"]
        PassiveCluster["Passive Cluster\n(warm standby)\nDNS failover\nRTO: Minutes"]
    end

    ClusterState -->|"scheduled snapshots"| Velero
    ExpMeta -->|"DB backup"| Velero
    ModelReg -->|"async replication"| S3CRR
    Pipelines -->|"GitOps sync"| PassiveCluster

    S3CRR -->|"available in\nsecondary region"| PassiveCluster

    style Primary fill:#dbeafe,stroke:#3b82f6
    style Tier1 fill:#fef9c3,stroke:#eab308
    style Tier2 fill:#dcfce7,stroke:#22c55e
    style Tier3 fill:#ede9fe,stroke:#8b5cf6

Chaos Engineering for AI Infrastructure

Even well-designed systems fail in unexpected ways. Chaos engineering is the discipline of deliberately injecting failures into a system to discover weaknesses before they manifest in production [Source: https://www.wiz.io/academy/ai-ml-kubernetes-best-practices].

The analogy is a fire drill: you do not wait for a real fire to discover that the exit is blocked. You schedule a drill, find the blocked exit, and fix it.

For AI infrastructure, relevant chaos experiments include:

Tools like Chaos Mesh and LitmusChaos integrate natively with Kubernetes and provide a library of fault injection experiments that can be scheduled, monitored, and automatically validated.

The output of a chaos engineering program is not just discovered bugs — it is resilience confidence: documented proof that the platform handles specific failure modes gracefully, which is invaluable during incident postmortems and compliance audits.

Key Takeaway: Production hardening for AI on Kubernetes combines PodDisruptionBudgets to protect inference availability during maintenance, checkpoint-and-resume patterns to protect training progress during disruptions, tiered disaster recovery for irreplaceable model artifacts, and chaos engineering to validate resilience proactively rather than discovering gaps during real incidents.


10.4 The Future of AI on Kubernetes

Dynamic Resource Allocation (DRA) for GPUs

Every previous chapter in this book used the resources.limits pattern to request GPUs: nvidia.com/gpu: 1. This works, but it is a blunt instrument. The scheduler knows only that a pod needs one GPU — it has no visibility into GPU memory capacity, NVLink connectivity, PCIe topology, or the relationship between GPUs and high-speed NICs. This opacity leads to suboptimal placements, idle hardware, and fragile manual workarounds using node labels and selectors.

Dynamic Resource Allocation (DRA) is the successor to Kubernetes Device Plugins and represents a fundamental redesign of how Kubernetes understands and allocates specialized hardware [Source: https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/]. DRA reached stable (GA) status in Kubernetes 1.34 [Source: https://developers.redhat.com/articles/2026/03/25/dynamic-resource-allocation-goes-ga-red-hat-openshift-421-smarter-gpu], and is also GA in Red Hat OpenShift 4.21 as of March 2026.

DRA introduces three new API objects in the resource.k8s.io API group:

ObjectRoleWho Creates It
ResourceSliceAdvertises hardware available on each node, including device attributes (memory, architecture, vendor capabilities)DRA driver / node agent
DeviceClassDefines a category of devices that can be requested, e.g., “any NVIDIA H100” or “any GPU with >= 40 GB VRAM”Cluster administrator or device driver
ResourceClaimA workload’s request for a specific device or combination of devicesUser / pipeline

Attribute-based requests are the headline feature. Instead of requesting nvidia.com/gpu: 1 and hoping the scheduler places the pod on a node with the right GPU, a user can write:

apiVersion: resource.k8s.io/v1alpha3
kind: ResourceClaim
metadata:
  name: training-gpu-claim
spec:
  devices:
    requests:
    - name: gpu
      deviceClassName: nvidia-gpu
      selectors:
      - cel:
          expression: device.attributes["memory"].isGreaterThan(quantity("40Gi"))

This claim requests any NVIDIA GPU with more than 40 GiB of memory. The scheduler consults the ResourceSlice objects published by DRA drivers on each node and finds a matching device automatically — no manual node selectors required [Source: https://cloud.google.com/blog/products/containers-kubernetes/kubernetes-device-management-with-dra-dynamic-resource-allocation].

Topology-aware co-scheduling is equally important for distributed training. DRA supports inter-device constraints: a workload can request both a GPU and a high-speed NIC and specify that they must be attached to the same PCIe Root Complex to minimize latency. Previously, this required complex custom scheduler plugins or manual affinity rules; with DRA, it is a first-class declarative expression [Source: https://thenewstack.io/kubernetes-primer-dynamic-resource-allocation-dra-for-gpu-workloads/].

Figure 10.5: Dynamic Resource Allocation (DRA) — Object Model and Scheduling Flow

flowchart TD
    subgraph Drivers["Node-Level DRA Drivers"]
        NVDriver["NVIDIA DRA Driver\n(node agent)"]
        AMDDriver["AMD DRA Driver\n(node agent)"]
    end

    subgraph DRAObjects["DRA API Objects (resource.k8s.io)"]
        RS_NV["ResourceSlice\n(node: gpu-node-01)\ndevice: H100 80GB\nmemory: 80Gi\nnvlink: true"]
        RS_AMD["ResourceSlice\n(node: gpu-node-05)\ndevice: MI300X\nmemory: 192Gi"]
        DC["DeviceClass\n'nvidia-gpu'\n(vendor selector,\ndefault attributes)"]
        RC["ResourceClaim\n(workload request)\nmemory >= 40Gi\nnvlink == true"]
    end

    subgraph Workload["User Workload"]
        Pod["Training Pod\n(references ResourceClaim)"]
    end

    Scheduler["Kubernetes Scheduler\n(DRA-aware)\nmatches Claims to Slices"]

    NVDriver -->|"publishes"| RS_NV
    AMDDriver -->|"publishes"| RS_AMD
    DC -->|"scopes"| RC
    Pod -->|"references"| RC
    RC -->|"evaluated by"| Scheduler
    RS_NV -->|"consulted by"| Scheduler
    RS_AMD -->|"consulted by"| Scheduler
    Scheduler -->|"binds Pod to\ngpu-node-01"| Pod

    style DRAObjects fill:#dbeafe,stroke:#3b82f6
    style Scheduler fill:#ede9fe,stroke:#8b5cf6
    style Workload fill:#dcfce7,stroke:#22c55e

AWS has demonstrated DRA on EKS with EC2 P6e-GB200 instances (NVIDIA GB200 NVL72), and AKS has shipped DRA support for NVIDIA vGPU partitioning in Azure as of March 2026 [Source: https://blog.aks.azure.com/2026/03/06/dra-with-vGPUs-on-aks]. NVIDIA’s NIM Operator also supports DRA for deploying inference microservices [Source: https://docs.nvidia.com/nim-operator/latest/dra.html].

Kubernetes AI/ML Working Group Initiatives

The Kubernetes project formalized the AI/ML Working Group (wg-ai-ml) to coordinate the development of primitives and patterns needed for machine learning workloads. The working group focuses on use cases that cut across SIGs (Special Interest Groups): batch scheduling, accelerator management, topology-aware placement, and job lifecycle management.

Working group outputs feed directly into the APIs described in this chapter. The working group also maintains reference architectures and best practices documentation that platform engineers can use to evaluate whether their cluster configuration aligns with community consensus.

LeaderWorkerSet and JobSet APIs

Two new batch APIs have emerged from the AI/ML working group that address the limitations of plain Kubernetes Job for distributed training:

JobSet (JobSet.batch.x-k8s.io) is an API for managing groups of related Job objects as a single unit. A distributed training run often needs several coordinated job components: parameter servers, workers, and evaluation processes. Managing these as separate Job objects means writing custom controllers to coordinate their lifecycle. JobSet treats the group as a single logical unit with shared failure policies, indexing, and lifecycle management [Source: https://rafay.co/ai-and-cloud-native-blog/introduction-to-dynamic-resource-allocation-dra-in-kubernetes].

LeaderWorkerSet (LeaderWorkerSet.leaderworkerset.x-k8s.io) targets the specific pattern of LLM training and inference, where one process (the leader) coordinates many worker processes in a tightly coupled group. Rather than expressing this as a headless Service plus a StatefulSet — a common but awkward workaround — LeaderWorkerSet provides a dedicated API with first-class concepts for leader election, worker group sizing, and restart policies that preserve group cohesion.

Together, JobSet and LeaderWorkerSet reduce the boilerplate required to run distributed AI workloads, moving the community toward standardized, interoperable primitives rather than each MLOps platform reinventing its own coordination layer.

Emerging Hardware: TPUs, AWS Trainium/Inferentia, Intel Gaudi on Kubernetes

NVIDIA GPUs have dominated the first wave of AI on Kubernetes, but the hardware landscape is diversifying rapidly. Each accelerator brings its own Kubernetes integration patterns:

HardwareVendorPrimary UseKubernetes Integration
TPU v4/v5GoogleLarge-scale training, inferenceGKE TPU node pools, google.com/tpu resource, DRA in development
AWS Trainium (Trn1/Trn2)AmazonCost-efficient trainingEKS Neuron device plugin, aws.amazon.com/neuron resource
AWS Inferentia (Inf2)AmazonHigh-throughput inferenceEKS Neuron device plugin, NeuronRT runtime
Intel Gaudi 2/3IntelTraining, inference (open alternative)Gaudi device plugin, habana.ai/gaudi resource
AMD Instinct (MI300X)AMDTraining, LLM inferenceROCm device plugin, upstream Device Plugin framework

The common thread is that each new accelerator requires a vendor-specific device plugin (or DRA driver, the modern equivalent) to expose the hardware to the Kubernetes scheduler. DRA’s attribute-based model is architecturally better suited to this diversity: a workload that can run on either an NVIDIA H100 or an AMD MI300X can express that flexibility as a single ResourceClaim rather than requiring separate YAML manifests for each.

For platform engineers, the practical implication is hardware abstraction: by adopting DRA-based resource claims, AI platforms can support multiple accelerator backends without changing job definitions. The data scientist writes “I need a GPU with 80 GB of memory and NVLink”; the platform matches it to whatever hardware is available, whether that is an H100, an MI300X, or a future accelerator not yet on the market.

Key Takeaway: The future of AI on Kubernetes is declarative, hardware-agnostic, and increasingly topology-aware. DRA replaces the blunt-instrument device plugin model with attribute-based scheduling that matches workloads to hardware intelligently. LeaderWorkerSet and JobSet provide first-class APIs for distributed AI job patterns, while expanding hardware diversity — TPUs, Trainium, Gaudi — is pushing the ecosystem toward abstraction layers that make AI platforms portable across accelerator generations.


Chapter Summary

This chapter assembled the production engineering patterns that transform a functional Kubernetes AI cluster into a reliable, scalable platform.

Multi-tenancy requires a deliberate isolation strategy — namespace boundaries for cooperative internal teams, virtual clusters for stronger isolation or CRD diversity — combined with ResourceQuotas, Kueue, and NVIDIA MIG for fair, enforceable resource governance. Self-service tooling and policy enforcement with OPA Gatekeeper or Kyverno turn the platform team from a bottleneck into an enabler.

Multi-cluster and hybrid architectures allow organizations to optimize cost (on-prem steady-state, cloud bursts), satisfy data sovereignty requirements, and eliminate single points of failure. The key engineering investment is the synchronization layer: model registries, data replication pipelines, and federation tooling that make cluster boundaries invisible to data scientists.

Production hardening protects both inference services (PodDisruptionBudgets) and training jobs (checkpoint-and-resume, extended grace periods). Disaster recovery for AI platforms is tiered — cluster state, model artifacts, and pipeline definitions each require different backup strategies and different recovery objectives. Chaos engineering validates that these mechanisms actually work when needed.

The future belongs to DRA — attribute-based, topology-aware hardware scheduling that replaces rigid device counts with declarative capability expressions — alongside dedicated APIs like LeaderWorkerSet and JobSet that make distributed AI workloads first-class Kubernetes citizens. As the hardware landscape diversifies beyond NVIDIA GPUs, the platform patterns established in this chapter provide the foundation for AI infrastructure that can evolve with the technology.


Key Terms

TermDefinition
multi-tenancyThe practice of sharing a Kubernetes cluster (or platform) among multiple teams or organizations, with isolation and resource governance to prevent interference
virtual clusterA fully isolated Kubernetes control plane (API server, etcd, controllers) running as pods inside a host cluster, providing per-tenant CRD isolation
PodDisruptionBudgetA Kubernetes policy object that constrains the number of pods that can be voluntarily evicted simultaneously, protecting service availability during node maintenance
DRA (Dynamic Resource Allocation)A Kubernetes API framework (GA in 1.34) that allows workloads to request specialized hardware based on device attributes rather than simple counts, succeeding the Device Plugin model
LeaderWorkerSetA Kubernetes API for managing tightly coupled leader/worker process groups in LLM training and inference, providing first-class lifecycle management for multi-process AI workloads
JobSetA Kubernetes batch API for managing groups of related Job objects as a single logical unit, with shared failure policies and lifecycle management for distributed training
hybrid cloudAn architecture that combines on-premises infrastructure with public cloud resources, allowing workloads to be placed based on cost, latency, data governance, or availability requirements
burst-to-cloudA scaling pattern where on-premises clusters handle steady-state workloads and cloud clusters absorb peak demand spikes, with federation or pipeline tooling managing workload placement
chaos engineeringThe discipline of deliberately injecting controlled failures into a system to discover and remediate weaknesses before they occur in uncontrolled production incidents
platform engineeringThe practice of treating internal developer infrastructure as a product, building self-service capabilities and golden paths that allow application teams to move faster with less platform team involvement
ResourceSliceA DRA API object published by hardware drivers that advertises available devices on each node, including detailed attributes like memory capacity and architecture version
DeviceClassA DRA API object that defines a category of requestable devices in a cluster, allowing cluster admins to specify what devices workloads can request and under what conditions
ResourceClaimA DRA API object representing a workload’s request for one or more devices, matched against ResourceSlices by the Kubernetes scheduler
noisy neighborA multi-tenancy problem where one tenant’s resource-intensive workload degrades performance for other tenants sharing the same physical or logical infrastructure
KueueA CNCF project providing Kubernetes-native batch job queueing with ClusterQueue and LocalQueue objects, enabling fair resource sharing, borrowing, and preemption across teams