What is GPU Scheduling in Kubernetes?
GPU scheduling in Kubernetes manages the allocation and orchestration of GPU resources across cluster nodes for machine learning, AI, and compute-intensive workloads. It involves resource discovery, allocation policies, multi-tenancy, and optimizing GPU utilization across distributed workloads.
Key Challenge: GPUs are expensive, non-shareable resources that require sophisticated scheduling to maximize utilization while ensuring workload isolation and performance guarantees.
GPU Cluster Optimization Calculator
4 nodes
8 GPUs/node
Cluster Metrics
Total GPUs:32
Effective GPUs:29
Wasted GPUs:3
GPU Utilization:85%
Scheduling Efficiency:92%
Workload Duration:Hours-Days
NVIDIA GPU Operator & Device Plugin
GPU Operator Components
- • Device Plugin: GPU resource discovery and allocation
- • Driver Container: NVIDIA driver installation and management
- • Container Runtime: nvidia-container-runtime integration
- • Node Feature Discovery: Hardware topology detection
- • GPU Feature Discovery: GPU capabilities enumeration
- • DCGM Exporter: GPU monitoring and metrics collection
Scheduling Features
- • Resource Requests: GPU count and memory allocation
- • Multi-GPU Support: Single and multi-GPU workloads
- • GPU Sharing: Time-slicing and MIG partitioning
- • Topology Awareness: PCIe, NVLink, and NUMA optimization
- • Health Monitoring: GPU failure detection and recovery
- • Mixed Workloads: Training and inference co-location
GPU Operator Installation & Configuration
Install NVIDIA GPU Operator
# Add NVIDIA Helm repository
helm repo add nvidia https://nvidia.github.io/gpu-operator
helm repo update
# Install GPU Operator with custom configuration
helm install gpu-operator nvidia/gpu-operator \
--namespace nvidia-system --create-namespace \
--set operator.defaultRuntime=containerd \
--set toolkit.enabled=true \
--set driver.enabled=true \
--set dcgmExporter.enabled=true \
--set nodeFeatureDiscovery.enabled=true \
--set gfd.enabled=true \
--set migManager.enabled=true \
--set devicePlugin.config.name="time-slicing-config" \
--set devicePlugin.config.default="any"
# Verify installation
kubectl get pods -n nvidia-system
kubectl get nodes -l nvidia.com/gpu.present=true
GPU Time-Slicing Configuration
# gpu-time-slicing-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: time-slicing-config
namespace: nvidia-system
data:
any: |-
version: v1
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 4 # Number of virtual GPUs per physical GPU
---
# ClusterPolicy to enable time-slicing
apiVersion: config.openshift.io/v1
kind: ClusterPolicy
metadata:
name: gpu-cluster-policy
spec:
devicePlugin:
enabled: true
config:
name: "time-slicing-config"
default: "any"
dcgmExporter:
enabled: true
config:
name: "console-plugin-nvidia-gpu"
gfd:
enabled: true
GPU Workload Examples
# Single GPU Training Job
apiVersion: batch/v1
kind: Job
metadata:
name: pytorch-training
spec:
template:
spec:
restartPolicy: Never
containers:
- name: pytorch-trainer
image: pytorch/pytorch:2.1.0-cuda11.8-cudnn8-devel
resources:
limits:
nvidia.com/gpu: 1
memory: "32Gi"
requests:
nvidia.com/gpu: 1
memory: "32Gi"
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "all"
- name: CUDA_VISIBLE_DEVICES
value: "0"
command: ["python", "-m", "torch.distributed.launch"]
args: ["--nproc_per_node=1", "train.py", "--epochs=100"]
volumeMounts:
- name: dataset-volume
mountPath: /data
- name: model-output
mountPath: /models
volumes:
- name: dataset-volume
persistentVolumeClaim:
claimName: training-data-pvc
- name: model-output
persistentVolumeClaim:
claimName: model-output-pvc
nodeSelector:
nvidia.com/gpu.product: "Tesla-V100-SXM2-32GB"
---
# Multi-GPU Distributed Training
apiVersion: batch/v1
kind: Job
metadata:
name: distributed-training
spec:
parallelism: 4 # 4 worker pods
template:
spec:
restartPolicy: Never
containers:
- name: distributed-trainer
image: pytorch/pytorch:2.1.0-cuda11.8-cudnn8-devel
resources:
limits:
nvidia.com/gpu: 2 # 2 GPUs per pod
memory: "64Gi"
requests:
nvidia.com/gpu: 2
memory: "64Gi"
env:
- name: MASTER_ADDR
value: "distributed-training-master"
- name: MASTER_PORT
value: "29500"
- name: WORLD_SIZE
value: "8" # Total GPUs across all pods
- name: NCCL_DEBUG
value: "WARN"
command: ["python", "-m", "torch.distributed.launch"]
args: [
"--nproc_per_node=2",
"--nnodes=4",
"--node_rank=$(POD_INDEX)",
"distributed_train.py"
]
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: job-name
operator: In
values: ["distributed-training"]
topologyKey: kubernetes.io/hostname
Advanced GPU Scheduling Patterns
GPU Topology-Aware Scheduling
# NodeAffinity for GPU topology optimization
apiVersion: apps/v1
kind: Deployment
metadata:
name: multi-gpu-inference
spec:
replicas: 2
selector:
matchLabels:
app: multi-gpu-inference
template:
metadata:
labels:
app: multi-gpu-inference
spec:
containers:
- name: inference-server
image: nvidia/triton-inference-server:22.12-py3
resources:
limits:
nvidia.com/gpu: 4
memory: "128Gi"
requests:
nvidia.com/gpu: 4
memory: "128Gi"
env:
- name: CUDA_VISIBLE_DEVICES
value: "0,1,2,3"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
# Require nodes with NVLink-connected GPUs
- key: nvidia.com/gpu.product
operator: In
values: ["Tesla-V100-SXM2-32GB", "A100-SXM4-80GB"]
- key: nvidia.com/gpu.count
operator: Gt
values: ["4"]
podAntiAffinity:
# Spread across different nodes for fault tolerance
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values: ["multi-gpu-inference"]
topologyKey: kubernetes.io/hostname
---
# Custom GPU Scheduler with priorities
apiVersion: v1
kind: ConfigMap
metadata:
name: gpu-scheduler-config
data:
config.yaml: |
apiVersion: kubescheduler.config.k8s.io/v1beta3
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: gpu-scheduler
plugins:
score:
enabled:
- name: NodeResourcesFit
weight: 80 # Prioritize resource fit
- name: NodeAffinity
weight: 60 # Consider GPU topology
filter:
enabled:
- name: NodeResourcesFit
- name: NodeAffinity
pluginConfig:
- name: NodeResourcesFit
args:
scoringStrategy:
type: LeastAllocated # Prefer nodes with more available GPUs
resources:
- name: nvidia.com/gpu
weight: 100
- name: memory
weight: 50
GPU Resource Quotas & Limits
# Namespace ResourceQuota for GPU resources
apiVersion: v1
kind: ResourceQuota
metadata:
name: gpu-quota
namespace: ml-team
spec:
hard:
requests.nvidia.com/gpu: "20" # Max 20 GPUs requested
limits.nvidia.com/gpu: "20" # Max 20 GPUs limit
persistentvolumeclaims: "10" # Max 10 PVCs for datasets
requests.memory: "400Gi" # Max memory for GPU workloads
requests.cpu: "200" # Max CPU cores
---
# LimitRange for individual pods
apiVersion: v1
kind: LimitRange
metadata:
name: gpu-limits
namespace: ml-team
spec:
limits:
- default:
nvidia.com/gpu: "1"
memory: "16Gi"
cpu: "4"
defaultRequest:
nvidia.com/gpu: "1"
memory: "8Gi"
cpu: "2"
max:
nvidia.com/gpu: "8" # Max 8 GPUs per pod
memory: "256Gi"
cpu: "64"
min:
nvidia.com/gpu: "1" # Min 1 GPU per pod
memory: "4Gi"
cpu: "1"
type: Container
---
# NetworkPolicy for GPU workloads
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: gpu-workload-policy
namespace: ml-team
spec:
podSelector:
matchLabels:
gpu-enabled: "true"
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: ml-team
- namespaceSelector:
matchLabels:
name: monitoring
ports:
- protocol: TCP
port: 8080 # Model serving port
- protocol: TCP
port: 6006 # TensorBoard port
egress:
- to: [] # Allow all egress for data fetching
ports:
- protocol: TCP
port: 80
- protocol: TCP
port: 443
- protocol: TCP
port: 6379 # Redis
- protocol: TCP
port: 5432 # PostgreSQL
Real-World GPU Kubernetes Implementations
OpenAI GPU Infrastructure
- • 10,000+ NVIDIA A100s across multiple Kubernetes clusters
- • Custom scheduler for multi-month training jobs
- • Topology-aware placement for optimal NVLink utilization
- • Automated fault tolerance with checkpointing
- • Powers GPT-4 and DALL-E training workloads
Google GKE Autopilot GPU
- • Automatic GPU node provisioning and scaling
- • Multi-Instance GPU (MIG) support for A100s
- • Integrated with Vertex AI for ML pipelines
- • Time-sharing for improved GPU utilization
- • Serves Google's internal ML training at scale
Uber ML Platform
- • Kubernetes-based Horovod for distributed training
- • Dynamic GPU allocation based on workload priority
- • Multi-tenant GPU sharing across 100+ teams
- • Custom metrics for GPU cost optimization
- • 85% average GPU utilization across fleet
Netflix GPU Inference
- • Kubernetes-native GPU inference serving
- • Auto-scaling based on inference request load
- • Mixed GPU/CPU workloads for cost optimization
- • Supports 500M+ daily recommendations
- • A/B testing infrastructure on GPU clusters
GPU Kubernetes Best Practices
✅ Do
- Use NVIDIA GPU Operator for automated GPU lifecycle management
- Implement GPU time-slicing for improved utilization
- Monitor GPU metrics with DCGM and Prometheus
- Use topology-aware scheduling for multi-GPU workloads
- Implement proper resource quotas and limits
- Plan for GPU node failures with checkpointing
❌ Don't
- Schedule GPU workloads without resource requests/limits
- Ignore GPU topology for multi-GPU training jobs
- Run GPU workloads without proper monitoring
- Mix incompatible CUDA versions in the same cluster
- Forget to implement GPU node taints and tolerations
- Over-provision GPU resources without utilization analysis
No quiz questions available
Quiz ID "gpu-scheduling-kubernetes" not found