Skip to main contentSkip to user menuSkip to navigation

GPU Scheduling in Kubernetes

Master GPU scheduling in Kubernetes: NVIDIA GPU Operator, multi-GPU workloads, resource optimization, and ML infrastructure.

45 min readAdvanced
Not Started
Loading...

What is GPU Scheduling in Kubernetes?

GPU scheduling in Kubernetes manages the allocation and orchestration of GPU resources across cluster nodes for machine learning, AI, and compute-intensive workloads. It involves resource discovery, allocation policies, multi-tenancy, and optimizing GPU utilization across distributed workloads.

Key Challenge: GPUs are expensive, non-shareable resources that require sophisticated scheduling to maximize utilization while ensuring workload isolation and performance guarantees.

GPU Cluster Optimization Calculator

4 nodes
8 GPUs/node

Cluster Metrics

Total GPUs:32
Effective GPUs:29
Wasted GPUs:3
GPU Utilization:85%
Scheduling Efficiency:92%
Workload Duration:Hours-Days

NVIDIA GPU Operator & Device Plugin

GPU Operator Components

  • Device Plugin: GPU resource discovery and allocation
  • Driver Container: NVIDIA driver installation and management
  • Container Runtime: nvidia-container-runtime integration
  • Node Feature Discovery: Hardware topology detection
  • GPU Feature Discovery: GPU capabilities enumeration
  • DCGM Exporter: GPU monitoring and metrics collection

Scheduling Features

  • Resource Requests: GPU count and memory allocation
  • Multi-GPU Support: Single and multi-GPU workloads
  • GPU Sharing: Time-slicing and MIG partitioning
  • Topology Awareness: PCIe, NVLink, and NUMA optimization
  • Health Monitoring: GPU failure detection and recovery
  • Mixed Workloads: Training and inference co-location

GPU Operator Installation & Configuration

Install NVIDIA GPU Operator

# Add NVIDIA Helm repository
helm repo add nvidia https://nvidia.github.io/gpu-operator
helm repo update

# Install GPU Operator with custom configuration
helm install gpu-operator nvidia/gpu-operator \
  --namespace nvidia-system --create-namespace \
  --set operator.defaultRuntime=containerd \
  --set toolkit.enabled=true \
  --set driver.enabled=true \
  --set dcgmExporter.enabled=true \
  --set nodeFeatureDiscovery.enabled=true \
  --set gfd.enabled=true \
  --set migManager.enabled=true \
  --set devicePlugin.config.name="time-slicing-config" \
  --set devicePlugin.config.default="any"

# Verify installation
kubectl get pods -n nvidia-system
kubectl get nodes -l nvidia.com/gpu.present=true

GPU Time-Slicing Configuration

# gpu-time-slicing-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: nvidia-system
data:
  any: |-
    version: v1
    sharing:
      timeSlicing:
        resources:
        - name: nvidia.com/gpu
          replicas: 4  # Number of virtual GPUs per physical GPU
          
---
# ClusterPolicy to enable time-slicing
apiVersion: config.openshift.io/v1
kind: ClusterPolicy
metadata:
  name: gpu-cluster-policy
spec:
  devicePlugin:
    enabled: true
    config:
      name: "time-slicing-config"
      default: "any"
  dcgmExporter:
    enabled: true
    config:
      name: "console-plugin-nvidia-gpu"
  gfd:
    enabled: true

GPU Workload Examples

# Single GPU Training Job
apiVersion: batch/v1
kind: Job
metadata:
  name: pytorch-training
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: pytorch-trainer
        image: pytorch/pytorch:2.1.0-cuda11.8-cudnn8-devel
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: "32Gi"
          requests:
            nvidia.com/gpu: 1
            memory: "32Gi"
        env:
        - name: NVIDIA_VISIBLE_DEVICES
          value: "all"
        - name: CUDA_VISIBLE_DEVICES
          value: "0"
        command: ["python", "-m", "torch.distributed.launch"]
        args: ["--nproc_per_node=1", "train.py", "--epochs=100"]
        volumeMounts:
        - name: dataset-volume
          mountPath: /data
        - name: model-output
          mountPath: /models
      volumes:
      - name: dataset-volume
        persistentVolumeClaim:
          claimName: training-data-pvc
      - name: model-output
        persistentVolumeClaim:
          claimName: model-output-pvc
      nodeSelector:
        nvidia.com/gpu.product: "Tesla-V100-SXM2-32GB"
        
---
# Multi-GPU Distributed Training
apiVersion: batch/v1
kind: Job
metadata:
  name: distributed-training
spec:
  parallelism: 4  # 4 worker pods
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: distributed-trainer
        image: pytorch/pytorch:2.1.0-cuda11.8-cudnn8-devel
        resources:
          limits:
            nvidia.com/gpu: 2  # 2 GPUs per pod
            memory: "64Gi"
          requests:
            nvidia.com/gpu: 2
            memory: "64Gi"
        env:
        - name: MASTER_ADDR
          value: "distributed-training-master"
        - name: MASTER_PORT
          value: "29500"
        - name: WORLD_SIZE
          value: "8"  # Total GPUs across all pods
        - name: NCCL_DEBUG
          value: "WARN"
        command: ["python", "-m", "torch.distributed.launch"]
        args: [
          "--nproc_per_node=2", 
          "--nnodes=4", 
          "--node_rank=$(POD_INDEX)",
          "distributed_train.py"
        ]
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: job-name
                  operator: In
                  values: ["distributed-training"]
              topologyKey: kubernetes.io/hostname

Advanced GPU Scheduling Patterns

GPU Topology-Aware Scheduling

# NodeAffinity for GPU topology optimization
apiVersion: apps/v1
kind: Deployment
metadata:
  name: multi-gpu-inference
spec:
  replicas: 2
  selector:
    matchLabels:
      app: multi-gpu-inference
  template:
    metadata:
      labels:
        app: multi-gpu-inference
    spec:
      containers:
      - name: inference-server
        image: nvidia/triton-inference-server:22.12-py3
        resources:
          limits:
            nvidia.com/gpu: 4
            memory: "128Gi"
          requests:
            nvidia.com/gpu: 4
            memory: "128Gi"
        env:
        - name: CUDA_VISIBLE_DEVICES
          value: "0,1,2,3"
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              # Require nodes with NVLink-connected GPUs
              - key: nvidia.com/gpu.product
                operator: In
                values: ["Tesla-V100-SXM2-32GB", "A100-SXM4-80GB"]
              - key: nvidia.com/gpu.count
                operator: Gt
                values: ["4"]
        podAntiAffinity:
          # Spread across different nodes for fault tolerance
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values: ["multi-gpu-inference"]
              topologyKey: kubernetes.io/hostname

---
# Custom GPU Scheduler with priorities
apiVersion: v1
kind: ConfigMap
metadata:
  name: gpu-scheduler-config
data:
  config.yaml: |
    apiVersion: kubescheduler.config.k8s.io/v1beta3
    kind: KubeSchedulerConfiguration
    profiles:
    - schedulerName: gpu-scheduler
      plugins:
        score:
          enabled:
          - name: NodeResourcesFit
            weight: 80  # Prioritize resource fit
          - name: NodeAffinity
            weight: 60  # Consider GPU topology
        filter:
          enabled:
          - name: NodeResourcesFit
          - name: NodeAffinity
      pluginConfig:
      - name: NodeResourcesFit
        args:
          scoringStrategy:
            type: LeastAllocated  # Prefer nodes with more available GPUs
            resources:
            - name: nvidia.com/gpu
              weight: 100
            - name: memory
              weight: 50

GPU Resource Quotas & Limits

# Namespace ResourceQuota for GPU resources
apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-quota
  namespace: ml-team
spec:
  hard:
    requests.nvidia.com/gpu: "20"    # Max 20 GPUs requested
    limits.nvidia.com/gpu: "20"      # Max 20 GPUs limit
    persistentvolumeclaims: "10"     # Max 10 PVCs for datasets
    requests.memory: "400Gi"         # Max memory for GPU workloads
    requests.cpu: "200"              # Max CPU cores

---
# LimitRange for individual pods
apiVersion: v1
kind: LimitRange
metadata:
  name: gpu-limits
  namespace: ml-team
spec:
  limits:
  - default:
      nvidia.com/gpu: "1"
      memory: "16Gi"
      cpu: "4"
    defaultRequest:
      nvidia.com/gpu: "1"
      memory: "8Gi"
      cpu: "2"
    max:
      nvidia.com/gpu: "8"      # Max 8 GPUs per pod
      memory: "256Gi"
      cpu: "64"
    min:
      nvidia.com/gpu: "1"      # Min 1 GPU per pod
      memory: "4Gi"
      cpu: "1"
    type: Container

---
# NetworkPolicy for GPU workloads
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: gpu-workload-policy
  namespace: ml-team
spec:
  podSelector:
    matchLabels:
      gpu-enabled: "true"
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: ml-team
    - namespaceSelector:
        matchLabels:
          name: monitoring
    ports:
    - protocol: TCP
      port: 8080  # Model serving port
    - protocol: TCP
      port: 6006  # TensorBoard port
  egress:
  - to: []  # Allow all egress for data fetching
    ports:
    - protocol: TCP
      port: 80
    - protocol: TCP
      port: 443
    - protocol: TCP
      port: 6379  # Redis
    - protocol: TCP
      port: 5432  # PostgreSQL

Real-World GPU Kubernetes Implementations

OpenAI GPU Infrastructure

  • • 10,000+ NVIDIA A100s across multiple Kubernetes clusters
  • • Custom scheduler for multi-month training jobs
  • • Topology-aware placement for optimal NVLink utilization
  • • Automated fault tolerance with checkpointing
  • • Powers GPT-4 and DALL-E training workloads

Google GKE Autopilot GPU

  • • Automatic GPU node provisioning and scaling
  • • Multi-Instance GPU (MIG) support for A100s
  • • Integrated with Vertex AI for ML pipelines
  • • Time-sharing for improved GPU utilization
  • • Serves Google's internal ML training at scale

Uber ML Platform

  • • Kubernetes-based Horovod for distributed training
  • • Dynamic GPU allocation based on workload priority
  • • Multi-tenant GPU sharing across 100+ teams
  • • Custom metrics for GPU cost optimization
  • • 85% average GPU utilization across fleet

Netflix GPU Inference

  • • Kubernetes-native GPU inference serving
  • • Auto-scaling based on inference request load
  • • Mixed GPU/CPU workloads for cost optimization
  • • Supports 500M+ daily recommendations
  • • A/B testing infrastructure on GPU clusters

GPU Kubernetes Best Practices

✅ Do

  • Use NVIDIA GPU Operator for automated GPU lifecycle management
  • Implement GPU time-slicing for improved utilization
  • Monitor GPU metrics with DCGM and Prometheus
  • Use topology-aware scheduling for multi-GPU workloads
  • Implement proper resource quotas and limits
  • Plan for GPU node failures with checkpointing

❌ Don't

  • Schedule GPU workloads without resource requests/limits
  • Ignore GPU topology for multi-GPU training jobs
  • Run GPU workloads without proper monitoring
  • Mix incompatible CUDA versions in the same cluster
  • Forget to implement GPU node taints and tolerations
  • Over-provision GPU resources without utilization analysis
No quiz questions available
Quiz ID "gpu-scheduling-kubernetes" not found