System Designer

What is GPU Scheduling in Kubernetes?

GPU scheduling in Kubernetes manages the allocation and orchestration of GPU resources across cluster nodes for machine learning, AI, and compute-intensive workloads. It involves resource discovery, allocation policies, multi-tenancy, and optimizing GPU utilization across distributed workloads.

Key Challenge: GPUs are expensive, non-shareable resources that require sophisticated scheduling to maximize utilization while ensuring workload isolation and performance guarantees.

GPU Cluster Optimization Calculator

Number of Nodes4 nodes

GPUs per Node8 GPUs/node

Workload Type

Scheduling Strategy

Cluster Metrics

Total GPUs:32

Effective GPUs:29

Wasted GPUs:3

GPU Utilization:85%

Scheduling Efficiency:92%

Workload Duration:Hours-Days

NVIDIA GPU Operator & Device Plugin

GPU Operator Components

• Device Plugin: GPU resource discovery and allocation
• Driver Container: NVIDIA driver installation and management
• Container Runtime: nvidia-container-runtime integration
• Node Feature Discovery: Hardware topology detection
• GPU Feature Discovery: GPU capabilities enumeration
• DCGM Exporter: GPU monitoring and metrics collection

Scheduling Features

• Resource Requests: GPU count and memory allocation
• Multi-GPU Support: Single and multi-GPU workloads
• GPU Sharing: Time-slicing and MIG partitioning
• Topology Awareness: PCIe, NVLink, and NUMA optimization
• Health Monitoring: GPU failure detection and recovery
• Mixed Workloads: Training and inference co-location

GPU Operator Installation & Configuration

Install NVIDIA GPU Operator

# Add NVIDIA Helm repository
helm repo add nvidia https://nvidia.github.io/gpu-operator
helm repo update

# Install GPU Operator with custom configuration
helm install gpu-operator nvidia/gpu-operator \
  --namespace nvidia-system --create-namespace \
  --set operator.defaultRuntime=containerd \
  --set toolkit.enabled=true \
  --set driver.enabled=true \
  --set dcgmExporter.enabled=true \
  --set nodeFeatureDiscovery.enabled=true \
  --set gfd.enabled=true \
  --set migManager.enabled=true \
  --set devicePlugin.config.name="time-slicing-config" \
  --set devicePlugin.config.default="any"

# Verify installation
kubectl get pods -n nvidia-system
kubectl get nodes -l nvidia.com/gpu.present=true

GPU Time-Slicing Configuration

# gpu-time-slicing-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: nvidia-system
data:
  any: |-
    version: v1
    sharing:
      timeSlicing:
        resources:
        - name: nvidia.com/gpu
          replicas: 4  # Number of virtual GPUs per physical GPU
          
---
# ClusterPolicy to enable time-slicing
apiVersion: config.openshift.io/v1
kind: ClusterPolicy
metadata:
  name: gpu-cluster-policy
spec:
  devicePlugin:
    enabled: true
    config:
      name: "time-slicing-config"
      default: "any"
  dcgmExporter:
    enabled: true
    config:
      name: "console-plugin-nvidia-gpu"
  gfd:
    enabled: true

GPU Workload Examples

# Single GPU Training Job
apiVersion: batch/v1
kind: Job
metadata:
  name: pytorch-training
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: pytorch-trainer
        image: pytorch/pytorch:2.1.0-cuda11.8-cudnn8-devel
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: "32Gi"
          requests:
            nvidia.com/gpu: 1
            memory: "32Gi"
        env:
        - name: NVIDIA_VISIBLE_DEVICES
          value: "all"
        - name: CUDA_VISIBLE_DEVICES
          value: "0"
        command: ["python", "-m", "torch.distributed.launch"]
        args: ["--nproc_per_node=1", "train.py", "--epochs=100"]
        volumeMounts:
        - name: dataset-volume
          mountPath: /data
        - name: model-output
          mountPath: /models
      volumes:
      - name: dataset-volume
        persistentVolumeClaim:
          claimName: training-data-pvc
      - name: model-output
        persistentVolumeClaim:
          claimName: model-output-pvc
      nodeSelector:
        nvidia.com/gpu.product: "Tesla-V100-SXM2-32GB"
        
---
# Multi-GPU Distributed Training
apiVersion: batch/v1
kind: Job
metadata:
  name: distributed-training
spec:
  parallelism: 4  # 4 worker pods
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: distributed-trainer
        image: pytorch/pytorch:2.1.0-cuda11.8-cudnn8-devel
        resources:
          limits:
            nvidia.com/gpu: 2  # 2 GPUs per pod
            memory: "64Gi"
          requests:
            nvidia.com/gpu: 2
            memory: "64Gi"
        env:
        - name: MASTER_ADDR
          value: "distributed-training-master"
        - name: MASTER_PORT
          value: "29500"
        - name: WORLD_SIZE
          value: "8"  # Total GPUs across all pods
        - name: NCCL_DEBUG
          value: "WARN"
        command: ["python", "-m", "torch.distributed.launch"]
        args: [
          "--nproc_per_node=2", 
          "--nnodes=4", 
          "--node_rank=$(POD_INDEX)",
          "distributed_train.py"
        ]
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: job-name
                  operator: In
                  values: ["distributed-training"]
              topologyKey: kubernetes.io/hostname

Advanced GPU Scheduling Patterns

GPU Topology-Aware Scheduling

# NodeAffinity for GPU topology optimization
apiVersion: apps/v1
kind: Deployment
metadata:
  name: multi-gpu-inference
spec:
  replicas: 2
  selector:
    matchLabels:
      app: multi-gpu-inference
  template:
    metadata:
      labels:
        app: multi-gpu-inference
    spec:
      containers:
      - name: inference-server
        image: nvidia/triton-inference-server:22.12-py3
        resources:
          limits:
            nvidia.com/gpu: 4
            memory: "128Gi"
          requests:
            nvidia.com/gpu: 4
            memory: "128Gi"
        env:
        - name: CUDA_VISIBLE_DEVICES
          value: "0,1,2,3"
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              # Require nodes with NVLink-connected GPUs
              - key: nvidia.com/gpu.product
                operator: In
                values: ["Tesla-V100-SXM2-32GB", "A100-SXM4-80GB"]
              - key: nvidia.com/gpu.count
                operator: Gt
                values: ["4"]
        podAntiAffinity:
          # Spread across different nodes for fault tolerance
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values: ["multi-gpu-inference"]
              topologyKey: kubernetes.io/hostname

---
# Custom GPU Scheduler with priorities
apiVersion: v1
kind: ConfigMap
metadata:
  name: gpu-scheduler-config
data:
  config.yaml: |
    apiVersion: kubescheduler.config.k8s.io/v1beta3
    kind: KubeSchedulerConfiguration
    profiles:
    - schedulerName: gpu-scheduler
      plugins:
        score:
          enabled:
          - name: NodeResourcesFit
            weight: 80  # Prioritize resource fit
          - name: NodeAffinity
            weight: 60  # Consider GPU topology
        filter:
          enabled:
          - name: NodeResourcesFit
          - name: NodeAffinity
      pluginConfig:
      - name: NodeResourcesFit
        args:
          scoringStrategy:
            type: LeastAllocated  # Prefer nodes with more available GPUs
            resources:
            - name: nvidia.com/gpu
              weight: 100
            - name: memory
              weight: 50

GPU Resource Quotas & Limits

# Namespace ResourceQuota for GPU resources
apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-quota
  namespace: ml-team
spec:
  hard:
    requests.nvidia.com/gpu: "20"    # Max 20 GPUs requested
    limits.nvidia.com/gpu: "20"      # Max 20 GPUs limit
    persistentvolumeclaims: "10"     # Max 10 PVCs for datasets
    requests.memory: "400Gi"         # Max memory for GPU workloads
    requests.cpu: "200"              # Max CPU cores

---
# LimitRange for individual pods
apiVersion: v1
kind: LimitRange
metadata:
  name: gpu-limits
  namespace: ml-team
spec:
  limits:
  - default:
      nvidia.com/gpu: "1"
      memory: "16Gi"
      cpu: "4"
    defaultRequest:
      nvidia.com/gpu: "1"
      memory: "8Gi"
      cpu: "2"
    max:
      nvidia.com/gpu: "8"      # Max 8 GPUs per pod
      memory: "256Gi"
      cpu: "64"
    min:
      nvidia.com/gpu: "1"      # Min 1 GPU per pod
      memory: "4Gi"
      cpu: "1"
    type: Container

---
# NetworkPolicy for GPU workloads
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: gpu-workload-policy
  namespace: ml-team
spec:
  podSelector:
    matchLabels:
      gpu-enabled: "true"
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: ml-team
    - namespaceSelector:
        matchLabels:
          name: monitoring
    ports:
    - protocol: TCP
      port: 8080  # Model serving port
    - protocol: TCP
      port: 6006  # TensorBoard port
  egress:
  - to: []  # Allow all egress for data fetching
    ports:
    - protocol: TCP
      port: 80
    - protocol: TCP
      port: 443
    - protocol: TCP
      port: 6379  # Redis
    - protocol: TCP
      port: 5432  # PostgreSQL

Real-World GPU Kubernetes Implementations

OpenAI GPU Infrastructure

• 10,000+ NVIDIA A100s across multiple Kubernetes clusters
• Custom scheduler for multi-month training jobs
• Topology-aware placement for optimal NVLink utilization
• Automated fault tolerance with checkpointing
• Powers GPT-4 and DALL-E training workloads

Google GKE Autopilot GPU

• Automatic GPU node provisioning and scaling
• Multi-Instance GPU (MIG) support for A100s
• Integrated with Vertex AI for ML pipelines
• Time-sharing for improved GPU utilization
• Serves Google's internal ML training at scale

Uber ML Platform

• Kubernetes-based Horovod for distributed training
• Dynamic GPU allocation based on workload priority
• Multi-tenant GPU sharing across 100+ teams
• Custom metrics for GPU cost optimization
• 85% average GPU utilization across fleet

Netflix GPU Inference

• Kubernetes-native GPU inference serving
• Auto-scaling based on inference request load
• Mixed GPU/CPU workloads for cost optimization
• Supports 500M+ daily recommendations
• A/B testing infrastructure on GPU clusters

GPU Kubernetes Best Practices

✅ Do

Use NVIDIA GPU Operator for automated GPU lifecycle management
Implement GPU time-slicing for improved utilization
Monitor GPU metrics with DCGM and Prometheus
Use topology-aware scheduling for multi-GPU workloads
Implement proper resource quotas and limits
Plan for GPU node failures with checkpointing

❌ Don't

Schedule GPU workloads without resource requests/limits
Ignore GPU topology for multi-GPU training jobs
Run GPU workloads without proper monitoring
Mix incompatible CUDA versions in the same cluster
Forget to implement GPU node taints and tolerations
Over-provision GPU resources without utilization analysis

No quiz questions available

Quiz ID "gpu-scheduling-kubernetes" not found