How It Helps Users
Ideal For:
How It Works
The NVIDIA device plugin is deployed as a DaemonSet, which means it is running on all nodes that have a GPU in the Kubernetes cluster. The plugin exposes nvidia.com/gpu resource and this is how Kubernetes would do pod scheduling based on GPU requirements.
Key Points:
Setup Steps:
wget https://us.download.nvidia.com/XFree86/Linux-x86_64/550.135/NVIDIA-Linux-x86_64-550.135.run
chmod +x NVIDIA-Linux-x86_64-*.run
sudo ./NVIDIA-Linux-x86_64-*.run
nvidia-smi
2. Install nvidia-container-runtime on the node. For more information follow official documentation
Configure the repository of nvidia-container-runtime:
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=containerd
sudo systemctl restart containerd
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
helm upgrade -i nvdp nvdp/nvidia-device-plugin --namespace nvidia-device-plugin --create-namespace --version 0.17.1
Bingo! We’ve successfully completed the setup of the NVIDIA Device Plugin.
You can now utilize GPU resources by specifying the appropriate resource limits in your pod manifest file.
cat <
Crucial Consideration:
The NVIDIA GPU Operator is a Kubernetes Operator designed to provide a comprehensive, automated solution for managing GPU resources in Kubernetes. It installs and manages not only the NVIDIA Device Plugin but also the NVIDIA GPU drivers, container runtimes and monitoring tools (such as DCGM). The Operator is designed for production environments where you need full lifecycle management for GPUs.
How It Helps Users
Ideal For:
How It Works
The GPU Operator simplifies the entire process of GPU management:
The GPU Operator also uses Kubernetes CRDs to expose more information and manage GPU resources beyond just scheduling.
Key Points:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update
helm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator --version=v25.3.0
cat <
Crucial Consideration:
Feature | NVIDIA Device Plugin | NVIDIA GPU Operator |
Setup | Manual (drivers, runtime, etc.) | Automated (drivers, runtime, plugin, monitoring) |
Lifecycle Management | No automated updates/management | Automatic updates, monitoring and self-healing |
Target Use Case | Smaller clusters or simpler GPU workloads | Large clusters, production-grade environments |
Components Managed | Only the device plugin | Device plugin, drivers, runtime, monitoring |
Configuration | Simple configuration | More complex but comprehensive setup |
Monitoring | Requires manual setup | Includes built-in monitoring (DCGM) |
Self-healing | No | Yes, automatically monitors and fixes issues |
Resource Overhead | Low | Higher (due to the added components) |
Feature | NVIDIA Device Plugin | NVIDIA GPU Operator |
Automated Driver & Runtime Management | No, drivers and runtime need to be manually installed and updated | Yes, the GPU Operator installs and manages GPU drivers, container runtimes automatically |
Installation Complexity | Simple, manual installation of the device plugin | More complex, but automated and comprehensive installation (via Helm) |
Component Management | Only exposes GPU resources to Kubernetes | Manages the entire GPU lifecycle, including drivers, runtime, device plugin and monitoring |
Monitoring Support | Needs to be set up separately | Built-in GPU metrics collection (DCGM) with monitoring capabilities |
Self-Healing & Maintenance | No built-in self-healing; manual intervention required | Yes, automatic health checks and recovery from failures |
Scalability | Works well for small to medium-sized clusters | Designed for large-scale, production-grade environments |
Use Case | Ideal for testing, small workloads or simple setups | Ideal for production clusters with heavy GPU workloads |
Resource Overhead | Low, only requires GPU resources and basic setup | Higher, due to additional components for monitoring, lifecycle management, etc. |
User Control | Gives users complete control over setup and configuration | Provides automation but offers less flexibility in manual configuration |
Use the NVIDIA Device Plugin if:
Use the NVIDIA GPU Operator if:
I hope this blog helps you decide between the NVIDIA Device Plugin and the NVIDIA GPU Operator.