Nvidia GPU Support#
To use Nvidia GPU in the cluster the nvidia-container-runtime and runc are needed. To get the two components it suffices to add the following to the configuration
virtualisation.docker = {
enable = true;
enableNvidia = true;
};
environment.systemPackages = with pkgs; [ docker runc ];
Note, using docker here is a workaround, it will install nvidia-container-runtime and that will cause it to be accessible via /run/current-system/sw/bin/nvidia-container-runtime, currently its not directly accessible in nixpkgs.
You now need to create a new file in /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl with the following
{{ template "base" . }}
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
BinaryName = "/run/current-system/sw/bin/nvidia-container-runtime"
Update: As of 12/03/2024 It appears that the last two lines above are added by default, and if the two lines are present (as shown above) it will refuse to start the server. You will need to remove the two lines from that point onward.
Note here we are pointing the nvidia runtime to "/run/current-system/sw/bin/nvidia-container-runtime".
Now apply the following runtime class to k3s cluster:
apiVersion: node.k8s.io/v1
handler: nvidia
kind: RuntimeClass
metadata:
labels:
app.kubernetes.io/component: gpu-operator
name: nvidia
Following k8s-device-plugin install the helm chart with runtimeClassName: nvidia set. In order to passthrough the nvidia card into the container, your deployments spec must contain
runtimeClassName: nvidia
# for each container
env:
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: all
to test its working exec onto a pod and run nvidia-smi. For more configurability of nvidia related matters in k3s look in k3s-docs.