···11# Nvidia GPU Support
2233-To use Nvidia GPU in the cluster the nvidia-container-runtime and runc are needed. To get the two components it suffices to add the following to the configuration
33+> Note: this article assumes `services.k3s.enable = true;` is already set
44+55+## Enable the Nvidia driver
4657```
66-virtualisation.docker = {
77- enable = true;
88- enableNvidia = true;
88+hardware.nvidia = {
99+ open = true;
1010+ package = config.boot.kernelPackages.nvidiaPackages.stable; # change to match your kernel
1111+ nvidiaSettings = true;
1212+};
1313+1414+# Hack for getting the nvidia driver recognized
1515+services.xserver = {
1616+ enable = false;
1717+ videoDrivers = [ "nvidia" ];
918};
1010-environment.systemPackages = with pkgs; [ docker runc ];
1919+2020+nixpkgs.config.allowUnfreePredicate = pkg: builtins.elem (lib.getName pkg) [
2121+ "nvidia-x11"
2222+ "nvidia-settings"
2323+];
2424+```
2525+2626+Also, enable the Nvidia container toolkit:
2727+2828+```
2929+hardware.nvidia-container-toolkit.enable = true;
3030+hardware.nvidia-container-toolkit.mount-nvidia-executables = true;
3131+3232+environment.systemPackages = with pkgs; [
3333+ nvidia-container-toolkit
3434+];
3535+```
3636+3737+Rebuild your NixOS configuration.
3838+3939+### Verify that the GPU is accessible
4040+4141+Use the following command to ensure the GPU is accessible:
4242+4343+```
4444+nvidia-smi
1145```
12461313-Note, using docker here is a workaround, it will install nvidia-container-runtime and that will cause it to be accessible via /run/current-system/sw/bin/nvidia-container-runtime, currently its not directly accessible in nixpkgs.
4747+If there is an error in the output, a reboot may be required for the driver to be assigned to the GPU.
4848+4949+Additionally, `lspci -k` can be used to ensure the driver has been assigned to the GPU:
5050+5151+```
5252+# lspci -k | grep -i nvidia
5353+5454+01:00.0 VGA compatible controller: NVIDIA Corporation TU106 [GeForce RTX 2060 Rev. A] (rev a1)
5555+ Kernel driver in use: nvidia
5656+ Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
5757+```
5858+5959+## Configure k3s
14601561You now need to create a new file in `/var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl` with the following
1662···2268 runtime_engine = ""
2369 runtime_root = ""
2470 runtime_type = "io.containerd.runc.v2"
2525-2626-[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
2727- BinaryName = "/run/current-system/sw/bin/nvidia-container-runtime"
2871```
29723030-Update: As of 12/03/2024 It appears that the last two lines above are added by default, and if the two lines are present (as shown above) it will refuse to start the server. You will need to remove the two lines from that point onward.
3131-3232-Note here we are pointing the nvidia runtime to "/run/current-system/sw/bin/nvidia-container-runtime".
3333-3473Now apply the following runtime class to k3s cluster:
35743675```yaml
···4382 name: nvidia
4483```
45844646-Following [k8s-device-plugin](https://github.com/NVIDIA/k8s-device-plugin#deployment-via-helm) install the helm chart with `runtimeClassName: nvidia` set. In order to passthrough the nvidia card into the container, your deployments spec must contain
8585+Restart k3s:
47864848-```yaml
4949-runtimeClassName: nvidia
5050-# for each container
5151- env:
5252- - name: NVIDIA_VISIBLE_DEVICES
5353- value: all
5454- - name: NVIDIA_DRIVER_CAPABILITIES
5555- value: all
8787+```
8888+systemctl restart k3s.service
8989+```
9090+9191+Ensure that the Nvidia runtime is detected by k3s:
9292+9393+```
9494+grep nvidia /var/lib/rancher/k3s/agent/etc/containerd/config.toml
5695```
57965858-to test its working exec onto a pod and run nvidia-smi. For more configurability of nvidia related matters in k3s look in [k3s-docs](https://docs.k3s.io/advanced#nvidia-container-runtime-support).
9797+Apply the DaemonSet in the [generic-cdi-plugin README](https://github.com/OlfillasOdikno/generic-cdi-plugin):
9898+9999+```
100100+apiVersion: v1
101101+kind: Namespace
102102+metadata:
103103+ name: generic-cdi-plugin
104104+---
105105+apiVersion: apps/v1
106106+kind: DaemonSet
107107+metadata:
108108+ name: generic-cdi-plugin-daemonset
109109+ namespace: generic-cdi-plugin
110110+spec:
111111+ selector:
112112+ matchLabels:
113113+ name: generic-cdi-plugin
114114+ template:
115115+ metadata:
116116+ labels:
117117+ name: generic-cdi-plugin
118118+ app.kubernetes.io/component: generic-cdi-plugin
119119+ app.kubernetes.io/name: generic-cdi-plugin
120120+ spec:
121121+ containers:
122122+ - image: ghcr.io/olfillasodikno/generic-cdi-plugin:main
123123+ name: generic-cdi-plugin
124124+ command:
125125+ - /generic-cdi-plugin
126126+ - /var/run/cdi/nvidia-container-toolkit.json
127127+ imagePullPolicy: Always
128128+ securityContext:
129129+ privileged: true
130130+ tty: true
131131+ volumeMounts:
132132+ - name: kubelet
133133+ mountPath: /var/lib/kubelet
134134+ - name: nvidia-container-toolkit
135135+ mountPath: /var/run/cdi/nvidia-container-toolkit.json
136136+ volumes:
137137+ - name: kubelet
138138+ hostPath:
139139+ path: /var/lib/kubelet
140140+ - name: nvidia-container-toolkit
141141+ hostPath:
142142+ path: /var/run/cdi/nvidia-container-toolkit.json
143143+ affinity:
144144+ nodeAffinity:
145145+ requiredDuringSchedulingIgnoredDuringExecution:
146146+ nodeSelectorTerms:
147147+ - matchExpressions:
148148+ - key: "nixos-nvidia-cdi"
149149+ operator: In
150150+ values:
151151+ - "enabled"
152152+```
153153+154154+Apply the following node label (replace `#CHANGEME` with your node name):
155155+156156+```
157157+kind: Node
158158+apiVersion: v1
159159+metadata:
160160+ name: #CHANGEME
161161+ labels:
162162+ nixos-nvidia-cdi: enabled
163163+```
164164+165165+Now, GPU-enabled pods can be run with this configuration:
166166+167167+```
168168+spec:
169169+ runtimeClassName: nvidia
170170+ containers:
171171+ resources:
172172+ requests:
173173+ nvidia.com/gpu-all: "1"
174174+ limits:
175175+ nvidia.com/gpu-all: "1"
176176+```
177177+178178+### Test pod
179179+180180+This is a complete pod configuration for reference/testing:
181181+182182+```
183183+---
184184+apiVersion: v1
185185+kind: Pod
186186+metadata:
187187+ name: gpu-test
188188+ namespace: default
189189+spec:
190190+ runtimeClassName: nvidia # <- THIS FOR GPU
191191+ containers:
192192+ - name: gpu-test
193193+ image: nvidia/cuda:12.6.3-base-ubuntu22.04
194194+ command: [ "/bin/bash", "-c", "--" ]
195195+ args: [ "while true; do sleep 30; done;" ]
196196+ env:
197197+ - name: NVIDIA_VISIBLE_DEVICES
198198+ value: all
199199+ - name: NVIDIA_DRIVER_CAPABILITIES
200200+ value: all
201201+ resources: # <- THIS FOR GPU
202202+ requests:
203203+ nvidia.com/gpu-all: "1"
204204+ limits:
205205+ nvidia.com/gpu-all: "1"
206206+```
207207+208208+Once the pod is running, use the following command to test that the GPU was detected:
209209+210210+```
211211+kubectl exec -n default -it pod/gpu-test -- nvidia-smi
212212+```
213213+214214+If successful, the output will look like the following:
215215+216216+```
217217+Thu Sep 25 04:17:42 2025
218218+219219++-----------------------------------------------------------------------------------------+
220220+221221+| NVIDIA-SMI 580.82.09 Driver Version: 580.82.09 CUDA Version: 13.0 |
222222+223223++-----------------------------------------+------------------------+----------------------+
224224+225225+| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
226226+227227+| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
228228+229229+| | | MIG M. |
230230+231231+|=========================================+========================+======================|
232232+233233+| 0 NVIDIA GeForce RTX 2060 Off | 00000000:01:00.0 On | N/A |
234234+235235+| 0% 36C P8 10W / 190W | 104MiB / 6144MiB | 0% Default |
236236+237237+| | | N/A |
238238+239239++-----------------------------------------+------------------------+----------------------+
240240+241241+242242+243243++-----------------------------------------------------------------------------------------+
244244+245245+| Processes: |
246246+247247+| GPU GI CI PID Type Process name GPU Memory |
248248+249249+| ID ID Usage |
250250+251251+|=========================================================================================|
252252+253253+| No running processes found |
254254+255255++-----------------------------------------------------------------------------------------+
256256+```