k3s: update nvidia docs · pyrox.dev/nixpkgs@5e389a3

+221 -23

1 changed file

expand all

pkgs

applications

networking

cluster

k3s

docs

examples

NVIDIA.md

+221 -23

pkgs/applications/networking/cluster/k3s/docs/examples/NVIDIA.md

··· 1 1 # Nvidia GPU Support 2 2 3 - To use Nvidia GPU in the cluster the nvidia-container-runtime and runc are needed. To get the two components it suffices to add the following to the configuration 3 + > Note: this article assumes `services.k3s.enable = true;` is already set 4 + 5 + ## Enable the Nvidia driver 4 6 5 7 ``` 6 - virtualisation.docker = { 7 - enable = true; 8 - enableNvidia = true; 8 + hardware.nvidia = { 9 + open = true; 10 + package = config.boot.kernelPackages.nvidiaPackages.stable; # change to match your kernel 11 + nvidiaSettings = true; 12 + }; 13 + 14 + # Hack for getting the nvidia driver recognized 15 + services.xserver = { 16 + enable = false; 17 + videoDrivers = [ "nvidia" ]; 9 18 }; 10 - environment.systemPackages = with pkgs; [ docker runc ]; 19 + 20 + nixpkgs.config.allowUnfreePredicate = pkg: builtins.elem (lib.getName pkg) [ 21 + "nvidia-x11" 22 + "nvidia-settings" 23 + ]; 24 + ``` 25 + 26 + Also, enable the Nvidia container toolkit: 27 + 28 + ``` 29 + hardware.nvidia-container-toolkit.enable = true; 30 + hardware.nvidia-container-toolkit.mount-nvidia-executables = true; 31 + 32 + environment.systemPackages = with pkgs; [ 33 + nvidia-container-toolkit 34 + ]; 35 + ``` 36 + 37 + Rebuild your NixOS configuration. 38 + 39 + ### Verify that the GPU is accessible 40 + 41 + Use the following command to ensure the GPU is accessible: 42 + 43 + ``` 44 + nvidia-smi 11 45 ``` 12 46 13 - Note, using docker here is a workaround, it will install nvidia-container-runtime and that will cause it to be accessible via /run/current-system/sw/bin/nvidia-container-runtime, currently its not directly accessible in nixpkgs. 47 + If there is an error in the output, a reboot may be required for the driver to be assigned to the GPU. 48 + 49 + Additionally, `lspci -k` can be used to ensure the driver has been assigned to the GPU: 50 + 51 + ``` 52 + # lspci -k | grep -i nvidia 53 + 54 + 01:00.0 VGA compatible controller: NVIDIA Corporation TU106 [GeForce RTX 2060 Rev. A] (rev a1) 55 + Kernel driver in use: nvidia 56 + Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia 57 + ``` 58 + 59 + ## Configure k3s 14 60 15 61 You now need to create a new file in `/var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl` with the following 16 62 ··· 22 68 runtime_engine = "" 23 69 runtime_root = "" 24 70 runtime_type = "io.containerd.runc.v2" 25 - 26 - [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options] 27 - BinaryName = "/run/current-system/sw/bin/nvidia-container-runtime" 28 71 ``` 29 72 30 - Update: As of 12/03/2024 It appears that the last two lines above are added by default, and if the two lines are present (as shown above) it will refuse to start the server. You will need to remove the two lines from that point onward. 31 - 32 - Note here we are pointing the nvidia runtime to "/run/current-system/sw/bin/nvidia-container-runtime". 33 - 34 73 Now apply the following runtime class to k3s cluster: 35 74 36 75 ```yaml ··· 43 82 name: nvidia 44 83 ``` 45 84 46 - Following [k8s-device-plugin](https://github.com/NVIDIA/k8s-device-plugin#deployment-via-helm) install the helm chart with `runtimeClassName: nvidia` set. In order to passthrough the nvidia card into the container, your deployments spec must contain 85 + Restart k3s: 47 86 48 - ```yaml 49 - runtimeClassName: nvidia 50 - # for each container 51 - env: 52 - - name: NVIDIA_VISIBLE_DEVICES 53 - value: all 54 - - name: NVIDIA_DRIVER_CAPABILITIES 55 - value: all 87 + ``` 88 + systemctl restart k3s.service 89 + ``` 90 + 91 + Ensure that the Nvidia runtime is detected by k3s: 92 + 93 + ``` 94 + grep nvidia /var/lib/rancher/k3s/agent/etc/containerd/config.toml 56 95 ``` 57 96 58 - to test its working exec onto a pod and run nvidia-smi. For more configurability of nvidia related matters in k3s look in [k3s-docs](https://docs.k3s.io/advanced#nvidia-container-runtime-support). 97 + Apply the DaemonSet in the [generic-cdi-plugin README](https://github.com/OlfillasOdikno/generic-cdi-plugin): 98 + 99 + ``` 100 + apiVersion: v1 101 + kind: Namespace 102 + metadata: 103 + name: generic-cdi-plugin 104 + --- 105 + apiVersion: apps/v1 106 + kind: DaemonSet 107 + metadata: 108 + name: generic-cdi-plugin-daemonset 109 + namespace: generic-cdi-plugin 110 + spec: 111 + selector: 112 + matchLabels: 113 + name: generic-cdi-plugin 114 + template: 115 + metadata: 116 + labels: 117 + name: generic-cdi-plugin 118 + app.kubernetes.io/component: generic-cdi-plugin 119 + app.kubernetes.io/name: generic-cdi-plugin 120 + spec: 121 + containers: 122 + - image: ghcr.io/olfillasodikno/generic-cdi-plugin:main 123 + name: generic-cdi-plugin 124 + command: 125 + - /generic-cdi-plugin 126 + - /var/run/cdi/nvidia-container-toolkit.json 127 + imagePullPolicy: Always 128 + securityContext: 129 + privileged: true 130 + tty: true 131 + volumeMounts: 132 + - name: kubelet 133 + mountPath: /var/lib/kubelet 134 + - name: nvidia-container-toolkit 135 + mountPath: /var/run/cdi/nvidia-container-toolkit.json 136 + volumes: 137 + - name: kubelet 138 + hostPath: 139 + path: /var/lib/kubelet 140 + - name: nvidia-container-toolkit 141 + hostPath: 142 + path: /var/run/cdi/nvidia-container-toolkit.json 143 + affinity: 144 + nodeAffinity: 145 + requiredDuringSchedulingIgnoredDuringExecution: 146 + nodeSelectorTerms: 147 + - matchExpressions: 148 + - key: "nixos-nvidia-cdi" 149 + operator: In 150 + values: 151 + - "enabled" 152 + ``` 153 + 154 + Apply the following node label (replace `#CHANGEME` with your node name): 155 + 156 + ``` 157 + kind: Node 158 + apiVersion: v1 159 + metadata: 160 + name: #CHANGEME 161 + labels: 162 + nixos-nvidia-cdi: enabled 163 + ``` 164 + 165 + Now, GPU-enabled pods can be run with this configuration: 166 + 167 + ``` 168 + spec: 169 + runtimeClassName: nvidia 170 + containers: 171 + resources: 172 + requests: 173 + nvidia.com/gpu-all: "1" 174 + limits: 175 + nvidia.com/gpu-all: "1" 176 + ``` 177 + 178 + ### Test pod 179 + 180 + This is a complete pod configuration for reference/testing: 181 + 182 + ``` 183 + --- 184 + apiVersion: v1 185 + kind: Pod 186 + metadata: 187 + name: gpu-test 188 + namespace: default 189 + spec: 190 + runtimeClassName: nvidia # <- THIS FOR GPU 191 + containers: 192 + - name: gpu-test 193 + image: nvidia/cuda:12.6.3-base-ubuntu22.04 194 + command: [ "/bin/bash", "-c", "--" ] 195 + args: [ "while true; do sleep 30; done;" ] 196 + env: 197 + - name: NVIDIA_VISIBLE_DEVICES 198 + value: all 199 + - name: NVIDIA_DRIVER_CAPABILITIES 200 + value: all 201 + resources: # <- THIS FOR GPU 202 + requests: 203 + nvidia.com/gpu-all: "1" 204 + limits: 205 + nvidia.com/gpu-all: "1" 206 + ``` 207 + 208 + Once the pod is running, use the following command to test that the GPU was detected: 209 + 210 + ``` 211 + kubectl exec -n default -it pod/gpu-test -- nvidia-smi 212 + ``` 213 + 214 + If successful, the output will look like the following: 215 + 216 + ``` 217 + Thu Sep 25 04:17:42 2025 218 + 219 + +-----------------------------------------------------------------------------------------+ 220 + 221 + | NVIDIA-SMI 580.82.09 Driver Version: 580.82.09 CUDA Version: 13.0 | 222 + 223 + +-----------------------------------------+------------------------+----------------------+ 224 + 225 + | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 226 + 227 + | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 228 + 229 + | | | MIG M. | 230 + 231 + |=========================================+========================+======================| 232 + 233 + | 0 NVIDIA GeForce RTX 2060 Off | 00000000:01:00.0 On | N/A | 234 + 235 + | 0% 36C P8 10W / 190W | 104MiB / 6144MiB | 0% Default | 236 + 237 + | | | N/A | 238 + 239 + +-----------------------------------------+------------------------+----------------------+ 240 + 241 + 242 + 243 + +-----------------------------------------------------------------------------------------+ 244 + 245 + | Processes: | 246 + 247 + | GPU GI CI PID Type Process name GPU Memory | 248 + 249 + | ID ID Usage | 250 + 251 + |=========================================================================================| 252 + 253 + | No running processes found | 254 + 255 + +-----------------------------------------------------------------------------------------+ 256 + ```

Configure Feed

Configure Feed