Linux kernel mirror (for testing)
git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel
os
linux
1.. SPDX-License-Identifier: GPL-2.0
2
3===================
4ice devlink support
5===================
6
7This document describes the devlink features implemented by the ``ice``
8device driver.
9
10Parameters
11==========
12
13.. list-table:: Generic parameters implemented
14 :widths: 5 5 90
15
16 * - Name
17 - Mode
18 - Notes
19 * - ``enable_roce``
20 - runtime
21 - mutually exclusive with ``enable_iwarp``
22 * - ``enable_iwarp``
23 - runtime
24 - mutually exclusive with ``enable_roce``
25 * - ``tx_scheduling_layers``
26 - permanent
27 - The ice hardware uses hierarchical scheduling for Tx with a fixed
28 number of layers in the scheduling tree. Each of them are decision
29 points. Root node represents a port, while all the leaves represent
30 the queues. This way of configuring the Tx scheduler allows features
31 like DCB or devlink-rate (documented below) to configure how much
32 bandwidth is given to any given queue or group of queues, enabling
33 fine-grained control because scheduling parameters can be configured
34 at any given layer of the tree.
35
36 The default 9-layer tree topology was deemed best for most workloads,
37 as it gives an optimal ratio of performance to configurability. However,
38 for some specific cases, this 9-layer topology might not be desired.
39 One example would be sending traffic to queues that are not a multiple
40 of 8. Because the maximum radix is limited to 8 in 9-layer topology,
41 the 9th queue has a different parent than the rest, and it's given
42 more bandwidth credits. This causes a problem when the system is
43 sending traffic to 9 queues:
44
45 | tx_queue_0_packets: 24163396
46 | tx_queue_1_packets: 24164623
47 | tx_queue_2_packets: 24163188
48 | tx_queue_3_packets: 24163701
49 | tx_queue_4_packets: 24163683
50 | tx_queue_5_packets: 24164668
51 | tx_queue_6_packets: 23327200
52 | tx_queue_7_packets: 24163853
53 | tx_queue_8_packets: 91101417 < Too much traffic is sent from 9th
54
55 To address this need, you can switch to a 5-layer topology, which
56 changes the maximum topology radix to 512. With this enhancement,
57 the performance characteristic is equal as all queues can be assigned
58 to the same parent in the tree. The obvious drawback of this solution
59 is a lower configuration depth of the tree.
60
61 Use the ``tx_scheduling_layer`` parameter with the devlink command
62 to change the transmit scheduler topology. To use 5-layer topology,
63 use a value of 5. For example:
64 $ devlink dev param set pci/0000:16:00.0 name tx_scheduling_layers
65 value 5 cmode permanent
66 Use a value of 9 to set it back to the default value.
67
68 You must do PCI slot powercycle for the selected topology to take effect.
69
70 To verify that value has been set:
71 $ devlink dev param show pci/0000:16:00.0 name tx_scheduling_layers
72.. list-table:: Driver specific parameters implemented
73 :widths: 5 5 90
74
75 * - Name
76 - Mode
77 - Description
78 * - ``local_forwarding``
79 - runtime
80 - Controls loopback behavior by tuning scheduler bandwidth.
81 It impacts all kinds of functions: physical, virtual and
82 subfunctions.
83 Supported values are:
84
85 ``enabled`` - loopback traffic is allowed on port
86
87 ``disabled`` - loopback traffic is not allowed on this port
88
89 ``prioritized`` - loopback traffic is prioritized on this port
90
91 Default value of ``local_forwarding`` parameter is ``enabled``.
92 ``prioritized`` provides ability to adjust loopback traffic rate to increase
93 one port capacity at cost of the another. User needs to disable
94 local forwarding on one of the ports in order have increased capacity
95 on the ``prioritized`` port.
96
97Info versions
98=============
99
100The ``ice`` driver reports the following versions
101
102.. list-table:: devlink info versions implemented
103 :widths: 5 5 5 90
104
105 * - Name
106 - Type
107 - Example
108 - Description
109 * - ``board.id``
110 - fixed
111 - K65390-000
112 - The Product Board Assembly (PBA) identifier of the board.
113 * - ``cgu.id``
114 - fixed
115 - 36
116 - The Clock Generation Unit (CGU) hardware revision identifier.
117 * - ``fw.mgmt``
118 - running
119 - 2.1.7
120 - 3-digit version number of the management firmware running on the
121 Embedded Management Processor of the device. It controls the PHY,
122 link, access to device resources, etc. Intel documentation refers to
123 this as the EMP firmware.
124 * - ``fw.mgmt.api``
125 - running
126 - 1.5.1
127 - 3-digit version number (major.minor.patch) of the API exported over
128 the AdminQ by the management firmware. Used by the driver to
129 identify what commands are supported. Historical versions of the
130 kernel only displayed a 2-digit version number (major.minor).
131 * - ``fw.mgmt.build``
132 - running
133 - 0x305d955f
134 - Unique identifier of the source for the management firmware.
135 * - ``fw.undi``
136 - running
137 - 1.2581.0
138 - Version of the Option ROM containing the UEFI driver. The version is
139 reported in ``major.minor.patch`` format. The major version is
140 incremented whenever a major breaking change occurs, or when the
141 minor version would overflow. The minor version is incremented for
142 non-breaking changes and reset to 1 when the major version is
143 incremented. The patch version is normally 0 but is incremented when
144 a fix is delivered as a patch against an older base Option ROM.
145 * - ``fw.psid.api``
146 - running
147 - 0.80
148 - Version defining the format of the flash contents.
149 * - ``fw.bundle_id``
150 - running
151 - 0x80002ec0
152 - Unique identifier of the firmware image file that was loaded onto
153 the device. Also referred to as the EETRACK identifier of the NVM.
154 * - ``fw.app.name``
155 - running
156 - ICE OS Default Package
157 - The name of the DDP package that is active in the device. The DDP
158 package is loaded by the driver during initialization. Each
159 variation of the DDP package has a unique name.
160 * - ``fw.app``
161 - running
162 - 1.3.1.0
163 - The version of the DDP package that is active in the device. Note
164 that both the name (as reported by ``fw.app.name``) and version are
165 required to uniquely identify the package.
166 * - ``fw.app.bundle_id``
167 - running
168 - 0xc0000001
169 - Unique identifier for the DDP package loaded in the device. Also
170 referred to as the DDP Track ID. Can be used to uniquely identify
171 the specific DDP package.
172 * - ``fw.netlist``
173 - running
174 - 1.1.2000-6.7.0
175 - The version of the netlist module. This module defines the device's
176 Ethernet capabilities and default settings, and is used by the
177 management firmware as part of managing link and device
178 connectivity.
179 * - ``fw.netlist.build``
180 - running
181 - 0xee16ced7
182 - The first 4 bytes of the hash of the netlist module contents.
183 * - ``fw.cgu``
184 - running
185 - 8032.16973825.6021
186 - The version of Clock Generation Unit (CGU). Format:
187 <CGU type>.<configuration version>.<firmware version>.
188
189Flash Update
190============
191
192The ``ice`` driver implements support for flash update using the
193``devlink-flash`` interface. It supports updating the device flash using a
194combined flash image that contains the ``fw.mgmt``, ``fw.undi``, and
195``fw.netlist`` components.
196
197.. list-table:: List of supported overwrite modes
198 :widths: 5 95
199
200 * - Bits
201 - Behavior
202 * - ``DEVLINK_FLASH_OVERWRITE_SETTINGS``
203 - Do not preserve settings stored in the flash components being
204 updated. This includes overwriting the port configuration that
205 determines the number of physical functions the device will
206 initialize with.
207 * - ``DEVLINK_FLASH_OVERWRITE_SETTINGS`` and ``DEVLINK_FLASH_OVERWRITE_IDENTIFIERS``
208 - Do not preserve either settings or identifiers. Overwrite everything
209 in the flash with the contents from the provided image, without
210 performing any preservation. This includes overwriting device
211 identifying fields such as the MAC address, VPD area, and device
212 serial number. It is expected that this combination be used with an
213 image customized for the specific device.
214
215The ice hardware does not support overwriting only identifiers while
216preserving settings, and thus ``DEVLINK_FLASH_OVERWRITE_IDENTIFIERS`` on its
217own will be rejected. If no overwrite mask is provided, the firmware will be
218instructed to preserve all settings and identifying fields when updating.
219
220Reload
221======
222
223The ``ice`` driver supports activating new firmware after a flash update
224using ``DEVLINK_CMD_RELOAD`` with the ``DEVLINK_RELOAD_ACTION_FW_ACTIVATE``
225action.
226
227.. code:: shell
228
229 $ devlink dev reload pci/0000:01:00.0 reload action fw_activate
230
231The new firmware is activated by issuing a device specific Embedded
232Management Processor reset which requests the device to reset and reload the
233EMP firmware image.
234
235The driver does not currently support reloading the driver via
236``DEVLINK_RELOAD_ACTION_DRIVER_REINIT``.
237
238Port split
239==========
240
241The ``ice`` driver supports port splitting only for port 0, as the FW has
242a predefined set of available port split options for the whole device.
243
244A system reboot is required for port split to be applied.
245
246The following command will select the port split option with 4 ports:
247
248.. code:: shell
249
250 $ devlink port split pci/0000:16:00.0/0 count 4
251
252The list of all available port options will be printed to dynamic debug after
253each ``split`` and ``unsplit`` command. The first option is the default.
254
255.. code:: shell
256
257 ice 0000:16:00.0: Available port split options and max port speeds (Gbps):
258 ice 0000:16:00.0: Status Split Quad 0 Quad 1
259 ice 0000:16:00.0: count L0 L1 L2 L3 L4 L5 L6 L7
260 ice 0000:16:00.0: Active 2 100 - - - 100 - - -
261 ice 0000:16:00.0: 2 50 - 50 - - - - -
262 ice 0000:16:00.0: Pending 4 25 25 25 25 - - - -
263 ice 0000:16:00.0: 4 25 25 - - 25 25 - -
264 ice 0000:16:00.0: 8 10 10 10 10 10 10 10 10
265 ice 0000:16:00.0: 1 100 - - - - - - -
266
267There could be multiple FW port options with the same port split count. When
268the same port split count request is issued again, the next FW port option with
269the same port split count will be selected.
270
271``devlink port unsplit`` will select the option with a split count of 1. If
272there is no FW option available with split count 1, you will receive an error.
273
274Regions
275=======
276
277The ``ice`` driver implements the following regions for accessing internal
278device data.
279
280.. list-table:: regions implemented
281 :widths: 15 85
282
283 * - Name
284 - Description
285 * - ``nvm-flash``
286 - The contents of the entire flash chip, sometimes referred to as
287 the device's Non Volatile Memory.
288 * - ``shadow-ram``
289 - The contents of the Shadow RAM, which is loaded from the beginning
290 of the flash. Although the contents are primarily from the flash,
291 this area also contains data generated during device boot which is
292 not stored in flash.
293 * - ``device-caps``
294 - The contents of the device firmware's capabilities buffer. Useful to
295 determine the current state and configuration of the device.
296
297Both the ``nvm-flash`` and ``shadow-ram`` regions can be accessed without a
298snapshot. The ``device-caps`` region requires a snapshot as the contents are
299sent by firmware and can't be split into separate reads.
300
301Users can request an immediate capture of a snapshot for all three regions
302via the ``DEVLINK_CMD_REGION_NEW`` command.
303
304.. code:: shell
305
306 $ devlink region show
307 pci/0000:01:00.0/nvm-flash: size 10485760 snapshot [] max 1
308 pci/0000:01:00.0/device-caps: size 4096 snapshot [] max 10
309
310 $ devlink region new pci/0000:01:00.0/nvm-flash snapshot 1
311 $ devlink region dump pci/0000:01:00.0/nvm-flash snapshot 1
312
313 $ devlink region dump pci/0000:01:00.0/nvm-flash snapshot 1
314 0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30
315 0000000000000010 0000 0000 ffff ff04 0029 8c00 0028 8cc8
316 0000000000000020 0016 0bb8 0016 1720 0000 0000 c00f 3ffc
317 0000000000000030 bada cce5 bada cce5 bada cce5 bada cce5
318
319 $ devlink region read pci/0000:01:00.0/nvm-flash snapshot 1 address 0 length 16
320 0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30
321
322 $ devlink region delete pci/0000:01:00.0/nvm-flash snapshot 1
323
324 $ devlink region new pci/0000:01:00.0/device-caps snapshot 1
325 $ devlink region dump pci/0000:01:00.0/device-caps snapshot 1
326 0000000000000000 01 00 01 00 00 00 00 00 01 00 00 00 00 00 00 00
327 0000000000000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
328 0000000000000020 02 00 02 01 32 03 00 00 0a 00 00 00 25 00 00 00
329 0000000000000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
330 0000000000000040 04 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
331 0000000000000050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
332 0000000000000060 05 00 01 00 03 00 00 00 00 00 00 00 00 00 00 00
333 0000000000000070 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
334 0000000000000080 06 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
335 0000000000000090 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
336 00000000000000a0 08 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00
337 00000000000000b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
338 00000000000000c0 12 00 01 00 01 00 00 00 01 00 01 00 00 00 00 00
339 00000000000000d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
340 00000000000000e0 13 00 01 00 00 01 00 00 00 00 00 00 00 00 00 00
341 00000000000000f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
342 0000000000000100 14 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
343 0000000000000110 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
344 0000000000000120 15 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
345 0000000000000130 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
346 0000000000000140 16 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
347 0000000000000150 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
348 0000000000000160 17 00 01 00 06 00 00 00 00 00 00 00 00 00 00 00
349 0000000000000170 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
350 0000000000000180 18 00 01 00 01 00 00 00 01 00 00 00 08 00 00 00
351 0000000000000190 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
352 00000000000001a0 22 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
353 00000000000001b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
354 00000000000001c0 40 00 01 00 00 08 00 00 08 00 00 00 00 00 00 00
355 00000000000001d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
356 00000000000001e0 41 00 01 00 00 08 00 00 00 00 00 00 00 00 00 00
357 00000000000001f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
358 0000000000000200 42 00 01 00 00 08 00 00 00 00 00 00 00 00 00 00
359 0000000000000210 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
360
361 $ devlink region delete pci/0000:01:00.0/device-caps snapshot 1
362
363Devlink Rate
364============
365
366The ``ice`` driver implements devlink-rate API. It allows for offload of
367the Hierarchical QoS to the hardware. It enables user to group Virtual
368Functions in a tree structure and assign supported parameters: tx_share,
369tx_max, tx_priority and tx_weight to each node in a tree. So effectively
370user gains an ability to control how much bandwidth is allocated for each
371VF group. This is later enforced by the HW.
372
373It is assumed that this feature is mutually exclusive with DCB performed
374in FW and ADQ, or any driver feature that would trigger changes in QoS,
375for example creation of the new traffic class. The driver will prevent DCB
376or ADQ configuration if user started making any changes to the nodes using
377devlink-rate API. To configure those features a driver reload is necessary.
378Correspondingly if ADQ or DCB will get configured the driver won't export
379hierarchy at all, or will remove the untouched hierarchy if those
380features are enabled after the hierarchy is exported, but before any
381changes are made.
382
383This feature is also dependent on switchdev being enabled in the system.
384It's required because devlink-rate requires devlink-port objects to be
385present, and those objects are only created in switchdev mode.
386
387If the driver is set to the switchdev mode, it will export internal
388hierarchy the moment VF's are created. Root of the tree is always
389represented by the node_0. This node can't be deleted by the user. Leaf
390nodes and nodes with children also can't be deleted.
391
392.. list-table:: Attributes supported
393 :widths: 15 85
394
395 * - Name
396 - Description
397 * - ``tx_max``
398 - maximum bandwidth to be consumed by the tree Node. Rate Limit is
399 an absolute number specifying a maximum amount of bytes a Node may
400 consume during the course of one second. Rate limit guarantees
401 that a link will not oversaturate the receiver on the remote end
402 and also enforces an SLA between the subscriber and network
403 provider.
404 * - ``tx_share``
405 - minimum bandwidth allocated to a tree node when it is not blocked.
406 It specifies an absolute BW. While tx_max defines the maximum
407 bandwidth the node may consume, the tx_share marks committed BW
408 for the Node.
409 * - ``tx_priority``
410 - allows for usage of strict priority arbiter among siblings. This
411 arbitration scheme attempts to schedule nodes based on their
412 priority as long as the nodes remain within their bandwidth limit.
413 Range 0-7. Nodes with priority 7 have the highest priority and are
414 selected first, while nodes with priority 0 have the lowest
415 priority. Nodes that have the same priority are treated equally.
416 * - ``tx_weight``
417 - allows for usage of Weighted Fair Queuing arbitration scheme among
418 siblings. This arbitration scheme can be used simultaneously with
419 the strict priority. Range 1-200. Only relative values matter for
420 arbitration.
421
422``tx_priority`` and ``tx_weight`` can be used simultaneously. In that case
423nodes with the same priority form a WFQ subgroup in the sibling group
424and arbitration among them is based on assigned weights.
425
426.. code:: shell
427
428 # enable switchdev
429 $ devlink dev eswitch set pci/0000:4b:00.0 mode switchdev
430
431 # at this point driver should export internal hierarchy
432 $ echo 2 > /sys/class/net/ens785np0/device/sriov_numvfs
433
434 $ devlink port function rate show
435 pci/0000:4b:00.0/node_25: type node parent node_24
436 pci/0000:4b:00.0/node_24: type node parent node_0
437 pci/0000:4b:00.0/node_32: type node parent node_31
438 pci/0000:4b:00.0/node_31: type node parent node_30
439 pci/0000:4b:00.0/node_30: type node parent node_16
440 pci/0000:4b:00.0/node_19: type node parent node_18
441 pci/0000:4b:00.0/node_18: type node parent node_17
442 pci/0000:4b:00.0/node_17: type node parent node_16
443 pci/0000:4b:00.0/node_14: type node parent node_5
444 pci/0000:4b:00.0/node_5: type node parent node_3
445 pci/0000:4b:00.0/node_13: type node parent node_4
446 pci/0000:4b:00.0/node_12: type node parent node_4
447 pci/0000:4b:00.0/node_11: type node parent node_4
448 pci/0000:4b:00.0/node_10: type node parent node_4
449 pci/0000:4b:00.0/node_9: type node parent node_4
450 pci/0000:4b:00.0/node_8: type node parent node_4
451 pci/0000:4b:00.0/node_7: type node parent node_4
452 pci/0000:4b:00.0/node_6: type node parent node_4
453 pci/0000:4b:00.0/node_4: type node parent node_3
454 pci/0000:4b:00.0/node_3: type node parent node_16
455 pci/0000:4b:00.0/node_16: type node parent node_15
456 pci/0000:4b:00.0/node_15: type node parent node_0
457 pci/0000:4b:00.0/node_2: type node parent node_1
458 pci/0000:4b:00.0/node_1: type node parent node_0
459 pci/0000:4b:00.0/node_0: type node
460 pci/0000:4b:00.0/1: type leaf parent node_25
461 pci/0000:4b:00.0/2: type leaf parent node_25
462
463 # let's create some custom node
464 $ devlink port function rate add pci/0000:4b:00.0/node_custom parent node_0
465
466 # second custom node
467 $ devlink port function rate add pci/0000:4b:00.0/node_custom_1 parent node_custom
468
469 # reassign second VF to newly created branch
470 $ devlink port function rate set pci/0000:4b:00.0/2 parent node_custom_1
471
472 # assign tx_weight to the VF
473 $ devlink port function rate set pci/0000:4b:00.0/2 tx_weight 5
474
475 # assign tx_share to the VF
476 $ devlink port function rate set pci/0000:4b:00.0/2 tx_share 500Mbps