Linux kernel mirror (for testing)
git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel
os
linux
1.. SPDX-License-Identifier: GPL-2.0
2
3==============
4Devlink Health
5==============
6
7Background
8==========
9
10The ``devlink`` health mechanism is targeted for Real Time Alerting, in
11order to know when something bad happened to a PCI device.
12
13 * Provide alert debug information.
14 * Self healing.
15 * If problem needs vendor support, provide a way to gather all needed
16 debugging information.
17
18Overview
19========
20
21The main idea is to unify and centralize driver health reports in the
22generic ``devlink`` instance and allow the user to set different
23attributes of the health reporting and recovery procedures.
24
25The ``devlink`` health reporter:
26Device driver creates a "health reporter" per each error/health type.
27Error/Health type can be a known/generic (e.g. PCI error, fw error, rx/tx error)
28or unknown (driver specific).
29For each registered health reporter a driver can issue error/health reports
30asynchronously. All health reports handling is done by ``devlink``.
31Device driver can provide specific callbacks for each "health reporter", e.g.:
32
33 * Recovery procedures
34 * Diagnostics procedures
35 * Object dump procedures
36 * OOB initial parameters
37
38Different parts of the driver can register different types of health reporters
39with different handlers.
40
41Actions
42=======
43
44Once an error is reported, devlink health will perform the following actions:
45
46 * A log is being send to the kernel trace events buffer
47 * Health status and statistics are being updated for the reporter instance
48 * Object dump is being taken and saved at the reporter instance (as long as
49 there is no other dump which is already stored)
50 * Auto recovery attempt is being done. Depends on:
51
52 - Auto-recovery configuration
53 - Grace period vs. time passed since last recover
54
55User Interface
56==============
57
58User can access/change each reporter's parameters and driver specific callbacks
59via ``devlink``, e.g per error type (per health reporter):
60
61 * Configure reporter's generic parameters (like: disable/enable auto recovery)
62 * Invoke recovery procedure
63 * Run diagnostics
64 * Object dump
65
66.. list-table:: List of devlink health interfaces
67 :widths: 10 90
68
69 * - Name
70 - Description
71 * - ``DEVLINK_CMD_HEALTH_REPORTER_GET``
72 - Retrieves status and configuration info per DEV and reporter.
73 * - ``DEVLINK_CMD_HEALTH_REPORTER_SET``
74 - Allows reporter-related configuration setting.
75 * - ``DEVLINK_CMD_HEALTH_REPORTER_RECOVER``
76 - Triggers reporter's recovery procedure.
77 * - ``DEVLINK_CMD_HEALTH_REPORTER_TEST``
78 - Triggers a fake health event on the reporter. The effects of the test
79 event in terms of recovery flow should follow closely that of a real
80 event.
81 * - ``DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE``
82 - Retrieves current device state related to the reporter.
83 * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET``
84 - Retrieves the last stored dump. Devlink health
85 saves a single dump. If an dump is not already stored by devlink
86 for this reporter, devlink generates a new dump.
87 Dump output is defined by the reporter.
88 * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_CLEAR``
89 - Clears the last saved dump file for the specified reporter.
90
91The following diagram provides a general overview of ``devlink-health``::
92
93 netlink
94 +--------------------------+
95 | |
96 | + |
97 | | |
98 +--------------------------+
99 |request for ops
100 |(diagnose,
101 driver devlink |recover,
102 |dump)
103 +--------+ +--------------------------+
104 | | | reporter| |
105 | | | +---------v----------+ |
106 | | ops execution | | | |
107 | <----------------------------------+ | |
108 | | | | | |
109 | | | + ^------------------+ |
110 | | | | request for ops |
111 | | | | (recover, dump) |
112 | | | | |
113 | | | +-+------------------+ |
114 | | health report | | health handler | |
115 | +-------------------------------> | |
116 | | | +--------------------+ |
117 | | health reporter create | |
118 | +----------------------------> |
119 +--------+ +--------------------------+