Merge tag 'pci-v5.3-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci

+1 -1

Documentation/ABI/testing/sysfs-class-powercap

··· 5 5 Description: 6 6 The powercap/ class sub directory belongs to the power cap 7 7 subsystem. Refer to 8 - Documentation/power/powercap/powercap.txt for details. 8 + Documentation/power/powercap/powercap.rst for details. 9 9 10 10 What: /sys/class/powercap/<control type> 11 11 Date: September 2013

-270

Documentation/PCI/MSI-HOWTO.txt

··· 1 - The MSI Driver Guide HOWTO 2 - Tom L Nguyen tom.l.nguyen@intel.com 3 - 10/03/2003 4 - Revised Feb 12, 2004 by Martine Silbermann 5 - email: Martine.Silbermann@hp.com 6 - Revised Jun 25, 2004 by Tom L Nguyen 7 - Revised Jul 9, 2008 by Matthew Wilcox <willy@linux.intel.com> 8 - Copyright 2003, 2008 Intel Corporation 9 - 10 - 1. About this guide 11 - 12 - This guide describes the basics of Message Signaled Interrupts (MSIs), 13 - the advantages of using MSI over traditional interrupt mechanisms, how 14 - to change your driver to use MSI or MSI-X and some basic diagnostics to 15 - try if a device doesn't support MSIs. 16 - 17 - 18 - 2. What are MSIs? 19 - 20 - A Message Signaled Interrupt is a write from the device to a special 21 - address which causes an interrupt to be received by the CPU. 22 - 23 - The MSI capability was first specified in PCI 2.2 and was later enhanced 24 - in PCI 3.0 to allow each interrupt to be masked individually. The MSI-X 25 - capability was also introduced with PCI 3.0. It supports more interrupts 26 - per device than MSI and allows interrupts to be independently configured. 27 - 28 - Devices may support both MSI and MSI-X, but only one can be enabled at 29 - a time. 30 - 31 - 32 - 3. Why use MSIs? 33 - 34 - There are three reasons why using MSIs can give an advantage over 35 - traditional pin-based interrupts. 36 - 37 - Pin-based PCI interrupts are often shared amongst several devices. 38 - To support this, the kernel must call each interrupt handler associated 39 - with an interrupt, which leads to reduced performance for the system as 40 - a whole. MSIs are never shared, so this problem cannot arise. 41 - 42 - When a device writes data to memory, then raises a pin-based interrupt, 43 - it is possible that the interrupt may arrive before all the data has 44 - arrived in memory (this becomes more likely with devices behind PCI-PCI 45 - bridges). In order to ensure that all the data has arrived in memory, 46 - the interrupt handler must read a register on the device which raised 47 - the interrupt. PCI transaction ordering rules require that all the data 48 - arrive in memory before the value may be returned from the register. 49 - Using MSIs avoids this problem as the interrupt-generating write cannot 50 - pass the data writes, so by the time the interrupt is raised, the driver 51 - knows that all the data has arrived in memory. 52 - 53 - PCI devices can only support a single pin-based interrupt per function. 54 - Often drivers have to query the device to find out what event has 55 - occurred, slowing down interrupt handling for the common case. With 56 - MSIs, a device can support more interrupts, allowing each interrupt 57 - to be specialised to a different purpose. One possible design gives 58 - infrequent conditions (such as errors) their own interrupt which allows 59 - the driver to handle the normal interrupt handling path more efficiently. 60 - Other possible designs include giving one interrupt to each packet queue 61 - in a network card or each port in a storage controller. 62 - 63 - 64 - 4. How to use MSIs 65 - 66 - PCI devices are initialised to use pin-based interrupts. The device 67 - driver has to set up the device to use MSI or MSI-X. Not all machines 68 - support MSIs correctly, and for those machines, the APIs described below 69 - will simply fail and the device will continue to use pin-based interrupts. 70 - 71 - 4.1 Include kernel support for MSIs 72 - 73 - To support MSI or MSI-X, the kernel must be built with the CONFIG_PCI_MSI 74 - option enabled. This option is only available on some architectures, 75 - and it may depend on some other options also being set. For example, 76 - on x86, you must also enable X86_UP_APIC or SMP in order to see the 77 - CONFIG_PCI_MSI option. 78 - 79 - 4.2 Using MSI 80 - 81 - Most of the hard work is done for the driver in the PCI layer. The driver 82 - simply has to request that the PCI layer set up the MSI capability for this 83 - device. 84 - 85 - To automatically use MSI or MSI-X interrupt vectors, use the following 86 - function: 87 - 88 - int pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs, 89 - unsigned int max_vecs, unsigned int flags); 90 - 91 - which allocates up to max_vecs interrupt vectors for a PCI device. It 92 - returns the number of vectors allocated or a negative error. If the device 93 - has a requirements for a minimum number of vectors the driver can pass a 94 - min_vecs argument set to this limit, and the PCI core will return -ENOSPC 95 - if it can't meet the minimum number of vectors. 96 - 97 - The flags argument is used to specify which type of interrupt can be used 98 - by the device and the driver (PCI_IRQ_LEGACY, PCI_IRQ_MSI, PCI_IRQ_MSIX). 99 - A convenient short-hand (PCI_IRQ_ALL_TYPES) is also available to ask for 100 - any possible kind of interrupt. If the PCI_IRQ_AFFINITY flag is set, 101 - pci_alloc_irq_vectors() will spread the interrupts around the available CPUs. 102 - 103 - To get the Linux IRQ numbers passed to request_irq() and free_irq() and the 104 - vectors, use the following function: 105 - 106 - int pci_irq_vector(struct pci_dev *dev, unsigned int nr); 107 - 108 - Any allocated resources should be freed before removing the device using 109 - the following function: 110 - 111 - void pci_free_irq_vectors(struct pci_dev *dev); 112 - 113 - If a device supports both MSI-X and MSI capabilities, this API will use the 114 - MSI-X facilities in preference to the MSI facilities. MSI-X supports any 115 - number of interrupts between 1 and 2048. In contrast, MSI is restricted to 116 - a maximum of 32 interrupts (and must be a power of two). In addition, the 117 - MSI interrupt vectors must be allocated consecutively, so the system might 118 - not be able to allocate as many vectors for MSI as it could for MSI-X. On 119 - some platforms, MSI interrupts must all be targeted at the same set of CPUs 120 - whereas MSI-X interrupts can all be targeted at different CPUs. 121 - 122 - If a device supports neither MSI-X or MSI it will fall back to a single 123 - legacy IRQ vector. 124 - 125 - The typical usage of MSI or MSI-X interrupts is to allocate as many vectors 126 - as possible, likely up to the limit supported by the device. If nvec is 127 - larger than the number supported by the device it will automatically be 128 - capped to the supported limit, so there is no need to query the number of 129 - vectors supported beforehand: 130 - 131 - nvec = pci_alloc_irq_vectors(pdev, 1, nvec, PCI_IRQ_ALL_TYPES) 132 - if (nvec < 0) 133 - goto out_err; 134 - 135 - If a driver is unable or unwilling to deal with a variable number of MSI 136 - interrupts it can request a particular number of interrupts by passing that 137 - number to pci_alloc_irq_vectors() function as both 'min_vecs' and 138 - 'max_vecs' parameters: 139 - 140 - ret = pci_alloc_irq_vectors(pdev, nvec, nvec, PCI_IRQ_ALL_TYPES); 141 - if (ret < 0) 142 - goto out_err; 143 - 144 - The most notorious example of the request type described above is enabling 145 - the single MSI mode for a device. It could be done by passing two 1s as 146 - 'min_vecs' and 'max_vecs': 147 - 148 - ret = pci_alloc_irq_vectors(pdev, 1, 1, PCI_IRQ_ALL_TYPES); 149 - if (ret < 0) 150 - goto out_err; 151 - 152 - Some devices might not support using legacy line interrupts, in which case 153 - the driver can specify that only MSI or MSI-X is acceptable: 154 - 155 - nvec = pci_alloc_irq_vectors(pdev, 1, nvec, PCI_IRQ_MSI | PCI_IRQ_MSIX); 156 - if (nvec < 0) 157 - goto out_err; 158 - 159 - 4.3 Legacy APIs 160 - 161 - The following old APIs to enable and disable MSI or MSI-X interrupts should 162 - not be used in new code: 163 - 164 - pci_enable_msi() /* deprecated */ 165 - pci_disable_msi() /* deprecated */ 166 - pci_enable_msix_range() /* deprecated */ 167 - pci_enable_msix_exact() /* deprecated */ 168 - pci_disable_msix() /* deprecated */ 169 - 170 - Additionally there are APIs to provide the number of supported MSI or MSI-X 171 - vectors: pci_msi_vec_count() and pci_msix_vec_count(). In general these 172 - should be avoided in favor of letting pci_alloc_irq_vectors() cap the 173 - number of vectors. If you have a legitimate special use case for the count 174 - of vectors we might have to revisit that decision and add a 175 - pci_nr_irq_vectors() helper that handles MSI and MSI-X transparently. 176 - 177 - 4.4 Considerations when using MSIs 178 - 179 - 4.4.1 Spinlocks 180 - 181 - Most device drivers have a per-device spinlock which is taken in the 182 - interrupt handler. With pin-based interrupts or a single MSI, it is not 183 - necessary to disable interrupts (Linux guarantees the same interrupt will 184 - not be re-entered). If a device uses multiple interrupts, the driver 185 - must disable interrupts while the lock is held. If the device sends 186 - a different interrupt, the driver will deadlock trying to recursively 187 - acquire the spinlock. Such deadlocks can be avoided by using 188 - spin_lock_irqsave() or spin_lock_irq() which disable local interrupts 189 - and acquire the lock (see Documentation/kernel-hacking/locking.rst). 190 - 191 - 4.5 How to tell whether MSI/MSI-X is enabled on a device 192 - 193 - Using 'lspci -v' (as root) may show some devices with "MSI", "Message 194 - Signalled Interrupts" or "MSI-X" capabilities. Each of these capabilities 195 - has an 'Enable' flag which is followed with either "+" (enabled) 196 - or "-" (disabled). 197 - 198 - 199 - 5. MSI quirks 200 - 201 - Several PCI chipsets or devices are known not to support MSIs. 202 - The PCI stack provides three ways to disable MSIs: 203 - 204 - 1. globally 205 - 2. on all devices behind a specific bridge 206 - 3. on a single device 207 - 208 - 5.1. Disabling MSIs globally 209 - 210 - Some host chipsets simply don't support MSIs properly. If we're 211 - lucky, the manufacturer knows this and has indicated it in the ACPI 212 - FADT table. In this case, Linux automatically disables MSIs. 213 - Some boards don't include this information in the table and so we have 214 - to detect them ourselves. The complete list of these is found near the 215 - quirk_disable_all_msi() function in drivers/pci/quirks.c. 216 - 217 - If you have a board which has problems with MSIs, you can pass pci=nomsi 218 - on the kernel command line to disable MSIs on all devices. It would be 219 - in your best interests to report the problem to linux-pci@vger.kernel.org 220 - including a full 'lspci -v' so we can add the quirks to the kernel. 221 - 222 - 5.2. Disabling MSIs below a bridge 223 - 224 - Some PCI bridges are not able to route MSIs between busses properly. 225 - In this case, MSIs must be disabled on all devices behind the bridge. 226 - 227 - Some bridges allow you to enable MSIs by changing some bits in their 228 - PCI configuration space (especially the Hypertransport chipsets such 229 - as the nVidia nForce and Serverworks HT2000). As with host chipsets, 230 - Linux mostly knows about them and automatically enables MSIs if it can. 231 - If you have a bridge unknown to Linux, you can enable 232 - MSIs in configuration space using whatever method you know works, then 233 - enable MSIs on that bridge by doing: 234 - 235 - echo 1 > /sys/bus/pci/devices/$bridge/msi_bus 236 - 237 - where $bridge is the PCI address of the bridge you've enabled (eg 238 - 0000:00:0e.0). 239 - 240 - To disable MSIs, echo 0 instead of 1. Changing this value should be 241 - done with caution as it could break interrupt handling for all devices 242 - below this bridge. 243 - 244 - Again, please notify linux-pci@vger.kernel.org of any bridges that need 245 - special handling. 246 - 247 - 5.3. Disabling MSIs on a single device 248 - 249 - Some devices are known to have faulty MSI implementations. Usually this 250 - is handled in the individual device driver, but occasionally it's necessary 251 - to handle this with a quirk. Some drivers have an option to disable use 252 - of MSI. While this is a convenient workaround for the driver author, 253 - it is not good practice, and should not be emulated. 254 - 255 - 5.4. Finding why MSIs are disabled on a device 256 - 257 - From the above three sections, you can see that there are many reasons 258 - why MSIs may not be enabled for a given device. Your first step should 259 - be to examine your dmesg carefully to determine whether MSIs are enabled 260 - for your machine. You should also check your .config to be sure you 261 - have enabled CONFIG_PCI_MSI. 262 - 263 - Then, 'lspci -t' gives the list of bridges above a device. Reading 264 - /sys/bus/pci/devices/*/msi_bus will tell you whether MSIs are enabled (1) 265 - or disabled (0). If 0 is found in any of the msi_bus files belonging 266 - to bridges between the PCI root and the device, MSIs are disabled. 267 - 268 - It is also worth checking the device driver to see whether it supports MSIs. 269 - For example, it may contain calls to pci_irq_alloc_vectors() with the 270 - PCI_IRQ_MSI or PCI_IRQ_MSIX flags.

-198

Documentation/PCI/PCIEBUS-HOWTO.txt

··· 1 - The PCI Express Port Bus Driver Guide HOWTO 2 - Tom L Nguyen tom.l.nguyen@intel.com 3 - 11/03/2004 4 - 5 - 1. About this guide 6 - 7 - This guide describes the basics of the PCI Express Port Bus driver 8 - and provides information on how to enable the service drivers to 9 - register/unregister with the PCI Express Port Bus Driver. 10 - 11 - 2. Copyright 2004 Intel Corporation 12 - 13 - 3. What is the PCI Express Port Bus Driver 14 - 15 - A PCI Express Port is a logical PCI-PCI Bridge structure. There 16 - are two types of PCI Express Port: the Root Port and the Switch 17 - Port. The Root Port originates a PCI Express link from a PCI Express 18 - Root Complex and the Switch Port connects PCI Express links to 19 - internal logical PCI buses. The Switch Port, which has its secondary 20 - bus representing the switch's internal routing logic, is called the 21 - switch's Upstream Port. The switch's Downstream Port is bridging from 22 - switch's internal routing bus to a bus representing the downstream 23 - PCI Express link from the PCI Express Switch. 24 - 25 - A PCI Express Port can provide up to four distinct functions, 26 - referred to in this document as services, depending on its port type. 27 - PCI Express Port's services include native hotplug support (HP), 28 - power management event support (PME), advanced error reporting 29 - support (AER), and virtual channel support (VC). These services may 30 - be handled by a single complex driver or be individually distributed 31 - and handled by corresponding service drivers. 32 - 33 - 4. Why use the PCI Express Port Bus Driver? 34 - 35 - In existing Linux kernels, the Linux Device Driver Model allows a 36 - physical device to be handled by only a single driver. The PCI 37 - Express Port is a PCI-PCI Bridge device with multiple distinct 38 - services. To maintain a clean and simple solution each service 39 - may have its own software service driver. In this case several 40 - service drivers will compete for a single PCI-PCI Bridge device. 41 - For example, if the PCI Express Root Port native hotplug service 42 - driver is loaded first, it claims a PCI-PCI Bridge Root Port. The 43 - kernel therefore does not load other service drivers for that Root 44 - Port. In other words, it is impossible to have multiple service 45 - drivers load and run on a PCI-PCI Bridge device simultaneously 46 - using the current driver model. 47 - 48 - To enable multiple service drivers running simultaneously requires 49 - having a PCI Express Port Bus driver, which manages all populated 50 - PCI Express Ports and distributes all provided service requests 51 - to the corresponding service drivers as required. Some key 52 - advantages of using the PCI Express Port Bus driver are listed below: 53 - 54 - - Allow multiple service drivers to run simultaneously on 55 - a PCI-PCI Bridge Port device. 56 - 57 - - Allow service drivers implemented in an independent 58 - staged approach. 59 - 60 - - Allow one service driver to run on multiple PCI-PCI Bridge 61 - Port devices. 62 - 63 - - Manage and distribute resources of a PCI-PCI Bridge Port 64 - device to requested service drivers. 65 - 66 - 5. Configuring the PCI Express Port Bus Driver vs. Service Drivers 67 - 68 - 5.1 Including the PCI Express Port Bus Driver Support into the Kernel 69 - 70 - Including the PCI Express Port Bus driver depends on whether the PCI 71 - Express support is included in the kernel config. The kernel will 72 - automatically include the PCI Express Port Bus driver as a kernel 73 - driver when the PCI Express support is enabled in the kernel. 74 - 75 - 5.2 Enabling Service Driver Support 76 - 77 - PCI device drivers are implemented based on Linux Device Driver Model. 78 - All service drivers are PCI device drivers. As discussed above, it is 79 - impossible to load any service driver once the kernel has loaded the 80 - PCI Express Port Bus Driver. To meet the PCI Express Port Bus Driver 81 - Model requires some minimal changes on existing service drivers that 82 - imposes no impact on the functionality of existing service drivers. 83 - 84 - A service driver is required to use the two APIs shown below to 85 - register its service with the PCI Express Port Bus driver (see 86 - section 5.2.1 & 5.2.2). It is important that a service driver 87 - initializes the pcie_port_service_driver data structure, included in 88 - header file /include/linux/pcieport_if.h, before calling these APIs. 89 - Failure to do so will result an identity mismatch, which prevents 90 - the PCI Express Port Bus driver from loading a service driver. 91 - 92 - 5.2.1 pcie_port_service_register 93 - 94 - int pcie_port_service_register(struct pcie_port_service_driver *new) 95 - 96 - This API replaces the Linux Driver Model's pci_register_driver API. A 97 - service driver should always calls pcie_port_service_register at 98 - module init. Note that after service driver being loaded, calls 99 - such as pci_enable_device(dev) and pci_set_master(dev) are no longer 100 - necessary since these calls are executed by the PCI Port Bus driver. 101 - 102 - 5.2.2 pcie_port_service_unregister 103 - 104 - void pcie_port_service_unregister(struct pcie_port_service_driver *new) 105 - 106 - pcie_port_service_unregister replaces the Linux Driver Model's 107 - pci_unregister_driver. It's always called by service driver when a 108 - module exits. 109 - 110 - 5.2.3 Sample Code 111 - 112 - Below is sample service driver code to initialize the port service 113 - driver data structure. 114 - 115 - static struct pcie_port_service_id service_id[] = { { 116 - .vendor = PCI_ANY_ID, 117 - .device = PCI_ANY_ID, 118 - .port_type = PCIE_RC_PORT, 119 - .service_type = PCIE_PORT_SERVICE_AER, 120 - }, { /* end: all zeroes */ } 121 - }; 122 - 123 - static struct pcie_port_service_driver root_aerdrv = { 124 - .name = (char *)device_name, 125 - .id_table = &service_id[0], 126 - 127 - .probe = aerdrv_load, 128 - .remove = aerdrv_unload, 129 - 130 - .suspend = aerdrv_suspend, 131 - .resume = aerdrv_resume, 132 - }; 133 - 134 - Below is a sample code for registering/unregistering a service 135 - driver. 136 - 137 - static int __init aerdrv_service_init(void) 138 - { 139 - int retval = 0; 140 - 141 - retval = pcie_port_service_register(&root_aerdrv); 142 - if (!retval) { 143 - /* 144 - * FIX ME 145 - */ 146 - } 147 - return retval; 148 - } 149 - 150 - static void __exit aerdrv_service_exit(void) 151 - { 152 - pcie_port_service_unregister(&root_aerdrv); 153 - } 154 - 155 - module_init(aerdrv_service_init); 156 - module_exit(aerdrv_service_exit); 157 - 158 - 6. Possible Resource Conflicts 159 - 160 - Since all service drivers of a PCI-PCI Bridge Port device are 161 - allowed to run simultaneously, below lists a few of possible resource 162 - conflicts with proposed solutions. 163 - 164 - 6.1 MSI and MSI-X Vector Resource 165 - 166 - Once MSI or MSI-X interrupts are enabled on a device, it stays in this 167 - mode until they are disabled again. Since service drivers of the same 168 - PCI-PCI Bridge port share the same physical device, if an individual 169 - service driver enables or disables MSI/MSI-X mode it may result 170 - unpredictable behavior. 171 - 172 - To avoid this situation all service drivers are not permitted to 173 - switch interrupt mode on its device. The PCI Express Port Bus driver 174 - is responsible for determining the interrupt mode and this should be 175 - transparent to service drivers. Service drivers need to know only 176 - the vector IRQ assigned to the field irq of struct pcie_device, which 177 - is passed in when the PCI Express Port Bus driver probes each service 178 - driver. Service drivers should use (struct pcie_device*)dev->irq to 179 - call request_irq/free_irq. In addition, the interrupt mode is stored 180 - in the field interrupt_mode of struct pcie_device. 181 - 182 - 6.3 PCI Memory/IO Mapped Regions 183 - 184 - Service drivers for PCI Express Power Management (PME), Advanced 185 - Error Reporting (AER), Hot-Plug (HP) and Virtual Channel (VC) access 186 - PCI configuration space on the PCI Express port. In all cases the 187 - registers accessed are independent of each other. This patch assumes 188 - that all service drivers will be well behaved and not overwrite 189 - other service driver's configuration settings. 190 - 191 - 6.4 PCI Config Registers 192 - 193 - Each service driver runs its PCI config operations on its own 194 - capability structure except the PCI Express capability structure, in 195 - which Root Control register and Device Control register are shared 196 - between PME and AER. This patch assumes that all service drivers 197 - will be well behaved and not overwrite other service driver's 198 - configuration settings.

+192

Documentation/PCI/acpi-info.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ======================================== 4 + ACPI considerations for PCI host bridges 5 + ======================================== 6 + 7 + The general rule is that the ACPI namespace should describe everything the 8 + OS might use unless there's another way for the OS to find it [1, 2]. 9 + 10 + For example, there's no standard hardware mechanism for enumerating PCI 11 + host bridges, so the ACPI namespace must describe each host bridge, the 12 + method for accessing PCI config space below it, the address space windows 13 + the host bridge forwards to PCI (using _CRS), and the routing of legacy 14 + INTx interrupts (using _PRT). 15 + 16 + PCI devices, which are below the host bridge, generally do not need to be 17 + described via ACPI. The OS can discover them via the standard PCI 18 + enumeration mechanism, using config accesses to discover and identify 19 + devices and read and size their BARs. However, ACPI may describe PCI 20 + devices if it provides power management or hotplug functionality for them 21 + or if the device has INTx interrupts connected by platform interrupt 22 + controllers and a _PRT is needed to describe those connections. 23 + 24 + ACPI resource description is done via _CRS objects of devices in the ACPI 25 + namespace [2]. The _CRS is like a generalized PCI BAR: the OS can read 26 + _CRS and figure out what resource is being consumed even if it doesn't have 27 + a driver for the device [3]. That's important because it means an old OS 28 + can work correctly even on a system with new devices unknown to the OS. 29 + The new devices might not do anything, but the OS can at least make sure no 30 + resources conflict with them. 31 + 32 + Static tables like MCFG, HPET, ECDT, etc., are *not* mechanisms for 33 + reserving address space. The static tables are for things the OS needs to 34 + know early in boot, before it can parse the ACPI namespace. If a new table 35 + is defined, an old OS needs to operate correctly even though it ignores the 36 + table. _CRS allows that because it is generic and understood by the old 37 + OS; a static table does not. 38 + 39 + If the OS is expected to manage a non-discoverable device described via 40 + ACPI, that device will have a specific _HID/_CID that tells the OS what 41 + driver to bind to it, and the _CRS tells the OS and the driver where the 42 + device's registers are. 43 + 44 + PCI host bridges are PNP0A03 or PNP0A08 devices. Their _CRS should 45 + describe all the address space they consume. This includes all the windows 46 + they forward down to the PCI bus, as well as registers of the host bridge 47 + itself that are not forwarded to PCI. The host bridge registers include 48 + things like secondary/subordinate bus registers that determine the bus 49 + range below the bridge, window registers that describe the apertures, etc. 50 + These are all device-specific, non-architected things, so the only way a 51 + PNP0A03/PNP0A08 driver can manage them is via _PRS/_CRS/_SRS, which contain 52 + the device-specific details. The host bridge registers also include ECAM 53 + space, since it is consumed by the host bridge. 54 + 55 + ACPI defines a Consumer/Producer bit to distinguish the bridge registers 56 + ("Consumer") from the bridge apertures ("Producer") [4, 5], but early 57 + BIOSes didn't use that bit correctly. The result is that the current ACPI 58 + spec defines Consumer/Producer only for the Extended Address Space 59 + descriptors; the bit should be ignored in the older QWord/DWord/Word 60 + Address Space descriptors. Consequently, OSes have to assume all 61 + QWord/DWord/Word descriptors are windows. 62 + 63 + Prior to the addition of Extended Address Space descriptors, the failure of 64 + Consumer/Producer meant there was no way to describe bridge registers in 65 + the PNP0A03/PNP0A08 device itself. The workaround was to describe the 66 + bridge registers (including ECAM space) in PNP0C02 catch-all devices [6]. 67 + With the exception of ECAM, the bridge register space is device-specific 68 + anyway, so the generic PNP0A03/PNP0A08 driver (pci_root.c) has no need to 69 + know about it. 70 + 71 + New architectures should be able to use "Consumer" Extended Address Space 72 + descriptors in the PNP0A03 device for bridge registers, including ECAM, 73 + although a strict interpretation of [6] might prohibit this. Old x86 and 74 + ia64 kernels assume all address space descriptors, including "Consumer" 75 + Extended Address Space ones, are windows, so it would not be safe to 76 + describe bridge registers this way on those architectures. 77 + 78 + PNP0C02 "motherboard" devices are basically a catch-all. There's no 79 + programming model for them other than "don't use these resources for 80 + anything else." So a PNP0C02 _CRS should claim any address space that is 81 + (1) not claimed by _CRS under any other device object in the ACPI namespace 82 + and (2) should not be assigned by the OS to something else. 83 + 84 + The PCIe spec requires the Enhanced Configuration Access Method (ECAM) 85 + unless there's a standard firmware interface for config access, e.g., the 86 + ia64 SAL interface [7]. A host bridge consumes ECAM memory address space 87 + and converts memory accesses into PCI configuration accesses. The spec 88 + defines the ECAM address space layout and functionality; only the base of 89 + the address space is device-specific. An ACPI OS learns the base address 90 + from either the static MCFG table or a _CBA method in the PNP0A03 device. 91 + 92 + The MCFG table must describe the ECAM space of non-hot pluggable host 93 + bridges [8]. Since MCFG is a static table and can't be updated by hotplug, 94 + a _CBA method in the PNP0A03 device describes the ECAM space of a 95 + hot-pluggable host bridge [9]. Note that for both MCFG and _CBA, the base 96 + address always corresponds to bus 0, even if the bus range below the bridge 97 + (which is reported via _CRS) doesn't start at 0. 98 + 99 + 100 + [1] ACPI 6.2, sec 6.1: 101 + For any device that is on a non-enumerable type of bus (for example, an 102 + ISA bus), OSPM enumerates the devices' identifier(s) and the ACPI 103 + system firmware must supply an _HID object ... for each device to 104 + enable OSPM to do that. 105 + 106 + [2] ACPI 6.2, sec 3.7: 107 + The OS enumerates motherboard devices simply by reading through the 108 + ACPI Namespace looking for devices with hardware IDs. 109 + 110 + Each device enumerated by ACPI includes ACPI-defined objects in the 111 + ACPI Namespace that report the hardware resources the device could 112 + occupy [_PRS], an object that reports the resources that are currently 113 + used by the device [_CRS], and objects for configuring those resources 114 + [_SRS]. The information is used by the Plug and Play OS (OSPM) to 115 + configure the devices. 116 + 117 + [3] ACPI 6.2, sec 6.2: 118 + OSPM uses device configuration objects to configure hardware resources 119 + for devices enumerated via ACPI. Device configuration objects provide 120 + information about current and possible resource requirements, the 121 + relationship between shared resources, and methods for configuring 122 + hardware resources. 123 + 124 + When OSPM enumerates a device, it calls _PRS to determine the resource 125 + requirements of the device. It may also call _CRS to find the current 126 + resource settings for the device. Using this information, the Plug and 127 + Play system determines what resources the device should consume and 128 + sets those resources by calling the device’s _SRS control method. 129 + 130 + In ACPI, devices can consume resources (for example, legacy keyboards), 131 + provide resources (for example, a proprietary PCI bridge), or do both. 132 + Unless otherwise specified, resources for a device are assumed to be 133 + taken from the nearest matching resource above the device in the device 134 + hierarchy. 135 + 136 + [4] ACPI 6.2, sec 6.4.3.5.1, 2, 3, 4: 137 + QWord/DWord/Word Address Space Descriptor (.1, .2, .3) 138 + General Flags: Bit [0] Ignored 139 + 140 + Extended Address Space Descriptor (.4) 141 + General Flags: Bit [0] Consumer/Producer: 142 + 143 + * 1 – This device consumes this resource 144 + * 0 – This device produces and consumes this resource 145 + 146 + [5] ACPI 6.2, sec 19.6.43: 147 + ResourceUsage specifies whether the Memory range is consumed by 148 + this device (ResourceConsumer) or passed on to child devices 149 + (ResourceProducer). If nothing is specified, then 150 + ResourceConsumer is assumed. 151 + 152 + [6] PCI Firmware 3.2, sec 4.1.2: 153 + If the operating system does not natively comprehend reserving the 154 + MMCFG region, the MMCFG region must be reserved by firmware. The 155 + address range reported in the MCFG table or by _CBA method (see Section 156 + 4.1.3) must be reserved by declaring a motherboard resource. For most 157 + systems, the motherboard resource would appear at the root of the ACPI 158 + namespace (under \_SB) in a node with a _HID of EISAID (PNP0C02), and 159 + the resources in this case should not be claimed in the root PCI bus’s 160 + _CRS. The resources can optionally be returned in Int15 E820 or 161 + EFIGetMemoryMap as reserved memory but must always be reported through 162 + ACPI as a motherboard resource. 163 + 164 + [7] PCI Express 4.0, sec 7.2.2: 165 + For systems that are PC-compatible, or that do not implement a 166 + processor-architecture-specific firmware interface standard that allows 167 + access to the Configuration Space, the ECAM is required as defined in 168 + this section. 169 + 170 + [8] PCI Firmware 3.2, sec 4.1.2: 171 + The MCFG table is an ACPI table that is used to communicate the base 172 + addresses corresponding to the non-hot removable PCI Segment Groups 173 + range within a PCI Segment Group available to the operating system at 174 + boot. This is required for the PC-compatible systems. 175 + 176 + The MCFG table is only used to communicate the base addresses 177 + corresponding to the PCI Segment Groups available to the system at 178 + boot. 179 + 180 + [9] PCI Firmware 3.2, sec 4.1.3: 181 + The _CBA (Memory mapped Configuration Base Address) control method is 182 + an optional ACPI object that returns the 64-bit memory mapped 183 + configuration base address for the hot plug capable host bridge. The 184 + base address returned by _CBA is processor-relative address. The _CBA 185 + control method evaluates to an Integer. 186 + 187 + This control method appears under a host bridge object. When the _CBA 188 + method appears under an active host bridge object, the operating system 189 + evaluates this structure to identify the memory mapped configuration 190 + base address corresponding to the PCI Segment Group for the bus number 191 + range specified in _CRS method. An ACPI name space object that contains 192 + the _CBA method must also contain a corresponding _SEG method.

-187

Documentation/PCI/acpi-info.txt

··· 1 - ACPI considerations for PCI host bridges 2 - 3 - The general rule is that the ACPI namespace should describe everything the 4 - OS might use unless there's another way for the OS to find it [1, 2]. 5 - 6 - For example, there's no standard hardware mechanism for enumerating PCI 7 - host bridges, so the ACPI namespace must describe each host bridge, the 8 - method for accessing PCI config space below it, the address space windows 9 - the host bridge forwards to PCI (using _CRS), and the routing of legacy 10 - INTx interrupts (using _PRT). 11 - 12 - PCI devices, which are below the host bridge, generally do not need to be 13 - described via ACPI. The OS can discover them via the standard PCI 14 - enumeration mechanism, using config accesses to discover and identify 15 - devices and read and size their BARs. However, ACPI may describe PCI 16 - devices if it provides power management or hotplug functionality for them 17 - or if the device has INTx interrupts connected by platform interrupt 18 - controllers and a _PRT is needed to describe those connections. 19 - 20 - ACPI resource description is done via _CRS objects of devices in the ACPI 21 - namespace [2]. The _CRS is like a generalized PCI BAR: the OS can read 22 - _CRS and figure out what resource is being consumed even if it doesn't have 23 - a driver for the device [3]. That's important because it means an old OS 24 - can work correctly even on a system with new devices unknown to the OS. 25 - The new devices might not do anything, but the OS can at least make sure no 26 - resources conflict with them. 27 - 28 - Static tables like MCFG, HPET, ECDT, etc., are *not* mechanisms for 29 - reserving address space. The static tables are for things the OS needs to 30 - know early in boot, before it can parse the ACPI namespace. If a new table 31 - is defined, an old OS needs to operate correctly even though it ignores the 32 - table. _CRS allows that because it is generic and understood by the old 33 - OS; a static table does not. 34 - 35 - If the OS is expected to manage a non-discoverable device described via 36 - ACPI, that device will have a specific _HID/_CID that tells the OS what 37 - driver to bind to it, and the _CRS tells the OS and the driver where the 38 - device's registers are. 39 - 40 - PCI host bridges are PNP0A03 or PNP0A08 devices. Their _CRS should 41 - describe all the address space they consume. This includes all the windows 42 - they forward down to the PCI bus, as well as registers of the host bridge 43 - itself that are not forwarded to PCI. The host bridge registers include 44 - things like secondary/subordinate bus registers that determine the bus 45 - range below the bridge, window registers that describe the apertures, etc. 46 - These are all device-specific, non-architected things, so the only way a 47 - PNP0A03/PNP0A08 driver can manage them is via _PRS/_CRS/_SRS, which contain 48 - the device-specific details. The host bridge registers also include ECAM 49 - space, since it is consumed by the host bridge. 50 - 51 - ACPI defines a Consumer/Producer bit to distinguish the bridge registers 52 - ("Consumer") from the bridge apertures ("Producer") [4, 5], but early 53 - BIOSes didn't use that bit correctly. The result is that the current ACPI 54 - spec defines Consumer/Producer only for the Extended Address Space 55 - descriptors; the bit should be ignored in the older QWord/DWord/Word 56 - Address Space descriptors. Consequently, OSes have to assume all 57 - QWord/DWord/Word descriptors are windows. 58 - 59 - Prior to the addition of Extended Address Space descriptors, the failure of 60 - Consumer/Producer meant there was no way to describe bridge registers in 61 - the PNP0A03/PNP0A08 device itself. The workaround was to describe the 62 - bridge registers (including ECAM space) in PNP0C02 catch-all devices [6]. 63 - With the exception of ECAM, the bridge register space is device-specific 64 - anyway, so the generic PNP0A03/PNP0A08 driver (pci_root.c) has no need to 65 - know about it. 66 - 67 - New architectures should be able to use "Consumer" Extended Address Space 68 - descriptors in the PNP0A03 device for bridge registers, including ECAM, 69 - although a strict interpretation of [6] might prohibit this. Old x86 and 70 - ia64 kernels assume all address space descriptors, including "Consumer" 71 - Extended Address Space ones, are windows, so it would not be safe to 72 - describe bridge registers this way on those architectures. 73 - 74 - PNP0C02 "motherboard" devices are basically a catch-all. There's no 75 - programming model for them other than "don't use these resources for 76 - anything else." So a PNP0C02 _CRS should claim any address space that is 77 - (1) not claimed by _CRS under any other device object in the ACPI namespace 78 - and (2) should not be assigned by the OS to something else. 79 - 80 - The PCIe spec requires the Enhanced Configuration Access Method (ECAM) 81 - unless there's a standard firmware interface for config access, e.g., the 82 - ia64 SAL interface [7]. A host bridge consumes ECAM memory address space 83 - and converts memory accesses into PCI configuration accesses. The spec 84 - defines the ECAM address space layout and functionality; only the base of 85 - the address space is device-specific. An ACPI OS learns the base address 86 - from either the static MCFG table or a _CBA method in the PNP0A03 device. 87 - 88 - The MCFG table must describe the ECAM space of non-hot pluggable host 89 - bridges [8]. Since MCFG is a static table and can't be updated by hotplug, 90 - a _CBA method in the PNP0A03 device describes the ECAM space of a 91 - hot-pluggable host bridge [9]. Note that for both MCFG and _CBA, the base 92 - address always corresponds to bus 0, even if the bus range below the bridge 93 - (which is reported via _CRS) doesn't start at 0. 94 - 95 - 96 - [1] ACPI 6.2, sec 6.1: 97 - For any device that is on a non-enumerable type of bus (for example, an 98 - ISA bus), OSPM enumerates the devices' identifier(s) and the ACPI 99 - system firmware must supply an _HID object ... for each device to 100 - enable OSPM to do that. 101 - 102 - [2] ACPI 6.2, sec 3.7: 103 - The OS enumerates motherboard devices simply by reading through the 104 - ACPI Namespace looking for devices with hardware IDs. 105 - 106 - Each device enumerated by ACPI includes ACPI-defined objects in the 107 - ACPI Namespace that report the hardware resources the device could 108 - occupy [_PRS], an object that reports the resources that are currently 109 - used by the device [_CRS], and objects for configuring those resources 110 - [_SRS]. The information is used by the Plug and Play OS (OSPM) to 111 - configure the devices. 112 - 113 - [3] ACPI 6.2, sec 6.2: 114 - OSPM uses device configuration objects to configure hardware resources 115 - for devices enumerated via ACPI. Device configuration objects provide 116 - information about current and possible resource requirements, the 117 - relationship between shared resources, and methods for configuring 118 - hardware resources. 119 - 120 - When OSPM enumerates a device, it calls _PRS to determine the resource 121 - requirements of the device. It may also call _CRS to find the current 122 - resource settings for the device. Using this information, the Plug and 123 - Play system determines what resources the device should consume and 124 - sets those resources by calling the device’s _SRS control method. 125 - 126 - In ACPI, devices can consume resources (for example, legacy keyboards), 127 - provide resources (for example, a proprietary PCI bridge), or do both. 128 - Unless otherwise specified, resources for a device are assumed to be 129 - taken from the nearest matching resource above the device in the device 130 - hierarchy. 131 - 132 - [4] ACPI 6.2, sec 6.4.3.5.1, 2, 3, 4: 133 - QWord/DWord/Word Address Space Descriptor (.1, .2, .3) 134 - General Flags: Bit [0] Ignored 135 - 136 - Extended Address Space Descriptor (.4) 137 - General Flags: Bit [0] Consumer/Producer: 138 - 1–This device consumes this resource 139 - 0–This device produces and consumes this resource 140 - 141 - [5] ACPI 6.2, sec 19.6.43: 142 - ResourceUsage specifies whether the Memory range is consumed by 143 - this device (ResourceConsumer) or passed on to child devices 144 - (ResourceProducer). If nothing is specified, then 145 - ResourceConsumer is assumed. 146 - 147 - [6] PCI Firmware 3.2, sec 4.1.2: 148 - If the operating system does not natively comprehend reserving the 149 - MMCFG region, the MMCFG region must be reserved by firmware. The 150 - address range reported in the MCFG table or by _CBA method (see Section 151 - 4.1.3) must be reserved by declaring a motherboard resource. For most 152 - systems, the motherboard resource would appear at the root of the ACPI 153 - namespace (under \_SB) in a node with a _HID of EISAID (PNP0C02), and 154 - the resources in this case should not be claimed in the root PCI bus’s 155 - _CRS. The resources can optionally be returned in Int15 E820 or 156 - EFIGetMemoryMap as reserved memory but must always be reported through 157 - ACPI as a motherboard resource. 158 - 159 - [7] PCI Express 4.0, sec 7.2.2: 160 - For systems that are PC-compatible, or that do not implement a 161 - processor-architecture-specific firmware interface standard that allows 162 - access to the Configuration Space, the ECAM is required as defined in 163 - this section. 164 - 165 - [8] PCI Firmware 3.2, sec 4.1.2: 166 - The MCFG table is an ACPI table that is used to communicate the base 167 - addresses corresponding to the non-hot removable PCI Segment Groups 168 - range within a PCI Segment Group available to the operating system at 169 - boot. This is required for the PC-compatible systems. 170 - 171 - The MCFG table is only used to communicate the base addresses 172 - corresponding to the PCI Segment Groups available to the system at 173 - boot. 174 - 175 - [9] PCI Firmware 3.2, sec 4.1.3: 176 - The _CBA (Memory mapped Configuration Base Address) control method is 177 - an optional ACPI object that returns the 64-bit memory mapped 178 - configuration base address for the hot plug capable host bridge. The 179 - base address returned by _CBA is processor-relative address. The _CBA 180 - control method evaluates to an Integer. 181 - 182 - This control method appears under a host bridge object. When the _CBA 183 - method appears under an active host bridge object, the operating system 184 - evaluates this structure to identify the memory mapped configuration 185 - base address corresponding to the PCI Segment Group for the bus number 186 - range specified in _CRS method. An ACPI name space object that contains 187 - the _CBA method must also contain a corresponding _SEG method.

+13

Documentation/PCI/endpoint/index.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ====================== 4 + PCI Endpoint Framework 5 + ====================== 6 + 7 + .. toctree:: 8 + :maxdepth: 2 9 + 10 + pci-endpoint 11 + pci-endpoint-cfs 12 + pci-test-function 13 + pci-test-howto

+118

Documentation/PCI/endpoint/pci-endpoint-cfs.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ======================================= 4 + Configuring PCI Endpoint Using CONFIGFS 5 + ======================================= 6 + 7 + :Author: Kishon Vijay Abraham I <kishon@ti.com> 8 + 9 + The PCI Endpoint Core exposes configfs entry (pci_ep) to configure the 10 + PCI endpoint function and to bind the endpoint function 11 + with the endpoint controller. (For introducing other mechanisms to 12 + configure the PCI Endpoint Function refer to [1]). 13 + 14 + Mounting configfs 15 + ================= 16 + 17 + The PCI Endpoint Core layer creates pci_ep directory in the mounted configfs 18 + directory. configfs can be mounted using the following command:: 19 + 20 + mount -t configfs none /sys/kernel/config 21 + 22 + Directory Structure 23 + =================== 24 + 25 + The pci_ep configfs has two directories at its root: controllers and 26 + functions. Every EPC device present in the system will have an entry in 27 + the *controllers* directory and and every EPF driver present in the system 28 + will have an entry in the *functions* directory. 29 + :: 30 + 31 + /sys/kernel/config/pci_ep/ 32 + .. controllers/ 33 + .. functions/ 34 + 35 + Creating EPF Device 36 + =================== 37 + 38 + Every registered EPF driver will be listed in controllers directory. The 39 + entries corresponding to EPF driver will be created by the EPF core. 40 + :: 41 + 42 + /sys/kernel/config/pci_ep/functions/ 43 + .. <EPF Driver1>/ 44 + ... <EPF Device 11>/ 45 + ... <EPF Device 21>/ 46 + .. <EPF Driver2>/ 47 + ... <EPF Device 12>/ 48 + ... <EPF Device 22>/ 49 + 50 + In order to create a <EPF device> of the type probed by <EPF Driver>, the 51 + user has to create a directory inside <EPF DriverN>. 52 + 53 + Every <EPF device> directory consists of the following entries that can be 54 + used to configure the standard configuration header of the endpoint function. 55 + (These entries are created by the framework when any new <EPF Device> is 56 + created) 57 + :: 58 + 59 + .. <EPF Driver1>/ 60 + ... <EPF Device 11>/ 61 + ... vendorid 62 + ... deviceid 63 + ... revid 64 + ... progif_code 65 + ... subclass_code 66 + ... baseclass_code 67 + ... cache_line_size 68 + ... subsys_vendor_id 69 + ... subsys_id 70 + ... interrupt_pin 71 + 72 + EPC Device 73 + ========== 74 + 75 + Every registered EPC device will be listed in controllers directory. The 76 + entries corresponding to EPC device will be created by the EPC core. 77 + :: 78 + 79 + /sys/kernel/config/pci_ep/controllers/ 80 + .. <EPC Device1>/ 81 + ... <Symlink EPF Device11>/ 82 + ... <Symlink EPF Device12>/ 83 + ... start 84 + .. <EPC Device2>/ 85 + ... <Symlink EPF Device21>/ 86 + ... <Symlink EPF Device22>/ 87 + ... start 88 + 89 + The <EPC Device> directory will have a list of symbolic links to 90 + <EPF Device>. These symbolic links should be created by the user to 91 + represent the functions present in the endpoint device. 92 + 93 + The <EPC Device> directory will also have a *start* field. Once 94 + "1" is written to this field, the endpoint device will be ready to 95 + establish the link with the host. This is usually done after 96 + all the EPF devices are created and linked with the EPC device. 97 + :: 98 + 99 + | controllers/ 100 + | <Directory: EPC name>/ 101 + | <Symbolic Link: Function> 102 + | start 103 + | functions/ 104 + | <Directory: EPF driver>/ 105 + | <Directory: EPF device>/ 106 + | vendorid 107 + | deviceid 108 + | revid 109 + | progif_code 110 + | subclass_code 111 + | baseclass_code 112 + | cache_line_size 113 + | subsys_vendor_id 114 + | subsys_id 115 + | interrupt_pin 116 + | function 117 + 118 + [1] :doc:`pci-endpoint`

-105

Documentation/PCI/endpoint/pci-endpoint-cfs.txt

··· 1 - CONFIGURING PCI ENDPOINT USING CONFIGFS 2 - Kishon Vijay Abraham I <kishon@ti.com> 3 - 4 - The PCI Endpoint Core exposes configfs entry (pci_ep) to configure the 5 - PCI endpoint function and to bind the endpoint function 6 - with the endpoint controller. (For introducing other mechanisms to 7 - configure the PCI Endpoint Function refer to [1]). 8 - 9 - *) Mounting configfs 10 - 11 - The PCI Endpoint Core layer creates pci_ep directory in the mounted configfs 12 - directory. configfs can be mounted using the following command. 13 - 14 - mount -t configfs none /sys/kernel/config 15 - 16 - *) Directory Structure 17 - 18 - The pci_ep configfs has two directories at its root: controllers and 19 - functions. Every EPC device present in the system will have an entry in 20 - the *controllers* directory and and every EPF driver present in the system 21 - will have an entry in the *functions* directory. 22 - 23 - /sys/kernel/config/pci_ep/ 24 - .. controllers/ 25 - .. functions/ 26 - 27 - *) Creating EPF Device 28 - 29 - Every registered EPF driver will be listed in controllers directory. The 30 - entries corresponding to EPF driver will be created by the EPF core. 31 - 32 - /sys/kernel/config/pci_ep/functions/ 33 - .. <EPF Driver1>/ 34 - ... <EPF Device 11>/ 35 - ... <EPF Device 21>/ 36 - .. <EPF Driver2>/ 37 - ... <EPF Device 12>/ 38 - ... <EPF Device 22>/ 39 - 40 - In order to create a <EPF device> of the type probed by <EPF Driver>, the 41 - user has to create a directory inside <EPF DriverN>. 42 - 43 - Every <EPF device> directory consists of the following entries that can be 44 - used to configure the standard configuration header of the endpoint function. 45 - (These entries are created by the framework when any new <EPF Device> is 46 - created) 47 - 48 - .. <EPF Driver1>/ 49 - ... <EPF Device 11>/ 50 - ... vendorid 51 - ... deviceid 52 - ... revid 53 - ... progif_code 54 - ... subclass_code 55 - ... baseclass_code 56 - ... cache_line_size 57 - ... subsys_vendor_id 58 - ... subsys_id 59 - ... interrupt_pin 60 - 61 - *) EPC Device 62 - 63 - Every registered EPC device will be listed in controllers directory. The 64 - entries corresponding to EPC device will be created by the EPC core. 65 - 66 - /sys/kernel/config/pci_ep/controllers/ 67 - .. <EPC Device1>/ 68 - ... <Symlink EPF Device11>/ 69 - ... <Symlink EPF Device12>/ 70 - ... start 71 - .. <EPC Device2>/ 72 - ... <Symlink EPF Device21>/ 73 - ... <Symlink EPF Device22>/ 74 - ... start 75 - 76 - The <EPC Device> directory will have a list of symbolic links to 77 - <EPF Device>. These symbolic links should be created by the user to 78 - represent the functions present in the endpoint device. 79 - 80 - The <EPC Device> directory will also have a *start* field. Once 81 - "1" is written to this field, the endpoint device will be ready to 82 - establish the link with the host. This is usually done after 83 - all the EPF devices are created and linked with the EPC device. 84 - 85 - 86 - | controllers/ 87 - | <Directory: EPC name>/ 88 - | <Symbolic Link: Function> 89 - | start 90 - | functions/ 91 - | <Directory: EPF driver>/ 92 - | <Directory: EPF device>/ 93 - | vendorid 94 - | deviceid 95 - | revid 96 - | progif_code 97 - | subclass_code 98 - | baseclass_code 99 - | cache_line_size 100 - | subsys_vendor_id 101 - | subsys_id 102 - | interrupt_pin 103 - | function 104 - 105 - [1] -> Documentation/PCI/endpoint/pci-endpoint.txt

+231

Documentation/PCI/endpoint/pci-endpoint.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + :Author: Kishon Vijay Abraham I <kishon@ti.com> 4 + 5 + This document is a guide to use the PCI Endpoint Framework in order to create 6 + endpoint controller driver, endpoint function driver, and using configfs 7 + interface to bind the function driver to the controller driver. 8 + 9 + Introduction 10 + ============ 11 + 12 + Linux has a comprehensive PCI subsystem to support PCI controllers that 13 + operates in Root Complex mode. The subsystem has capability to scan PCI bus, 14 + assign memory resources and IRQ resources, load PCI driver (based on 15 + vendor ID, device ID), support other services like hot-plug, power management, 16 + advanced error reporting and virtual channels. 17 + 18 + However the PCI controller IP integrated in some SoCs is capable of operating 19 + either in Root Complex mode or Endpoint mode. PCI Endpoint Framework will 20 + add endpoint mode support in Linux. This will help to run Linux in an 21 + EP system which can have a wide variety of use cases from testing or 22 + validation, co-processor accelerator, etc. 23 + 24 + PCI Endpoint Core 25 + ================= 26 + 27 + The PCI Endpoint Core layer comprises 3 components: the Endpoint Controller 28 + library, the Endpoint Function library, and the configfs layer to bind the 29 + endpoint function with the endpoint controller. 30 + 31 + PCI Endpoint Controller(EPC) Library 32 + ------------------------------------ 33 + 34 + The EPC library provides APIs to be used by the controller that can operate 35 + in endpoint mode. It also provides APIs to be used by function driver/library 36 + in order to implement a particular endpoint function. 37 + 38 + APIs for the PCI controller Driver 39 + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 40 + 41 + This section lists the APIs that the PCI Endpoint core provides to be used 42 + by the PCI controller driver. 43 + 44 + * devm_pci_epc_create()/pci_epc_create() 45 + 46 + The PCI controller driver should implement the following ops: 47 + 48 + * write_header: ops to populate configuration space header 49 + * set_bar: ops to configure the BAR 50 + * clear_bar: ops to reset the BAR 51 + * alloc_addr_space: ops to allocate in PCI controller address space 52 + * free_addr_space: ops to free the allocated address space 53 + * raise_irq: ops to raise a legacy, MSI or MSI-X interrupt 54 + * start: ops to start the PCI link 55 + * stop: ops to stop the PCI link 56 + 57 + The PCI controller driver can then create a new EPC device by invoking 58 + devm_pci_epc_create()/pci_epc_create(). 59 + 60 + * devm_pci_epc_destroy()/pci_epc_destroy() 61 + 62 + The PCI controller driver can destroy the EPC device created by either 63 + devm_pci_epc_create() or pci_epc_create() using devm_pci_epc_destroy() or 64 + pci_epc_destroy(). 65 + 66 + * pci_epc_linkup() 67 + 68 + In order to notify all the function devices that the EPC device to which 69 + they are linked has established a link with the host, the PCI controller 70 + driver should invoke pci_epc_linkup(). 71 + 72 + * pci_epc_mem_init() 73 + 74 + Initialize the pci_epc_mem structure used for allocating EPC addr space. 75 + 76 + * pci_epc_mem_exit() 77 + 78 + Cleanup the pci_epc_mem structure allocated during pci_epc_mem_init(). 79 + 80 + 81 + APIs for the PCI Endpoint Function Driver 82 + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 83 + 84 + This section lists the APIs that the PCI Endpoint core provides to be used 85 + by the PCI endpoint function driver. 86 + 87 + * pci_epc_write_header() 88 + 89 + The PCI endpoint function driver should use pci_epc_write_header() to 90 + write the standard configuration header to the endpoint controller. 91 + 92 + * pci_epc_set_bar() 93 + 94 + The PCI endpoint function driver should use pci_epc_set_bar() to configure 95 + the Base Address Register in order for the host to assign PCI addr space. 96 + Register space of the function driver is usually configured 97 + using this API. 98 + 99 + * pci_epc_clear_bar() 100 + 101 + The PCI endpoint function driver should use pci_epc_clear_bar() to reset 102 + the BAR. 103 + 104 + * pci_epc_raise_irq() 105 + 106 + The PCI endpoint function driver should use pci_epc_raise_irq() to raise 107 + Legacy Interrupt, MSI or MSI-X Interrupt. 108 + 109 + * pci_epc_mem_alloc_addr() 110 + 111 + The PCI endpoint function driver should use pci_epc_mem_alloc_addr(), to 112 + allocate memory address from EPC addr space which is required to access 113 + RC's buffer 114 + 115 + * pci_epc_mem_free_addr() 116 + 117 + The PCI endpoint function driver should use pci_epc_mem_free_addr() to 118 + free the memory space allocated using pci_epc_mem_alloc_addr(). 119 + 120 + Other APIs 121 + ~~~~~~~~~~ 122 + 123 + There are other APIs provided by the EPC library. These are used for binding 124 + the EPF device with EPC device. pci-ep-cfs.c can be used as reference for 125 + using these APIs. 126 + 127 + * pci_epc_get() 128 + 129 + Get a reference to the PCI endpoint controller based on the device name of 130 + the controller. 131 + 132 + * pci_epc_put() 133 + 134 + Release the reference to the PCI endpoint controller obtained using 135 + pci_epc_get() 136 + 137 + * pci_epc_add_epf() 138 + 139 + Add a PCI endpoint function to a PCI endpoint controller. A PCIe device 140 + can have up to 8 functions according to the specification. 141 + 142 + * pci_epc_remove_epf() 143 + 144 + Remove the PCI endpoint function from PCI endpoint controller. 145 + 146 + * pci_epc_start() 147 + 148 + The PCI endpoint function driver should invoke pci_epc_start() once it 149 + has configured the endpoint function and wants to start the PCI link. 150 + 151 + * pci_epc_stop() 152 + 153 + The PCI endpoint function driver should invoke pci_epc_stop() to stop 154 + the PCI LINK. 155 + 156 + 157 + PCI Endpoint Function(EPF) Library 158 + ---------------------------------- 159 + 160 + The EPF library provides APIs to be used by the function driver and the EPC 161 + library to provide endpoint mode functionality. 162 + 163 + APIs for the PCI Endpoint Function Driver 164 + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 165 + 166 + This section lists the APIs that the PCI Endpoint core provides to be used 167 + by the PCI endpoint function driver. 168 + 169 + * pci_epf_register_driver() 170 + 171 + The PCI Endpoint Function driver should implement the following ops: 172 + * bind: ops to perform when a EPC device has been bound to EPF device 173 + * unbind: ops to perform when a binding has been lost between a EPC 174 + device and EPF device 175 + * linkup: ops to perform when the EPC device has established a 176 + connection with a host system 177 + 178 + The PCI Function driver can then register the PCI EPF driver by using 179 + pci_epf_register_driver(). 180 + 181 + * pci_epf_unregister_driver() 182 + 183 + The PCI Function driver can unregister the PCI EPF driver by using 184 + pci_epf_unregister_driver(). 185 + 186 + * pci_epf_alloc_space() 187 + 188 + The PCI Function driver can allocate space for a particular BAR using 189 + pci_epf_alloc_space(). 190 + 191 + * pci_epf_free_space() 192 + 193 + The PCI Function driver can free the allocated space 194 + (using pci_epf_alloc_space) by invoking pci_epf_free_space(). 195 + 196 + APIs for the PCI Endpoint Controller Library 197 + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 198 + 199 + This section lists the APIs that the PCI Endpoint core provides to be used 200 + by the PCI endpoint controller library. 201 + 202 + * pci_epf_linkup() 203 + 204 + The PCI endpoint controller library invokes pci_epf_linkup() when the 205 + EPC device has established the connection to the host. 206 + 207 + Other APIs 208 + ~~~~~~~~~~ 209 + 210 + There are other APIs provided by the EPF library. These are used to notify 211 + the function driver when the EPF device is bound to the EPC device. 212 + pci-ep-cfs.c can be used as reference for using these APIs. 213 + 214 + * pci_epf_create() 215 + 216 + Create a new PCI EPF device by passing the name of the PCI EPF device. 217 + This name will be used to bind the the EPF device to a EPF driver. 218 + 219 + * pci_epf_destroy() 220 + 221 + Destroy the created PCI EPF device. 222 + 223 + * pci_epf_bind() 224 + 225 + pci_epf_bind() should be invoked when the EPF device has been bound to 226 + a EPC device. 227 + 228 + * pci_epf_unbind() 229 + 230 + pci_epf_unbind() should be invoked when the binding between EPC device 231 + and EPF device is lost.

-215

Documentation/PCI/endpoint/pci-endpoint.txt

··· 1 - PCI ENDPOINT FRAMEWORK 2 - Kishon Vijay Abraham I <kishon@ti.com> 3 - 4 - This document is a guide to use the PCI Endpoint Framework in order to create 5 - endpoint controller driver, endpoint function driver, and using configfs 6 - interface to bind the function driver to the controller driver. 7 - 8 - 1. Introduction 9 - 10 - Linux has a comprehensive PCI subsystem to support PCI controllers that 11 - operates in Root Complex mode. The subsystem has capability to scan PCI bus, 12 - assign memory resources and IRQ resources, load PCI driver (based on 13 - vendor ID, device ID), support other services like hot-plug, power management, 14 - advanced error reporting and virtual channels. 15 - 16 - However the PCI controller IP integrated in some SoCs is capable of operating 17 - either in Root Complex mode or Endpoint mode. PCI Endpoint Framework will 18 - add endpoint mode support in Linux. This will help to run Linux in an 19 - EP system which can have a wide variety of use cases from testing or 20 - validation, co-processor accelerator, etc. 21 - 22 - 2. PCI Endpoint Core 23 - 24 - The PCI Endpoint Core layer comprises 3 components: the Endpoint Controller 25 - library, the Endpoint Function library, and the configfs layer to bind the 26 - endpoint function with the endpoint controller. 27 - 28 - 2.1 PCI Endpoint Controller(EPC) Library 29 - 30 - The EPC library provides APIs to be used by the controller that can operate 31 - in endpoint mode. It also provides APIs to be used by function driver/library 32 - in order to implement a particular endpoint function. 33 - 34 - 2.1.1 APIs for the PCI controller Driver 35 - 36 - This section lists the APIs that the PCI Endpoint core provides to be used 37 - by the PCI controller driver. 38 - 39 - *) devm_pci_epc_create()/pci_epc_create() 40 - 41 - The PCI controller driver should implement the following ops: 42 - * write_header: ops to populate configuration space header 43 - * set_bar: ops to configure the BAR 44 - * clear_bar: ops to reset the BAR 45 - * alloc_addr_space: ops to allocate in PCI controller address space 46 - * free_addr_space: ops to free the allocated address space 47 - * raise_irq: ops to raise a legacy, MSI or MSI-X interrupt 48 - * start: ops to start the PCI link 49 - * stop: ops to stop the PCI link 50 - 51 - The PCI controller driver can then create a new EPC device by invoking 52 - devm_pci_epc_create()/pci_epc_create(). 53 - 54 - *) devm_pci_epc_destroy()/pci_epc_destroy() 55 - 56 - The PCI controller driver can destroy the EPC device created by either 57 - devm_pci_epc_create() or pci_epc_create() using devm_pci_epc_destroy() or 58 - pci_epc_destroy(). 59 - 60 - *) pci_epc_linkup() 61 - 62 - In order to notify all the function devices that the EPC device to which 63 - they are linked has established a link with the host, the PCI controller 64 - driver should invoke pci_epc_linkup(). 65 - 66 - *) pci_epc_mem_init() 67 - 68 - Initialize the pci_epc_mem structure used for allocating EPC addr space. 69 - 70 - *) pci_epc_mem_exit() 71 - 72 - Cleanup the pci_epc_mem structure allocated during pci_epc_mem_init(). 73 - 74 - 2.1.2 APIs for the PCI Endpoint Function Driver 75 - 76 - This section lists the APIs that the PCI Endpoint core provides to be used 77 - by the PCI endpoint function driver. 78 - 79 - *) pci_epc_write_header() 80 - 81 - The PCI endpoint function driver should use pci_epc_write_header() to 82 - write the standard configuration header to the endpoint controller. 83 - 84 - *) pci_epc_set_bar() 85 - 86 - The PCI endpoint function driver should use pci_epc_set_bar() to configure 87 - the Base Address Register in order for the host to assign PCI addr space. 88 - Register space of the function driver is usually configured 89 - using this API. 90 - 91 - *) pci_epc_clear_bar() 92 - 93 - The PCI endpoint function driver should use pci_epc_clear_bar() to reset 94 - the BAR. 95 - 96 - *) pci_epc_raise_irq() 97 - 98 - The PCI endpoint function driver should use pci_epc_raise_irq() to raise 99 - Legacy Interrupt, MSI or MSI-X Interrupt. 100 - 101 - *) pci_epc_mem_alloc_addr() 102 - 103 - The PCI endpoint function driver should use pci_epc_mem_alloc_addr(), to 104 - allocate memory address from EPC addr space which is required to access 105 - RC's buffer 106 - 107 - *) pci_epc_mem_free_addr() 108 - 109 - The PCI endpoint function driver should use pci_epc_mem_free_addr() to 110 - free the memory space allocated using pci_epc_mem_alloc_addr(). 111 - 112 - 2.1.3 Other APIs 113 - 114 - There are other APIs provided by the EPC library. These are used for binding 115 - the EPF device with EPC device. pci-ep-cfs.c can be used as reference for 116 - using these APIs. 117 - 118 - *) pci_epc_get() 119 - 120 - Get a reference to the PCI endpoint controller based on the device name of 121 - the controller. 122 - 123 - *) pci_epc_put() 124 - 125 - Release the reference to the PCI endpoint controller obtained using 126 - pci_epc_get() 127 - 128 - *) pci_epc_add_epf() 129 - 130 - Add a PCI endpoint function to a PCI endpoint controller. A PCIe device 131 - can have up to 8 functions according to the specification. 132 - 133 - *) pci_epc_remove_epf() 134 - 135 - Remove the PCI endpoint function from PCI endpoint controller. 136 - 137 - *) pci_epc_start() 138 - 139 - The PCI endpoint function driver should invoke pci_epc_start() once it 140 - has configured the endpoint function and wants to start the PCI link. 141 - 142 - *) pci_epc_stop() 143 - 144 - The PCI endpoint function driver should invoke pci_epc_stop() to stop 145 - the PCI LINK. 146 - 147 - 2.2 PCI Endpoint Function(EPF) Library 148 - 149 - The EPF library provides APIs to be used by the function driver and the EPC 150 - library to provide endpoint mode functionality. 151 - 152 - 2.2.1 APIs for the PCI Endpoint Function Driver 153 - 154 - This section lists the APIs that the PCI Endpoint core provides to be used 155 - by the PCI endpoint function driver. 156 - 157 - *) pci_epf_register_driver() 158 - 159 - The PCI Endpoint Function driver should implement the following ops: 160 - * bind: ops to perform when a EPC device has been bound to EPF device 161 - * unbind: ops to perform when a binding has been lost between a EPC 162 - device and EPF device 163 - * linkup: ops to perform when the EPC device has established a 164 - connection with a host system 165 - 166 - The PCI Function driver can then register the PCI EPF driver by using 167 - pci_epf_register_driver(). 168 - 169 - *) pci_epf_unregister_driver() 170 - 171 - The PCI Function driver can unregister the PCI EPF driver by using 172 - pci_epf_unregister_driver(). 173 - 174 - *) pci_epf_alloc_space() 175 - 176 - The PCI Function driver can allocate space for a particular BAR using 177 - pci_epf_alloc_space(). 178 - 179 - *) pci_epf_free_space() 180 - 181 - The PCI Function driver can free the allocated space 182 - (using pci_epf_alloc_space) by invoking pci_epf_free_space(). 183 - 184 - 2.2.2 APIs for the PCI Endpoint Controller Library 185 - This section lists the APIs that the PCI Endpoint core provides to be used 186 - by the PCI endpoint controller library. 187 - 188 - *) pci_epf_linkup() 189 - 190 - The PCI endpoint controller library invokes pci_epf_linkup() when the 191 - EPC device has established the connection to the host. 192 - 193 - 2.2.2 Other APIs 194 - There are other APIs provided by the EPF library. These are used to notify 195 - the function driver when the EPF device is bound to the EPC device. 196 - pci-ep-cfs.c can be used as reference for using these APIs. 197 - 198 - *) pci_epf_create() 199 - 200 - Create a new PCI EPF device by passing the name of the PCI EPF device. 201 - This name will be used to bind the the EPF device to a EPF driver. 202 - 203 - *) pci_epf_destroy() 204 - 205 - Destroy the created PCI EPF device. 206 - 207 - *) pci_epf_bind() 208 - 209 - pci_epf_bind() should be invoked when the EPF device has been bound to 210 - a EPC device. 211 - 212 - *) pci_epf_unbind() 213 - 214 - pci_epf_unbind() should be invoked when the binding between EPC device 215 - and EPF device is lost.

+103

Documentation/PCI/endpoint/pci-test-function.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ================= 4 + PCI Test Function 5 + ================= 6 + 7 + :Author: Kishon Vijay Abraham I <kishon@ti.com> 8 + 9 + Traditionally PCI RC has always been validated by using standard 10 + PCI cards like ethernet PCI cards or USB PCI cards or SATA PCI cards. 11 + However with the addition of EP-core in linux kernel, it is possible 12 + to configure a PCI controller that can operate in EP mode to work as 13 + a test device. 14 + 15 + The PCI endpoint test device is a virtual device (defined in software) 16 + used to test the endpoint functionality and serve as a sample driver 17 + for other PCI endpoint devices (to use the EP framework). 18 + 19 + The PCI endpoint test device has the following registers: 20 + 21 + 1) PCI_ENDPOINT_TEST_MAGIC 22 + 2) PCI_ENDPOINT_TEST_COMMAND 23 + 3) PCI_ENDPOINT_TEST_STATUS 24 + 4) PCI_ENDPOINT_TEST_SRC_ADDR 25 + 5) PCI_ENDPOINT_TEST_DST_ADDR 26 + 6) PCI_ENDPOINT_TEST_SIZE 27 + 7) PCI_ENDPOINT_TEST_CHECKSUM 28 + 8) PCI_ENDPOINT_TEST_IRQ_TYPE 29 + 9) PCI_ENDPOINT_TEST_IRQ_NUMBER 30 + 31 + * PCI_ENDPOINT_TEST_MAGIC 32 + 33 + This register will be used to test BAR0. A known pattern will be written 34 + and read back from MAGIC register to verify BAR0. 35 + 36 + * PCI_ENDPOINT_TEST_COMMAND 37 + 38 + This register will be used by the host driver to indicate the function 39 + that the endpoint device must perform. 40 + 41 + ======== ================================================================ 42 + Bitfield Description 43 + ======== ================================================================ 44 + Bit 0 raise legacy IRQ 45 + Bit 1 raise MSI IRQ 46 + Bit 2 raise MSI-X IRQ 47 + Bit 3 read command (read data from RC buffer) 48 + Bit 4 write command (write data to RC buffer) 49 + Bit 5 copy command (copy data from one RC buffer to another RC buffer) 50 + ======== ================================================================ 51 + 52 + * PCI_ENDPOINT_TEST_STATUS 53 + 54 + This register reflects the status of the PCI endpoint device. 55 + 56 + ======== ============================== 57 + Bitfield Description 58 + ======== ============================== 59 + Bit 0 read success 60 + Bit 1 read fail 61 + Bit 2 write success 62 + Bit 3 write fail 63 + Bit 4 copy success 64 + Bit 5 copy fail 65 + Bit 6 IRQ raised 66 + Bit 7 source address is invalid 67 + Bit 8 destination address is invalid 68 + ======== ============================== 69 + 70 + * PCI_ENDPOINT_TEST_SRC_ADDR 71 + 72 + This register contains the source address (RC buffer address) for the 73 + COPY/READ command. 74 + 75 + * PCI_ENDPOINT_TEST_DST_ADDR 76 + 77 + This register contains the destination address (RC buffer address) for 78 + the COPY/WRITE command. 79 + 80 + * PCI_ENDPOINT_TEST_IRQ_TYPE 81 + 82 + This register contains the interrupt type (Legacy/MSI) triggered 83 + for the READ/WRITE/COPY and raise IRQ (Legacy/MSI) commands. 84 + 85 + Possible types: 86 + 87 + ====== == 88 + Legacy 0 89 + MSI 1 90 + MSI-X 2 91 + ====== == 92 + 93 + * PCI_ENDPOINT_TEST_IRQ_NUMBER 94 + 95 + This register contains the triggered ID interrupt. 96 + 97 + Admissible values: 98 + 99 + ====== =========== 100 + Legacy 0 101 + MSI [1 .. 32] 102 + MSI-X [1 .. 2048] 103 + ====== ===========

-87

Documentation/PCI/endpoint/pci-test-function.txt

··· 1 - PCI TEST 2 - Kishon Vijay Abraham I <kishon@ti.com> 3 - 4 - Traditionally PCI RC has always been validated by using standard 5 - PCI cards like ethernet PCI cards or USB PCI cards or SATA PCI cards. 6 - However with the addition of EP-core in linux kernel, it is possible 7 - to configure a PCI controller that can operate in EP mode to work as 8 - a test device. 9 - 10 - The PCI endpoint test device is a virtual device (defined in software) 11 - used to test the endpoint functionality and serve as a sample driver 12 - for other PCI endpoint devices (to use the EP framework). 13 - 14 - The PCI endpoint test device has the following registers: 15 - 16 - 1) PCI_ENDPOINT_TEST_MAGIC 17 - 2) PCI_ENDPOINT_TEST_COMMAND 18 - 3) PCI_ENDPOINT_TEST_STATUS 19 - 4) PCI_ENDPOINT_TEST_SRC_ADDR 20 - 5) PCI_ENDPOINT_TEST_DST_ADDR 21 - 6) PCI_ENDPOINT_TEST_SIZE 22 - 7) PCI_ENDPOINT_TEST_CHECKSUM 23 - 8) PCI_ENDPOINT_TEST_IRQ_TYPE 24 - 9) PCI_ENDPOINT_TEST_IRQ_NUMBER 25 - 26 - *) PCI_ENDPOINT_TEST_MAGIC 27 - 28 - This register will be used to test BAR0. A known pattern will be written 29 - and read back from MAGIC register to verify BAR0. 30 - 31 - *) PCI_ENDPOINT_TEST_COMMAND: 32 - 33 - This register will be used by the host driver to indicate the function 34 - that the endpoint device must perform. 35 - 36 - Bitfield Description: 37 - Bit 0 : raise legacy IRQ 38 - Bit 1 : raise MSI IRQ 39 - Bit 2 : raise MSI-X IRQ 40 - Bit 3 : read command (read data from RC buffer) 41 - Bit 4 : write command (write data to RC buffer) 42 - Bit 5 : copy command (copy data from one RC buffer to another 43 - RC buffer) 44 - 45 - *) PCI_ENDPOINT_TEST_STATUS 46 - 47 - This register reflects the status of the PCI endpoint device. 48 - 49 - Bitfield Description: 50 - Bit 0 : read success 51 - Bit 1 : read fail 52 - Bit 2 : write success 53 - Bit 3 : write fail 54 - Bit 4 : copy success 55 - Bit 5 : copy fail 56 - Bit 6 : IRQ raised 57 - Bit 7 : source address is invalid 58 - Bit 8 : destination address is invalid 59 - 60 - *) PCI_ENDPOINT_TEST_SRC_ADDR 61 - 62 - This register contains the source address (RC buffer address) for the 63 - COPY/READ command. 64 - 65 - *) PCI_ENDPOINT_TEST_DST_ADDR 66 - 67 - This register contains the destination address (RC buffer address) for 68 - the COPY/WRITE command. 69 - 70 - *) PCI_ENDPOINT_TEST_IRQ_TYPE 71 - 72 - This register contains the interrupt type (Legacy/MSI) triggered 73 - for the READ/WRITE/COPY and raise IRQ (Legacy/MSI) commands. 74 - 75 - Possible types: 76 - - Legacy : 0 77 - - MSI : 1 78 - - MSI-X : 2 79 - 80 - *) PCI_ENDPOINT_TEST_IRQ_NUMBER 81 - 82 - This register contains the triggered ID interrupt. 83 - 84 - Admissible values: 85 - - Legacy : 0 86 - - MSI : [1 .. 32] 87 - - MSI-X : [1 .. 2048]

+235

Documentation/PCI/endpoint/pci-test-howto.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + =================== 4 + PCI Test User Guide 5 + =================== 6 + 7 + :Author: Kishon Vijay Abraham I <kishon@ti.com> 8 + 9 + This document is a guide to help users use pci-epf-test function driver 10 + and pci_endpoint_test host driver for testing PCI. The list of steps to 11 + be followed in the host side and EP side is given below. 12 + 13 + Endpoint Device 14 + =============== 15 + 16 + Endpoint Controller Devices 17 + --------------------------- 18 + 19 + To find the list of endpoint controller devices in the system:: 20 + 21 + # ls /sys/class/pci_epc/ 22 + 51000000.pcie_ep 23 + 24 + If PCI_ENDPOINT_CONFIGFS is enabled:: 25 + 26 + # ls /sys/kernel/config/pci_ep/controllers 27 + 51000000.pcie_ep 28 + 29 + 30 + Endpoint Function Drivers 31 + ------------------------- 32 + 33 + To find the list of endpoint function drivers in the system:: 34 + 35 + # ls /sys/bus/pci-epf/drivers 36 + pci_epf_test 37 + 38 + If PCI_ENDPOINT_CONFIGFS is enabled:: 39 + 40 + # ls /sys/kernel/config/pci_ep/functions 41 + pci_epf_test 42 + 43 + 44 + Creating pci-epf-test Device 45 + ---------------------------- 46 + 47 + PCI endpoint function device can be created using the configfs. To create 48 + pci-epf-test device, the following commands can be used:: 49 + 50 + # mount -t configfs none /sys/kernel/config 51 + # cd /sys/kernel/config/pci_ep/ 52 + # mkdir functions/pci_epf_test/func1 53 + 54 + The "mkdir func1" above creates the pci-epf-test function device that will 55 + be probed by pci_epf_test driver. 56 + 57 + The PCI endpoint framework populates the directory with the following 58 + configurable fields:: 59 + 60 + # ls functions/pci_epf_test/func1 61 + baseclass_code interrupt_pin progif_code subsys_id 62 + cache_line_size msi_interrupts revid subsys_vendorid 63 + deviceid msix_interrupts subclass_code vendorid 64 + 65 + The PCI endpoint function driver populates these entries with default values 66 + when the device is bound to the driver. The pci-epf-test driver populates 67 + vendorid with 0xffff and interrupt_pin with 0x0001:: 68 + 69 + # cat functions/pci_epf_test/func1/vendorid 70 + 0xffff 71 + # cat functions/pci_epf_test/func1/interrupt_pin 72 + 0x0001 73 + 74 + 75 + Configuring pci-epf-test Device 76 + ------------------------------- 77 + 78 + The user can configure the pci-epf-test device using configfs entry. In order 79 + to change the vendorid and the number of MSI interrupts used by the function 80 + device, the following commands can be used:: 81 + 82 + # echo 0x104c > functions/pci_epf_test/func1/vendorid 83 + # echo 0xb500 > functions/pci_epf_test/func1/deviceid 84 + # echo 16 > functions/pci_epf_test/func1/msi_interrupts 85 + # echo 8 > functions/pci_epf_test/func1/msix_interrupts 86 + 87 + 88 + Binding pci-epf-test Device to EP Controller 89 + -------------------------------------------- 90 + 91 + In order for the endpoint function device to be useful, it has to be bound to 92 + a PCI endpoint controller driver. Use the configfs to bind the function 93 + device to one of the controller driver present in the system:: 94 + 95 + # ln -s functions/pci_epf_test/func1 controllers/51000000.pcie_ep/ 96 + 97 + Once the above step is completed, the PCI endpoint is ready to establish a link 98 + with the host. 99 + 100 + 101 + Start the Link 102 + -------------- 103 + 104 + In order for the endpoint device to establish a link with the host, the _start_ 105 + field should be populated with '1':: 106 + 107 + # echo 1 > controllers/51000000.pcie_ep/start 108 + 109 + 110 + RootComplex Device 111 + ================== 112 + 113 + lspci Output 114 + ------------ 115 + 116 + Note that the devices listed here correspond to the value populated in 1.4 117 + above:: 118 + 119 + 00:00.0 PCI bridge: Texas Instruments Device 8888 (rev 01) 120 + 01:00.0 Unassigned class [ff00]: Texas Instruments Device b500 121 + 122 + 123 + Using Endpoint Test function Device 124 + ----------------------------------- 125 + 126 + pcitest.sh added in tools/pci/ can be used to run all the default PCI endpoint 127 + tests. To compile this tool the following commands should be used:: 128 + 129 + # cd <kernel-dir> 130 + # make -C tools/pci 131 + 132 + or if you desire to compile and install in your system:: 133 + 134 + # cd <kernel-dir> 135 + # make -C tools/pci install 136 + 137 + The tool and script will be located in <rootfs>/usr/bin/ 138 + 139 + 140 + pcitest.sh Output 141 + ~~~~~~~~~~~~~~~~~ 142 + :: 143 + 144 + # pcitest.sh 145 + BAR tests 146 + 147 + BAR0: OKAY 148 + BAR1: OKAY 149 + BAR2: OKAY 150 + BAR3: OKAY 151 + BAR4: NOT OKAY 152 + BAR5: NOT OKAY 153 + 154 + Interrupt tests 155 + 156 + SET IRQ TYPE TO LEGACY: OKAY 157 + LEGACY IRQ: NOT OKAY 158 + SET IRQ TYPE TO MSI: OKAY 159 + MSI1: OKAY 160 + MSI2: OKAY 161 + MSI3: OKAY 162 + MSI4: OKAY 163 + MSI5: OKAY 164 + MSI6: OKAY 165 + MSI7: OKAY 166 + MSI8: OKAY 167 + MSI9: OKAY 168 + MSI10: OKAY 169 + MSI11: OKAY 170 + MSI12: OKAY 171 + MSI13: OKAY 172 + MSI14: OKAY 173 + MSI15: OKAY 174 + MSI16: OKAY 175 + MSI17: NOT OKAY 176 + MSI18: NOT OKAY 177 + MSI19: NOT OKAY 178 + MSI20: NOT OKAY 179 + MSI21: NOT OKAY 180 + MSI22: NOT OKAY 181 + MSI23: NOT OKAY 182 + MSI24: NOT OKAY 183 + MSI25: NOT OKAY 184 + MSI26: NOT OKAY 185 + MSI27: NOT OKAY 186 + MSI28: NOT OKAY 187 + MSI29: NOT OKAY 188 + MSI30: NOT OKAY 189 + MSI31: NOT OKAY 190 + MSI32: NOT OKAY 191 + SET IRQ TYPE TO MSI-X: OKAY 192 + MSI-X1: OKAY 193 + MSI-X2: OKAY 194 + MSI-X3: OKAY 195 + MSI-X4: OKAY 196 + MSI-X5: OKAY 197 + MSI-X6: OKAY 198 + MSI-X7: OKAY 199 + MSI-X8: OKAY 200 + MSI-X9: NOT OKAY 201 + MSI-X10: NOT OKAY 202 + MSI-X11: NOT OKAY 203 + MSI-X12: NOT OKAY 204 + MSI-X13: NOT OKAY 205 + MSI-X14: NOT OKAY 206 + MSI-X15: NOT OKAY 207 + MSI-X16: NOT OKAY 208 + [...] 209 + MSI-X2047: NOT OKAY 210 + MSI-X2048: NOT OKAY 211 + 212 + Read Tests 213 + 214 + SET IRQ TYPE TO MSI: OKAY 215 + READ ( 1 bytes): OKAY 216 + READ ( 1024 bytes): OKAY 217 + READ ( 1025 bytes): OKAY 218 + READ (1024000 bytes): OKAY 219 + READ (1024001 bytes): OKAY 220 + 221 + Write Tests 222 + 223 + WRITE ( 1 bytes): OKAY 224 + WRITE ( 1024 bytes): OKAY 225 + WRITE ( 1025 bytes): OKAY 226 + WRITE (1024000 bytes): OKAY 227 + WRITE (1024001 bytes): OKAY 228 + 229 + Copy Tests 230 + 231 + COPY ( 1 bytes): OKAY 232 + COPY ( 1024 bytes): OKAY 233 + COPY ( 1025 bytes): OKAY 234 + COPY (1024000 bytes): OKAY 235 + COPY (1024001 bytes): OKAY

-206

Documentation/PCI/endpoint/pci-test-howto.txt

··· 1 - PCI TEST USERGUIDE 2 - Kishon Vijay Abraham I <kishon@ti.com> 3 - 4 - This document is a guide to help users use pci-epf-test function driver 5 - and pci_endpoint_test host driver for testing PCI. The list of steps to 6 - be followed in the host side and EP side is given below. 7 - 8 - 1. Endpoint Device 9 - 10 - 1.1 Endpoint Controller Devices 11 - 12 - To find the list of endpoint controller devices in the system: 13 - 14 - # ls /sys/class/pci_epc/ 15 - 51000000.pcie_ep 16 - 17 - If PCI_ENDPOINT_CONFIGFS is enabled 18 - # ls /sys/kernel/config/pci_ep/controllers 19 - 51000000.pcie_ep 20 - 21 - 1.2 Endpoint Function Drivers 22 - 23 - To find the list of endpoint function drivers in the system: 24 - 25 - # ls /sys/bus/pci-epf/drivers 26 - pci_epf_test 27 - 28 - If PCI_ENDPOINT_CONFIGFS is enabled 29 - # ls /sys/kernel/config/pci_ep/functions 30 - pci_epf_test 31 - 32 - 1.3 Creating pci-epf-test Device 33 - 34 - PCI endpoint function device can be created using the configfs. To create 35 - pci-epf-test device, the following commands can be used 36 - 37 - # mount -t configfs none /sys/kernel/config 38 - # cd /sys/kernel/config/pci_ep/ 39 - # mkdir functions/pci_epf_test/func1 40 - 41 - The "mkdir func1" above creates the pci-epf-test function device that will 42 - be probed by pci_epf_test driver. 43 - 44 - The PCI endpoint framework populates the directory with the following 45 - configurable fields. 46 - 47 - # ls functions/pci_epf_test/func1 48 - baseclass_code interrupt_pin progif_code subsys_id 49 - cache_line_size msi_interrupts revid subsys_vendorid 50 - deviceid msix_interrupts subclass_code vendorid 51 - 52 - The PCI endpoint function driver populates these entries with default values 53 - when the device is bound to the driver. The pci-epf-test driver populates 54 - vendorid with 0xffff and interrupt_pin with 0x0001 55 - 56 - # cat functions/pci_epf_test/func1/vendorid 57 - 0xffff 58 - # cat functions/pci_epf_test/func1/interrupt_pin 59 - 0x0001 60 - 61 - 1.4 Configuring pci-epf-test Device 62 - 63 - The user can configure the pci-epf-test device using configfs entry. In order 64 - to change the vendorid and the number of MSI interrupts used by the function 65 - device, the following commands can be used. 66 - 67 - # echo 0x104c > functions/pci_epf_test/func1/vendorid 68 - # echo 0xb500 > functions/pci_epf_test/func1/deviceid 69 - # echo 16 > functions/pci_epf_test/func1/msi_interrupts 70 - # echo 8 > functions/pci_epf_test/func1/msix_interrupts 71 - 72 - 1.5 Binding pci-epf-test Device to EP Controller 73 - 74 - In order for the endpoint function device to be useful, it has to be bound to 75 - a PCI endpoint controller driver. Use the configfs to bind the function 76 - device to one of the controller driver present in the system. 77 - 78 - # ln -s functions/pci_epf_test/func1 controllers/51000000.pcie_ep/ 79 - 80 - Once the above step is completed, the PCI endpoint is ready to establish a link 81 - with the host. 82 - 83 - 1.6 Start the Link 84 - 85 - In order for the endpoint device to establish a link with the host, the _start_ 86 - field should be populated with '1'. 87 - 88 - # echo 1 > controllers/51000000.pcie_ep/start 89 - 90 - 2. RootComplex Device 91 - 92 - 2.1 lspci Output 93 - 94 - Note that the devices listed here correspond to the value populated in 1.4 above 95 - 96 - 00:00.0 PCI bridge: Texas Instruments Device 8888 (rev 01) 97 - 01:00.0 Unassigned class [ff00]: Texas Instruments Device b500 98 - 99 - 2.2 Using Endpoint Test function Device 100 - 101 - pcitest.sh added in tools/pci/ can be used to run all the default PCI endpoint 102 - tests. To compile this tool the following commands should be used: 103 - 104 - # cd <kernel-dir> 105 - # make -C tools/pci 106 - 107 - or if you desire to compile and install in your system: 108 - 109 - # cd <kernel-dir> 110 - # make -C tools/pci install 111 - 112 - The tool and script will be located in <rootfs>/usr/bin/ 113 - 114 - 2.2.1 pcitest.sh Output 115 - # pcitest.sh 116 - BAR tests 117 - 118 - BAR0: OKAY 119 - BAR1: OKAY 120 - BAR2: OKAY 121 - BAR3: OKAY 122 - BAR4: NOT OKAY 123 - BAR5: NOT OKAY 124 - 125 - Interrupt tests 126 - 127 - SET IRQ TYPE TO LEGACY: OKAY 128 - LEGACY IRQ: NOT OKAY 129 - SET IRQ TYPE TO MSI: OKAY 130 - MSI1: OKAY 131 - MSI2: OKAY 132 - MSI3: OKAY 133 - MSI4: OKAY 134 - MSI5: OKAY 135 - MSI6: OKAY 136 - MSI7: OKAY 137 - MSI8: OKAY 138 - MSI9: OKAY 139 - MSI10: OKAY 140 - MSI11: OKAY 141 - MSI12: OKAY 142 - MSI13: OKAY 143 - MSI14: OKAY 144 - MSI15: OKAY 145 - MSI16: OKAY 146 - MSI17: NOT OKAY 147 - MSI18: NOT OKAY 148 - MSI19: NOT OKAY 149 - MSI20: NOT OKAY 150 - MSI21: NOT OKAY 151 - MSI22: NOT OKAY 152 - MSI23: NOT OKAY 153 - MSI24: NOT OKAY 154 - MSI25: NOT OKAY 155 - MSI26: NOT OKAY 156 - MSI27: NOT OKAY 157 - MSI28: NOT OKAY 158 - MSI29: NOT OKAY 159 - MSI30: NOT OKAY 160 - MSI31: NOT OKAY 161 - MSI32: NOT OKAY 162 - SET IRQ TYPE TO MSI-X: OKAY 163 - MSI-X1: OKAY 164 - MSI-X2: OKAY 165 - MSI-X3: OKAY 166 - MSI-X4: OKAY 167 - MSI-X5: OKAY 168 - MSI-X6: OKAY 169 - MSI-X7: OKAY 170 - MSI-X8: OKAY 171 - MSI-X9: NOT OKAY 172 - MSI-X10: NOT OKAY 173 - MSI-X11: NOT OKAY 174 - MSI-X12: NOT OKAY 175 - MSI-X13: NOT OKAY 176 - MSI-X14: NOT OKAY 177 - MSI-X15: NOT OKAY 178 - MSI-X16: NOT OKAY 179 - [...] 180 - MSI-X2047: NOT OKAY 181 - MSI-X2048: NOT OKAY 182 - 183 - Read Tests 184 - 185 - SET IRQ TYPE TO MSI: OKAY 186 - READ ( 1 bytes): OKAY 187 - READ ( 1024 bytes): OKAY 188 - READ ( 1025 bytes): OKAY 189 - READ (1024000 bytes): OKAY 190 - READ (1024001 bytes): OKAY 191 - 192 - Write Tests 193 - 194 - WRITE ( 1 bytes): OKAY 195 - WRITE ( 1024 bytes): OKAY 196 - WRITE ( 1025 bytes): OKAY 197 - WRITE (1024000 bytes): OKAY 198 - WRITE (1024001 bytes): OKAY 199 - 200 - Copy Tests 201 - 202 - COPY ( 1 bytes): OKAY 203 - COPY ( 1024 bytes): OKAY 204 - COPY ( 1025 bytes): OKAY 205 - COPY (1024000 bytes): OKAY 206 - COPY (1024001 bytes): OKAY

+18

Documentation/PCI/index.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ======================= 4 + Linux PCI Bus Subsystem 5 + ======================= 6 + 7 + .. toctree:: 8 + :maxdepth: 2 9 + :numbered: 10 + 11 + pci 12 + picebus-howto 13 + pci-iov-howto 14 + msi-howto 15 + acpi-info 16 + pci-error-recovery 17 + pcieaer-howto 18 + endpoint/index

+287

Documentation/PCI/msi-howto.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + .. include:: <isonum.txt> 3 + 4 + ========================== 5 + The MSI Driver Guide HOWTO 6 + ========================== 7 + 8 + :Authors: Tom L Nguyen; Martine Silbermann; Matthew Wilcox 9 + 10 + :Copyright: 2003, 2008 Intel Corporation 11 + 12 + About this guide 13 + ================ 14 + 15 + This guide describes the basics of Message Signaled Interrupts (MSIs), 16 + the advantages of using MSI over traditional interrupt mechanisms, how 17 + to change your driver to use MSI or MSI-X and some basic diagnostics to 18 + try if a device doesn't support MSIs. 19 + 20 + 21 + What are MSIs? 22 + ============== 23 + 24 + A Message Signaled Interrupt is a write from the device to a special 25 + address which causes an interrupt to be received by the CPU. 26 + 27 + The MSI capability was first specified in PCI 2.2 and was later enhanced 28 + in PCI 3.0 to allow each interrupt to be masked individually. The MSI-X 29 + capability was also introduced with PCI 3.0. It supports more interrupts 30 + per device than MSI and allows interrupts to be independently configured. 31 + 32 + Devices may support both MSI and MSI-X, but only one can be enabled at 33 + a time. 34 + 35 + 36 + Why use MSIs? 37 + ============= 38 + 39 + There are three reasons why using MSIs can give an advantage over 40 + traditional pin-based interrupts. 41 + 42 + Pin-based PCI interrupts are often shared amongst several devices. 43 + To support this, the kernel must call each interrupt handler associated 44 + with an interrupt, which leads to reduced performance for the system as 45 + a whole. MSIs are never shared, so this problem cannot arise. 46 + 47 + When a device writes data to memory, then raises a pin-based interrupt, 48 + it is possible that the interrupt may arrive before all the data has 49 + arrived in memory (this becomes more likely with devices behind PCI-PCI 50 + bridges). In order to ensure that all the data has arrived in memory, 51 + the interrupt handler must read a register on the device which raised 52 + the interrupt. PCI transaction ordering rules require that all the data 53 + arrive in memory before the value may be returned from the register. 54 + Using MSIs avoids this problem as the interrupt-generating write cannot 55 + pass the data writes, so by the time the interrupt is raised, the driver 56 + knows that all the data has arrived in memory. 57 + 58 + PCI devices can only support a single pin-based interrupt per function. 59 + Often drivers have to query the device to find out what event has 60 + occurred, slowing down interrupt handling for the common case. With 61 + MSIs, a device can support more interrupts, allowing each interrupt 62 + to be specialised to a different purpose. One possible design gives 63 + infrequent conditions (such as errors) their own interrupt which allows 64 + the driver to handle the normal interrupt handling path more efficiently. 65 + Other possible designs include giving one interrupt to each packet queue 66 + in a network card or each port in a storage controller. 67 + 68 + 69 + How to use MSIs 70 + =============== 71 + 72 + PCI devices are initialised to use pin-based interrupts. The device 73 + driver has to set up the device to use MSI or MSI-X. Not all machines 74 + support MSIs correctly, and for those machines, the APIs described below 75 + will simply fail and the device will continue to use pin-based interrupts. 76 + 77 + Include kernel support for MSIs 78 + ------------------------------- 79 + 80 + To support MSI or MSI-X, the kernel must be built with the CONFIG_PCI_MSI 81 + option enabled. This option is only available on some architectures, 82 + and it may depend on some other options also being set. For example, 83 + on x86, you must also enable X86_UP_APIC or SMP in order to see the 84 + CONFIG_PCI_MSI option. 85 + 86 + Using MSI 87 + --------- 88 + 89 + Most of the hard work is done for the driver in the PCI layer. The driver 90 + simply has to request that the PCI layer set up the MSI capability for this 91 + device. 92 + 93 + To automatically use MSI or MSI-X interrupt vectors, use the following 94 + function:: 95 + 96 + int pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs, 97 + unsigned int max_vecs, unsigned int flags); 98 + 99 + which allocates up to max_vecs interrupt vectors for a PCI device. It 100 + returns the number of vectors allocated or a negative error. If the device 101 + has a requirements for a minimum number of vectors the driver can pass a 102 + min_vecs argument set to this limit, and the PCI core will return -ENOSPC 103 + if it can't meet the minimum number of vectors. 104 + 105 + The flags argument is used to specify which type of interrupt can be used 106 + by the device and the driver (PCI_IRQ_LEGACY, PCI_IRQ_MSI, PCI_IRQ_MSIX). 107 + A convenient short-hand (PCI_IRQ_ALL_TYPES) is also available to ask for 108 + any possible kind of interrupt. If the PCI_IRQ_AFFINITY flag is set, 109 + pci_alloc_irq_vectors() will spread the interrupts around the available CPUs. 110 + 111 + To get the Linux IRQ numbers passed to request_irq() and free_irq() and the 112 + vectors, use the following function:: 113 + 114 + int pci_irq_vector(struct pci_dev *dev, unsigned int nr); 115 + 116 + Any allocated resources should be freed before removing the device using 117 + the following function:: 118 + 119 + void pci_free_irq_vectors(struct pci_dev *dev); 120 + 121 + If a device supports both MSI-X and MSI capabilities, this API will use the 122 + MSI-X facilities in preference to the MSI facilities. MSI-X supports any 123 + number of interrupts between 1 and 2048. In contrast, MSI is restricted to 124 + a maximum of 32 interrupts (and must be a power of two). In addition, the 125 + MSI interrupt vectors must be allocated consecutively, so the system might 126 + not be able to allocate as many vectors for MSI as it could for MSI-X. On 127 + some platforms, MSI interrupts must all be targeted at the same set of CPUs 128 + whereas MSI-X interrupts can all be targeted at different CPUs. 129 + 130 + If a device supports neither MSI-X or MSI it will fall back to a single 131 + legacy IRQ vector. 132 + 133 + The typical usage of MSI or MSI-X interrupts is to allocate as many vectors 134 + as possible, likely up to the limit supported by the device. If nvec is 135 + larger than the number supported by the device it will automatically be 136 + capped to the supported limit, so there is no need to query the number of 137 + vectors supported beforehand:: 138 + 139 + nvec = pci_alloc_irq_vectors(pdev, 1, nvec, PCI_IRQ_ALL_TYPES) 140 + if (nvec < 0) 141 + goto out_err; 142 + 143 + If a driver is unable or unwilling to deal with a variable number of MSI 144 + interrupts it can request a particular number of interrupts by passing that 145 + number to pci_alloc_irq_vectors() function as both 'min_vecs' and 146 + 'max_vecs' parameters:: 147 + 148 + ret = pci_alloc_irq_vectors(pdev, nvec, nvec, PCI_IRQ_ALL_TYPES); 149 + if (ret < 0) 150 + goto out_err; 151 + 152 + The most notorious example of the request type described above is enabling 153 + the single MSI mode for a device. It could be done by passing two 1s as 154 + 'min_vecs' and 'max_vecs':: 155 + 156 + ret = pci_alloc_irq_vectors(pdev, 1, 1, PCI_IRQ_ALL_TYPES); 157 + if (ret < 0) 158 + goto out_err; 159 + 160 + Some devices might not support using legacy line interrupts, in which case 161 + the driver can specify that only MSI or MSI-X is acceptable:: 162 + 163 + nvec = pci_alloc_irq_vectors(pdev, 1, nvec, PCI_IRQ_MSI | PCI_IRQ_MSIX); 164 + if (nvec < 0) 165 + goto out_err; 166 + 167 + Legacy APIs 168 + ----------- 169 + 170 + The following old APIs to enable and disable MSI or MSI-X interrupts should 171 + not be used in new code:: 172 + 173 + pci_enable_msi() /* deprecated */ 174 + pci_disable_msi() /* deprecated */ 175 + pci_enable_msix_range() /* deprecated */ 176 + pci_enable_msix_exact() /* deprecated */ 177 + pci_disable_msix() /* deprecated */ 178 + 179 + Additionally there are APIs to provide the number of supported MSI or MSI-X 180 + vectors: pci_msi_vec_count() and pci_msix_vec_count(). In general these 181 + should be avoided in favor of letting pci_alloc_irq_vectors() cap the 182 + number of vectors. If you have a legitimate special use case for the count 183 + of vectors we might have to revisit that decision and add a 184 + pci_nr_irq_vectors() helper that handles MSI and MSI-X transparently. 185 + 186 + Considerations when using MSIs 187 + ------------------------------ 188 + 189 + Spinlocks 190 + ~~~~~~~~~ 191 + 192 + Most device drivers have a per-device spinlock which is taken in the 193 + interrupt handler. With pin-based interrupts or a single MSI, it is not 194 + necessary to disable interrupts (Linux guarantees the same interrupt will 195 + not be re-entered). If a device uses multiple interrupts, the driver 196 + must disable interrupts while the lock is held. If the device sends 197 + a different interrupt, the driver will deadlock trying to recursively 198 + acquire the spinlock. Such deadlocks can be avoided by using 199 + spin_lock_irqsave() or spin_lock_irq() which disable local interrupts 200 + and acquire the lock (see Documentation/kernel-hacking/locking.rst). 201 + 202 + How to tell whether MSI/MSI-X is enabled on a device 203 + ---------------------------------------------------- 204 + 205 + Using 'lspci -v' (as root) may show some devices with "MSI", "Message 206 + Signalled Interrupts" or "MSI-X" capabilities. Each of these capabilities 207 + has an 'Enable' flag which is followed with either "+" (enabled) 208 + or "-" (disabled). 209 + 210 + 211 + MSI quirks 212 + ========== 213 + 214 + Several PCI chipsets or devices are known not to support MSIs. 215 + The PCI stack provides three ways to disable MSIs: 216 + 217 + 1. globally 218 + 2. on all devices behind a specific bridge 219 + 3. on a single device 220 + 221 + Disabling MSIs globally 222 + ----------------------- 223 + 224 + Some host chipsets simply don't support MSIs properly. If we're 225 + lucky, the manufacturer knows this and has indicated it in the ACPI 226 + FADT table. In this case, Linux automatically disables MSIs. 227 + Some boards don't include this information in the table and so we have 228 + to detect them ourselves. The complete list of these is found near the 229 + quirk_disable_all_msi() function in drivers/pci/quirks.c. 230 + 231 + If you have a board which has problems with MSIs, you can pass pci=nomsi 232 + on the kernel command line to disable MSIs on all devices. It would be 233 + in your best interests to report the problem to linux-pci@vger.kernel.org 234 + including a full 'lspci -v' so we can add the quirks to the kernel. 235 + 236 + Disabling MSIs below a bridge 237 + ----------------------------- 238 + 239 + Some PCI bridges are not able to route MSIs between busses properly. 240 + In this case, MSIs must be disabled on all devices behind the bridge. 241 + 242 + Some bridges allow you to enable MSIs by changing some bits in their 243 + PCI configuration space (especially the Hypertransport chipsets such 244 + as the nVidia nForce and Serverworks HT2000). As with host chipsets, 245 + Linux mostly knows about them and automatically enables MSIs if it can. 246 + If you have a bridge unknown to Linux, you can enable 247 + MSIs in configuration space using whatever method you know works, then 248 + enable MSIs on that bridge by doing:: 249 + 250 + echo 1 > /sys/bus/pci/devices/$bridge/msi_bus 251 + 252 + where $bridge is the PCI address of the bridge you've enabled (eg 253 + 0000:00:0e.0). 254 + 255 + To disable MSIs, echo 0 instead of 1. Changing this value should be 256 + done with caution as it could break interrupt handling for all devices 257 + below this bridge. 258 + 259 + Again, please notify linux-pci@vger.kernel.org of any bridges that need 260 + special handling. 261 + 262 + Disabling MSIs on a single device 263 + --------------------------------- 264 + 265 + Some devices are known to have faulty MSI implementations. Usually this 266 + is handled in the individual device driver, but occasionally it's necessary 267 + to handle this with a quirk. Some drivers have an option to disable use 268 + of MSI. While this is a convenient workaround for the driver author, 269 + it is not good practice, and should not be emulated. 270 + 271 + Finding why MSIs are disabled on a device 272 + ----------------------------------------- 273 + 274 + From the above three sections, you can see that there are many reasons 275 + why MSIs may not be enabled for a given device. Your first step should 276 + be to examine your dmesg carefully to determine whether MSIs are enabled 277 + for your machine. You should also check your .config to be sure you 278 + have enabled CONFIG_PCI_MSI. 279 + 280 + Then, 'lspci -t' gives the list of bridges above a device. Reading 281 + `/sys/bus/pci/devices/*/msi_bus` will tell you whether MSIs are enabled (1) 282 + or disabled (0). If 0 is found in any of the msi_bus files belonging 283 + to bridges between the PCI root and the device, MSIs are disabled. 284 + 285 + It is also worth checking the device driver to see whether it supports MSIs. 286 + For example, it may contain calls to pci_irq_alloc_vectors() with the 287 + PCI_IRQ_MSI or PCI_IRQ_MSIX flags.

+424

Documentation/PCI/pci-error-recovery.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ================== 4 + PCI Error Recovery 5 + ================== 6 + 7 + 8 + :Authors: - Linas Vepstas <linasvepstas@gmail.com> 9 + - Richard Lary <rlary@us.ibm.com> 10 + - Mike Mason <mmlnx@us.ibm.com> 11 + 12 + 13 + Many PCI bus controllers are able to detect a variety of hardware 14 + PCI errors on the bus, such as parity errors on the data and address 15 + buses, as well as SERR and PERR errors. Some of the more advanced 16 + chipsets are able to deal with these errors; these include PCI-E chipsets, 17 + and the PCI-host bridges found on IBM Power4, Power5 and Power6-based 18 + pSeries boxes. A typical action taken is to disconnect the affected device, 19 + halting all I/O to it. The goal of a disconnection is to avoid system 20 + corruption; for example, to halt system memory corruption due to DMA's 21 + to "wild" addresses. Typically, a reconnection mechanism is also 22 + offered, so that the affected PCI device(s) are reset and put back 23 + into working condition. The reset phase requires coordination 24 + between the affected device drivers and the PCI controller chip. 25 + This document describes a generic API for notifying device drivers 26 + of a bus disconnection, and then performing error recovery. 27 + This API is currently implemented in the 2.6.16 and later kernels. 28 + 29 + Reporting and recovery is performed in several steps. First, when 30 + a PCI hardware error has resulted in a bus disconnect, that event 31 + is reported as soon as possible to all affected device drivers, 32 + including multiple instances of a device driver on multi-function 33 + cards. This allows device drivers to avoid deadlocking in spinloops, 34 + waiting for some i/o-space register to change, when it never will. 35 + It also gives the drivers a chance to defer incoming I/O as 36 + needed. 37 + 38 + Next, recovery is performed in several stages. Most of the complexity 39 + is forced by the need to handle multi-function devices, that is, 40 + devices that have multiple device drivers associated with them. 41 + In the first stage, each driver is allowed to indicate what type 42 + of reset it desires, the choices being a simple re-enabling of I/O 43 + or requesting a slot reset. 44 + 45 + If any driver requests a slot reset, that is what will be done. 46 + 47 + After a reset and/or a re-enabling of I/O, all drivers are 48 + again notified, so that they may then perform any device setup/config 49 + that may be required. After these have all completed, a final 50 + "resume normal operations" event is sent out. 51 + 52 + The biggest reason for choosing a kernel-based implementation rather 53 + than a user-space implementation was the need to deal with bus 54 + disconnects of PCI devices attached to storage media, and, in particular, 55 + disconnects from devices holding the root file system. If the root 56 + file system is disconnected, a user-space mechanism would have to go 57 + through a large number of contortions to complete recovery. Almost all 58 + of the current Linux file systems are not tolerant of disconnection 59 + from/reconnection to their underlying block device. By contrast, 60 + bus errors are easy to manage in the device driver. Indeed, most 61 + device drivers already handle very similar recovery procedures; 62 + for example, the SCSI-generic layer already provides significant 63 + mechanisms for dealing with SCSI bus errors and SCSI bus resets. 64 + 65 + 66 + Detailed Design 67 + =============== 68 + 69 + Design and implementation details below, based on a chain of 70 + public email discussions with Ben Herrenschmidt, circa 5 April 2005. 71 + 72 + The error recovery API support is exposed to the driver in the form of 73 + a structure of function pointers pointed to by a new field in struct 74 + pci_driver. A driver that fails to provide the structure is "non-aware", 75 + and the actual recovery steps taken are platform dependent. The 76 + arch/powerpc implementation will simulate a PCI hotplug remove/add. 77 + 78 + This structure has the form:: 79 + 80 + struct pci_error_handlers 81 + { 82 + int (*error_detected)(struct pci_dev *dev, enum pci_channel_state); 83 + int (*mmio_enabled)(struct pci_dev *dev); 84 + int (*slot_reset)(struct pci_dev *dev); 85 + void (*resume)(struct pci_dev *dev); 86 + }; 87 + 88 + The possible channel states are:: 89 + 90 + enum pci_channel_state { 91 + pci_channel_io_normal, /* I/O channel is in normal state */ 92 + pci_channel_io_frozen, /* I/O to channel is blocked */ 93 + pci_channel_io_perm_failure, /* PCI card is dead */ 94 + }; 95 + 96 + Possible return values are:: 97 + 98 + enum pci_ers_result { 99 + PCI_ERS_RESULT_NONE, /* no result/none/not supported in device driver */ 100 + PCI_ERS_RESULT_CAN_RECOVER, /* Device driver can recover without slot reset */ 101 + PCI_ERS_RESULT_NEED_RESET, /* Device driver wants slot to be reset. */ 102 + PCI_ERS_RESULT_DISCONNECT, /* Device has completely failed, is unrecoverable */ 103 + PCI_ERS_RESULT_RECOVERED, /* Device driver is fully recovered and operational */ 104 + }; 105 + 106 + A driver does not have to implement all of these callbacks; however, 107 + if it implements any, it must implement error_detected(). If a callback 108 + is not implemented, the corresponding feature is considered unsupported. 109 + For example, if mmio_enabled() and resume() aren't there, then it 110 + is assumed that the driver is not doing any direct recovery and requires 111 + a slot reset. Typically a driver will want to know about 112 + a slot_reset(). 113 + 114 + The actual steps taken by a platform to recover from a PCI error 115 + event will be platform-dependent, but will follow the general 116 + sequence described below. 117 + 118 + STEP 0: Error Event 119 + ------------------- 120 + A PCI bus error is detected by the PCI hardware. On powerpc, the slot 121 + is isolated, in that all I/O is blocked: all reads return 0xffffffff, 122 + all writes are ignored. 123 + 124 + 125 + STEP 1: Notification 126 + -------------------- 127 + Platform calls the error_detected() callback on every instance of 128 + every driver affected by the error. 129 + 130 + At this point, the device might not be accessible anymore, depending on 131 + the platform (the slot will be isolated on powerpc). The driver may 132 + already have "noticed" the error because of a failing I/O, but this 133 + is the proper "synchronization point", that is, it gives the driver 134 + a chance to cleanup, waiting for pending stuff (timers, whatever, etc...) 135 + to complete; it can take semaphores, schedule, etc... everything but 136 + touch the device. Within this function and after it returns, the driver 137 + shouldn't do any new IOs. Called in task context. This is sort of a 138 + "quiesce" point. See note about interrupts at the end of this doc. 139 + 140 + All drivers participating in this system must implement this call. 141 + The driver must return one of the following result codes: 142 + 143 + - PCI_ERS_RESULT_CAN_RECOVER 144 + Driver returns this if it thinks it might be able to recover 145 + the HW by just banging IOs or if it wants to be given 146 + a chance to extract some diagnostic information (see 147 + mmio_enable, below). 148 + - PCI_ERS_RESULT_NEED_RESET 149 + Driver returns this if it can't recover without a 150 + slot reset. 151 + - PCI_ERS_RESULT_DISCONNECT 152 + Driver returns this if it doesn't want to recover at all. 153 + 154 + The next step taken will depend on the result codes returned by the 155 + drivers. 156 + 157 + If all drivers on the segment/slot return PCI_ERS_RESULT_CAN_RECOVER, 158 + then the platform should re-enable IOs on the slot (or do nothing in 159 + particular, if the platform doesn't isolate slots), and recovery 160 + proceeds to STEP 2 (MMIO Enable). 161 + 162 + If any driver requested a slot reset (by returning PCI_ERS_RESULT_NEED_RESET), 163 + then recovery proceeds to STEP 4 (Slot Reset). 164 + 165 + If the platform is unable to recover the slot, the next step 166 + is STEP 6 (Permanent Failure). 167 + 168 + .. note:: 169 + 170 + The current powerpc implementation assumes that a device driver will 171 + *not* schedule or semaphore in this routine; the current powerpc 172 + implementation uses one kernel thread to notify all devices; 173 + thus, if one device sleeps/schedules, all devices are affected. 174 + Doing better requires complex multi-threaded logic in the error 175 + recovery implementation (e.g. waiting for all notification threads 176 + to "join" before proceeding with recovery.) This seems excessively 177 + complex and not worth implementing. 178 + 179 + The current powerpc implementation doesn't much care if the device 180 + attempts I/O at this point, or not. I/O's will fail, returning 181 + a value of 0xff on read, and writes will be dropped. If more than 182 + EEH_MAX_FAILS I/O's are attempted to a frozen adapter, EEH 183 + assumes that the device driver has gone into an infinite loop 184 + and prints an error to syslog. A reboot is then required to 185 + get the device working again. 186 + 187 + STEP 2: MMIO Enabled 188 + -------------------- 189 + The platform re-enables MMIO to the device (but typically not the 190 + DMA), and then calls the mmio_enabled() callback on all affected 191 + device drivers. 192 + 193 + This is the "early recovery" call. IOs are allowed again, but DMA is 194 + not, with some restrictions. This is NOT a callback for the driver to 195 + start operations again, only to peek/poke at the device, extract diagnostic 196 + information, if any, and eventually do things like trigger a device local 197 + reset or some such, but not restart operations. This callback is made if 198 + all drivers on a segment agree that they can try to recover and if no automatic 199 + link reset was performed by the HW. If the platform can't just re-enable IOs 200 + without a slot reset or a link reset, it will not call this callback, and 201 + instead will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset) 202 + 203 + .. note:: 204 + 205 + The following is proposed; no platform implements this yet: 206 + Proposal: All I/O's should be done _synchronously_ from within 207 + this callback, errors triggered by them will be returned via 208 + the normal pci_check_whatever() API, no new error_detected() 209 + callback will be issued due to an error happening here. However, 210 + such an error might cause IOs to be re-blocked for the whole 211 + segment, and thus invalidate the recovery that other devices 212 + on the same segment might have done, forcing the whole segment 213 + into one of the next states, that is, link reset or slot reset. 214 + 215 + The driver should return one of the following result codes: 216 + - PCI_ERS_RESULT_RECOVERED 217 + Driver returns this if it thinks the device is fully 218 + functional and thinks it is ready to start 219 + normal driver operations again. There is no 220 + guarantee that the driver will actually be 221 + allowed to proceed, as another driver on the 222 + same segment might have failed and thus triggered a 223 + slot reset on platforms that support it. 224 + 225 + - PCI_ERS_RESULT_NEED_RESET 226 + Driver returns this if it thinks the device is not 227 + recoverable in its current state and it needs a slot 228 + reset to proceed. 229 + 230 + - PCI_ERS_RESULT_DISCONNECT 231 + Same as above. Total failure, no recovery even after 232 + reset driver dead. (To be defined more precisely) 233 + 234 + The next step taken depends on the results returned by the drivers. 235 + If all drivers returned PCI_ERS_RESULT_RECOVERED, then the platform 236 + proceeds to either STEP3 (Link Reset) or to STEP 5 (Resume Operations). 237 + 238 + If any driver returned PCI_ERS_RESULT_NEED_RESET, then the platform 239 + proceeds to STEP 4 (Slot Reset) 240 + 241 + STEP 3: Link Reset 242 + ------------------ 243 + The platform resets the link. This is a PCI-Express specific step 244 + and is done whenever a fatal error has been detected that can be 245 + "solved" by resetting the link. 246 + 247 + STEP 4: Slot Reset 248 + ------------------ 249 + 250 + In response to a return value of PCI_ERS_RESULT_NEED_RESET, the 251 + the platform will perform a slot reset on the requesting PCI device(s). 252 + The actual steps taken by a platform to perform a slot reset 253 + will be platform-dependent. Upon completion of slot reset, the 254 + platform will call the device slot_reset() callback. 255 + 256 + Powerpc platforms implement two levels of slot reset: 257 + soft reset(default) and fundamental(optional) reset. 258 + 259 + Powerpc soft reset consists of asserting the adapter #RST line and then 260 + restoring the PCI BAR's and PCI configuration header to a state 261 + that is equivalent to what it would be after a fresh system 262 + power-on followed by power-on BIOS/system firmware initialization. 263 + Soft reset is also known as hot-reset. 264 + 265 + Powerpc fundamental reset is supported by PCI Express cards only 266 + and results in device's state machines, hardware logic, port states and 267 + configuration registers to initialize to their default conditions. 268 + 269 + For most PCI devices, a soft reset will be sufficient for recovery. 270 + Optional fundamental reset is provided to support a limited number 271 + of PCI Express devices for which a soft reset is not sufficient 272 + for recovery. 273 + 274 + If the platform supports PCI hotplug, then the reset might be 275 + performed by toggling the slot electrical power off/on. 276 + 277 + It is important for the platform to restore the PCI config space 278 + to the "fresh poweron" state, rather than the "last state". After 279 + a slot reset, the device driver will almost always use its standard 280 + device initialization routines, and an unusual config space setup 281 + may result in hung devices, kernel panics, or silent data corruption. 282 + 283 + This call gives drivers the chance to re-initialize the hardware 284 + (re-download firmware, etc.). At this point, the driver may assume 285 + that the card is in a fresh state and is fully functional. The slot 286 + is unfrozen and the driver has full access to PCI config space, 287 + memory mapped I/O space and DMA. Interrupts (Legacy, MSI, or MSI-X) 288 + will also be available. 289 + 290 + Drivers should not restart normal I/O processing operations 291 + at this point. If all device drivers report success on this 292 + callback, the platform will call resume() to complete the sequence, 293 + and let the driver restart normal I/O processing. 294 + 295 + A driver can still return a critical failure for this function if 296 + it can't get the device operational after reset. If the platform 297 + previously tried a soft reset, it might now try a hard reset (power 298 + cycle) and then call slot_reset() again. It the device still can't 299 + be recovered, there is nothing more that can be done; the platform 300 + will typically report a "permanent failure" in such a case. The 301 + device will be considered "dead" in this case. 302 + 303 + Drivers for multi-function cards will need to coordinate among 304 + themselves as to which driver instance will perform any "one-shot" 305 + or global device initialization. For example, the Symbios sym53cxx2 306 + driver performs device init only from PCI function 0:: 307 + 308 + + if (PCI_FUNC(pdev->devfn) == 0) 309 + + sym_reset_scsi_bus(np, 0); 310 + 311 + Result codes: 312 + - PCI_ERS_RESULT_DISCONNECT 313 + Same as above. 314 + 315 + Drivers for PCI Express cards that require a fundamental reset must 316 + set the needs_freset bit in the pci_dev structure in their probe function. 317 + For example, the QLogic qla2xxx driver sets the needs_freset bit for certain 318 + PCI card types:: 319 + 320 + + /* Set EEH reset type to fundamental if required by hba */ 321 + + if (IS_QLA24XX(ha) || IS_QLA25XX(ha) || IS_QLA81XX(ha)) 322 + + pdev->needs_freset = 1; 323 + + 324 + 325 + Platform proceeds either to STEP 5 (Resume Operations) or STEP 6 (Permanent 326 + Failure). 327 + 328 + .. note:: 329 + 330 + The current powerpc implementation does not try a power-cycle 331 + reset if the driver returned PCI_ERS_RESULT_DISCONNECT. 332 + However, it probably should. 333 + 334 + 335 + STEP 5: Resume Operations 336 + ------------------------- 337 + The platform will call the resume() callback on all affected device 338 + drivers if all drivers on the segment have returned 339 + PCI_ERS_RESULT_RECOVERED from one of the 3 previous callbacks. 340 + The goal of this callback is to tell the driver to restart activity, 341 + that everything is back and running. This callback does not return 342 + a result code. 343 + 344 + At this point, if a new error happens, the platform will restart 345 + a new error recovery sequence. 346 + 347 + STEP 6: Permanent Failure 348 + ------------------------- 349 + A "permanent failure" has occurred, and the platform cannot recover 350 + the device. The platform will call error_detected() with a 351 + pci_channel_state value of pci_channel_io_perm_failure. 352 + 353 + The device driver should, at this point, assume the worst. It should 354 + cancel all pending I/O, refuse all new I/O, returning -EIO to 355 + higher layers. The device driver should then clean up all of its 356 + memory and remove itself from kernel operations, much as it would 357 + during system shutdown. 358 + 359 + The platform will typically notify the system operator of the 360 + permanent failure in some way. If the device is hotplug-capable, 361 + the operator will probably want to remove and replace the device. 362 + Note, however, not all failures are truly "permanent". Some are 363 + caused by over-heating, some by a poorly seated card. Many 364 + PCI error events are caused by software bugs, e.g. DMA's to 365 + wild addresses or bogus split transactions due to programming 366 + errors. See the discussion in powerpc/eeh-pci-error-recovery.txt 367 + for additional detail on real-life experience of the causes of 368 + software errors. 369 + 370 + 371 + Conclusion; General Remarks 372 + --------------------------- 373 + The way the callbacks are called is platform policy. A platform with 374 + no slot reset capability may want to just "ignore" drivers that can't 375 + recover (disconnect them) and try to let other cards on the same segment 376 + recover. Keep in mind that in most real life cases, though, there will 377 + be only one driver per segment. 378 + 379 + Now, a note about interrupts. If you get an interrupt and your 380 + device is dead or has been isolated, there is a problem :) 381 + The current policy is to turn this into a platform policy. 382 + That is, the recovery API only requires that: 383 + 384 + - There is no guarantee that interrupt delivery can proceed from any 385 + device on the segment starting from the error detection and until the 386 + slot_reset callback is called, at which point interrupts are expected 387 + to be fully operational. 388 + 389 + - There is no guarantee that interrupt delivery is stopped, that is, 390 + a driver that gets an interrupt after detecting an error, or that detects 391 + an error within the interrupt handler such that it prevents proper 392 + ack'ing of the interrupt (and thus removal of the source) should just 393 + return IRQ_NOTHANDLED. It's up to the platform to deal with that 394 + condition, typically by masking the IRQ source during the duration of 395 + the error handling. It is expected that the platform "knows" which 396 + interrupts are routed to error-management capable slots and can deal 397 + with temporarily disabling that IRQ number during error processing (this 398 + isn't terribly complex). That means some IRQ latency for other devices 399 + sharing the interrupt, but there is simply no other way. High end 400 + platforms aren't supposed to share interrupts between many devices 401 + anyway :) 402 + 403 + .. note:: 404 + 405 + Implementation details for the powerpc platform are discussed in 406 + the file Documentation/powerpc/eeh-pci-error-recovery.txt 407 + 408 + As of this writing, there is a growing list of device drivers with 409 + patches implementing error recovery. Not all of these patches are in 410 + mainline yet. These may be used as "examples": 411 + 412 + - drivers/scsi/ipr 413 + - drivers/scsi/sym53c8xx_2 414 + - drivers/scsi/qla2xxx 415 + - drivers/scsi/lpfc 416 + - drivers/next/bnx2.c 417 + - drivers/next/e100.c 418 + - drivers/net/e1000 419 + - drivers/net/e1000e 420 + - drivers/net/ixgb 421 + - drivers/net/ixgbe 422 + - drivers/net/cxgb3 423 + - drivers/net/s2io.c 424 + - drivers/net/qlge

-413

Documentation/PCI/pci-error-recovery.txt

··· 1 - 2 - PCI Error Recovery 3 - ------------------ 4 - February 2, 2006 5 - 6 - Current document maintainer: 7 - Linas Vepstas <linasvepstas@gmail.com> 8 - updated by Richard Lary <rlary@us.ibm.com> 9 - and Mike Mason <mmlnx@us.ibm.com> on 27-Jul-2009 10 - 11 - 12 - Many PCI bus controllers are able to detect a variety of hardware 13 - PCI errors on the bus, such as parity errors on the data and address 14 - buses, as well as SERR and PERR errors. Some of the more advanced 15 - chipsets are able to deal with these errors; these include PCI-E chipsets, 16 - and the PCI-host bridges found on IBM Power4, Power5 and Power6-based 17 - pSeries boxes. A typical action taken is to disconnect the affected device, 18 - halting all I/O to it. The goal of a disconnection is to avoid system 19 - corruption; for example, to halt system memory corruption due to DMA's 20 - to "wild" addresses. Typically, a reconnection mechanism is also 21 - offered, so that the affected PCI device(s) are reset and put back 22 - into working condition. The reset phase requires coordination 23 - between the affected device drivers and the PCI controller chip. 24 - This document describes a generic API for notifying device drivers 25 - of a bus disconnection, and then performing error recovery. 26 - This API is currently implemented in the 2.6.16 and later kernels. 27 - 28 - Reporting and recovery is performed in several steps. First, when 29 - a PCI hardware error has resulted in a bus disconnect, that event 30 - is reported as soon as possible to all affected device drivers, 31 - including multiple instances of a device driver on multi-function 32 - cards. This allows device drivers to avoid deadlocking in spinloops, 33 - waiting for some i/o-space register to change, when it never will. 34 - It also gives the drivers a chance to defer incoming I/O as 35 - needed. 36 - 37 - Next, recovery is performed in several stages. Most of the complexity 38 - is forced by the need to handle multi-function devices, that is, 39 - devices that have multiple device drivers associated with them. 40 - In the first stage, each driver is allowed to indicate what type 41 - of reset it desires, the choices being a simple re-enabling of I/O 42 - or requesting a slot reset. 43 - 44 - If any driver requests a slot reset, that is what will be done. 45 - 46 - After a reset and/or a re-enabling of I/O, all drivers are 47 - again notified, so that they may then perform any device setup/config 48 - that may be required. After these have all completed, a final 49 - "resume normal operations" event is sent out. 50 - 51 - The biggest reason for choosing a kernel-based implementation rather 52 - than a user-space implementation was the need to deal with bus 53 - disconnects of PCI devices attached to storage media, and, in particular, 54 - disconnects from devices holding the root file system. If the root 55 - file system is disconnected, a user-space mechanism would have to go 56 - through a large number of contortions to complete recovery. Almost all 57 - of the current Linux file systems are not tolerant of disconnection 58 - from/reconnection to their underlying block device. By contrast, 59 - bus errors are easy to manage in the device driver. Indeed, most 60 - device drivers already handle very similar recovery procedures; 61 - for example, the SCSI-generic layer already provides significant 62 - mechanisms for dealing with SCSI bus errors and SCSI bus resets. 63 - 64 - 65 - Detailed Design 66 - --------------- 67 - Design and implementation details below, based on a chain of 68 - public email discussions with Ben Herrenschmidt, circa 5 April 2005. 69 - 70 - The error recovery API support is exposed to the driver in the form of 71 - a structure of function pointers pointed to by a new field in struct 72 - pci_driver. A driver that fails to provide the structure is "non-aware", 73 - and the actual recovery steps taken are platform dependent. The 74 - arch/powerpc implementation will simulate a PCI hotplug remove/add. 75 - 76 - This structure has the form: 77 - struct pci_error_handlers 78 - { 79 - int (*error_detected)(struct pci_dev *dev, enum pci_channel_state); 80 - int (*mmio_enabled)(struct pci_dev *dev); 81 - int (*slot_reset)(struct pci_dev *dev); 82 - void (*resume)(struct pci_dev *dev); 83 - }; 84 - 85 - The possible channel states are: 86 - enum pci_channel_state { 87 - pci_channel_io_normal, /* I/O channel is in normal state */ 88 - pci_channel_io_frozen, /* I/O to channel is blocked */ 89 - pci_channel_io_perm_failure, /* PCI card is dead */ 90 - }; 91 - 92 - Possible return values are: 93 - enum pci_ers_result { 94 - PCI_ERS_RESULT_NONE, /* no result/none/not supported in device driver */ 95 - PCI_ERS_RESULT_CAN_RECOVER, /* Device driver can recover without slot reset */ 96 - PCI_ERS_RESULT_NEED_RESET, /* Device driver wants slot to be reset. */ 97 - PCI_ERS_RESULT_DISCONNECT, /* Device has completely failed, is unrecoverable */ 98 - PCI_ERS_RESULT_RECOVERED, /* Device driver is fully recovered and operational */ 99 - }; 100 - 101 - A driver does not have to implement all of these callbacks; however, 102 - if it implements any, it must implement error_detected(). If a callback 103 - is not implemented, the corresponding feature is considered unsupported. 104 - For example, if mmio_enabled() and resume() aren't there, then it 105 - is assumed that the driver is not doing any direct recovery and requires 106 - a slot reset. Typically a driver will want to know about 107 - a slot_reset(). 108 - 109 - The actual steps taken by a platform to recover from a PCI error 110 - event will be platform-dependent, but will follow the general 111 - sequence described below. 112 - 113 - STEP 0: Error Event 114 - ------------------- 115 - A PCI bus error is detected by the PCI hardware. On powerpc, the slot 116 - is isolated, in that all I/O is blocked: all reads return 0xffffffff, 117 - all writes are ignored. 118 - 119 - 120 - STEP 1: Notification 121 - -------------------- 122 - Platform calls the error_detected() callback on every instance of 123 - every driver affected by the error. 124 - 125 - At this point, the device might not be accessible anymore, depending on 126 - the platform (the slot will be isolated on powerpc). The driver may 127 - already have "noticed" the error because of a failing I/O, but this 128 - is the proper "synchronization point", that is, it gives the driver 129 - a chance to cleanup, waiting for pending stuff (timers, whatever, etc...) 130 - to complete; it can take semaphores, schedule, etc... everything but 131 - touch the device. Within this function and after it returns, the driver 132 - shouldn't do any new IOs. Called in task context. This is sort of a 133 - "quiesce" point. See note about interrupts at the end of this doc. 134 - 135 - All drivers participating in this system must implement this call. 136 - The driver must return one of the following result codes: 137 - - PCI_ERS_RESULT_CAN_RECOVER: 138 - Driver returns this if it thinks it might be able to recover 139 - the HW by just banging IOs or if it wants to be given 140 - a chance to extract some diagnostic information (see 141 - mmio_enable, below). 142 - - PCI_ERS_RESULT_NEED_RESET: 143 - Driver returns this if it can't recover without a 144 - slot reset. 145 - - PCI_ERS_RESULT_DISCONNECT: 146 - Driver returns this if it doesn't want to recover at all. 147 - 148 - The next step taken will depend on the result codes returned by the 149 - drivers. 150 - 151 - If all drivers on the segment/slot return PCI_ERS_RESULT_CAN_RECOVER, 152 - then the platform should re-enable IOs on the slot (or do nothing in 153 - particular, if the platform doesn't isolate slots), and recovery 154 - proceeds to STEP 2 (MMIO Enable). 155 - 156 - If any driver requested a slot reset (by returning PCI_ERS_RESULT_NEED_RESET), 157 - then recovery proceeds to STEP 4 (Slot Reset). 158 - 159 - If the platform is unable to recover the slot, the next step 160 - is STEP 6 (Permanent Failure). 161 - 162 - >>> The current powerpc implementation assumes that a device driver will 163 - >>> *not* schedule or semaphore in this routine; the current powerpc 164 - >>> implementation uses one kernel thread to notify all devices; 165 - >>> thus, if one device sleeps/schedules, all devices are affected. 166 - >>> Doing better requires complex multi-threaded logic in the error 167 - >>> recovery implementation (e.g. waiting for all notification threads 168 - >>> to "join" before proceeding with recovery.) This seems excessively 169 - >>> complex and not worth implementing. 170 - 171 - >>> The current powerpc implementation doesn't much care if the device 172 - >>> attempts I/O at this point, or not. I/O's will fail, returning 173 - >>> a value of 0xff on read, and writes will be dropped. If more than 174 - >>> EEH_MAX_FAILS I/O's are attempted to a frozen adapter, EEH 175 - >>> assumes that the device driver has gone into an infinite loop 176 - >>> and prints an error to syslog. A reboot is then required to 177 - >>> get the device working again. 178 - 179 - STEP 2: MMIO Enabled 180 - ------------------- 181 - The platform re-enables MMIO to the device (but typically not the 182 - DMA), and then calls the mmio_enabled() callback on all affected 183 - device drivers. 184 - 185 - This is the "early recovery" call. IOs are allowed again, but DMA is 186 - not, with some restrictions. This is NOT a callback for the driver to 187 - start operations again, only to peek/poke at the device, extract diagnostic 188 - information, if any, and eventually do things like trigger a device local 189 - reset or some such, but not restart operations. This callback is made if 190 - all drivers on a segment agree that they can try to recover and if no automatic 191 - link reset was performed by the HW. If the platform can't just re-enable IOs 192 - without a slot reset or a link reset, it will not call this callback, and 193 - instead will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset) 194 - 195 - >>> The following is proposed; no platform implements this yet: 196 - >>> Proposal: All I/O's should be done _synchronously_ from within 197 - >>> this callback, errors triggered by them will be returned via 198 - >>> the normal pci_check_whatever() API, no new error_detected() 199 - >>> callback will be issued due to an error happening here. However, 200 - >>> such an error might cause IOs to be re-blocked for the whole 201 - >>> segment, and thus invalidate the recovery that other devices 202 - >>> on the same segment might have done, forcing the whole segment 203 - >>> into one of the next states, that is, link reset or slot reset. 204 - 205 - The driver should return one of the following result codes: 206 - - PCI_ERS_RESULT_RECOVERED 207 - Driver returns this if it thinks the device is fully 208 - functional and thinks it is ready to start 209 - normal driver operations again. There is no 210 - guarantee that the driver will actually be 211 - allowed to proceed, as another driver on the 212 - same segment might have failed and thus triggered a 213 - slot reset on platforms that support it. 214 - 215 - - PCI_ERS_RESULT_NEED_RESET 216 - Driver returns this if it thinks the device is not 217 - recoverable in its current state and it needs a slot 218 - reset to proceed. 219 - 220 - - PCI_ERS_RESULT_DISCONNECT 221 - Same as above. Total failure, no recovery even after 222 - reset driver dead. (To be defined more precisely) 223 - 224 - The next step taken depends on the results returned by the drivers. 225 - If all drivers returned PCI_ERS_RESULT_RECOVERED, then the platform 226 - proceeds to either STEP3 (Link Reset) or to STEP 5 (Resume Operations). 227 - 228 - If any driver returned PCI_ERS_RESULT_NEED_RESET, then the platform 229 - proceeds to STEP 4 (Slot Reset) 230 - 231 - STEP 3: Link Reset 232 - ------------------ 233 - The platform resets the link. This is a PCI-Express specific step 234 - and is done whenever a fatal error has been detected that can be 235 - "solved" by resetting the link. 236 - 237 - STEP 4: Slot Reset 238 - ------------------ 239 - 240 - In response to a return value of PCI_ERS_RESULT_NEED_RESET, the 241 - the platform will perform a slot reset on the requesting PCI device(s). 242 - The actual steps taken by a platform to perform a slot reset 243 - will be platform-dependent. Upon completion of slot reset, the 244 - platform will call the device slot_reset() callback. 245 - 246 - Powerpc platforms implement two levels of slot reset: 247 - soft reset(default) and fundamental(optional) reset. 248 - 249 - Powerpc soft reset consists of asserting the adapter #RST line and then 250 - restoring the PCI BAR's and PCI configuration header to a state 251 - that is equivalent to what it would be after a fresh system 252 - power-on followed by power-on BIOS/system firmware initialization. 253 - Soft reset is also known as hot-reset. 254 - 255 - Powerpc fundamental reset is supported by PCI Express cards only 256 - and results in device's state machines, hardware logic, port states and 257 - configuration registers to initialize to their default conditions. 258 - 259 - For most PCI devices, a soft reset will be sufficient for recovery. 260 - Optional fundamental reset is provided to support a limited number 261 - of PCI Express devices for which a soft reset is not sufficient 262 - for recovery. 263 - 264 - If the platform supports PCI hotplug, then the reset might be 265 - performed by toggling the slot electrical power off/on. 266 - 267 - It is important for the platform to restore the PCI config space 268 - to the "fresh poweron" state, rather than the "last state". After 269 - a slot reset, the device driver will almost always use its standard 270 - device initialization routines, and an unusual config space setup 271 - may result in hung devices, kernel panics, or silent data corruption. 272 - 273 - This call gives drivers the chance to re-initialize the hardware 274 - (re-download firmware, etc.). At this point, the driver may assume 275 - that the card is in a fresh state and is fully functional. The slot 276 - is unfrozen and the driver has full access to PCI config space, 277 - memory mapped I/O space and DMA. Interrupts (Legacy, MSI, or MSI-X) 278 - will also be available. 279 - 280 - Drivers should not restart normal I/O processing operations 281 - at this point. If all device drivers report success on this 282 - callback, the platform will call resume() to complete the sequence, 283 - and let the driver restart normal I/O processing. 284 - 285 - A driver can still return a critical failure for this function if 286 - it can't get the device operational after reset. If the platform 287 - previously tried a soft reset, it might now try a hard reset (power 288 - cycle) and then call slot_reset() again. It the device still can't 289 - be recovered, there is nothing more that can be done; the platform 290 - will typically report a "permanent failure" in such a case. The 291 - device will be considered "dead" in this case. 292 - 293 - Drivers for multi-function cards will need to coordinate among 294 - themselves as to which driver instance will perform any "one-shot" 295 - or global device initialization. For example, the Symbios sym53cxx2 296 - driver performs device init only from PCI function 0: 297 - 298 - + if (PCI_FUNC(pdev->devfn) == 0) 299 - + sym_reset_scsi_bus(np, 0); 300 - 301 - Result codes: 302 - - PCI_ERS_RESULT_DISCONNECT 303 - Same as above. 304 - 305 - Drivers for PCI Express cards that require a fundamental reset must 306 - set the needs_freset bit in the pci_dev structure in their probe function. 307 - For example, the QLogic qla2xxx driver sets the needs_freset bit for certain 308 - PCI card types: 309 - 310 - + /* Set EEH reset type to fundamental if required by hba */ 311 - + if (IS_QLA24XX(ha) || IS_QLA25XX(ha) || IS_QLA81XX(ha)) 312 - + pdev->needs_freset = 1; 313 - + 314 - 315 - Platform proceeds either to STEP 5 (Resume Operations) or STEP 6 (Permanent 316 - Failure). 317 - 318 - >>> The current powerpc implementation does not try a power-cycle 319 - >>> reset if the driver returned PCI_ERS_RESULT_DISCONNECT. 320 - >>> However, it probably should. 321 - 322 - 323 - STEP 5: Resume Operations 324 - ------------------------- 325 - The platform will call the resume() callback on all affected device 326 - drivers if all drivers on the segment have returned 327 - PCI_ERS_RESULT_RECOVERED from one of the 3 previous callbacks. 328 - The goal of this callback is to tell the driver to restart activity, 329 - that everything is back and running. This callback does not return 330 - a result code. 331 - 332 - At this point, if a new error happens, the platform will restart 333 - a new error recovery sequence. 334 - 335 - STEP 6: Permanent Failure 336 - ------------------------- 337 - A "permanent failure" has occurred, and the platform cannot recover 338 - the device. The platform will call error_detected() with a 339 - pci_channel_state value of pci_channel_io_perm_failure. 340 - 341 - The device driver should, at this point, assume the worst. It should 342 - cancel all pending I/O, refuse all new I/O, returning -EIO to 343 - higher layers. The device driver should then clean up all of its 344 - memory and remove itself from kernel operations, much as it would 345 - during system shutdown. 346 - 347 - The platform will typically notify the system operator of the 348 - permanent failure in some way. If the device is hotplug-capable, 349 - the operator will probably want to remove and replace the device. 350 - Note, however, not all failures are truly "permanent". Some are 351 - caused by over-heating, some by a poorly seated card. Many 352 - PCI error events are caused by software bugs, e.g. DMA's to 353 - wild addresses or bogus split transactions due to programming 354 - errors. See the discussion in powerpc/eeh-pci-error-recovery.txt 355 - for additional detail on real-life experience of the causes of 356 - software errors. 357 - 358 - 359 - Conclusion; General Remarks 360 - --------------------------- 361 - The way the callbacks are called is platform policy. A platform with 362 - no slot reset capability may want to just "ignore" drivers that can't 363 - recover (disconnect them) and try to let other cards on the same segment 364 - recover. Keep in mind that in most real life cases, though, there will 365 - be only one driver per segment. 366 - 367 - Now, a note about interrupts. If you get an interrupt and your 368 - device is dead or has been isolated, there is a problem :) 369 - The current policy is to turn this into a platform policy. 370 - That is, the recovery API only requires that: 371 - 372 - - There is no guarantee that interrupt delivery can proceed from any 373 - device on the segment starting from the error detection and until the 374 - slot_reset callback is called, at which point interrupts are expected 375 - to be fully operational. 376 - 377 - - There is no guarantee that interrupt delivery is stopped, that is, 378 - a driver that gets an interrupt after detecting an error, or that detects 379 - an error within the interrupt handler such that it prevents proper 380 - ack'ing of the interrupt (and thus removal of the source) should just 381 - return IRQ_NOTHANDLED. It's up to the platform to deal with that 382 - condition, typically by masking the IRQ source during the duration of 383 - the error handling. It is expected that the platform "knows" which 384 - interrupts are routed to error-management capable slots and can deal 385 - with temporarily disabling that IRQ number during error processing (this 386 - isn't terribly complex). That means some IRQ latency for other devices 387 - sharing the interrupt, but there is simply no other way. High end 388 - platforms aren't supposed to share interrupts between many devices 389 - anyway :) 390 - 391 - >>> Implementation details for the powerpc platform are discussed in 392 - >>> the file Documentation/powerpc/eeh-pci-error-recovery.txt 393 - 394 - >>> As of this writing, there is a growing list of device drivers with 395 - >>> patches implementing error recovery. Not all of these patches are in 396 - >>> mainline yet. These may be used as "examples": 397 - >>> 398 - >>> drivers/scsi/ipr 399 - >>> drivers/scsi/sym53c8xx_2 400 - >>> drivers/scsi/qla2xxx 401 - >>> drivers/scsi/lpfc 402 - >>> drivers/next/bnx2.c 403 - >>> drivers/next/e100.c 404 - >>> drivers/net/e1000 405 - >>> drivers/net/e1000e 406 - >>> drivers/net/ixgb 407 - >>> drivers/net/ixgbe 408 - >>> drivers/net/cxgb3 409 - >>> drivers/net/s2io.c 410 - >>> drivers/net/qlge 411 - 412 - The End 413 - -------

+172

Documentation/PCI/pci-iov-howto.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + .. include:: <isonum.txt> 3 + 4 + ==================================== 5 + PCI Express I/O Virtualization Howto 6 + ==================================== 7 + 8 + :Copyright: |copy| 2009 Intel Corporation 9 + :Authors: - Yu Zhao <yu.zhao@intel.com> 10 + - Donald Dutile <ddutile@redhat.com> 11 + 12 + Overview 13 + ======== 14 + 15 + What is SR-IOV 16 + -------------- 17 + 18 + Single Root I/O Virtualization (SR-IOV) is a PCI Express Extended 19 + capability which makes one physical device appear as multiple virtual 20 + devices. The physical device is referred to as Physical Function (PF) 21 + while the virtual devices are referred to as Virtual Functions (VF). 22 + Allocation of the VF can be dynamically controlled by the PF via 23 + registers encapsulated in the capability. By default, this feature is 24 + not enabled and the PF behaves as traditional PCIe device. Once it's 25 + turned on, each VF's PCI configuration space can be accessed by its own 26 + Bus, Device and Function Number (Routing ID). And each VF also has PCI 27 + Memory Space, which is used to map its register set. VF device driver 28 + operates on the register set so it can be functional and appear as a 29 + real existing PCI device. 30 + 31 + User Guide 32 + ========== 33 + 34 + How can I enable SR-IOV capability 35 + ---------------------------------- 36 + 37 + Multiple methods are available for SR-IOV enablement. 38 + In the first method, the device driver (PF driver) will control the 39 + enabling and disabling of the capability via API provided by SR-IOV core. 40 + If the hardware has SR-IOV capability, loading its PF driver would 41 + enable it and all VFs associated with the PF. Some PF drivers require 42 + a module parameter to be set to determine the number of VFs to enable. 43 + In the second method, a write to the sysfs file sriov_numvfs will 44 + enable and disable the VFs associated with a PCIe PF. This method 45 + enables per-PF, VF enable/disable values versus the first method, 46 + which applies to all PFs of the same device. Additionally, the 47 + PCI SRIOV core support ensures that enable/disable operations are 48 + valid to reduce duplication in multiple drivers for the same 49 + checks, e.g., check numvfs == 0 if enabling VFs, ensure 50 + numvfs <= totalvfs. 51 + The second method is the recommended method for new/future VF devices. 52 + 53 + How can I use the Virtual Functions 54 + ----------------------------------- 55 + 56 + The VF is treated as hot-plugged PCI devices in the kernel, so they 57 + should be able to work in the same way as real PCI devices. The VF 58 + requires device driver that is same as a normal PCI device's. 59 + 60 + Developer Guide 61 + =============== 62 + 63 + SR-IOV API 64 + ---------- 65 + 66 + To enable SR-IOV capability: 67 + 68 + (a) For the first method, in the driver:: 69 + 70 + int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn); 71 + 72 + 'nr_virtfn' is number of VFs to be enabled. 73 + 74 + (b) For the second method, from sysfs:: 75 + 76 + echo 'nr_virtfn' > \ 77 + /sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_numvfs 78 + 79 + To disable SR-IOV capability: 80 + 81 + (a) For the first method, in the driver:: 82 + 83 + void pci_disable_sriov(struct pci_dev *dev); 84 + 85 + (b) For the second method, from sysfs:: 86 + 87 + echo 0 > \ 88 + /sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_numvfs 89 + 90 + To enable auto probing VFs by a compatible driver on the host, run 91 + command below before enabling SR-IOV capabilities. This is the 92 + default behavior. 93 + :: 94 + 95 + echo 1 > \ 96 + /sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_drivers_autoprobe 97 + 98 + To disable auto probing VFs by a compatible driver on the host, run 99 + command below before enabling SR-IOV capabilities. Updating this 100 + entry will not affect VFs which are already probed. 101 + :: 102 + 103 + echo 0 > \ 104 + /sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_drivers_autoprobe 105 + 106 + Usage example 107 + ------------- 108 + 109 + Following piece of code illustrates the usage of the SR-IOV API. 110 + :: 111 + 112 + static int dev_probe(struct pci_dev *dev, const struct pci_device_id *id) 113 + { 114 + pci_enable_sriov(dev, NR_VIRTFN); 115 + 116 + ... 117 + 118 + return 0; 119 + } 120 + 121 + static void dev_remove(struct pci_dev *dev) 122 + { 123 + pci_disable_sriov(dev); 124 + 125 + ... 126 + } 127 + 128 + static int dev_suspend(struct pci_dev *dev, pm_message_t state) 129 + { 130 + ... 131 + 132 + return 0; 133 + } 134 + 135 + static int dev_resume(struct pci_dev *dev) 136 + { 137 + ... 138 + 139 + return 0; 140 + } 141 + 142 + static void dev_shutdown(struct pci_dev *dev) 143 + { 144 + ... 145 + } 146 + 147 + static int dev_sriov_configure(struct pci_dev *dev, int numvfs) 148 + { 149 + if (numvfs > 0) { 150 + ... 151 + pci_enable_sriov(dev, numvfs); 152 + ... 153 + return numvfs; 154 + } 155 + if (numvfs == 0) { 156 + .... 157 + pci_disable_sriov(dev); 158 + ... 159 + return 0; 160 + } 161 + } 162 + 163 + static struct pci_driver dev_driver = { 164 + .name = "SR-IOV Physical Function driver", 165 + .id_table = dev_id_table, 166 + .probe = dev_probe, 167 + .remove = dev_remove, 168 + .suspend = dev_suspend, 169 + .resume = dev_resume, 170 + .shutdown = dev_shutdown, 171 + .sriov_configure = dev_sriov_configure, 172 + };

-147

Documentation/PCI/pci-iov-howto.txt

··· 1 - PCI Express I/O Virtualization Howto 2 - Copyright (C) 2009 Intel Corporation 3 - Yu Zhao <yu.zhao@intel.com> 4 - 5 - Update: November 2012 6 - -- sysfs-based SRIOV enable-/disable-ment 7 - Donald Dutile <ddutile@redhat.com> 8 - 9 - 1. Overview 10 - 11 - 1.1 What is SR-IOV 12 - 13 - Single Root I/O Virtualization (SR-IOV) is a PCI Express Extended 14 - capability which makes one physical device appear as multiple virtual 15 - devices. The physical device is referred to as Physical Function (PF) 16 - while the virtual devices are referred to as Virtual Functions (VF). 17 - Allocation of the VF can be dynamically controlled by the PF via 18 - registers encapsulated in the capability. By default, this feature is 19 - not enabled and the PF behaves as traditional PCIe device. Once it's 20 - turned on, each VF's PCI configuration space can be accessed by its own 21 - Bus, Device and Function Number (Routing ID). And each VF also has PCI 22 - Memory Space, which is used to map its register set. VF device driver 23 - operates on the register set so it can be functional and appear as a 24 - real existing PCI device. 25 - 26 - 2. User Guide 27 - 28 - 2.1 How can I enable SR-IOV capability 29 - 30 - Multiple methods are available for SR-IOV enablement. 31 - In the first method, the device driver (PF driver) will control the 32 - enabling and disabling of the capability via API provided by SR-IOV core. 33 - If the hardware has SR-IOV capability, loading its PF driver would 34 - enable it and all VFs associated with the PF. Some PF drivers require 35 - a module parameter to be set to determine the number of VFs to enable. 36 - In the second method, a write to the sysfs file sriov_numvfs will 37 - enable and disable the VFs associated with a PCIe PF. This method 38 - enables per-PF, VF enable/disable values versus the first method, 39 - which applies to all PFs of the same device. Additionally, the 40 - PCI SRIOV core support ensures that enable/disable operations are 41 - valid to reduce duplication in multiple drivers for the same 42 - checks, e.g., check numvfs == 0 if enabling VFs, ensure 43 - numvfs <= totalvfs. 44 - The second method is the recommended method for new/future VF devices. 45 - 46 - 2.2 How can I use the Virtual Functions 47 - 48 - The VF is treated as hot-plugged PCI devices in the kernel, so they 49 - should be able to work in the same way as real PCI devices. The VF 50 - requires device driver that is same as a normal PCI device's. 51 - 52 - 3. Developer Guide 53 - 54 - 3.1 SR-IOV API 55 - 56 - To enable SR-IOV capability: 57 - (a) For the first method, in the driver: 58 - int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn); 59 - 'nr_virtfn' is number of VFs to be enabled. 60 - (b) For the second method, from sysfs: 61 - echo 'nr_virtfn' > \ 62 - /sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_numvfs 63 - 64 - To disable SR-IOV capability: 65 - (a) For the first method, in the driver: 66 - void pci_disable_sriov(struct pci_dev *dev); 67 - (b) For the second method, from sysfs: 68 - echo 0 > \ 69 - /sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_numvfs 70 - 71 - To enable auto probing VFs by a compatible driver on the host, run 72 - command below before enabling SR-IOV capabilities. This is the 73 - default behavior. 74 - echo 1 > \ 75 - /sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_drivers_autoprobe 76 - 77 - To disable auto probing VFs by a compatible driver on the host, run 78 - command below before enabling SR-IOV capabilities. Updating this 79 - entry will not affect VFs which are already probed. 80 - echo 0 > \ 81 - /sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_drivers_autoprobe 82 - 83 - 3.2 Usage example 84 - 85 - Following piece of code illustrates the usage of the SR-IOV API. 86 - 87 - static int dev_probe(struct pci_dev *dev, const struct pci_device_id *id) 88 - { 89 - pci_enable_sriov(dev, NR_VIRTFN); 90 - 91 - ... 92 - 93 - return 0; 94 - } 95 - 96 - static void dev_remove(struct pci_dev *dev) 97 - { 98 - pci_disable_sriov(dev); 99 - 100 - ... 101 - } 102 - 103 - static int dev_suspend(struct pci_dev *dev, pm_message_t state) 104 - { 105 - ... 106 - 107 - return 0; 108 - } 109 - 110 - static int dev_resume(struct pci_dev *dev) 111 - { 112 - ... 113 - 114 - return 0; 115 - } 116 - 117 - static void dev_shutdown(struct pci_dev *dev) 118 - { 119 - ... 120 - } 121 - 122 - static int dev_sriov_configure(struct pci_dev *dev, int numvfs) 123 - { 124 - if (numvfs > 0) { 125 - ... 126 - pci_enable_sriov(dev, numvfs); 127 - ... 128 - return numvfs; 129 - } 130 - if (numvfs == 0) { 131 - .... 132 - pci_disable_sriov(dev); 133 - ... 134 - return 0; 135 - } 136 - } 137 - 138 - static struct pci_driver dev_driver = { 139 - .name = "SR-IOV Physical Function driver", 140 - .id_table = dev_id_table, 141 - .probe = dev_probe, 142 - .remove = dev_remove, 143 - .suspend = dev_suspend, 144 - .resume = dev_resume, 145 - .shutdown = dev_shutdown, 146 - .sriov_configure = dev_sriov_configure, 147 - };

+578

Documentation/PCI/pci.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ============================== 4 + How To Write Linux PCI Drivers 5 + ============================== 6 + 7 + :Authors: - Martin Mares <mj@ucw.cz> 8 + - Grant Grundler <grundler@parisc-linux.org> 9 + 10 + The world of PCI is vast and full of (mostly unpleasant) surprises. 11 + Since each CPU architecture implements different chip-sets and PCI devices 12 + have different requirements (erm, "features"), the result is the PCI support 13 + in the Linux kernel is not as trivial as one would wish. This short paper 14 + tries to introduce all potential driver authors to Linux APIs for 15 + PCI device drivers. 16 + 17 + A more complete resource is the third edition of "Linux Device Drivers" 18 + by Jonathan Corbet, Alessandro Rubini, and Greg Kroah-Hartman. 19 + LDD3 is available for free (under Creative Commons License) from: 20 + http://lwn.net/Kernel/LDD3/. 21 + 22 + However, keep in mind that all documents are subject to "bit rot". 23 + Refer to the source code if things are not working as described here. 24 + 25 + Please send questions/comments/patches about Linux PCI API to the 26 + "Linux PCI" <linux-pci@atrey.karlin.mff.cuni.cz> mailing list. 27 + 28 + 29 + Structure of PCI drivers 30 + ======================== 31 + PCI drivers "discover" PCI devices in a system via pci_register_driver(). 32 + Actually, it's the other way around. When the PCI generic code discovers 33 + a new device, the driver with a matching "description" will be notified. 34 + Details on this below. 35 + 36 + pci_register_driver() leaves most of the probing for devices to 37 + the PCI layer and supports online insertion/removal of devices [thus 38 + supporting hot-pluggable PCI, CardBus, and Express-Card in a single driver]. 39 + pci_register_driver() call requires passing in a table of function 40 + pointers and thus dictates the high level structure of a driver. 41 + 42 + Once the driver knows about a PCI device and takes ownership, the 43 + driver generally needs to perform the following initialization: 44 + 45 + - Enable the device 46 + - Request MMIO/IOP resources 47 + - Set the DMA mask size (for both coherent and streaming DMA) 48 + - Allocate and initialize shared control data (pci_allocate_coherent()) 49 + - Access device configuration space (if needed) 50 + - Register IRQ handler (request_irq()) 51 + - Initialize non-PCI (i.e. LAN/SCSI/etc parts of the chip) 52 + - Enable DMA/processing engines 53 + 54 + When done using the device, and perhaps the module needs to be unloaded, 55 + the driver needs to take the follow steps: 56 + 57 + - Disable the device from generating IRQs 58 + - Release the IRQ (free_irq()) 59 + - Stop all DMA activity 60 + - Release DMA buffers (both streaming and coherent) 61 + - Unregister from other subsystems (e.g. scsi or netdev) 62 + - Release MMIO/IOP resources 63 + - Disable the device 64 + 65 + Most of these topics are covered in the following sections. 66 + For the rest look at LDD3 or <linux/pci.h> . 67 + 68 + If the PCI subsystem is not configured (CONFIG_PCI is not set), most of 69 + the PCI functions described below are defined as inline functions either 70 + completely empty or just returning an appropriate error codes to avoid 71 + lots of ifdefs in the drivers. 72 + 73 + 74 + pci_register_driver() call 75 + ========================== 76 + 77 + PCI device drivers call ``pci_register_driver()`` during their 78 + initialization with a pointer to a structure describing the driver 79 + (``struct pci_driver``): 80 + 81 + .. kernel-doc:: include/linux/pci.h 82 + :functions: pci_driver 83 + 84 + The ID table is an array of ``struct pci_device_id`` entries ending with an 85 + all-zero entry. Definitions with static const are generally preferred. 86 + 87 + .. kernel-doc:: include/linux/mod_devicetable.h 88 + :functions: pci_device_id 89 + 90 + Most drivers only need ``PCI_DEVICE()`` or ``PCI_DEVICE_CLASS()`` to set up 91 + a pci_device_id table. 92 + 93 + New PCI IDs may be added to a device driver pci_ids table at runtime 94 + as shown below:: 95 + 96 + echo "vendor device subvendor subdevice class class_mask driver_data" > \ 97 + /sys/bus/pci/drivers/{driver}/new_id 98 + 99 + All fields are passed in as hexadecimal values (no leading 0x). 100 + The vendor and device fields are mandatory, the others are optional. Users 101 + need pass only as many optional fields as necessary: 102 + 103 + - subvendor and subdevice fields default to PCI_ANY_ID (FFFFFFFF) 104 + - class and classmask fields default to 0 105 + - driver_data defaults to 0UL. 106 + 107 + Note that driver_data must match the value used by any of the pci_device_id 108 + entries defined in the driver. This makes the driver_data field mandatory 109 + if all the pci_device_id entries have a non-zero driver_data value. 110 + 111 + Once added, the driver probe routine will be invoked for any unclaimed 112 + PCI devices listed in its (newly updated) pci_ids list. 113 + 114 + When the driver exits, it just calls pci_unregister_driver() and the PCI layer 115 + automatically calls the remove hook for all devices handled by the driver. 116 + 117 + 118 + "Attributes" for driver functions/data 119 + -------------------------------------- 120 + 121 + Please mark the initialization and cleanup functions where appropriate 122 + (the corresponding macros are defined in <linux/init.h>): 123 + 124 + ====== ================================================= 125 + __init Initialization code. Thrown away after the driver 126 + initializes. 127 + __exit Exit code. Ignored for non-modular drivers. 128 + ====== ================================================= 129 + 130 + Tips on when/where to use the above attributes: 131 + - The module_init()/module_exit() functions (and all 132 + initialization functions called _only_ from these) 133 + should be marked __init/__exit. 134 + 135 + - Do not mark the struct pci_driver. 136 + 137 + - Do NOT mark a function if you are not sure which mark to use. 138 + Better to not mark the function than mark the function wrong. 139 + 140 + 141 + How to find PCI devices manually 142 + ================================ 143 + 144 + PCI drivers should have a really good reason for not using the 145 + pci_register_driver() interface to search for PCI devices. 146 + The main reason PCI devices are controlled by multiple drivers 147 + is because one PCI device implements several different HW services. 148 + E.g. combined serial/parallel port/floppy controller. 149 + 150 + A manual search may be performed using the following constructs: 151 + 152 + Searching by vendor and device ID:: 153 + 154 + struct pci_dev *dev = NULL; 155 + while (dev = pci_get_device(VENDOR_ID, DEVICE_ID, dev)) 156 + configure_device(dev); 157 + 158 + Searching by class ID (iterate in a similar way):: 159 + 160 + pci_get_class(CLASS_ID, dev) 161 + 162 + Searching by both vendor/device and subsystem vendor/device ID:: 163 + 164 + pci_get_subsys(VENDOR_ID,DEVICE_ID, SUBSYS_VENDOR_ID, SUBSYS_DEVICE_ID, dev). 165 + 166 + You can use the constant PCI_ANY_ID as a wildcard replacement for 167 + VENDOR_ID or DEVICE_ID. This allows searching for any device from a 168 + specific vendor, for example. 169 + 170 + These functions are hotplug-safe. They increment the reference count on 171 + the pci_dev that they return. You must eventually (possibly at module unload) 172 + decrement the reference count on these devices by calling pci_dev_put(). 173 + 174 + 175 + Device Initialization Steps 176 + =========================== 177 + 178 + As noted in the introduction, most PCI drivers need the following steps 179 + for device initialization: 180 + 181 + - Enable the device 182 + - Request MMIO/IOP resources 183 + - Set the DMA mask size (for both coherent and streaming DMA) 184 + - Allocate and initialize shared control data (pci_allocate_coherent()) 185 + - Access device configuration space (if needed) 186 + - Register IRQ handler (request_irq()) 187 + - Initialize non-PCI (i.e. LAN/SCSI/etc parts of the chip) 188 + - Enable DMA/processing engines. 189 + 190 + The driver can access PCI config space registers at any time. 191 + (Well, almost. When running BIST, config space can go away...but 192 + that will just result in a PCI Bus Master Abort and config reads 193 + will return garbage). 194 + 195 + 196 + Enable the PCI device 197 + --------------------- 198 + Before touching any device registers, the driver needs to enable 199 + the PCI device by calling pci_enable_device(). This will: 200 + 201 + - wake up the device if it was in suspended state, 202 + - allocate I/O and memory regions of the device (if BIOS did not), 203 + - allocate an IRQ (if BIOS did not). 204 + 205 + .. note:: 206 + pci_enable_device() can fail! Check the return value. 207 + 208 + .. warning:: 209 + OS BUG: we don't check resource allocations before enabling those 210 + resources. The sequence would make more sense if we called 211 + pci_request_resources() before calling pci_enable_device(). 212 + Currently, the device drivers can't detect the bug when when two 213 + devices have been allocated the same range. This is not a common 214 + problem and unlikely to get fixed soon. 215 + 216 + This has been discussed before but not changed as of 2.6.19: 217 + http://lkml.org/lkml/2006/3/2/194 218 + 219 + 220 + pci_set_master() will enable DMA by setting the bus master bit 221 + in the PCI_COMMAND register. It also fixes the latency timer value if 222 + it's set to something bogus by the BIOS. pci_clear_master() will 223 + disable DMA by clearing the bus master bit. 224 + 225 + If the PCI device can use the PCI Memory-Write-Invalidate transaction, 226 + call pci_set_mwi(). This enables the PCI_COMMAND bit for Mem-Wr-Inval 227 + and also ensures that the cache line size register is set correctly. 228 + Check the return value of pci_set_mwi() as not all architectures 229 + or chip-sets may support Memory-Write-Invalidate. Alternatively, 230 + if Mem-Wr-Inval would be nice to have but is not required, call 231 + pci_try_set_mwi() to have the system do its best effort at enabling 232 + Mem-Wr-Inval. 233 + 234 + 235 + Request MMIO/IOP resources 236 + -------------------------- 237 + Memory (MMIO), and I/O port addresses should NOT be read directly 238 + from the PCI device config space. Use the values in the pci_dev structure 239 + as the PCI "bus address" might have been remapped to a "host physical" 240 + address by the arch/chip-set specific kernel support. 241 + 242 + See Documentation/io-mapping.txt for how to access device registers 243 + or device memory. 244 + 245 + The device driver needs to call pci_request_region() to verify 246 + no other device is already using the same address resource. 247 + Conversely, drivers should call pci_release_region() AFTER 248 + calling pci_disable_device(). 249 + The idea is to prevent two devices colliding on the same address range. 250 + 251 + .. tip:: 252 + See OS BUG comment above. Currently (2.6.19), The driver can only 253 + determine MMIO and IO Port resource availability _after_ calling 254 + pci_enable_device(). 255 + 256 + Generic flavors of pci_request_region() are request_mem_region() 257 + (for MMIO ranges) and request_region() (for IO Port ranges). 258 + Use these for address resources that are not described by "normal" PCI 259 + BARs. 260 + 261 + Also see pci_request_selected_regions() below. 262 + 263 + 264 + Set the DMA mask size 265 + --------------------- 266 + .. note:: 267 + If anything below doesn't make sense, please refer to 268 + Documentation/DMA-API.txt. This section is just a reminder that 269 + drivers need to indicate DMA capabilities of the device and is not 270 + an authoritative source for DMA interfaces. 271 + 272 + While all drivers should explicitly indicate the DMA capability 273 + (e.g. 32 or 64 bit) of the PCI bus master, devices with more than 274 + 32-bit bus master capability for streaming data need the driver 275 + to "register" this capability by calling pci_set_dma_mask() with 276 + appropriate parameters. In general this allows more efficient DMA 277 + on systems where System RAM exists above 4G _physical_ address. 278 + 279 + Drivers for all PCI-X and PCIe compliant devices must call 280 + pci_set_dma_mask() as they are 64-bit DMA devices. 281 + 282 + Similarly, drivers must also "register" this capability if the device 283 + can directly address "consistent memory" in System RAM above 4G physical 284 + address by calling pci_set_consistent_dma_mask(). 285 + Again, this includes drivers for all PCI-X and PCIe compliant devices. 286 + Many 64-bit "PCI" devices (before PCI-X) and some PCI-X devices are 287 + 64-bit DMA capable for payload ("streaming") data but not control 288 + ("consistent") data. 289 + 290 + 291 + Setup shared control data 292 + ------------------------- 293 + Once the DMA masks are set, the driver can allocate "consistent" (a.k.a. shared) 294 + memory. See Documentation/DMA-API.txt for a full description of 295 + the DMA APIs. This section is just a reminder that it needs to be done 296 + before enabling DMA on the device. 297 + 298 + 299 + Initialize device registers 300 + --------------------------- 301 + Some drivers will need specific "capability" fields programmed 302 + or other "vendor specific" register initialized or reset. 303 + E.g. clearing pending interrupts. 304 + 305 + 306 + Register IRQ handler 307 + -------------------- 308 + While calling request_irq() is the last step described here, 309 + this is often just another intermediate step to initialize a device. 310 + This step can often be deferred until the device is opened for use. 311 + 312 + All interrupt handlers for IRQ lines should be registered with IRQF_SHARED 313 + and use the devid to map IRQs to devices (remember that all PCI IRQ lines 314 + can be shared). 315 + 316 + request_irq() will associate an interrupt handler and device handle 317 + with an interrupt number. Historically interrupt numbers represent 318 + IRQ lines which run from the PCI device to the Interrupt controller. 319 + With MSI and MSI-X (more below) the interrupt number is a CPU "vector". 320 + 321 + request_irq() also enables the interrupt. Make sure the device is 322 + quiesced and does not have any interrupts pending before registering 323 + the interrupt handler. 324 + 325 + MSI and MSI-X are PCI capabilities. Both are "Message Signaled Interrupts" 326 + which deliver interrupts to the CPU via a DMA write to a Local APIC. 327 + The fundamental difference between MSI and MSI-X is how multiple 328 + "vectors" get allocated. MSI requires contiguous blocks of vectors 329 + while MSI-X can allocate several individual ones. 330 + 331 + MSI capability can be enabled by calling pci_alloc_irq_vectors() with the 332 + PCI_IRQ_MSI and/or PCI_IRQ_MSIX flags before calling request_irq(). This 333 + causes the PCI support to program CPU vector data into the PCI device 334 + capability registers. Many architectures, chip-sets, or BIOSes do NOT 335 + support MSI or MSI-X and a call to pci_alloc_irq_vectors with just 336 + the PCI_IRQ_MSI and PCI_IRQ_MSIX flags will fail, so try to always 337 + specify PCI_IRQ_LEGACY as well. 338 + 339 + Drivers that have different interrupt handlers for MSI/MSI-X and 340 + legacy INTx should chose the right one based on the msi_enabled 341 + and msix_enabled flags in the pci_dev structure after calling 342 + pci_alloc_irq_vectors. 343 + 344 + There are (at least) two really good reasons for using MSI: 345 + 346 + 1) MSI is an exclusive interrupt vector by definition. 347 + This means the interrupt handler doesn't have to verify 348 + its device caused the interrupt. 349 + 350 + 2) MSI avoids DMA/IRQ race conditions. DMA to host memory is guaranteed 351 + to be visible to the host CPU(s) when the MSI is delivered. This 352 + is important for both data coherency and avoiding stale control data. 353 + This guarantee allows the driver to omit MMIO reads to flush 354 + the DMA stream. 355 + 356 + See drivers/infiniband/hw/mthca/ or drivers/net/tg3.c for examples 357 + of MSI/MSI-X usage. 358 + 359 + 360 + PCI device shutdown 361 + =================== 362 + 363 + When a PCI device driver is being unloaded, most of the following 364 + steps need to be performed: 365 + 366 + - Disable the device from generating IRQs 367 + - Release the IRQ (free_irq()) 368 + - Stop all DMA activity 369 + - Release DMA buffers (both streaming and consistent) 370 + - Unregister from other subsystems (e.g. scsi or netdev) 371 + - Disable device from responding to MMIO/IO Port addresses 372 + - Release MMIO/IO Port resource(s) 373 + 374 + 375 + Stop IRQs on the device 376 + ----------------------- 377 + How to do this is chip/device specific. If it's not done, it opens 378 + the possibility of a "screaming interrupt" if (and only if) 379 + the IRQ is shared with another device. 380 + 381 + When the shared IRQ handler is "unhooked", the remaining devices 382 + using the same IRQ line will still need the IRQ enabled. Thus if the 383 + "unhooked" device asserts IRQ line, the system will respond assuming 384 + it was one of the remaining devices asserted the IRQ line. Since none 385 + of the other devices will handle the IRQ, the system will "hang" until 386 + it decides the IRQ isn't going to get handled and masks the IRQ (100,000 387 + iterations later). Once the shared IRQ is masked, the remaining devices 388 + will stop functioning properly. Not a nice situation. 389 + 390 + This is another reason to use MSI or MSI-X if it's available. 391 + MSI and MSI-X are defined to be exclusive interrupts and thus 392 + are not susceptible to the "screaming interrupt" problem. 393 + 394 + 395 + Release the IRQ 396 + --------------- 397 + Once the device is quiesced (no more IRQs), one can call free_irq(). 398 + This function will return control once any pending IRQs are handled, 399 + "unhook" the drivers IRQ handler from that IRQ, and finally release 400 + the IRQ if no one else is using it. 401 + 402 + 403 + Stop all DMA activity 404 + --------------------- 405 + It's extremely important to stop all DMA operations BEFORE attempting 406 + to deallocate DMA control data. Failure to do so can result in memory 407 + corruption, hangs, and on some chip-sets a hard crash. 408 + 409 + Stopping DMA after stopping the IRQs can avoid races where the 410 + IRQ handler might restart DMA engines. 411 + 412 + While this step sounds obvious and trivial, several "mature" drivers 413 + didn't get this step right in the past. 414 + 415 + 416 + Release DMA buffers 417 + ------------------- 418 + Once DMA is stopped, clean up streaming DMA first. 419 + I.e. unmap data buffers and return buffers to "upstream" 420 + owners if there is one. 421 + 422 + Then clean up "consistent" buffers which contain the control data. 423 + 424 + See Documentation/DMA-API.txt for details on unmapping interfaces. 425 + 426 + 427 + Unregister from other subsystems 428 + -------------------------------- 429 + Most low level PCI device drivers support some other subsystem 430 + like USB, ALSA, SCSI, NetDev, Infiniband, etc. Make sure your 431 + driver isn't losing resources from that other subsystem. 432 + If this happens, typically the symptom is an Oops (panic) when 433 + the subsystem attempts to call into a driver that has been unloaded. 434 + 435 + 436 + Disable Device from responding to MMIO/IO Port addresses 437 + -------------------------------------------------------- 438 + io_unmap() MMIO or IO Port resources and then call pci_disable_device(). 439 + This is the symmetric opposite of pci_enable_device(). 440 + Do not access device registers after calling pci_disable_device(). 441 + 442 + 443 + Release MMIO/IO Port Resource(s) 444 + -------------------------------- 445 + Call pci_release_region() to mark the MMIO or IO Port range as available. 446 + Failure to do so usually results in the inability to reload the driver. 447 + 448 + 449 + How to access PCI config space 450 + ============================== 451 + 452 + You can use `pci_(read|write)_config_(byte|word|dword)` to access the config 453 + space of a device represented by `struct pci_dev *`. All these functions return 454 + 0 when successful or an error code (`PCIBIOS_...`) which can be translated to a 455 + text string by pcibios_strerror. Most drivers expect that accesses to valid PCI 456 + devices don't fail. 457 + 458 + If you don't have a struct pci_dev available, you can call 459 + `pci_bus_(read|write)_config_(byte|word|dword)` to access a given device 460 + and function on that bus. 461 + 462 + If you access fields in the standard portion of the config header, please 463 + use symbolic names of locations and bits declared in <linux/pci.h>. 464 + 465 + If you need to access Extended PCI Capability registers, just call 466 + pci_find_capability() for the particular capability and it will find the 467 + corresponding register block for you. 468 + 469 + 470 + Other interesting functions 471 + =========================== 472 + 473 + ============================= ================================================ 474 + pci_get_domain_bus_and_slot() Find pci_dev corresponding to given domain, 475 + bus and slot and number. If the device is 476 + found, its reference count is increased. 477 + pci_set_power_state() Set PCI Power Management state (0=D0 ... 3=D3) 478 + pci_find_capability() Find specified capability in device's capability 479 + list. 480 + pci_resource_start() Returns bus start address for a given PCI region 481 + pci_resource_end() Returns bus end address for a given PCI region 482 + pci_resource_len() Returns the byte length of a PCI region 483 + pci_set_drvdata() Set private driver data pointer for a pci_dev 484 + pci_get_drvdata() Return private driver data pointer for a pci_dev 485 + pci_set_mwi() Enable Memory-Write-Invalidate transactions. 486 + pci_clear_mwi() Disable Memory-Write-Invalidate transactions. 487 + ============================= ================================================ 488 + 489 + 490 + Miscellaneous hints 491 + =================== 492 + 493 + When displaying PCI device names to the user (for example when a driver wants 494 + to tell the user what card has it found), please use pci_name(pci_dev). 495 + 496 + Always refer to the PCI devices by a pointer to the pci_dev structure. 497 + All PCI layer functions use this identification and it's the only 498 + reasonable one. Don't use bus/slot/function numbers except for very 499 + special purposes -- on systems with multiple primary buses their semantics 500 + can be pretty complex. 501 + 502 + Don't try to turn on Fast Back to Back writes in your driver. All devices 503 + on the bus need to be capable of doing it, so this is something which needs 504 + to be handled by platform and generic code, not individual drivers. 505 + 506 + 507 + Vendor and device identifications 508 + ================================= 509 + 510 + Do not add new device or vendor IDs to include/linux/pci_ids.h unless they 511 + are shared across multiple drivers. You can add private definitions in 512 + your driver if they're helpful, or just use plain hex constants. 513 + 514 + The device IDs are arbitrary hex numbers (vendor controlled) and normally used 515 + only in a single location, the pci_device_id table. 516 + 517 + Please DO submit new vendor/device IDs to http://pci-ids.ucw.cz/. 518 + There are mirrors of the pci.ids file at http://pciids.sourceforge.net/ 519 + and https://github.com/pciutils/pciids. 520 + 521 + 522 + Obsolete functions 523 + ================== 524 + 525 + There are several functions which you might come across when trying to 526 + port an old driver to the new PCI interface. They are no longer present 527 + in the kernel as they aren't compatible with hotplug or PCI domains or 528 + having sane locking. 529 + 530 + ================= =========================================== 531 + pci_find_device() Superseded by pci_get_device() 532 + pci_find_subsys() Superseded by pci_get_subsys() 533 + pci_find_slot() Superseded by pci_get_domain_bus_and_slot() 534 + pci_get_slot() Superseded by pci_get_domain_bus_and_slot() 535 + ================= =========================================== 536 + 537 + The alternative is the traditional PCI device driver that walks PCI 538 + device lists. This is still possible but discouraged. 539 + 540 + 541 + MMIO Space and "Write Posting" 542 + ============================== 543 + 544 + Converting a driver from using I/O Port space to using MMIO space 545 + often requires some additional changes. Specifically, "write posting" 546 + needs to be handled. Many drivers (e.g. tg3, acenic, sym53c8xx_2) 547 + already do this. I/O Port space guarantees write transactions reach the PCI 548 + device before the CPU can continue. Writes to MMIO space allow the CPU 549 + to continue before the transaction reaches the PCI device. HW weenies 550 + call this "Write Posting" because the write completion is "posted" to 551 + the CPU before the transaction has reached its destination. 552 + 553 + Thus, timing sensitive code should add readl() where the CPU is 554 + expected to wait before doing other work. The classic "bit banging" 555 + sequence works fine for I/O Port space:: 556 + 557 + for (i = 8; --i; val >>= 1) { 558 + outb(val & 1, ioport_reg); /* write bit */ 559 + udelay(10); 560 + } 561 + 562 + The same sequence for MMIO space should be:: 563 + 564 + for (i = 8; --i; val >>= 1) { 565 + writeb(val & 1, mmio_reg); /* write bit */ 566 + readb(safe_mmio_reg); /* flush posted write */ 567 + udelay(10); 568 + } 569 + 570 + It is important that "safe_mmio_reg" not have any side effects that 571 + interferes with the correct operation of the device. 572 + 573 + Another case to watch out for is when resetting a PCI device. Use PCI 574 + Configuration space reads to flush the writel(). This will gracefully 575 + handle the PCI master abort on all platforms if the PCI device is 576 + expected to not respond to a readl(). Most x86 platforms will allow 577 + MMIO reads to master abort (a.k.a. "Soft Fail") and return garbage 578 + (e.g. ~0). But many RISC platforms will crash (a.k.a."Hard Fail").

-636

Documentation/PCI/pci.txt

··· 1 - 2 - How To Write Linux PCI Drivers 3 - 4 - by Martin Mares <mj@ucw.cz> on 07-Feb-2000 5 - updated by Grant Grundler <grundler@parisc-linux.org> on 23-Dec-2006 6 - 7 - ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 8 - The world of PCI is vast and full of (mostly unpleasant) surprises. 9 - Since each CPU architecture implements different chip-sets and PCI devices 10 - have different requirements (erm, "features"), the result is the PCI support 11 - in the Linux kernel is not as trivial as one would wish. This short paper 12 - tries to introduce all potential driver authors to Linux APIs for 13 - PCI device drivers. 14 - 15 - A more complete resource is the third edition of "Linux Device Drivers" 16 - by Jonathan Corbet, Alessandro Rubini, and Greg Kroah-Hartman. 17 - LDD3 is available for free (under Creative Commons License) from: 18 - 19 - http://lwn.net/Kernel/LDD3/ 20 - 21 - However, keep in mind that all documents are subject to "bit rot". 22 - Refer to the source code if things are not working as described here. 23 - 24 - Please send questions/comments/patches about Linux PCI API to the 25 - "Linux PCI" <linux-pci@atrey.karlin.mff.cuni.cz> mailing list. 26 - 27 - 28 - 29 - 0. Structure of PCI drivers 30 - ~~~~~~~~~~~~~~~~~~~~~~~~~~~ 31 - PCI drivers "discover" PCI devices in a system via pci_register_driver(). 32 - Actually, it's the other way around. When the PCI generic code discovers 33 - a new device, the driver with a matching "description" will be notified. 34 - Details on this below. 35 - 36 - pci_register_driver() leaves most of the probing for devices to 37 - the PCI layer and supports online insertion/removal of devices [thus 38 - supporting hot-pluggable PCI, CardBus, and Express-Card in a single driver]. 39 - pci_register_driver() call requires passing in a table of function 40 - pointers and thus dictates the high level structure of a driver. 41 - 42 - Once the driver knows about a PCI device and takes ownership, the 43 - driver generally needs to perform the following initialization: 44 - 45 - Enable the device 46 - Request MMIO/IOP resources 47 - Set the DMA mask size (for both coherent and streaming DMA) 48 - Allocate and initialize shared control data (pci_allocate_coherent()) 49 - Access device configuration space (if needed) 50 - Register IRQ handler (request_irq()) 51 - Initialize non-PCI (i.e. LAN/SCSI/etc parts of the chip) 52 - Enable DMA/processing engines 53 - 54 - When done using the device, and perhaps the module needs to be unloaded, 55 - the driver needs to take the follow steps: 56 - Disable the device from generating IRQs 57 - Release the IRQ (free_irq()) 58 - Stop all DMA activity 59 - Release DMA buffers (both streaming and coherent) 60 - Unregister from other subsystems (e.g. scsi or netdev) 61 - Release MMIO/IOP resources 62 - Disable the device 63 - 64 - Most of these topics are covered in the following sections. 65 - For the rest look at LDD3 or <linux/pci.h> . 66 - 67 - If the PCI subsystem is not configured (CONFIG_PCI is not set), most of 68 - the PCI functions described below are defined as inline functions either 69 - completely empty or just returning an appropriate error codes to avoid 70 - lots of ifdefs in the drivers. 71 - 72 - 73 - 74 - 1. pci_register_driver() call 75 - ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 76 - 77 - PCI device drivers call pci_register_driver() during their 78 - initialization with a pointer to a structure describing the driver 79 - (struct pci_driver): 80 - 81 - field name Description 82 - ---------- ------------------------------------------------------ 83 - id_table Pointer to table of device ID's the driver is 84 - interested in. Most drivers should export this 85 - table using MODULE_DEVICE_TABLE(pci,...). 86 - 87 - probe This probing function gets called (during execution 88 - of pci_register_driver() for already existing 89 - devices or later if a new device gets inserted) for 90 - all PCI devices which match the ID table and are not 91 - "owned" by the other drivers yet. This function gets 92 - passed a "struct pci_dev *" for each device whose 93 - entry in the ID table matches the device. The probe 94 - function returns zero when the driver chooses to 95 - take "ownership" of the device or an error code 96 - (negative number) otherwise. 97 - The probe function always gets called from process 98 - context, so it can sleep. 99 - 100 - remove The remove() function gets called whenever a device 101 - being handled by this driver is removed (either during 102 - deregistration of the driver or when it's manually 103 - pulled out of a hot-pluggable slot). 104 - The remove function always gets called from process 105 - context, so it can sleep. 106 - 107 - suspend Put device into low power state. 108 - suspend_late Put device into low power state. 109 - 110 - resume_early Wake device from low power state. 111 - resume Wake device from low power state. 112 - 113 - (Please see Documentation/power/pci.txt for descriptions 114 - of PCI Power Management and the related functions.) 115 - 116 - shutdown Hook into reboot_notifier_list (kernel/sys.c). 117 - Intended to stop any idling DMA operations. 118 - Useful for enabling wake-on-lan (NIC) or changing 119 - the power state of a device before reboot. 120 - e.g. drivers/net/e100.c. 121 - 122 - err_handler See Documentation/PCI/pci-error-recovery.txt 123 - 124 - 125 - The ID table is an array of struct pci_device_id entries ending with an 126 - all-zero entry. Definitions with static const are generally preferred. 127 - 128 - Each entry consists of: 129 - 130 - vendor,device Vendor and device ID to match (or PCI_ANY_ID) 131 - 132 - subvendor, Subsystem vendor and device ID to match (or PCI_ANY_ID) 133 - subdevice, 134 - 135 - class Device class, subclass, and "interface" to match. 136 - See Appendix D of the PCI Local Bus Spec or 137 - include/linux/pci_ids.h for a full list of classes. 138 - Most drivers do not need to specify class/class_mask 139 - as vendor/device is normally sufficient. 140 - 141 - class_mask limit which sub-fields of the class field are compared. 142 - See drivers/scsi/sym53c8xx_2/ for example of usage. 143 - 144 - driver_data Data private to the driver. 145 - Most drivers don't need to use driver_data field. 146 - Best practice is to use driver_data as an index 147 - into a static list of equivalent device types, 148 - instead of using it as a pointer. 149 - 150 - 151 - Most drivers only need PCI_DEVICE() or PCI_DEVICE_CLASS() to set up 152 - a pci_device_id table. 153 - 154 - New PCI IDs may be added to a device driver pci_ids table at runtime 155 - as shown below: 156 - 157 - echo "vendor device subvendor subdevice class class_mask driver_data" > \ 158 - /sys/bus/pci/drivers/{driver}/new_id 159 - 160 - All fields are passed in as hexadecimal values (no leading 0x). 161 - The vendor and device fields are mandatory, the others are optional. Users 162 - need pass only as many optional fields as necessary: 163 - o subvendor and subdevice fields default to PCI_ANY_ID (FFFFFFFF) 164 - o class and classmask fields default to 0 165 - o driver_data defaults to 0UL. 166 - 167 - Note that driver_data must match the value used by any of the pci_device_id 168 - entries defined in the driver. This makes the driver_data field mandatory 169 - if all the pci_device_id entries have a non-zero driver_data value. 170 - 171 - Once added, the driver probe routine will be invoked for any unclaimed 172 - PCI devices listed in its (newly updated) pci_ids list. 173 - 174 - When the driver exits, it just calls pci_unregister_driver() and the PCI layer 175 - automatically calls the remove hook for all devices handled by the driver. 176 - 177 - 178 - 1.1 "Attributes" for driver functions/data 179 - 180 - Please mark the initialization and cleanup functions where appropriate 181 - (the corresponding macros are defined in <linux/init.h>): 182 - 183 - __init Initialization code. Thrown away after the driver 184 - initializes. 185 - __exit Exit code. Ignored for non-modular drivers. 186 - 187 - Tips on when/where to use the above attributes: 188 - o The module_init()/module_exit() functions (and all 189 - initialization functions called _only_ from these) 190 - should be marked __init/__exit. 191 - 192 - o Do not mark the struct pci_driver. 193 - 194 - o Do NOT mark a function if you are not sure which mark to use. 195 - Better to not mark the function than mark the function wrong. 196 - 197 - 198 - 199 - 2. How to find PCI devices manually 200 - ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 201 - 202 - PCI drivers should have a really good reason for not using the 203 - pci_register_driver() interface to search for PCI devices. 204 - The main reason PCI devices are controlled by multiple drivers 205 - is because one PCI device implements several different HW services. 206 - E.g. combined serial/parallel port/floppy controller. 207 - 208 - A manual search may be performed using the following constructs: 209 - 210 - Searching by vendor and device ID: 211 - 212 - struct pci_dev *dev = NULL; 213 - while (dev = pci_get_device(VENDOR_ID, DEVICE_ID, dev)) 214 - configure_device(dev); 215 - 216 - Searching by class ID (iterate in a similar way): 217 - 218 - pci_get_class(CLASS_ID, dev) 219 - 220 - Searching by both vendor/device and subsystem vendor/device ID: 221 - 222 - pci_get_subsys(VENDOR_ID,DEVICE_ID, SUBSYS_VENDOR_ID, SUBSYS_DEVICE_ID, dev). 223 - 224 - You can use the constant PCI_ANY_ID as a wildcard replacement for 225 - VENDOR_ID or DEVICE_ID. This allows searching for any device from a 226 - specific vendor, for example. 227 - 228 - These functions are hotplug-safe. They increment the reference count on 229 - the pci_dev that they return. You must eventually (possibly at module unload) 230 - decrement the reference count on these devices by calling pci_dev_put(). 231 - 232 - 233 - 234 - 3. Device Initialization Steps 235 - ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 236 - 237 - As noted in the introduction, most PCI drivers need the following steps 238 - for device initialization: 239 - 240 - Enable the device 241 - Request MMIO/IOP resources 242 - Set the DMA mask size (for both coherent and streaming DMA) 243 - Allocate and initialize shared control data (pci_allocate_coherent()) 244 - Access device configuration space (if needed) 245 - Register IRQ handler (request_irq()) 246 - Initialize non-PCI (i.e. LAN/SCSI/etc parts of the chip) 247 - Enable DMA/processing engines. 248 - 249 - The driver can access PCI config space registers at any time. 250 - (Well, almost. When running BIST, config space can go away...but 251 - that will just result in a PCI Bus Master Abort and config reads 252 - will return garbage). 253 - 254 - 255 - 3.1 Enable the PCI device 256 - ~~~~~~~~~~~~~~~~~~~~~~~~~ 257 - Before touching any device registers, the driver needs to enable 258 - the PCI device by calling pci_enable_device(). This will: 259 - o wake up the device if it was in suspended state, 260 - o allocate I/O and memory regions of the device (if BIOS did not), 261 - o allocate an IRQ (if BIOS did not). 262 - 263 - NOTE: pci_enable_device() can fail! Check the return value. 264 - 265 - [ OS BUG: we don't check resource allocations before enabling those 266 - resources. The sequence would make more sense if we called 267 - pci_request_resources() before calling pci_enable_device(). 268 - Currently, the device drivers can't detect the bug when when two 269 - devices have been allocated the same range. This is not a common 270 - problem and unlikely to get fixed soon. 271 - 272 - This has been discussed before but not changed as of 2.6.19: 273 - http://lkml.org/lkml/2006/3/2/194 274 - ] 275 - 276 - pci_set_master() will enable DMA by setting the bus master bit 277 - in the PCI_COMMAND register. It also fixes the latency timer value if 278 - it's set to something bogus by the BIOS. pci_clear_master() will 279 - disable DMA by clearing the bus master bit. 280 - 281 - If the PCI device can use the PCI Memory-Write-Invalidate transaction, 282 - call pci_set_mwi(). This enables the PCI_COMMAND bit for Mem-Wr-Inval 283 - and also ensures that the cache line size register is set correctly. 284 - Check the return value of pci_set_mwi() as not all architectures 285 - or chip-sets may support Memory-Write-Invalidate. Alternatively, 286 - if Mem-Wr-Inval would be nice to have but is not required, call 287 - pci_try_set_mwi() to have the system do its best effort at enabling 288 - Mem-Wr-Inval. 289 - 290 - 291 - 3.2 Request MMIO/IOP resources 292 - ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 293 - Memory (MMIO), and I/O port addresses should NOT be read directly 294 - from the PCI device config space. Use the values in the pci_dev structure 295 - as the PCI "bus address" might have been remapped to a "host physical" 296 - address by the arch/chip-set specific kernel support. 297 - 298 - See Documentation/io-mapping.txt for how to access device registers 299 - or device memory. 300 - 301 - The device driver needs to call pci_request_region() to verify 302 - no other device is already using the same address resource. 303 - Conversely, drivers should call pci_release_region() AFTER 304 - calling pci_disable_device(). 305 - The idea is to prevent two devices colliding on the same address range. 306 - 307 - [ See OS BUG comment above. Currently (2.6.19), The driver can only 308 - determine MMIO and IO Port resource availability _after_ calling 309 - pci_enable_device(). ] 310 - 311 - Generic flavors of pci_request_region() are request_mem_region() 312 - (for MMIO ranges) and request_region() (for IO Port ranges). 313 - Use these for address resources that are not described by "normal" PCI 314 - BARs. 315 - 316 - Also see pci_request_selected_regions() below. 317 - 318 - 319 - 3.3 Set the DMA mask size 320 - ~~~~~~~~~~~~~~~~~~~~~~~~~ 321 - [ If anything below doesn't make sense, please refer to 322 - Documentation/DMA-API.txt. This section is just a reminder that 323 - drivers need to indicate DMA capabilities of the device and is not 324 - an authoritative source for DMA interfaces. ] 325 - 326 - While all drivers should explicitly indicate the DMA capability 327 - (e.g. 32 or 64 bit) of the PCI bus master, devices with more than 328 - 32-bit bus master capability for streaming data need the driver 329 - to "register" this capability by calling pci_set_dma_mask() with 330 - appropriate parameters. In general this allows more efficient DMA 331 - on systems where System RAM exists above 4G _physical_ address. 332 - 333 - Drivers for all PCI-X and PCIe compliant devices must call 334 - pci_set_dma_mask() as they are 64-bit DMA devices. 335 - 336 - Similarly, drivers must also "register" this capability if the device 337 - can directly address "consistent memory" in System RAM above 4G physical 338 - address by calling pci_set_consistent_dma_mask(). 339 - Again, this includes drivers for all PCI-X and PCIe compliant devices. 340 - Many 64-bit "PCI" devices (before PCI-X) and some PCI-X devices are 341 - 64-bit DMA capable for payload ("streaming") data but not control 342 - ("consistent") data. 343 - 344 - 345 - 3.4 Setup shared control data 346 - ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 347 - Once the DMA masks are set, the driver can allocate "consistent" (a.k.a. shared) 348 - memory. See Documentation/DMA-API.txt for a full description of 349 - the DMA APIs. This section is just a reminder that it needs to be done 350 - before enabling DMA on the device. 351 - 352 - 353 - 3.5 Initialize device registers 354 - ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 355 - Some drivers will need specific "capability" fields programmed 356 - or other "vendor specific" register initialized or reset. 357 - E.g. clearing pending interrupts. 358 - 359 - 360 - 3.6 Register IRQ handler 361 - ~~~~~~~~~~~~~~~~~~~~~~~~ 362 - While calling request_irq() is the last step described here, 363 - this is often just another intermediate step to initialize a device. 364 - This step can often be deferred until the device is opened for use. 365 - 366 - All interrupt handlers for IRQ lines should be registered with IRQF_SHARED 367 - and use the devid to map IRQs to devices (remember that all PCI IRQ lines 368 - can be shared). 369 - 370 - request_irq() will associate an interrupt handler and device handle 371 - with an interrupt number. Historically interrupt numbers represent 372 - IRQ lines which run from the PCI device to the Interrupt controller. 373 - With MSI and MSI-X (more below) the interrupt number is a CPU "vector". 374 - 375 - request_irq() also enables the interrupt. Make sure the device is 376 - quiesced and does not have any interrupts pending before registering 377 - the interrupt handler. 378 - 379 - MSI and MSI-X are PCI capabilities. Both are "Message Signaled Interrupts" 380 - which deliver interrupts to the CPU via a DMA write to a Local APIC. 381 - The fundamental difference between MSI and MSI-X is how multiple 382 - "vectors" get allocated. MSI requires contiguous blocks of vectors 383 - while MSI-X can allocate several individual ones. 384 - 385 - MSI capability can be enabled by calling pci_alloc_irq_vectors() with the 386 - PCI_IRQ_MSI and/or PCI_IRQ_MSIX flags before calling request_irq(). This 387 - causes the PCI support to program CPU vector data into the PCI device 388 - capability registers. Many architectures, chip-sets, or BIOSes do NOT 389 - support MSI or MSI-X and a call to pci_alloc_irq_vectors with just 390 - the PCI_IRQ_MSI and PCI_IRQ_MSIX flags will fail, so try to always 391 - specify PCI_IRQ_LEGACY as well. 392 - 393 - Drivers that have different interrupt handlers for MSI/MSI-X and 394 - legacy INTx should chose the right one based on the msi_enabled 395 - and msix_enabled flags in the pci_dev structure after calling 396 - pci_alloc_irq_vectors. 397 - 398 - There are (at least) two really good reasons for using MSI: 399 - 1) MSI is an exclusive interrupt vector by definition. 400 - This means the interrupt handler doesn't have to verify 401 - its device caused the interrupt. 402 - 403 - 2) MSI avoids DMA/IRQ race conditions. DMA to host memory is guaranteed 404 - to be visible to the host CPU(s) when the MSI is delivered. This 405 - is important for both data coherency and avoiding stale control data. 406 - This guarantee allows the driver to omit MMIO reads to flush 407 - the DMA stream. 408 - 409 - See drivers/infiniband/hw/mthca/ or drivers/net/tg3.c for examples 410 - of MSI/MSI-X usage. 411 - 412 - 413 - 414 - 4. PCI device shutdown 415 - ~~~~~~~~~~~~~~~~~~~~~~~ 416 - 417 - When a PCI device driver is being unloaded, most of the following 418 - steps need to be performed: 419 - 420 - Disable the device from generating IRQs 421 - Release the IRQ (free_irq()) 422 - Stop all DMA activity 423 - Release DMA buffers (both streaming and consistent) 424 - Unregister from other subsystems (e.g. scsi or netdev) 425 - Disable device from responding to MMIO/IO Port addresses 426 - Release MMIO/IO Port resource(s) 427 - 428 - 429 - 4.1 Stop IRQs on the device 430 - ~~~~~~~~~~~~~~~~~~~~~~~~~~~ 431 - How to do this is chip/device specific. If it's not done, it opens 432 - the possibility of a "screaming interrupt" if (and only if) 433 - the IRQ is shared with another device. 434 - 435 - When the shared IRQ handler is "unhooked", the remaining devices 436 - using the same IRQ line will still need the IRQ enabled. Thus if the 437 - "unhooked" device asserts IRQ line, the system will respond assuming 438 - it was one of the remaining devices asserted the IRQ line. Since none 439 - of the other devices will handle the IRQ, the system will "hang" until 440 - it decides the IRQ isn't going to get handled and masks the IRQ (100,000 441 - iterations later). Once the shared IRQ is masked, the remaining devices 442 - will stop functioning properly. Not a nice situation. 443 - 444 - This is another reason to use MSI or MSI-X if it's available. 445 - MSI and MSI-X are defined to be exclusive interrupts and thus 446 - are not susceptible to the "screaming interrupt" problem. 447 - 448 - 449 - 4.2 Release the IRQ 450 - ~~~~~~~~~~~~~~~~~~~ 451 - Once the device is quiesced (no more IRQs), one can call free_irq(). 452 - This function will return control once any pending IRQs are handled, 453 - "unhook" the drivers IRQ handler from that IRQ, and finally release 454 - the IRQ if no one else is using it. 455 - 456 - 457 - 4.3 Stop all DMA activity 458 - ~~~~~~~~~~~~~~~~~~~~~~~~~ 459 - It's extremely important to stop all DMA operations BEFORE attempting 460 - to deallocate DMA control data. Failure to do so can result in memory 461 - corruption, hangs, and on some chip-sets a hard crash. 462 - 463 - Stopping DMA after stopping the IRQs can avoid races where the 464 - IRQ handler might restart DMA engines. 465 - 466 - While this step sounds obvious and trivial, several "mature" drivers 467 - didn't get this step right in the past. 468 - 469 - 470 - 4.4 Release DMA buffers 471 - ~~~~~~~~~~~~~~~~~~~~~~~ 472 - Once DMA is stopped, clean up streaming DMA first. 473 - I.e. unmap data buffers and return buffers to "upstream" 474 - owners if there is one. 475 - 476 - Then clean up "consistent" buffers which contain the control data. 477 - 478 - See Documentation/DMA-API.txt for details on unmapping interfaces. 479 - 480 - 481 - 4.5 Unregister from other subsystems 482 - ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 483 - Most low level PCI device drivers support some other subsystem 484 - like USB, ALSA, SCSI, NetDev, Infiniband, etc. Make sure your 485 - driver isn't losing resources from that other subsystem. 486 - If this happens, typically the symptom is an Oops (panic) when 487 - the subsystem attempts to call into a driver that has been unloaded. 488 - 489 - 490 - 4.6 Disable Device from responding to MMIO/IO Port addresses 491 - ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 492 - io_unmap() MMIO or IO Port resources and then call pci_disable_device(). 493 - This is the symmetric opposite of pci_enable_device(). 494 - Do not access device registers after calling pci_disable_device(). 495 - 496 - 497 - 4.7 Release MMIO/IO Port Resource(s) 498 - ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 499 - Call pci_release_region() to mark the MMIO or IO Port range as available. 500 - Failure to do so usually results in the inability to reload the driver. 501 - 502 - 503 - 504 - 5. How to access PCI config space 505 - ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 506 - 507 - You can use pci_(read|write)_config_(byte|word|dword) to access the config 508 - space of a device represented by struct pci_dev *. All these functions return 0 509 - when successful or an error code (PCIBIOS_...) which can be translated to a text 510 - string by pcibios_strerror. Most drivers expect that accesses to valid PCI 511 - devices don't fail. 512 - 513 - If you don't have a struct pci_dev available, you can call 514 - pci_bus_(read|write)_config_(byte|word|dword) to access a given device 515 - and function on that bus. 516 - 517 - If you access fields in the standard portion of the config header, please 518 - use symbolic names of locations and bits declared in <linux/pci.h>. 519 - 520 - If you need to access Extended PCI Capability registers, just call 521 - pci_find_capability() for the particular capability and it will find the 522 - corresponding register block for you. 523 - 524 - 525 - 526 - 6. Other interesting functions 527 - ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 528 - 529 - pci_get_domain_bus_and_slot() Find pci_dev corresponding to given domain, 530 - bus and slot and number. If the device is 531 - found, its reference count is increased. 532 - pci_set_power_state() Set PCI Power Management state (0=D0 ... 3=D3) 533 - pci_find_capability() Find specified capability in device's capability 534 - list. 535 - pci_resource_start() Returns bus start address for a given PCI region 536 - pci_resource_end() Returns bus end address for a given PCI region 537 - pci_resource_len() Returns the byte length of a PCI region 538 - pci_set_drvdata() Set private driver data pointer for a pci_dev 539 - pci_get_drvdata() Return private driver data pointer for a pci_dev 540 - pci_set_mwi() Enable Memory-Write-Invalidate transactions. 541 - pci_clear_mwi() Disable Memory-Write-Invalidate transactions. 542 - 543 - 544 - 545 - 7. Miscellaneous hints 546 - ~~~~~~~~~~~~~~~~~~~~~~ 547 - 548 - When displaying PCI device names to the user (for example when a driver wants 549 - to tell the user what card has it found), please use pci_name(pci_dev). 550 - 551 - Always refer to the PCI devices by a pointer to the pci_dev structure. 552 - All PCI layer functions use this identification and it's the only 553 - reasonable one. Don't use bus/slot/function numbers except for very 554 - special purposes -- on systems with multiple primary buses their semantics 555 - can be pretty complex. 556 - 557 - Don't try to turn on Fast Back to Back writes in your driver. All devices 558 - on the bus need to be capable of doing it, so this is something which needs 559 - to be handled by platform and generic code, not individual drivers. 560 - 561 - 562 - 563 - 8. Vendor and device identifications 564 - ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 565 - 566 - Do not add new device or vendor IDs to include/linux/pci_ids.h unless they 567 - are shared across multiple drivers. You can add private definitions in 568 - your driver if they're helpful, or just use plain hex constants. 569 - 570 - The device IDs are arbitrary hex numbers (vendor controlled) and normally used 571 - only in a single location, the pci_device_id table. 572 - 573 - Please DO submit new vendor/device IDs to http://pci-ids.ucw.cz/. 574 - There are mirrors of the pci.ids file at http://pciids.sourceforge.net/ 575 - and https://github.com/pciutils/pciids. 576 - 577 - 578 - 579 - 9. Obsolete functions 580 - ~~~~~~~~~~~~~~~~~~~~~ 581 - 582 - There are several functions which you might come across when trying to 583 - port an old driver to the new PCI interface. They are no longer present 584 - in the kernel as they aren't compatible with hotplug or PCI domains or 585 - having sane locking. 586 - 587 - pci_find_device() Superseded by pci_get_device() 588 - pci_find_subsys() Superseded by pci_get_subsys() 589 - pci_find_slot() Superseded by pci_get_domain_bus_and_slot() 590 - pci_get_slot() Superseded by pci_get_domain_bus_and_slot() 591 - 592 - 593 - The alternative is the traditional PCI device driver that walks PCI 594 - device lists. This is still possible but discouraged. 595 - 596 - 597 - 598 - 10. MMIO Space and "Write Posting" 599 - ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 600 - 601 - Converting a driver from using I/O Port space to using MMIO space 602 - often requires some additional changes. Specifically, "write posting" 603 - needs to be handled. Many drivers (e.g. tg3, acenic, sym53c8xx_2) 604 - already do this. I/O Port space guarantees write transactions reach the PCI 605 - device before the CPU can continue. Writes to MMIO space allow the CPU 606 - to continue before the transaction reaches the PCI device. HW weenies 607 - call this "Write Posting" because the write completion is "posted" to 608 - the CPU before the transaction has reached its destination. 609 - 610 - Thus, timing sensitive code should add readl() where the CPU is 611 - expected to wait before doing other work. The classic "bit banging" 612 - sequence works fine for I/O Port space: 613 - 614 - for (i = 8; --i; val >>= 1) { 615 - outb(val & 1, ioport_reg); /* write bit */ 616 - udelay(10); 617 - } 618 - 619 - The same sequence for MMIO space should be: 620 - 621 - for (i = 8; --i; val >>= 1) { 622 - writeb(val & 1, mmio_reg); /* write bit */ 623 - readb(safe_mmio_reg); /* flush posted write */ 624 - udelay(10); 625 - } 626 - 627 - It is important that "safe_mmio_reg" not have any side effects that 628 - interferes with the correct operation of the device. 629 - 630 - Another case to watch out for is when resetting a PCI device. Use PCI 631 - Configuration space reads to flush the writel(). This will gracefully 632 - handle the PCI master abort on all platforms if the PCI device is 633 - expected to not respond to a readl(). Most x86 platforms will allow 634 - MMIO reads to master abort (a.k.a. "Soft Fail") and return garbage 635 - (e.g. ~0). But many RISC platforms will crash (a.k.a."Hard Fail"). 636 -

+311

Documentation/PCI/pcieaer-howto.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + .. include:: <isonum.txt> 3 + 4 + =========================================================== 5 + The PCI Express Advanced Error Reporting Driver Guide HOWTO 6 + =========================================================== 7 + 8 + :Authors: - T. Long Nguyen <tom.l.nguyen@intel.com> 9 + - Yanmin Zhang <yanmin.zhang@intel.com> 10 + 11 + :Copyright: |copy| 2006 Intel Corporation 12 + 13 + Overview 14 + =========== 15 + 16 + About this guide 17 + ---------------- 18 + 19 + This guide describes the basics of the PCI Express Advanced Error 20 + Reporting (AER) driver and provides information on how to use it, as 21 + well as how to enable the drivers of endpoint devices to conform with 22 + PCI Express AER driver. 23 + 24 + 25 + What is the PCI Express AER Driver? 26 + ----------------------------------- 27 + 28 + PCI Express error signaling can occur on the PCI Express link itself 29 + or on behalf of transactions initiated on the link. PCI Express 30 + defines two error reporting paradigms: the baseline capability and 31 + the Advanced Error Reporting capability. The baseline capability is 32 + required of all PCI Express components providing a minimum defined 33 + set of error reporting requirements. Advanced Error Reporting 34 + capability is implemented with a PCI Express advanced error reporting 35 + extended capability structure providing more robust error reporting. 36 + 37 + The PCI Express AER driver provides the infrastructure to support PCI 38 + Express Advanced Error Reporting capability. The PCI Express AER 39 + driver provides three basic functions: 40 + 41 + - Gathers the comprehensive error information if errors occurred. 42 + - Reports error to the users. 43 + - Performs error recovery actions. 44 + 45 + AER driver only attaches root ports which support PCI-Express AER 46 + capability. 47 + 48 + 49 + User Guide 50 + ========== 51 + 52 + Include the PCI Express AER Root Driver into the Linux Kernel 53 + ------------------------------------------------------------- 54 + 55 + The PCI Express AER Root driver is a Root Port service driver attached 56 + to the PCI Express Port Bus driver. If a user wants to use it, the driver 57 + has to be compiled. Option CONFIG_PCIEAER supports this capability. It 58 + depends on CONFIG_PCIEPORTBUS, so pls. set CONFIG_PCIEPORTBUS=y and 59 + CONFIG_PCIEAER = y. 60 + 61 + Load PCI Express AER Root Driver 62 + -------------------------------- 63 + 64 + Some systems have AER support in firmware. Enabling Linux AER support at 65 + the same time the firmware handles AER may result in unpredictable 66 + behavior. Therefore, Linux does not handle AER events unless the firmware 67 + grants AER control to the OS via the ACPI _OSC method. See the PCI FW 3.0 68 + Specification for details regarding _OSC usage. 69 + 70 + AER error output 71 + ---------------- 72 + 73 + When a PCIe AER error is captured, an error message will be output to 74 + console. If it's a correctable error, it is output as a warning. 75 + Otherwise, it is printed as an error. So users could choose different 76 + log level to filter out correctable error messages. 77 + 78 + Below shows an example:: 79 + 80 + 0000:50:00.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0500(Requester ID) 81 + 0000:50:00.0: device [8086:0329] error status/mask=00100000/00000000 82 + 0000:50:00.0: [20] Unsupported Request (First) 83 + 0000:50:00.0: TLP Header: 04000001 00200a03 05010000 00050100 84 + 85 + In the example, 'Requester ID' means the ID of the device who sends 86 + the error message to root port. Pls. refer to pci express specs for 87 + other fields. 88 + 89 + AER Statistics / Counters 90 + ------------------------- 91 + 92 + When PCIe AER errors are captured, the counters / statistics are also exposed 93 + in the form of sysfs attributes which are documented at 94 + Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats 95 + 96 + Developer Guide 97 + =============== 98 + 99 + To enable AER aware support requires a software driver to configure 100 + the AER capability structure within its device and to provide callbacks. 101 + 102 + To support AER better, developers need understand how AER does work 103 + firstly. 104 + 105 + PCI Express errors are classified into two types: correctable errors 106 + and uncorrectable errors. This classification is based on the impacts 107 + of those errors, which may result in degraded performance or function 108 + failure. 109 + 110 + Correctable errors pose no impacts on the functionality of the 111 + interface. The PCI Express protocol can recover without any software 112 + intervention or any loss of data. These errors are detected and 113 + corrected by hardware. Unlike correctable errors, uncorrectable 114 + errors impact functionality of the interface. Uncorrectable errors 115 + can cause a particular transaction or a particular PCI Express link 116 + to be unreliable. Depending on those error conditions, uncorrectable 117 + errors are further classified into non-fatal errors and fatal errors. 118 + Non-fatal errors cause the particular transaction to be unreliable, 119 + but the PCI Express link itself is fully functional. Fatal errors, on 120 + the other hand, cause the link to be unreliable. 121 + 122 + When AER is enabled, a PCI Express device will automatically send an 123 + error message to the PCIe root port above it when the device captures 124 + an error. The Root Port, upon receiving an error reporting message, 125 + internally processes and logs the error message in its PCI Express 126 + capability structure. Error information being logged includes storing 127 + the error reporting agent's requestor ID into the Error Source 128 + Identification Registers and setting the error bits of the Root Error 129 + Status Register accordingly. If AER error reporting is enabled in Root 130 + Error Command Register, the Root Port generates an interrupt if an 131 + error is detected. 132 + 133 + Note that the errors as described above are related to the PCI Express 134 + hierarchy and links. These errors do not include any device specific 135 + errors because device specific errors will still get sent directly to 136 + the device driver. 137 + 138 + Configure the AER capability structure 139 + -------------------------------------- 140 + 141 + AER aware drivers of PCI Express component need change the device 142 + control registers to enable AER. They also could change AER registers, 143 + including mask and severity registers. Helper function 144 + pci_enable_pcie_error_reporting could be used to enable AER. See 145 + section 3.3. 146 + 147 + Provide callbacks 148 + ----------------- 149 + 150 + callback reset_link to reset pci express link 151 + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 152 + 153 + This callback is used to reset the pci express physical link when a 154 + fatal error happens. The root port aer service driver provides a 155 + default reset_link function, but different upstream ports might 156 + have different specifications to reset pci express link, so all 157 + upstream ports should provide their own reset_link functions. 158 + 159 + In struct pcie_port_service_driver, a new pointer, reset_link, is 160 + added. 161 + :: 162 + 163 + pci_ers_result_t (*reset_link) (struct pci_dev *dev); 164 + 165 + Section 3.2.2.2 provides more detailed info on when to call 166 + reset_link. 167 + 168 + PCI error-recovery callbacks 169 + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 170 + 171 + The PCI Express AER Root driver uses error callbacks to coordinate 172 + with downstream device drivers associated with a hierarchy in question 173 + when performing error recovery actions. 174 + 175 + Data struct pci_driver has a pointer, err_handler, to point to 176 + pci_error_handlers who consists of a couple of callback function 177 + pointers. AER driver follows the rules defined in 178 + pci-error-recovery.txt except pci express specific parts (e.g. 179 + reset_link). Pls. refer to pci-error-recovery.txt for detailed 180 + definitions of the callbacks. 181 + 182 + Below sections specify when to call the error callback functions. 183 + 184 + Correctable errors 185 + ~~~~~~~~~~~~~~~~~~ 186 + 187 + Correctable errors pose no impacts on the functionality of 188 + the interface. The PCI Express protocol can recover without any 189 + software intervention or any loss of data. These errors do not 190 + require any recovery actions. The AER driver clears the device's 191 + correctable error status register accordingly and logs these errors. 192 + 193 + Non-correctable (non-fatal and fatal) errors 194 + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 195 + 196 + If an error message indicates a non-fatal error, performing link reset 197 + at upstream is not required. The AER driver calls error_detected(dev, 198 + pci_channel_io_normal) to all drivers associated within a hierarchy in 199 + question. for example:: 200 + 201 + EndPoint<==>DownstreamPort B<==>UpstreamPort A<==>RootPort 202 + 203 + If Upstream port A captures an AER error, the hierarchy consists of 204 + Downstream port B and EndPoint. 205 + 206 + A driver may return PCI_ERS_RESULT_CAN_RECOVER, 207 + PCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_NEED_RESET, depending on 208 + whether it can recover or the AER driver calls mmio_enabled as next. 209 + 210 + If an error message indicates a fatal error, kernel will broadcast 211 + error_detected(dev, pci_channel_io_frozen) to all drivers within 212 + a hierarchy in question. Then, performing link reset at upstream is 213 + necessary. As different kinds of devices might use different approaches 214 + to reset link, AER port service driver is required to provide the 215 + function to reset link. Firstly, kernel looks for if the upstream 216 + component has an aer driver. If it has, kernel uses the reset_link 217 + callback of the aer driver. If the upstream component has no aer driver 218 + and the port is downstream port, we will perform a hot reset as the 219 + default by setting the Secondary Bus Reset bit of the Bridge Control 220 + register associated with the downstream port. As for upstream ports, 221 + they should provide their own aer service drivers with reset_link 222 + function. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER and 223 + reset_link returns PCI_ERS_RESULT_RECOVERED, the error handling goes 224 + to mmio_enabled. 225 + 226 + helper functions 227 + ---------------- 228 + :: 229 + 230 + int pci_enable_pcie_error_reporting(struct pci_dev *dev); 231 + 232 + pci_enable_pcie_error_reporting enables the device to send error 233 + messages to root port when an error is detected. Note that devices 234 + don't enable the error reporting by default, so device drivers need 235 + call this function to enable it. 236 + 237 + :: 238 + 239 + int pci_disable_pcie_error_reporting(struct pci_dev *dev); 240 + 241 + pci_disable_pcie_error_reporting disables the device to send error 242 + messages to root port when an error is detected. 243 + 244 + :: 245 + 246 + int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev);` 247 + 248 + pci_cleanup_aer_uncorrect_error_status cleanups the uncorrectable 249 + error status register. 250 + 251 + Frequent Asked Questions 252 + ------------------------ 253 + 254 + Q: 255 + What happens if a PCI Express device driver does not provide an 256 + error recovery handler (pci_driver->err_handler is equal to NULL)? 257 + 258 + A: 259 + The devices attached with the driver won't be recovered. If the 260 + error is fatal, kernel will print out warning messages. Please refer 261 + to section 3 for more information. 262 + 263 + Q: 264 + What happens if an upstream port service driver does not provide 265 + callback reset_link? 266 + 267 + A: 268 + Fatal error recovery will fail if the errors are reported by the 269 + upstream ports who are attached by the service driver. 270 + 271 + Q: 272 + How does this infrastructure deal with driver that is not PCI 273 + Express aware? 274 + 275 + A: 276 + This infrastructure calls the error callback functions of the 277 + driver when an error happens. But if the driver is not aware of 278 + PCI Express, the device might not report its own errors to root 279 + port. 280 + 281 + Q: 282 + What modifications will that driver need to make it compatible 283 + with the PCI Express AER Root driver? 284 + 285 + A: 286 + It could call the helper functions to enable AER in devices and 287 + cleanup uncorrectable status register. Pls. refer to section 3.3. 288 + 289 + 290 + Software error injection 291 + ======================== 292 + 293 + Debugging PCIe AER error recovery code is quite difficult because it 294 + is hard to trigger real hardware errors. Software based error 295 + injection can be used to fake various kinds of PCIe errors. 296 + 297 + First you should enable PCIe AER software error injection in kernel 298 + configuration, that is, following item should be in your .config. 299 + 300 + CONFIG_PCIEAER_INJECT=y or CONFIG_PCIEAER_INJECT=m 301 + 302 + After reboot with new kernel or insert the module, a device file named 303 + /dev/aer_inject should be created. 304 + 305 + Then, you need a user space tool named aer-inject, which can be gotten 306 + from: 307 + 308 + https://git.kernel.org/cgit/linux/kernel/git/gong.chen/aer-inject.git/ 309 + 310 + More information about aer-inject can be found in the document comes 311 + with its source code.

-267

Documentation/PCI/pcieaer-howto.txt

··· 1 - The PCI Express Advanced Error Reporting Driver Guide HOWTO 2 - T. Long Nguyen <tom.l.nguyen@intel.com> 3 - Yanmin Zhang <yanmin.zhang@intel.com> 4 - 07/29/2006 5 - 6 - 7 - 1. Overview 8 - 9 - 1.1 About this guide 10 - 11 - This guide describes the basics of the PCI Express Advanced Error 12 - Reporting (AER) driver and provides information on how to use it, as 13 - well as how to enable the drivers of endpoint devices to conform with 14 - PCI Express AER driver. 15 - 16 - 1.2 Copyright (C) Intel Corporation 2006. 17 - 18 - 1.3 What is the PCI Express AER Driver? 19 - 20 - PCI Express error signaling can occur on the PCI Express link itself 21 - or on behalf of transactions initiated on the link. PCI Express 22 - defines two error reporting paradigms: the baseline capability and 23 - the Advanced Error Reporting capability. The baseline capability is 24 - required of all PCI Express components providing a minimum defined 25 - set of error reporting requirements. Advanced Error Reporting 26 - capability is implemented with a PCI Express advanced error reporting 27 - extended capability structure providing more robust error reporting. 28 - 29 - The PCI Express AER driver provides the infrastructure to support PCI 30 - Express Advanced Error Reporting capability. The PCI Express AER 31 - driver provides three basic functions: 32 - 33 - - Gathers the comprehensive error information if errors occurred. 34 - - Reports error to the users. 35 - - Performs error recovery actions. 36 - 37 - AER driver only attaches root ports which support PCI-Express AER 38 - capability. 39 - 40 - 41 - 2. User Guide 42 - 43 - 2.1 Include the PCI Express AER Root Driver into the Linux Kernel 44 - 45 - The PCI Express AER Root driver is a Root Port service driver attached 46 - to the PCI Express Port Bus driver. If a user wants to use it, the driver 47 - has to be compiled. Option CONFIG_PCIEAER supports this capability. It 48 - depends on CONFIG_PCIEPORTBUS, so pls. set CONFIG_PCIEPORTBUS=y and 49 - CONFIG_PCIEAER = y. 50 - 51 - 2.2 Load PCI Express AER Root Driver 52 - 53 - Some systems have AER support in firmware. Enabling Linux AER support at 54 - the same time the firmware handles AER may result in unpredictable 55 - behavior. Therefore, Linux does not handle AER events unless the firmware 56 - grants AER control to the OS via the ACPI _OSC method. See the PCI FW 3.0 57 - Specification for details regarding _OSC usage. 58 - 59 - 2.3 AER error output 60 - 61 - When a PCIe AER error is captured, an error message will be output to 62 - console. If it's a correctable error, it is output as a warning. 63 - Otherwise, it is printed as an error. So users could choose different 64 - log level to filter out correctable error messages. 65 - 66 - Below shows an example: 67 - 0000:50:00.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0500(Requester ID) 68 - 0000:50:00.0: device [8086:0329] error status/mask=00100000/00000000 69 - 0000:50:00.0: [20] Unsupported Request (First) 70 - 0000:50:00.0: TLP Header: 04000001 00200a03 05010000 00050100 71 - 72 - In the example, 'Requester ID' means the ID of the device who sends 73 - the error message to root port. Pls. refer to pci express specs for 74 - other fields. 75 - 76 - 2.4 AER Statistics / Counters 77 - 78 - When PCIe AER errors are captured, the counters / statistics are also exposed 79 - in the form of sysfs attributes which are documented at 80 - Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats 81 - 82 - 3. Developer Guide 83 - 84 - To enable AER aware support requires a software driver to configure 85 - the AER capability structure within its device and to provide callbacks. 86 - 87 - To support AER better, developers need understand how AER does work 88 - firstly. 89 - 90 - PCI Express errors are classified into two types: correctable errors 91 - and uncorrectable errors. This classification is based on the impacts 92 - of those errors, which may result in degraded performance or function 93 - failure. 94 - 95 - Correctable errors pose no impacts on the functionality of the 96 - interface. The PCI Express protocol can recover without any software 97 - intervention or any loss of data. These errors are detected and 98 - corrected by hardware. Unlike correctable errors, uncorrectable 99 - errors impact functionality of the interface. Uncorrectable errors 100 - can cause a particular transaction or a particular PCI Express link 101 - to be unreliable. Depending on those error conditions, uncorrectable 102 - errors are further classified into non-fatal errors and fatal errors. 103 - Non-fatal errors cause the particular transaction to be unreliable, 104 - but the PCI Express link itself is fully functional. Fatal errors, on 105 - the other hand, cause the link to be unreliable. 106 - 107 - When AER is enabled, a PCI Express device will automatically send an 108 - error message to the PCIe root port above it when the device captures 109 - an error. The Root Port, upon receiving an error reporting message, 110 - internally processes and logs the error message in its PCI Express 111 - capability structure. Error information being logged includes storing 112 - the error reporting agent's requestor ID into the Error Source 113 - Identification Registers and setting the error bits of the Root Error 114 - Status Register accordingly. If AER error reporting is enabled in Root 115 - Error Command Register, the Root Port generates an interrupt if an 116 - error is detected. 117 - 118 - Note that the errors as described above are related to the PCI Express 119 - hierarchy and links. These errors do not include any device specific 120 - errors because device specific errors will still get sent directly to 121 - the device driver. 122 - 123 - 3.1 Configure the AER capability structure 124 - 125 - AER aware drivers of PCI Express component need change the device 126 - control registers to enable AER. They also could change AER registers, 127 - including mask and severity registers. Helper function 128 - pci_enable_pcie_error_reporting could be used to enable AER. See 129 - section 3.3. 130 - 131 - 3.2. Provide callbacks 132 - 133 - 3.2.1 callback reset_link to reset pci express link 134 - 135 - This callback is used to reset the pci express physical link when a 136 - fatal error happens. The root port aer service driver provides a 137 - default reset_link function, but different upstream ports might 138 - have different specifications to reset pci express link, so all 139 - upstream ports should provide their own reset_link functions. 140 - 141 - In struct pcie_port_service_driver, a new pointer, reset_link, is 142 - added. 143 - 144 - pci_ers_result_t (*reset_link) (struct pci_dev *dev); 145 - 146 - Section 3.2.2.2 provides more detailed info on when to call 147 - reset_link. 148 - 149 - 3.2.2 PCI error-recovery callbacks 150 - 151 - The PCI Express AER Root driver uses error callbacks to coordinate 152 - with downstream device drivers associated with a hierarchy in question 153 - when performing error recovery actions. 154 - 155 - Data struct pci_driver has a pointer, err_handler, to point to 156 - pci_error_handlers who consists of a couple of callback function 157 - pointers. AER driver follows the rules defined in 158 - pci-error-recovery.txt except pci express specific parts (e.g. 159 - reset_link). Pls. refer to pci-error-recovery.txt for detailed 160 - definitions of the callbacks. 161 - 162 - Below sections specify when to call the error callback functions. 163 - 164 - 3.2.2.1 Correctable errors 165 - 166 - Correctable errors pose no impacts on the functionality of 167 - the interface. The PCI Express protocol can recover without any 168 - software intervention or any loss of data. These errors do not 169 - require any recovery actions. The AER driver clears the device's 170 - correctable error status register accordingly and logs these errors. 171 - 172 - 3.2.2.2 Non-correctable (non-fatal and fatal) errors 173 - 174 - If an error message indicates a non-fatal error, performing link reset 175 - at upstream is not required. The AER driver calls error_detected(dev, 176 - pci_channel_io_normal) to all drivers associated within a hierarchy in 177 - question. for example, 178 - EndPoint<==>DownstreamPort B<==>UpstreamPort A<==>RootPort. 179 - If Upstream port A captures an AER error, the hierarchy consists of 180 - Downstream port B and EndPoint. 181 - 182 - A driver may return PCI_ERS_RESULT_CAN_RECOVER, 183 - PCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_NEED_RESET, depending on 184 - whether it can recover or the AER driver calls mmio_enabled as next. 185 - 186 - If an error message indicates a fatal error, kernel will broadcast 187 - error_detected(dev, pci_channel_io_frozen) to all drivers within 188 - a hierarchy in question. Then, performing link reset at upstream is 189 - necessary. As different kinds of devices might use different approaches 190 - to reset link, AER port service driver is required to provide the 191 - function to reset link. Firstly, kernel looks for if the upstream 192 - component has an aer driver. If it has, kernel uses the reset_link 193 - callback of the aer driver. If the upstream component has no aer driver 194 - and the port is downstream port, we will perform a hot reset as the 195 - default by setting the Secondary Bus Reset bit of the Bridge Control 196 - register associated with the downstream port. As for upstream ports, 197 - they should provide their own aer service drivers with reset_link 198 - function. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER and 199 - reset_link returns PCI_ERS_RESULT_RECOVERED, the error handling goes 200 - to mmio_enabled. 201 - 202 - 3.3 helper functions 203 - 204 - 3.3.1 int pci_enable_pcie_error_reporting(struct pci_dev *dev); 205 - pci_enable_pcie_error_reporting enables the device to send error 206 - messages to root port when an error is detected. Note that devices 207 - don't enable the error reporting by default, so device drivers need 208 - call this function to enable it. 209 - 210 - 3.3.2 int pci_disable_pcie_error_reporting(struct pci_dev *dev); 211 - pci_disable_pcie_error_reporting disables the device to send error 212 - messages to root port when an error is detected. 213 - 214 - 3.3.3 int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev); 215 - pci_cleanup_aer_uncorrect_error_status cleanups the uncorrectable 216 - error status register. 217 - 218 - 3.4 Frequent Asked Questions 219 - 220 - Q: What happens if a PCI Express device driver does not provide an 221 - error recovery handler (pci_driver->err_handler is equal to NULL)? 222 - 223 - A: The devices attached with the driver won't be recovered. If the 224 - error is fatal, kernel will print out warning messages. Please refer 225 - to section 3 for more information. 226 - 227 - Q: What happens if an upstream port service driver does not provide 228 - callback reset_link? 229 - 230 - A: Fatal error recovery will fail if the errors are reported by the 231 - upstream ports who are attached by the service driver. 232 - 233 - Q: How does this infrastructure deal with driver that is not PCI 234 - Express aware? 235 - 236 - A: This infrastructure calls the error callback functions of the 237 - driver when an error happens. But if the driver is not aware of 238 - PCI Express, the device might not report its own errors to root 239 - port. 240 - 241 - Q: What modifications will that driver need to make it compatible 242 - with the PCI Express AER Root driver? 243 - 244 - A: It could call the helper functions to enable AER in devices and 245 - cleanup uncorrectable status register. Pls. refer to section 3.3. 246 - 247 - 248 - 4. Software error injection 249 - 250 - Debugging PCIe AER error recovery code is quite difficult because it 251 - is hard to trigger real hardware errors. Software based error 252 - injection can be used to fake various kinds of PCIe errors. 253 - 254 - First you should enable PCIe AER software error injection in kernel 255 - configuration, that is, following item should be in your .config. 256 - 257 - CONFIG_PCIEAER_INJECT=y or CONFIG_PCIEAER_INJECT=m 258 - 259 - After reboot with new kernel or insert the module, a device file named 260 - /dev/aer_inject should be created. 261 - 262 - Then, you need a user space tool named aer-inject, which can be gotten 263 - from: 264 - https://git.kernel.org/cgit/linux/kernel/git/gong.chen/aer-inject.git/ 265 - 266 - More information about aer-inject can be found in the document comes 267 - with its source code.

+220

Documentation/PCI/picebus-howto.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + .. include:: <isonum.txt> 3 + 4 + =========================================== 5 + The PCI Express Port Bus Driver Guide HOWTO 6 + =========================================== 7 + 8 + :Author: Tom L Nguyen tom.l.nguyen@intel.com 11/03/2004 9 + :Copyright: |copy| 2004 Intel Corporation 10 + 11 + About this guide 12 + ================ 13 + 14 + This guide describes the basics of the PCI Express Port Bus driver 15 + and provides information on how to enable the service drivers to 16 + register/unregister with the PCI Express Port Bus Driver. 17 + 18 + 19 + What is the PCI Express Port Bus Driver 20 + ======================================= 21 + 22 + A PCI Express Port is a logical PCI-PCI Bridge structure. There 23 + are two types of PCI Express Port: the Root Port and the Switch 24 + Port. The Root Port originates a PCI Express link from a PCI Express 25 + Root Complex and the Switch Port connects PCI Express links to 26 + internal logical PCI buses. The Switch Port, which has its secondary 27 + bus representing the switch's internal routing logic, is called the 28 + switch's Upstream Port. The switch's Downstream Port is bridging from 29 + switch's internal routing bus to a bus representing the downstream 30 + PCI Express link from the PCI Express Switch. 31 + 32 + A PCI Express Port can provide up to four distinct functions, 33 + referred to in this document as services, depending on its port type. 34 + PCI Express Port's services include native hotplug support (HP), 35 + power management event support (PME), advanced error reporting 36 + support (AER), and virtual channel support (VC). These services may 37 + be handled by a single complex driver or be individually distributed 38 + and handled by corresponding service drivers. 39 + 40 + Why use the PCI Express Port Bus Driver? 41 + ======================================== 42 + 43 + In existing Linux kernels, the Linux Device Driver Model allows a 44 + physical device to be handled by only a single driver. The PCI 45 + Express Port is a PCI-PCI Bridge device with multiple distinct 46 + services. To maintain a clean and simple solution each service 47 + may have its own software service driver. In this case several 48 + service drivers will compete for a single PCI-PCI Bridge device. 49 + For example, if the PCI Express Root Port native hotplug service 50 + driver is loaded first, it claims a PCI-PCI Bridge Root Port. The 51 + kernel therefore does not load other service drivers for that Root 52 + Port. In other words, it is impossible to have multiple service 53 + drivers load and run on a PCI-PCI Bridge device simultaneously 54 + using the current driver model. 55 + 56 + To enable multiple service drivers running simultaneously requires 57 + having a PCI Express Port Bus driver, which manages all populated 58 + PCI Express Ports and distributes all provided service requests 59 + to the corresponding service drivers as required. Some key 60 + advantages of using the PCI Express Port Bus driver are listed below: 61 + 62 + - Allow multiple service drivers to run simultaneously on 63 + a PCI-PCI Bridge Port device. 64 + 65 + - Allow service drivers implemented in an independent 66 + staged approach. 67 + 68 + - Allow one service driver to run on multiple PCI-PCI Bridge 69 + Port devices. 70 + 71 + - Manage and distribute resources of a PCI-PCI Bridge Port 72 + device to requested service drivers. 73 + 74 + Configuring the PCI Express Port Bus Driver vs. Service Drivers 75 + =============================================================== 76 + 77 + Including the PCI Express Port Bus Driver Support into the Kernel 78 + ----------------------------------------------------------------- 79 + 80 + Including the PCI Express Port Bus driver depends on whether the PCI 81 + Express support is included in the kernel config. The kernel will 82 + automatically include the PCI Express Port Bus driver as a kernel 83 + driver when the PCI Express support is enabled in the kernel. 84 + 85 + Enabling Service Driver Support 86 + ------------------------------- 87 + 88 + PCI device drivers are implemented based on Linux Device Driver Model. 89 + All service drivers are PCI device drivers. As discussed above, it is 90 + impossible to load any service driver once the kernel has loaded the 91 + PCI Express Port Bus Driver. To meet the PCI Express Port Bus Driver 92 + Model requires some minimal changes on existing service drivers that 93 + imposes no impact on the functionality of existing service drivers. 94 + 95 + A service driver is required to use the two APIs shown below to 96 + register its service with the PCI Express Port Bus driver (see 97 + section 5.2.1 & 5.2.2). It is important that a service driver 98 + initializes the pcie_port_service_driver data structure, included in 99 + header file /include/linux/pcieport_if.h, before calling these APIs. 100 + Failure to do so will result an identity mismatch, which prevents 101 + the PCI Express Port Bus driver from loading a service driver. 102 + 103 + pcie_port_service_register 104 + ~~~~~~~~~~~~~~~~~~~~~~~~~~ 105 + :: 106 + 107 + int pcie_port_service_register(struct pcie_port_service_driver *new) 108 + 109 + This API replaces the Linux Driver Model's pci_register_driver API. A 110 + service driver should always calls pcie_port_service_register at 111 + module init. Note that after service driver being loaded, calls 112 + such as pci_enable_device(dev) and pci_set_master(dev) are no longer 113 + necessary since these calls are executed by the PCI Port Bus driver. 114 + 115 + pcie_port_service_unregister 116 + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 117 + :: 118 + 119 + void pcie_port_service_unregister(struct pcie_port_service_driver *new) 120 + 121 + pcie_port_service_unregister replaces the Linux Driver Model's 122 + pci_unregister_driver. It's always called by service driver when a 123 + module exits. 124 + 125 + Sample Code 126 + ~~~~~~~~~~~ 127 + 128 + Below is sample service driver code to initialize the port service 129 + driver data structure. 130 + :: 131 + 132 + static struct pcie_port_service_id service_id[] = { { 133 + .vendor = PCI_ANY_ID, 134 + .device = PCI_ANY_ID, 135 + .port_type = PCIE_RC_PORT, 136 + .service_type = PCIE_PORT_SERVICE_AER, 137 + }, { /* end: all zeroes */ } 138 + }; 139 + 140 + static struct pcie_port_service_driver root_aerdrv = { 141 + .name = (char *)device_name, 142 + .id_table = &service_id[0], 143 + 144 + .probe = aerdrv_load, 145 + .remove = aerdrv_unload, 146 + 147 + .suspend = aerdrv_suspend, 148 + .resume = aerdrv_resume, 149 + }; 150 + 151 + Below is a sample code for registering/unregistering a service 152 + driver. 153 + :: 154 + 155 + static int __init aerdrv_service_init(void) 156 + { 157 + int retval = 0; 158 + 159 + retval = pcie_port_service_register(&root_aerdrv); 160 + if (!retval) { 161 + /* 162 + * FIX ME 163 + */ 164 + } 165 + return retval; 166 + } 167 + 168 + static void __exit aerdrv_service_exit(void) 169 + { 170 + pcie_port_service_unregister(&root_aerdrv); 171 + } 172 + 173 + module_init(aerdrv_service_init); 174 + module_exit(aerdrv_service_exit); 175 + 176 + Possible Resource Conflicts 177 + =========================== 178 + 179 + Since all service drivers of a PCI-PCI Bridge Port device are 180 + allowed to run simultaneously, below lists a few of possible resource 181 + conflicts with proposed solutions. 182 + 183 + MSI and MSI-X Vector Resource 184 + ----------------------------- 185 + 186 + Once MSI or MSI-X interrupts are enabled on a device, it stays in this 187 + mode until they are disabled again. Since service drivers of the same 188 + PCI-PCI Bridge port share the same physical device, if an individual 189 + service driver enables or disables MSI/MSI-X mode it may result 190 + unpredictable behavior. 191 + 192 + To avoid this situation all service drivers are not permitted to 193 + switch interrupt mode on its device. The PCI Express Port Bus driver 194 + is responsible for determining the interrupt mode and this should be 195 + transparent to service drivers. Service drivers need to know only 196 + the vector IRQ assigned to the field irq of struct pcie_device, which 197 + is passed in when the PCI Express Port Bus driver probes each service 198 + driver. Service drivers should use (struct pcie_device*)dev->irq to 199 + call request_irq/free_irq. In addition, the interrupt mode is stored 200 + in the field interrupt_mode of struct pcie_device. 201 + 202 + PCI Memory/IO Mapped Regions 203 + ---------------------------- 204 + 205 + Service drivers for PCI Express Power Management (PME), Advanced 206 + Error Reporting (AER), Hot-Plug (HP) and Virtual Channel (VC) access 207 + PCI configuration space on the PCI Express port. In all cases the 208 + registers accessed are independent of each other. This patch assumes 209 + that all service drivers will be well behaved and not overwrite 210 + other service driver's configuration settings. 211 + 212 + PCI Config Registers 213 + -------------------- 214 + 215 + Each service driver runs its PCI config operations on its own 216 + capability structure except the PCI Express capability structure, in 217 + which Root Control register and Device Control register are shared 218 + between PME and AER. This patch assumes that all service drivers 219 + will be well behaved and not overwrite other service driver's 220 + configuration settings.

+3 -3

Documentation/admin-guide/kernel-parameters.txt

··· 13 13 For ARM64, ONLY "acpi=off", "acpi=on" or "acpi=force" 14 14 are available 15 15 16 - See also Documentation/power/runtime_pm.txt, pci=noacpi 16 + See also Documentation/power/runtime_pm.rst, pci=noacpi 17 17 18 18 acpi_apic_instance= [ACPI, IOAPIC] 19 19 Format: <int> ··· 223 223 acpi_sleep= [HW,ACPI] Sleep options 224 224 Format: { s3_bios, s3_mode, s3_beep, s4_nohwsig, 225 225 old_ordering, nonvs, sci_force_enable, nobl } 226 - See Documentation/power/video.txt for information on 226 + See Documentation/power/video.rst for information on 227 227 s3_bios and s3_mode. 228 228 s3_beep is for debugging; it makes the PC's speaker beep 229 229 as soon as the kernel's real-mode entry point is called. ··· 4119 4119 Specify the offset from the beginning of the partition 4120 4120 given by "resume=" at which the swap header is located, 4121 4121 in <PAGE_SIZE> units (needed only for swap files). 4122 - See Documentation/power/swsusp-and-swap-files.txt 4122 + See Documentation/power/swsusp-and-swap-files.rst 4123 4123 4124 4124 resumedelay= [HIBERNATION] Delay (in seconds) to pause before attempting to 4125 4125 read the resume files

+1 -1

Documentation/cpu-freq/core.txt

··· 95 95 96 96 3. CPUFreq Table Generation with Operating Performance Point (OPP) 97 97 ================================================================== 98 - For details about OPP, see Documentation/power/opp.txt 98 + For details about OPP, see Documentation/power/opp.rst 99 99 100 100 dev_pm_opp_init_cpufreq_table - 101 101 This function provides a ready to use conversion routine to translate

+2

Documentation/devicetree/bindings/pci/mobiveil-pcie.txt

··· 10 10 interrupt source. The value must be 1. 11 11 - compatible: Should contain "mbvl,gpex40-pcie" 12 12 - reg: Should contain PCIe registers location and length 13 + Mandatory: 13 14 "config_axi_slave": PCIe controller registers 14 15 "csr_axi_slave" : Bridge config registers 16 + Optional: 15 17 "gpio_slave" : GPIO registers to control slot power 16 18 "apb_csr" : MSI registers 17 19

+8

Documentation/devicetree/bindings/pci/nvidia,tegra20-pcie.txt

··· 65 65 - afi 66 66 - pcie_x 67 67 68 + Optional properties: 69 + - pinctrl-names: A list of pinctrl state names. Must contain the following 70 + entries: 71 + - "default": active state, puts PCIe I/O out of deep power down state 72 + - "idle": puts PCIe I/O into deep power down state 73 + - pinctrl-0: phandle for the default/active state of pin configurations. 74 + - pinctrl-1: phandle for the idle state of pin configurations. 75 + 68 76 Required properties on Tegra124 and later (deprecated): 69 77 - phys: Must contain an entry for each entry in phy-names. 70 78 - phy-names: Must include the following entries:

+3

Documentation/devicetree/bindings/pci/pci.txt

··· 24 24 unsupported link speed, for instance, trying to do training for 25 25 unsupported link speed, etc. Must be '4' for gen4, '3' for gen3, '2' 26 26 for gen2, and '1' for gen1. Any other values are invalid. 27 + - reset-gpios: 28 + If present this property specifies PERST# GPIO. Host drivers can parse the 29 + GPIO and apply fundamental reset to endpoints. 27 30 28 31 PCI-PCI Bridge properties 29 32 -------------------------

+23 -2

Documentation/devicetree/bindings/pci/qcom,pcie.txt

··· 10 10 - "qcom,pcie-msm8996" for msm8996 or apq8096 11 11 - "qcom,pcie-ipq4019" for ipq4019 12 12 - "qcom,pcie-ipq8074" for ipq8074 13 + - "qcom,pcie-qcs404" for qcs404 13 14 14 15 - reg: 15 16 Usage: required ··· 117 116 - "ahb" AHB clock 118 117 - "aux" Auxiliary clock 119 118 119 + - clock-names: 120 + Usage: required for qcs404 121 + Value type: <stringlist> 122 + Definition: Should contain the following entries 123 + - "iface" AHB clock 124 + - "aux" Auxiliary clock 125 + - "master_bus" AXI Master clock 126 + - "slave_bus" AXI Slave clock 127 + 120 128 - resets: 121 129 Usage: required 122 130 Value type: <prop-encoded-array> ··· 177 167 - "ahb" AHB Reset 178 168 - "axi_m_sticky" AXI Master Sticky reset 179 169 170 + - reset-names: 171 + Usage: required for qcs404 172 + Value type: <stringlist> 173 + Definition: Should contain the following entries 174 + - "axi_m" AXI Master reset 175 + - "axi_s" AXI Slave reset 176 + - "axi_m_sticky" AXI Master Sticky reset 177 + - "pipe_sticky" PIPE sticky reset 178 + - "pwr" PWR reset 179 + - "ahb" AHB reset 180 + 180 181 - power-domains: 181 182 Usage: required for apq8084 and msm8996/apq8096 182 183 Value type: <prop-encoded-array> ··· 216 195 Definition: A phandle to the PCIe endpoint power supply 217 196 218 197 - phys: 219 - Usage: required for apq8084 198 + Usage: required for apq8084 and qcs404 220 199 Value type: <phandle> 221 200 Definition: List of phandle(s) as listed in phy-names property 222 201 223 202 - phy-names: 224 - Usage: required for apq8084 203 + Usage: required for apq8084 and qcs404 225 204 Value type: <stringlist> 226 205 Definition: Should contain "pciephy" 227 206

+1

Documentation/devicetree/bindings/pci/rcar-pci.txt

··· 3 3 Required properties: 4 4 compatible: "renesas,pcie-r8a7743" for the R8A7743 SoC; 5 5 "renesas,pcie-r8a7744" for the R8A7744 SoC; 6 + "renesas,pcie-r8a774a1" for the R8A774A1 SoC; 6 7 "renesas,pcie-r8a774c0" for the R8A774C0 SoC; 7 8 "renesas,pcie-r8a7779" for the R8A7779 SoC; 8 9 "renesas,pcie-r8a7790" for the R8A7790 SoC;

+3 -3

Documentation/driver-api/pm/devices.rst

··· 225 225 flag is clear. 226 226 227 227 For more information about the runtime power management framework, refer to 228 - :file:`Documentation/power/runtime_pm.txt`. 228 + :file:`Documentation/power/runtime_pm.rst`. 229 229 230 230 231 231 Calling Drivers to Enter and Leave System Sleep States ··· 728 728 729 729 Devices may be defined as IRQ-safe which indicates to the PM core that their 730 730 runtime PM callbacks may be invoked with disabled interrupts (see 731 - :file:`Documentation/power/runtime_pm.txt` for more information). If an 731 + :file:`Documentation/power/runtime_pm.rst` for more information). If an 732 732 IRQ-safe device belongs to a PM domain, the runtime PM of the domain will be 733 733 disallowed, unless the domain itself is defined as IRQ-safe. However, it 734 734 makes sense to define a PM domain as IRQ-safe only if all the devices in it ··· 795 795 status in that case. 796 796 797 797 During system-wide resume from a sleep state it's easiest to put devices into 798 - the full-power state, as explained in :file:`Documentation/power/runtime_pm.txt`. 798 + the full-power state, as explained in :file:`Documentation/power/runtime_pm.rst`. 799 799 [Refer to that document for more information regarding this particular issue as 800 800 well as for information on the device runtime power management framework in 801 801 general.]

+1 -1

Documentation/driver-api/usb/power-management.rst

··· 46 46 call it a "dynamic suspend" (also known as a "runtime suspend" or 47 47 "selective suspend"). This document concentrates mostly on how 48 48 dynamic PM is implemented in the USB subsystem, although system PM is 49 - covered to some extent (see ``Documentation/power/*.txt`` for more 49 + covered to some extent (see ``Documentation/power/*.rst`` for more 50 50 information about system PM). 51 51 52 52 System PM support is present only if the kernel was built with

+1

Documentation/index.rst

··· 103 103 vm/index 104 104 bpf/index 105 105 usb/index 106 + PCI/index 106 107 misc-devices/index 107 108 108 109 Architecture-specific documentation

+36

Documentation/power/apm-acpi.rst

··· 1 + ============ 2 + APM or ACPI? 3 + ============ 4 + 5 + If you have a relatively recent x86 mobile, desktop, or server system, 6 + odds are it supports either Advanced Power Management (APM) or 7 + Advanced Configuration and Power Interface (ACPI). ACPI is the newer 8 + of the two technologies and puts power management in the hands of the 9 + operating system, allowing for more intelligent power management than 10 + is possible with BIOS controlled APM. 11 + 12 + The best way to determine which, if either, your system supports is to 13 + build a kernel with both ACPI and APM enabled (as of 2.3.x ACPI is 14 + enabled by default). If a working ACPI implementation is found, the 15 + ACPI driver will override and disable APM, otherwise the APM driver 16 + will be used. 17 + 18 + No, sorry, you cannot have both ACPI and APM enabled and running at 19 + once. Some people with broken ACPI or broken APM implementations 20 + would like to use both to get a full set of working features, but you 21 + simply cannot mix and match the two. Only one power management 22 + interface can be in control of the machine at once. Think about it.. 23 + 24 + User-space Daemons 25 + ------------------ 26 + Both APM and ACPI rely on user-space daemons, apmd and acpid 27 + respectively, to be completely functional. Obtain both of these 28 + daemons from your Linux distribution or from the Internet (see below) 29 + and be sure that they are started sometime in the system boot process. 30 + Go ahead and start both. If ACPI or APM is not available on your 31 + system the associated daemon will exit gracefully. 32 + 33 + ===== ======================================= 34 + apmd http://ftp.debian.org/pool/main/a/apmd/ 35 + acpid http://acpid.sf.net/ 36 + ===== =======================================

-32

Documentation/power/apm-acpi.txt

··· 1 - APM or ACPI? 2 - ------------ 3 - If you have a relatively recent x86 mobile, desktop, or server system, 4 - odds are it supports either Advanced Power Management (APM) or 5 - Advanced Configuration and Power Interface (ACPI). ACPI is the newer 6 - of the two technologies and puts power management in the hands of the 7 - operating system, allowing for more intelligent power management than 8 - is possible with BIOS controlled APM. 9 - 10 - The best way to determine which, if either, your system supports is to 11 - build a kernel with both ACPI and APM enabled (as of 2.3.x ACPI is 12 - enabled by default). If a working ACPI implementation is found, the 13 - ACPI driver will override and disable APM, otherwise the APM driver 14 - will be used. 15 - 16 - No, sorry, you cannot have both ACPI and APM enabled and running at 17 - once. Some people with broken ACPI or broken APM implementations 18 - would like to use both to get a full set of working features, but you 19 - simply cannot mix and match the two. Only one power management 20 - interface can be in control of the machine at once. Think about it.. 21 - 22 - User-space Daemons 23 - ------------------ 24 - Both APM and ACPI rely on user-space daemons, apmd and acpid 25 - respectively, to be completely functional. Obtain both of these 26 - daemons from your Linux distribution or from the Internet (see below) 27 - and be sure that they are started sometime in the system boot process. 28 - Go ahead and start both. If ACPI or APM is not available on your 29 - system the associated daemon will exit gracefully. 30 - 31 - apmd: http://ftp.debian.org/pool/main/a/apmd/ 32 - acpid: http://acpid.sf.net/

+269

Documentation/power/basic-pm-debugging.rst

··· 1 + ================================= 2 + Debugging hibernation and suspend 3 + ================================= 4 + 5 + (C) 2007 Rafael J. Wysocki <rjw@sisk.pl>, GPL 6 + 7 + 1. Testing hibernation (aka suspend to disk or STD) 8 + =================================================== 9 + 10 + To check if hibernation works, you can try to hibernate in the "reboot" mode:: 11 + 12 + # echo reboot > /sys/power/disk 13 + # echo disk > /sys/power/state 14 + 15 + and the system should create a hibernation image, reboot, resume and get back to 16 + the command prompt where you have started the transition. If that happens, 17 + hibernation is most likely to work correctly. Still, you need to repeat the 18 + test at least a couple of times in a row for confidence. [This is necessary, 19 + because some problems only show up on a second attempt at suspending and 20 + resuming the system.] Moreover, hibernating in the "reboot" and "shutdown" 21 + modes causes the PM core to skip some platform-related callbacks which on ACPI 22 + systems might be necessary to make hibernation work. Thus, if your machine 23 + fails to hibernate or resume in the "reboot" mode, you should try the 24 + "platform" mode:: 25 + 26 + # echo platform > /sys/power/disk 27 + # echo disk > /sys/power/state 28 + 29 + which is the default and recommended mode of hibernation. 30 + 31 + Unfortunately, the "platform" mode of hibernation does not work on some systems 32 + with broken BIOSes. In such cases the "shutdown" mode of hibernation might 33 + work:: 34 + 35 + # echo shutdown > /sys/power/disk 36 + # echo disk > /sys/power/state 37 + 38 + (it is similar to the "reboot" mode, but it requires you to press the power 39 + button to make the system resume). 40 + 41 + If neither "platform" nor "shutdown" hibernation mode works, you will need to 42 + identify what goes wrong. 43 + 44 + a) Test modes of hibernation 45 + ---------------------------- 46 + 47 + To find out why hibernation fails on your system, you can use a special testing 48 + facility available if the kernel is compiled with CONFIG_PM_DEBUG set. Then, 49 + there is the file /sys/power/pm_test that can be used to make the hibernation 50 + core run in a test mode. There are 5 test modes available: 51 + 52 + freezer 53 + - test the freezing of processes 54 + 55 + devices 56 + - test the freezing of processes and suspending of devices 57 + 58 + platform 59 + - test the freezing of processes, suspending of devices and platform 60 + global control methods [1]_ 61 + 62 + processors 63 + - test the freezing of processes, suspending of devices, platform 64 + global control methods [1]_ and the disabling of nonboot CPUs 65 + 66 + core 67 + - test the freezing of processes, suspending of devices, platform global 68 + control methods\ [1]_, the disabling of nonboot CPUs and suspending 69 + of platform/system devices 70 + 71 + .. [1] 72 + 73 + the platform global control methods are only available on ACPI systems 74 + and are only tested if the hibernation mode is set to "platform" 75 + 76 + To use one of them it is necessary to write the corresponding string to 77 + /sys/power/pm_test (eg. "devices" to test the freezing of processes and 78 + suspending devices) and issue the standard hibernation commands. For example, 79 + to use the "devices" test mode along with the "platform" mode of hibernation, 80 + you should do the following:: 81 + 82 + # echo devices > /sys/power/pm_test 83 + # echo platform > /sys/power/disk 84 + # echo disk > /sys/power/state 85 + 86 + Then, the kernel will try to freeze processes, suspend devices, wait a few 87 + seconds (5 by default, but configurable by the suspend.pm_test_delay module 88 + parameter), resume devices and thaw processes. If "platform" is written to 89 + /sys/power/pm_test , then after suspending devices the kernel will additionally 90 + invoke the global control methods (eg. ACPI global control methods) used to 91 + prepare the platform firmware for hibernation. Next, it will wait a 92 + configurable number of seconds and invoke the platform (eg. ACPI) global 93 + methods used to cancel hibernation etc. 94 + 95 + Writing "none" to /sys/power/pm_test causes the kernel to switch to the normal 96 + hibernation/suspend operations. Also, when open for reading, /sys/power/pm_test 97 + contains a space-separated list of all available tests (including "none" that 98 + represents the normal functionality) in which the current test level is 99 + indicated by square brackets. 100 + 101 + Generally, as you can see, each test level is more "invasive" than the previous 102 + one and the "core" level tests the hardware and drivers as deeply as possible 103 + without creating a hibernation image. Obviously, if the "devices" test fails, 104 + the "platform" test will fail as well and so on. Thus, as a rule of thumb, you 105 + should try the test modes starting from "freezer", through "devices", "platform" 106 + and "processors" up to "core" (repeat the test on each level a couple of times 107 + to make sure that any random factors are avoided). 108 + 109 + If the "freezer" test fails, there is a task that cannot be frozen (in that case 110 + it usually is possible to identify the offending task by analysing the output of 111 + dmesg obtained after the failing test). Failure at this level usually means 112 + that there is a problem with the tasks freezer subsystem that should be 113 + reported. 114 + 115 + If the "devices" test fails, most likely there is a driver that cannot suspend 116 + or resume its device (in the latter case the system may hang or become unstable 117 + after the test, so please take that into consideration). To find this driver, 118 + you can carry out a binary search according to the rules: 119 + 120 + - if the test fails, unload a half of the drivers currently loaded and repeat 121 + (that would probably involve rebooting the system, so always note what drivers 122 + have been loaded before the test), 123 + - if the test succeeds, load a half of the drivers you have unloaded most 124 + recently and repeat. 125 + 126 + Once you have found the failing driver (there can be more than just one of 127 + them), you have to unload it every time before hibernation. In that case please 128 + make sure to report the problem with the driver. 129 + 130 + It is also possible that the "devices" test will still fail after you have 131 + unloaded all modules. In that case, you may want to look in your kernel 132 + configuration for the drivers that can be compiled as modules (and test again 133 + with these drivers compiled as modules). You may also try to use some special 134 + kernel command line options such as "noapic", "noacpi" or even "acpi=off". 135 + 136 + If the "platform" test fails, there is a problem with the handling of the 137 + platform (eg. ACPI) firmware on your system. In that case the "platform" mode 138 + of hibernation is not likely to work. You can try the "shutdown" mode, but that 139 + is rather a poor man's workaround. 140 + 141 + If the "processors" test fails, the disabling/enabling of nonboot CPUs does not 142 + work (of course, this only may be an issue on SMP systems) and the problem 143 + should be reported. In that case you can also try to switch the nonboot CPUs 144 + off and on using the /sys/devices/system/cpu/cpu*/online sysfs attributes and 145 + see if that works. 146 + 147 + If the "core" test fails, which means that suspending of the system/platform 148 + devices has failed (these devices are suspended on one CPU with interrupts off), 149 + the problem is most probably hardware-related and serious, so it should be 150 + reported. 151 + 152 + A failure of any of the "platform", "processors" or "core" tests may cause your 153 + system to hang or become unstable, so please beware. Such a failure usually 154 + indicates a serious problem that very well may be related to the hardware, but 155 + please report it anyway. 156 + 157 + b) Testing minimal configuration 158 + -------------------------------- 159 + 160 + If all of the hibernation test modes work, you can boot the system with the 161 + "init=/bin/bash" command line parameter and attempt to hibernate in the 162 + "reboot", "shutdown" and "platform" modes. If that does not work, there 163 + probably is a problem with a driver statically compiled into the kernel and you 164 + can try to compile more drivers as modules, so that they can be tested 165 + individually. Otherwise, there is a problem with a modular driver and you can 166 + find it by loading a half of the modules you normally use and binary searching 167 + in accordance with the algorithm: 168 + - if there are n modules loaded and the attempt to suspend and resume fails, 169 + unload n/2 of the modules and try again (that would probably involve rebooting 170 + the system), 171 + - if there are n modules loaded and the attempt to suspend and resume succeeds, 172 + load n/2 modules more and try again. 173 + 174 + Again, if you find the offending module(s), it(they) must be unloaded every time 175 + before hibernation, and please report the problem with it(them). 176 + 177 + c) Using the "test_resume" hibernation option 178 + --------------------------------------------- 179 + 180 + /sys/power/disk generally tells the kernel what to do after creating a 181 + hibernation image. One of the available options is "test_resume" which 182 + causes the just created image to be used for immediate restoration. Namely, 183 + after doing:: 184 + 185 + # echo test_resume > /sys/power/disk 186 + # echo disk > /sys/power/state 187 + 188 + a hibernation image will be created and a resume from it will be triggered 189 + immediately without involving the platform firmware in any way. 190 + 191 + That test can be used to check if failures to resume from hibernation are 192 + related to bad interactions with the platform firmware. That is, if the above 193 + works every time, but resume from actual hibernation does not work or is 194 + unreliable, the platform firmware may be responsible for the failures. 195 + 196 + On architectures and platforms that support using different kernels to restore 197 + hibernation images (that is, the kernel used to read the image from storage and 198 + load it into memory is different from the one included in the image) or support 199 + kernel address space randomization, it also can be used to check if failures 200 + to resume may be related to the differences between the restore and image 201 + kernels. 202 + 203 + d) Advanced debugging 204 + --------------------- 205 + 206 + In case that hibernation does not work on your system even in the minimal 207 + configuration and compiling more drivers as modules is not practical or some 208 + modules cannot be unloaded, you can use one of the more advanced debugging 209 + techniques to find the problem. First, if there is a serial port in your box, 210 + you can boot the kernel with the 'no_console_suspend' parameter and try to log 211 + kernel messages using the serial console. This may provide you with some 212 + information about the reasons of the suspend (resume) failure. Alternatively, 213 + it may be possible to use a FireWire port for debugging with firescope 214 + (http://v3.sk/~lkundrak/firescope/). On x86 it is also possible to 215 + use the PM_TRACE mechanism documented in Documentation/power/s2ram.rst . 216 + 217 + 2. Testing suspend to RAM (STR) 218 + =============================== 219 + 220 + To verify that the STR works, it is generally more convenient to use the s2ram 221 + tool available from http://suspend.sf.net and documented at 222 + http://en.opensuse.org/SDB:Suspend_to_RAM (S2RAM_LINK). 223 + 224 + Namely, after writing "freezer", "devices", "platform", "processors", or "core" 225 + into /sys/power/pm_test (available if the kernel is compiled with 226 + CONFIG_PM_DEBUG set) the suspend code will work in the test mode corresponding 227 + to given string. The STR test modes are defined in the same way as for 228 + hibernation, so please refer to Section 1 for more information about them. In 229 + particular, the "core" test allows you to test everything except for the actual 230 + invocation of the platform firmware in order to put the system into the sleep 231 + state. 232 + 233 + Among other things, the testing with the help of /sys/power/pm_test may allow 234 + you to identify drivers that fail to suspend or resume their devices. They 235 + should be unloaded every time before an STR transition. 236 + 237 + Next, you can follow the instructions at S2RAM_LINK to test the system, but if 238 + it does not work "out of the box", you may need to boot it with 239 + "init=/bin/bash" and test s2ram in the minimal configuration. In that case, 240 + you may be able to search for failing drivers by following the procedure 241 + analogous to the one described in section 1. If you find some failing drivers, 242 + you will have to unload them every time before an STR transition (ie. before 243 + you run s2ram), and please report the problems with them. 244 + 245 + There is a debugfs entry which shows the suspend to RAM statistics. Here is an 246 + example of its output:: 247 + 248 + # mount -t debugfs none /sys/kernel/debug 249 + # cat /sys/kernel/debug/suspend_stats 250 + success: 20 251 + fail: 5 252 + failed_freeze: 0 253 + failed_prepare: 0 254 + failed_suspend: 5 255 + failed_suspend_noirq: 0 256 + failed_resume: 0 257 + failed_resume_noirq: 0 258 + failures: 259 + last_failed_dev: alarm 260 + adc 261 + last_failed_errno: -16 262 + -16 263 + last_failed_step: suspend 264 + suspend 265 + 266 + Field success means the success number of suspend to RAM, and field fail means 267 + the failure number. Others are the failure number of different steps of suspend 268 + to RAM. suspend_stats just lists the last 2 failed devices, error number and 269 + failed step of suspend.

-254

Documentation/power/basic-pm-debugging.txt

··· 1 - Debugging hibernation and suspend 2 - (C) 2007 Rafael J. Wysocki <rjw@sisk.pl>, GPL 3 - 4 - 1. Testing hibernation (aka suspend to disk or STD) 5 - 6 - To check if hibernation works, you can try to hibernate in the "reboot" mode: 7 - 8 - # echo reboot > /sys/power/disk 9 - # echo disk > /sys/power/state 10 - 11 - and the system should create a hibernation image, reboot, resume and get back to 12 - the command prompt where you have started the transition. If that happens, 13 - hibernation is most likely to work correctly. Still, you need to repeat the 14 - test at least a couple of times in a row for confidence. [This is necessary, 15 - because some problems only show up on a second attempt at suspending and 16 - resuming the system.] Moreover, hibernating in the "reboot" and "shutdown" 17 - modes causes the PM core to skip some platform-related callbacks which on ACPI 18 - systems might be necessary to make hibernation work. Thus, if your machine fails 19 - to hibernate or resume in the "reboot" mode, you should try the "platform" mode: 20 - 21 - # echo platform > /sys/power/disk 22 - # echo disk > /sys/power/state 23 - 24 - which is the default and recommended mode of hibernation. 25 - 26 - Unfortunately, the "platform" mode of hibernation does not work on some systems 27 - with broken BIOSes. In such cases the "shutdown" mode of hibernation might 28 - work: 29 - 30 - # echo shutdown > /sys/power/disk 31 - # echo disk > /sys/power/state 32 - 33 - (it is similar to the "reboot" mode, but it requires you to press the power 34 - button to make the system resume). 35 - 36 - If neither "platform" nor "shutdown" hibernation mode works, you will need to 37 - identify what goes wrong. 38 - 39 - a) Test modes of hibernation 40 - 41 - To find out why hibernation fails on your system, you can use a special testing 42 - facility available if the kernel is compiled with CONFIG_PM_DEBUG set. Then, 43 - there is the file /sys/power/pm_test that can be used to make the hibernation 44 - core run in a test mode. There are 5 test modes available: 45 - 46 - freezer 47 - - test the freezing of processes 48 - 49 - devices 50 - - test the freezing of processes and suspending of devices 51 - 52 - platform 53 - - test the freezing of processes, suspending of devices and platform 54 - global control methods(*) 55 - 56 - processors 57 - - test the freezing of processes, suspending of devices, platform 58 - global control methods(*) and the disabling of nonboot CPUs 59 - 60 - core 61 - - test the freezing of processes, suspending of devices, platform global 62 - control methods(*), the disabling of nonboot CPUs and suspending of 63 - platform/system devices 64 - 65 - (*) the platform global control methods are only available on ACPI systems 66 - and are only tested if the hibernation mode is set to "platform" 67 - 68 - To use one of them it is necessary to write the corresponding string to 69 - /sys/power/pm_test (eg. "devices" to test the freezing of processes and 70 - suspending devices) and issue the standard hibernation commands. For example, 71 - to use the "devices" test mode along with the "platform" mode of hibernation, 72 - you should do the following: 73 - 74 - # echo devices > /sys/power/pm_test 75 - # echo platform > /sys/power/disk 76 - # echo disk > /sys/power/state 77 - 78 - Then, the kernel will try to freeze processes, suspend devices, wait a few 79 - seconds (5 by default, but configurable by the suspend.pm_test_delay module 80 - parameter), resume devices and thaw processes. If "platform" is written to 81 - /sys/power/pm_test , then after suspending devices the kernel will additionally 82 - invoke the global control methods (eg. ACPI global control methods) used to 83 - prepare the platform firmware for hibernation. Next, it will wait a 84 - configurable number of seconds and invoke the platform (eg. ACPI) global 85 - methods used to cancel hibernation etc. 86 - 87 - Writing "none" to /sys/power/pm_test causes the kernel to switch to the normal 88 - hibernation/suspend operations. Also, when open for reading, /sys/power/pm_test 89 - contains a space-separated list of all available tests (including "none" that 90 - represents the normal functionality) in which the current test level is 91 - indicated by square brackets. 92 - 93 - Generally, as you can see, each test level is more "invasive" than the previous 94 - one and the "core" level tests the hardware and drivers as deeply as possible 95 - without creating a hibernation image. Obviously, if the "devices" test fails, 96 - the "platform" test will fail as well and so on. Thus, as a rule of thumb, you 97 - should try the test modes starting from "freezer", through "devices", "platform" 98 - and "processors" up to "core" (repeat the test on each level a couple of times 99 - to make sure that any random factors are avoided). 100 - 101 - If the "freezer" test fails, there is a task that cannot be frozen (in that case 102 - it usually is possible to identify the offending task by analysing the output of 103 - dmesg obtained after the failing test). Failure at this level usually means 104 - that there is a problem with the tasks freezer subsystem that should be 105 - reported. 106 - 107 - If the "devices" test fails, most likely there is a driver that cannot suspend 108 - or resume its device (in the latter case the system may hang or become unstable 109 - after the test, so please take that into consideration). To find this driver, 110 - you can carry out a binary search according to the rules: 111 - - if the test fails, unload a half of the drivers currently loaded and repeat 112 - (that would probably involve rebooting the system, so always note what drivers 113 - have been loaded before the test), 114 - - if the test succeeds, load a half of the drivers you have unloaded most 115 - recently and repeat. 116 - 117 - Once you have found the failing driver (there can be more than just one of 118 - them), you have to unload it every time before hibernation. In that case please 119 - make sure to report the problem with the driver. 120 - 121 - It is also possible that the "devices" test will still fail after you have 122 - unloaded all modules. In that case, you may want to look in your kernel 123 - configuration for the drivers that can be compiled as modules (and test again 124 - with these drivers compiled as modules). You may also try to use some special 125 - kernel command line options such as "noapic", "noacpi" or even "acpi=off". 126 - 127 - If the "platform" test fails, there is a problem with the handling of the 128 - platform (eg. ACPI) firmware on your system. In that case the "platform" mode 129 - of hibernation is not likely to work. You can try the "shutdown" mode, but that 130 - is rather a poor man's workaround. 131 - 132 - If the "processors" test fails, the disabling/enabling of nonboot CPUs does not 133 - work (of course, this only may be an issue on SMP systems) and the problem 134 - should be reported. In that case you can also try to switch the nonboot CPUs 135 - off and on using the /sys/devices/system/cpu/cpu*/online sysfs attributes and 136 - see if that works. 137 - 138 - If the "core" test fails, which means that suspending of the system/platform 139 - devices has failed (these devices are suspended on one CPU with interrupts off), 140 - the problem is most probably hardware-related and serious, so it should be 141 - reported. 142 - 143 - A failure of any of the "platform", "processors" or "core" tests may cause your 144 - system to hang or become unstable, so please beware. Such a failure usually 145 - indicates a serious problem that very well may be related to the hardware, but 146 - please report it anyway. 147 - 148 - b) Testing minimal configuration 149 - 150 - If all of the hibernation test modes work, you can boot the system with the 151 - "init=/bin/bash" command line parameter and attempt to hibernate in the 152 - "reboot", "shutdown" and "platform" modes. If that does not work, there 153 - probably is a problem with a driver statically compiled into the kernel and you 154 - can try to compile more drivers as modules, so that they can be tested 155 - individually. Otherwise, there is a problem with a modular driver and you can 156 - find it by loading a half of the modules you normally use and binary searching 157 - in accordance with the algorithm: 158 - - if there are n modules loaded and the attempt to suspend and resume fails, 159 - unload n/2 of the modules and try again (that would probably involve rebooting 160 - the system), 161 - - if there are n modules loaded and the attempt to suspend and resume succeeds, 162 - load n/2 modules more and try again. 163 - 164 - Again, if you find the offending module(s), it(they) must be unloaded every time 165 - before hibernation, and please report the problem with it(them). 166 - 167 - c) Using the "test_resume" hibernation option 168 - 169 - /sys/power/disk generally tells the kernel what to do after creating a 170 - hibernation image. One of the available options is "test_resume" which 171 - causes the just created image to be used for immediate restoration. Namely, 172 - after doing: 173 - 174 - # echo test_resume > /sys/power/disk 175 - # echo disk > /sys/power/state 176 - 177 - a hibernation image will be created and a resume from it will be triggered 178 - immediately without involving the platform firmware in any way. 179 - 180 - That test can be used to check if failures to resume from hibernation are 181 - related to bad interactions with the platform firmware. That is, if the above 182 - works every time, but resume from actual hibernation does not work or is 183 - unreliable, the platform firmware may be responsible for the failures. 184 - 185 - On architectures and platforms that support using different kernels to restore 186 - hibernation images (that is, the kernel used to read the image from storage and 187 - load it into memory is different from the one included in the image) or support 188 - kernel address space randomization, it also can be used to check if failures 189 - to resume may be related to the differences between the restore and image 190 - kernels. 191 - 192 - d) Advanced debugging 193 - 194 - In case that hibernation does not work on your system even in the minimal 195 - configuration and compiling more drivers as modules is not practical or some 196 - modules cannot be unloaded, you can use one of the more advanced debugging 197 - techniques to find the problem. First, if there is a serial port in your box, 198 - you can boot the kernel with the 'no_console_suspend' parameter and try to log 199 - kernel messages using the serial console. This may provide you with some 200 - information about the reasons of the suspend (resume) failure. Alternatively, 201 - it may be possible to use a FireWire port for debugging with firescope 202 - (http://v3.sk/~lkundrak/firescope/). On x86 it is also possible to 203 - use the PM_TRACE mechanism documented in Documentation/power/s2ram.txt . 204 - 205 - 2. Testing suspend to RAM (STR) 206 - 207 - To verify that the STR works, it is generally more convenient to use the s2ram 208 - tool available from http://suspend.sf.net and documented at 209 - http://en.opensuse.org/SDB:Suspend_to_RAM (S2RAM_LINK). 210 - 211 - Namely, after writing "freezer", "devices", "platform", "processors", or "core" 212 - into /sys/power/pm_test (available if the kernel is compiled with 213 - CONFIG_PM_DEBUG set) the suspend code will work in the test mode corresponding 214 - to given string. The STR test modes are defined in the same way as for 215 - hibernation, so please refer to Section 1 for more information about them. In 216 - particular, the "core" test allows you to test everything except for the actual 217 - invocation of the platform firmware in order to put the system into the sleep 218 - state. 219 - 220 - Among other things, the testing with the help of /sys/power/pm_test may allow 221 - you to identify drivers that fail to suspend or resume their devices. They 222 - should be unloaded every time before an STR transition. 223 - 224 - Next, you can follow the instructions at S2RAM_LINK to test the system, but if 225 - it does not work "out of the box", you may need to boot it with 226 - "init=/bin/bash" and test s2ram in the minimal configuration. In that case, 227 - you may be able to search for failing drivers by following the procedure 228 - analogous to the one described in section 1. If you find some failing drivers, 229 - you will have to unload them every time before an STR transition (ie. before 230 - you run s2ram), and please report the problems with them. 231 - 232 - There is a debugfs entry which shows the suspend to RAM statistics. Here is an 233 - example of its output. 234 - # mount -t debugfs none /sys/kernel/debug 235 - # cat /sys/kernel/debug/suspend_stats 236 - success: 20 237 - fail: 5 238 - failed_freeze: 0 239 - failed_prepare: 0 240 - failed_suspend: 5 241 - failed_suspend_noirq: 0 242 - failed_resume: 0 243 - failed_resume_noirq: 0 244 - failures: 245 - last_failed_dev: alarm 246 - adc 247 - last_failed_errno: -16 248 - -16 249 - last_failed_step: suspend 250 - suspend 251 - Field success means the success number of suspend to RAM, and field fail means 252 - the failure number. Others are the failure number of different steps of suspend 253 - to RAM. suspend_stats just lists the last 2 failed devices, error number and 254 - failed step of suspend.

+205

Documentation/power/charger-manager.rst

··· 1 + =============== 2 + Charger Manager 3 + =============== 4 + 5 + (C) 2011 MyungJoo Ham <myungjoo.ham@samsung.com>, GPL 6 + 7 + Charger Manager provides in-kernel battery charger management that 8 + requires temperature monitoring during suspend-to-RAM state 9 + and where each battery may have multiple chargers attached and the userland 10 + wants to look at the aggregated information of the multiple chargers. 11 + 12 + Charger Manager is a platform_driver with power-supply-class entries. 13 + An instance of Charger Manager (a platform-device created with Charger-Manager) 14 + represents an independent battery with chargers. If there are multiple 15 + batteries with their own chargers acting independently in a system, 16 + the system may need multiple instances of Charger Manager. 17 + 18 + 1. Introduction 19 + =============== 20 + 21 + Charger Manager supports the following: 22 + 23 + * Support for multiple chargers (e.g., a device with USB, AC, and solar panels) 24 + A system may have multiple chargers (or power sources) and some of 25 + they may be activated at the same time. Each charger may have its 26 + own power-supply-class and each power-supply-class can provide 27 + different information about the battery status. This framework 28 + aggregates charger-related information from multiple sources and 29 + shows combined information as a single power-supply-class. 30 + 31 + * Support for in suspend-to-RAM polling (with suspend_again callback) 32 + While the battery is being charged and the system is in suspend-to-RAM, 33 + we may need to monitor the battery health by looking at the ambient or 34 + battery temperature. We can accomplish this by waking up the system 35 + periodically. However, such a method wakes up devices unnecessarily for 36 + monitoring the battery health and tasks, and user processes that are 37 + supposed to be kept suspended. That, in turn, incurs unnecessary power 38 + consumption and slow down charging process. Or even, such peak power 39 + consumption can stop chargers in the middle of charging 40 + (external power input < device power consumption), which not 41 + only affects the charging time, but the lifespan of the battery. 42 + 43 + Charger Manager provides a function "cm_suspend_again" that can be 44 + used as suspend_again callback of platform_suspend_ops. If the platform 45 + requires tasks other than cm_suspend_again, it may implement its own 46 + suspend_again callback that calls cm_suspend_again in the middle. 47 + Normally, the platform will need to resume and suspend some devices 48 + that are used by Charger Manager. 49 + 50 + * Support for premature full-battery event handling 51 + If the battery voltage drops by "fullbatt_vchkdrop_uV" after 52 + "fullbatt_vchkdrop_ms" from the full-battery event, the framework 53 + restarts charging. This check is also performed while suspended by 54 + setting wakeup time accordingly and using suspend_again. 55 + 56 + * Support for uevent-notify 57 + With the charger-related events, the device sends 58 + notification to users with UEVENT. 59 + 60 + 2. Global Charger-Manager Data related with suspend_again 61 + ========================================================= 62 + In order to setup Charger Manager with suspend-again feature 63 + (in-suspend monitoring), the user should provide charger_global_desc 64 + with setup_charger_manager(`struct charger_global_desc *`). 65 + This charger_global_desc data for in-suspend monitoring is global 66 + as the name suggests. Thus, the user needs to provide only once even 67 + if there are multiple batteries. If there are multiple batteries, the 68 + multiple instances of Charger Manager share the same charger_global_desc 69 + and it will manage in-suspend monitoring for all instances of Charger Manager. 70 + 71 + The user needs to provide all the three entries to `struct charger_global_desc` 72 + properly in order to activate in-suspend monitoring: 73 + 74 + `char *rtc_name;` 75 + The name of rtc (e.g., "rtc0") used to wakeup the system from 76 + suspend for Charger Manager. The alarm interrupt (AIE) of the rtc 77 + should be able to wake up the system from suspend. Charger Manager 78 + saves and restores the alarm value and use the previously-defined 79 + alarm if it is going to go off earlier than Charger Manager so that 80 + Charger Manager does not interfere with previously-defined alarms. 81 + 82 + `bool (*rtc_only_wakeup)(void);` 83 + This callback should let CM know whether 84 + the wakeup-from-suspend is caused only by the alarm of "rtc" in the 85 + same struct. If there is any other wakeup source triggered the 86 + wakeup, it should return false. If the "rtc" is the only wakeup 87 + reason, it should return true. 88 + 89 + `bool assume_timer_stops_in_suspend;` 90 + if true, Charger Manager assumes that 91 + the timer (CM uses jiffies as timer) stops during suspend. Then, CM 92 + assumes that the suspend-duration is same as the alarm length. 93 + 94 + 95 + 3. How to setup suspend_again 96 + ============================= 97 + Charger Manager provides a function "extern bool cm_suspend_again(void)". 98 + When cm_suspend_again is called, it monitors every battery. The suspend_ops 99 + callback of the system's platform_suspend_ops can call cm_suspend_again 100 + function to know whether Charger Manager wants to suspend again or not. 101 + If there are no other devices or tasks that want to use suspend_again 102 + feature, the platform_suspend_ops may directly refer to cm_suspend_again 103 + for its suspend_again callback. 104 + 105 + The cm_suspend_again() returns true (meaning "I want to suspend again") 106 + if the system was woken up by Charger Manager and the polling 107 + (in-suspend monitoring) results in "normal". 108 + 109 + 4. Charger-Manager Data (struct charger_desc) 110 + ============================================= 111 + For each battery charged independently from other batteries (if a series of 112 + batteries are charged by a single charger, they are counted as one independent 113 + battery), an instance of Charger Manager is attached to it. The following 114 + 115 + struct charger_desc elements: 116 + 117 + `char *psy_name;` 118 + The power-supply-class name of the battery. Default is 119 + "battery" if psy_name is NULL. Users can access the psy entries 120 + at "/sys/class/power_supply/[psy_name]/". 121 + 122 + `enum polling_modes polling_mode;` 123 + CM_POLL_DISABLE: 124 + do not poll this battery. 125 + CM_POLL_ALWAYS: 126 + always poll this battery. 127 + CM_POLL_EXTERNAL_POWER_ONLY: 128 + poll this battery if and only if an external power 129 + source is attached. 130 + CM_POLL_CHARGING_ONLY: 131 + poll this battery if and only if the battery is being charged. 132 + 133 + `unsigned int fullbatt_vchkdrop_ms; / unsigned int fullbatt_vchkdrop_uV;` 134 + If both have non-zero values, Charger Manager will check the 135 + battery voltage drop fullbatt_vchkdrop_ms after the battery is fully 136 + charged. If the voltage drop is over fullbatt_vchkdrop_uV, Charger 137 + Manager will try to recharge the battery by disabling and enabling 138 + chargers. Recharge with voltage drop condition only (without delay 139 + condition) is needed to be implemented with hardware interrupts from 140 + fuel gauges or charger devices/chips. 141 + 142 + `unsigned int fullbatt_uV;` 143 + If specified with a non-zero value, Charger Manager assumes 144 + that the battery is full (capacity = 100) if the battery is not being 145 + charged and the battery voltage is equal to or greater than 146 + fullbatt_uV. 147 + 148 + `unsigned int polling_interval_ms;` 149 + Required polling interval in ms. Charger Manager will poll 150 + this battery every polling_interval_ms or more frequently. 151 + 152 + `enum data_source battery_present;` 153 + CM_BATTERY_PRESENT: 154 + assume that the battery exists. 155 + CM_NO_BATTERY: 156 + assume that the battery does not exists. 157 + CM_FUEL_GAUGE: 158 + get battery presence information from fuel gauge. 159 + CM_CHARGER_STAT: 160 + get battery presence from chargers. 161 + 162 + `char **psy_charger_stat;` 163 + An array ending with NULL that has power-supply-class names of 164 + chargers. Each power-supply-class should provide "PRESENT" (if 165 + battery_present is "CM_CHARGER_STAT"), "ONLINE" (shows whether an 166 + external power source is attached or not), and "STATUS" (shows whether 167 + the battery is {"FULL" or not FULL} or {"FULL", "Charging", 168 + "Discharging", "NotCharging"}). 169 + 170 + `int num_charger_regulators; / struct regulator_bulk_data *charger_regulators;` 171 + Regulators representing the chargers in the form for 172 + regulator framework's bulk functions. 173 + 174 + `char *psy_fuel_gauge;` 175 + Power-supply-class name of the fuel gauge. 176 + 177 + `int (*temperature_out_of_range)(int *mC); / bool measure_battery_temp;` 178 + This callback returns 0 if the temperature is safe for charging, 179 + a positive number if it is too hot to charge, and a negative number 180 + if it is too cold to charge. With the variable mC, the callback returns 181 + the temperature in 1/1000 of centigrade. 182 + The source of temperature can be battery or ambient one according to 183 + the value of measure_battery_temp. 184 + 185 + 186 + 5. Notify Charger-Manager of charger events: cm_notify_event() 187 + ============================================================== 188 + If there is an charger event is required to notify 189 + Charger Manager, a charger device driver that triggers the event can call 190 + cm_notify_event(psy, type, msg) to notify the corresponding Charger Manager. 191 + In the function, psy is the charger driver's power_supply pointer, which is 192 + associated with Charger-Manager. The parameter "type" 193 + is the same as irq's type (enum cm_event_types). The event message "msg" is 194 + optional and is effective only if the event type is "UNDESCRIBED" or "OTHERS". 195 + 196 + 6. Other Considerations 197 + ======================= 198 + 199 + At the charger/battery-related events such as battery-pulled-out, 200 + charger-pulled-out, charger-inserted, DCIN-over/under-voltage, charger-stopped, 201 + and others critical to chargers, the system should be configured to wake up. 202 + At least the following should wake up the system from a suspend: 203 + a) charger-on/off b) external-power-in/out c) battery-in/out (while charging) 204 + 205 + It is usually accomplished by configuring the PMIC as a wakeup source.

-200

Documentation/power/charger-manager.txt

··· 1 - Charger Manager 2 - (C) 2011 MyungJoo Ham <myungjoo.ham@samsung.com>, GPL 3 - 4 - Charger Manager provides in-kernel battery charger management that 5 - requires temperature monitoring during suspend-to-RAM state 6 - and where each battery may have multiple chargers attached and the userland 7 - wants to look at the aggregated information of the multiple chargers. 8 - 9 - Charger Manager is a platform_driver with power-supply-class entries. 10 - An instance of Charger Manager (a platform-device created with Charger-Manager) 11 - represents an independent battery with chargers. If there are multiple 12 - batteries with their own chargers acting independently in a system, 13 - the system may need multiple instances of Charger Manager. 14 - 15 - 1. Introduction 16 - =============== 17 - 18 - Charger Manager supports the following: 19 - 20 - * Support for multiple chargers (e.g., a device with USB, AC, and solar panels) 21 - A system may have multiple chargers (or power sources) and some of 22 - they may be activated at the same time. Each charger may have its 23 - own power-supply-class and each power-supply-class can provide 24 - different information about the battery status. This framework 25 - aggregates charger-related information from multiple sources and 26 - shows combined information as a single power-supply-class. 27 - 28 - * Support for in suspend-to-RAM polling (with suspend_again callback) 29 - While the battery is being charged and the system is in suspend-to-RAM, 30 - we may need to monitor the battery health by looking at the ambient or 31 - battery temperature. We can accomplish this by waking up the system 32 - periodically. However, such a method wakes up devices unnecessarily for 33 - monitoring the battery health and tasks, and user processes that are 34 - supposed to be kept suspended. That, in turn, incurs unnecessary power 35 - consumption and slow down charging process. Or even, such peak power 36 - consumption can stop chargers in the middle of charging 37 - (external power input < device power consumption), which not 38 - only affects the charging time, but the lifespan of the battery. 39 - 40 - Charger Manager provides a function "cm_suspend_again" that can be 41 - used as suspend_again callback of platform_suspend_ops. If the platform 42 - requires tasks other than cm_suspend_again, it may implement its own 43 - suspend_again callback that calls cm_suspend_again in the middle. 44 - Normally, the platform will need to resume and suspend some devices 45 - that are used by Charger Manager. 46 - 47 - * Support for premature full-battery event handling 48 - If the battery voltage drops by "fullbatt_vchkdrop_uV" after 49 - "fullbatt_vchkdrop_ms" from the full-battery event, the framework 50 - restarts charging. This check is also performed while suspended by 51 - setting wakeup time accordingly and using suspend_again. 52 - 53 - * Support for uevent-notify 54 - With the charger-related events, the device sends 55 - notification to users with UEVENT. 56 - 57 - 2. Global Charger-Manager Data related with suspend_again 58 - ======================================================== 59 - In order to setup Charger Manager with suspend-again feature 60 - (in-suspend monitoring), the user should provide charger_global_desc 61 - with setup_charger_manager(struct charger_global_desc *). 62 - This charger_global_desc data for in-suspend monitoring is global 63 - as the name suggests. Thus, the user needs to provide only once even 64 - if there are multiple batteries. If there are multiple batteries, the 65 - multiple instances of Charger Manager share the same charger_global_desc 66 - and it will manage in-suspend monitoring for all instances of Charger Manager. 67 - 68 - The user needs to provide all the three entries properly in order to activate 69 - in-suspend monitoring: 70 - 71 - struct charger_global_desc { 72 - 73 - char *rtc_name; 74 - : The name of rtc (e.g., "rtc0") used to wakeup the system from 75 - suspend for Charger Manager. The alarm interrupt (AIE) of the rtc 76 - should be able to wake up the system from suspend. Charger Manager 77 - saves and restores the alarm value and use the previously-defined 78 - alarm if it is going to go off earlier than Charger Manager so that 79 - Charger Manager does not interfere with previously-defined alarms. 80 - 81 - bool (*rtc_only_wakeup)(void); 82 - : This callback should let CM know whether 83 - the wakeup-from-suspend is caused only by the alarm of "rtc" in the 84 - same struct. If there is any other wakeup source triggered the 85 - wakeup, it should return false. If the "rtc" is the only wakeup 86 - reason, it should return true. 87 - 88 - bool assume_timer_stops_in_suspend; 89 - : if true, Charger Manager assumes that 90 - the timer (CM uses jiffies as timer) stops during suspend. Then, CM 91 - assumes that the suspend-duration is same as the alarm length. 92 - }; 93 - 94 - 3. How to setup suspend_again 95 - ============================= 96 - Charger Manager provides a function "extern bool cm_suspend_again(void)". 97 - When cm_suspend_again is called, it monitors every battery. The suspend_ops 98 - callback of the system's platform_suspend_ops can call cm_suspend_again 99 - function to know whether Charger Manager wants to suspend again or not. 100 - If there are no other devices or tasks that want to use suspend_again 101 - feature, the platform_suspend_ops may directly refer to cm_suspend_again 102 - for its suspend_again callback. 103 - 104 - The cm_suspend_again() returns true (meaning "I want to suspend again") 105 - if the system was woken up by Charger Manager and the polling 106 - (in-suspend monitoring) results in "normal". 107 - 108 - 4. Charger-Manager Data (struct charger_desc) 109 - ============================================= 110 - For each battery charged independently from other batteries (if a series of 111 - batteries are charged by a single charger, they are counted as one independent 112 - battery), an instance of Charger Manager is attached to it. 113 - 114 - struct charger_desc { 115 - 116 - char *psy_name; 117 - : The power-supply-class name of the battery. Default is 118 - "battery" if psy_name is NULL. Users can access the psy entries 119 - at "/sys/class/power_supply/[psy_name]/". 120 - 121 - enum polling_modes polling_mode; 122 - : CM_POLL_DISABLE: do not poll this battery. 123 - CM_POLL_ALWAYS: always poll this battery. 124 - CM_POLL_EXTERNAL_POWER_ONLY: poll this battery if and only if 125 - an external power source is attached. 126 - CM_POLL_CHARGING_ONLY: poll this battery if and only if the 127 - battery is being charged. 128 - 129 - unsigned int fullbatt_vchkdrop_ms; 130 - unsigned int fullbatt_vchkdrop_uV; 131 - : If both have non-zero values, Charger Manager will check the 132 - battery voltage drop fullbatt_vchkdrop_ms after the battery is fully 133 - charged. If the voltage drop is over fullbatt_vchkdrop_uV, Charger 134 - Manager will try to recharge the battery by disabling and enabling 135 - chargers. Recharge with voltage drop condition only (without delay 136 - condition) is needed to be implemented with hardware interrupts from 137 - fuel gauges or charger devices/chips. 138 - 139 - unsigned int fullbatt_uV; 140 - : If specified with a non-zero value, Charger Manager assumes 141 - that the battery is full (capacity = 100) if the battery is not being 142 - charged and the battery voltage is equal to or greater than 143 - fullbatt_uV. 144 - 145 - unsigned int polling_interval_ms; 146 - : Required polling interval in ms. Charger Manager will poll 147 - this battery every polling_interval_ms or more frequently. 148 - 149 - enum data_source battery_present; 150 - : CM_BATTERY_PRESENT: assume that the battery exists. 151 - CM_NO_BATTERY: assume that the battery does not exists. 152 - CM_FUEL_GAUGE: get battery presence information from fuel gauge. 153 - CM_CHARGER_STAT: get battery presence from chargers. 154 - 155 - char **psy_charger_stat; 156 - : An array ending with NULL that has power-supply-class names of 157 - chargers. Each power-supply-class should provide "PRESENT" (if 158 - battery_present is "CM_CHARGER_STAT"), "ONLINE" (shows whether an 159 - external power source is attached or not), and "STATUS" (shows whether 160 - the battery is {"FULL" or not FULL} or {"FULL", "Charging", 161 - "Discharging", "NotCharging"}). 162 - 163 - int num_charger_regulators; 164 - struct regulator_bulk_data *charger_regulators; 165 - : Regulators representing the chargers in the form for 166 - regulator framework's bulk functions. 167 - 168 - char *psy_fuel_gauge; 169 - : Power-supply-class name of the fuel gauge. 170 - 171 - int (*temperature_out_of_range)(int *mC); 172 - bool measure_battery_temp; 173 - : This callback returns 0 if the temperature is safe for charging, 174 - a positive number if it is too hot to charge, and a negative number 175 - if it is too cold to charge. With the variable mC, the callback returns 176 - the temperature in 1/1000 of centigrade. 177 - The source of temperature can be battery or ambient one according to 178 - the value of measure_battery_temp. 179 - }; 180 - 181 - 5. Notify Charger-Manager of charger events: cm_notify_event() 182 - ========================================================= 183 - If there is an charger event is required to notify 184 - Charger Manager, a charger device driver that triggers the event can call 185 - cm_notify_event(psy, type, msg) to notify the corresponding Charger Manager. 186 - In the function, psy is the charger driver's power_supply pointer, which is 187 - associated with Charger-Manager. The parameter "type" 188 - is the same as irq's type (enum cm_event_types). The event message "msg" is 189 - optional and is effective only if the event type is "UNDESCRIBED" or "OTHERS". 190 - 191 - 6. Other Considerations 192 - ======================= 193 - 194 - At the charger/battery-related events such as battery-pulled-out, 195 - charger-pulled-out, charger-inserted, DCIN-over/under-voltage, charger-stopped, 196 - and others critical to chargers, the system should be configured to wake up. 197 - At least the following should wake up the system from a suspend: 198 - a) charger-on/off b) external-power-in/out c) battery-in/out (while charging) 199 - 200 - It is usually accomplished by configuring the PMIC as a wakeup source.

+51

Documentation/power/drivers-testing.rst

··· 1 + ==================================================== 2 + Testing suspend and resume support in device drivers 3 + ==================================================== 4 + 5 + (C) 2007 Rafael J. Wysocki <rjw@sisk.pl>, GPL 6 + 7 + 1. Preparing the test system 8 + ============================ 9 + 10 + Unfortunately, to effectively test the support for the system-wide suspend and 11 + resume transitions in a driver, it is necessary to suspend and resume a fully 12 + functional system with this driver loaded. Moreover, that should be done 13 + several times, preferably several times in a row, and separately for hibernation 14 + (aka suspend to disk or STD) and suspend to RAM (STR), because each of these 15 + cases involves slightly different operations and different interactions with 16 + the machine's BIOS. 17 + 18 + Of course, for this purpose the test system has to be known to suspend and 19 + resume without the driver being tested. Thus, if possible, you should first 20 + resolve all suspend/resume-related problems in the test system before you start 21 + testing the new driver. Please see Documentation/power/basic-pm-debugging.rst 22 + for more information about the debugging of suspend/resume functionality. 23 + 24 + 2. Testing the driver 25 + ===================== 26 + 27 + Once you have resolved the suspend/resume-related problems with your test system 28 + without the new driver, you are ready to test it: 29 + 30 + a) Build the driver as a module, load it and try the test modes of hibernation 31 + (see: Documentation/power/basic-pm-debugging.rst, 1). 32 + 33 + b) Load the driver and attempt to hibernate in the "reboot", "shutdown" and 34 + "platform" modes (see: Documentation/power/basic-pm-debugging.rst, 1). 35 + 36 + c) Compile the driver directly into the kernel and try the test modes of 37 + hibernation. 38 + 39 + d) Attempt to hibernate with the driver compiled directly into the kernel 40 + in the "reboot", "shutdown" and "platform" modes. 41 + 42 + e) Try the test modes of suspend (see: Documentation/power/basic-pm-debugging.rst, 43 + 2). [As far as the STR tests are concerned, it should not matter whether or 44 + not the driver is built as a module.] 45 + 46 + f) Attempt to suspend to RAM using the s2ram tool with the driver loaded 47 + (see: Documentation/power/basic-pm-debugging.rst, 2). 48 + 49 + Each of the above tests should be repeated several times and the STD tests 50 + should be mixed with the STR tests. If any of them fails, the driver cannot be 51 + regarded as suspend/resume-safe.

-46

Documentation/power/drivers-testing.txt

··· 1 - Testing suspend and resume support in device drivers 2 - (C) 2007 Rafael J. Wysocki <rjw@sisk.pl>, GPL 3 - 4 - 1. Preparing the test system 5 - 6 - Unfortunately, to effectively test the support for the system-wide suspend and 7 - resume transitions in a driver, it is necessary to suspend and resume a fully 8 - functional system with this driver loaded. Moreover, that should be done 9 - several times, preferably several times in a row, and separately for hibernation 10 - (aka suspend to disk or STD) and suspend to RAM (STR), because each of these 11 - cases involves slightly different operations and different interactions with 12 - the machine's BIOS. 13 - 14 - Of course, for this purpose the test system has to be known to suspend and 15 - resume without the driver being tested. Thus, if possible, you should first 16 - resolve all suspend/resume-related problems in the test system before you start 17 - testing the new driver. Please see Documentation/power/basic-pm-debugging.txt 18 - for more information about the debugging of suspend/resume functionality. 19 - 20 - 2. Testing the driver 21 - 22 - Once you have resolved the suspend/resume-related problems with your test system 23 - without the new driver, you are ready to test it: 24 - 25 - a) Build the driver as a module, load it and try the test modes of hibernation 26 - (see: Documentation/power/basic-pm-debugging.txt, 1). 27 - 28 - b) Load the driver and attempt to hibernate in the "reboot", "shutdown" and 29 - "platform" modes (see: Documentation/power/basic-pm-debugging.txt, 1). 30 - 31 - c) Compile the driver directly into the kernel and try the test modes of 32 - hibernation. 33 - 34 - d) Attempt to hibernate with the driver compiled directly into the kernel 35 - in the "reboot", "shutdown" and "platform" modes. 36 - 37 - e) Try the test modes of suspend (see: Documentation/power/basic-pm-debugging.txt, 38 - 2). [As far as the STR tests are concerned, it should not matter whether or 39 - not the driver is built as a module.] 40 - 41 - f) Attempt to suspend to RAM using the s2ram tool with the driver loaded 42 - (see: Documentation/power/basic-pm-debugging.txt, 2). 43 - 44 - Each of the above tests should be repeated several times and the STD tests 45 - should be mixed with the STR tests. If any of them fails, the driver cannot be 46 - regarded as suspend/resume-safe.

+147

Documentation/power/energy-model.rst

··· 1 + ==================== 2 + Energy Model of CPUs 3 + ==================== 4 + 5 + 1. Overview 6 + ----------- 7 + 8 + The Energy Model (EM) framework serves as an interface between drivers knowing 9 + the power consumed by CPUs at various performance levels, and the kernel 10 + subsystems willing to use that information to make energy-aware decisions. 11 + 12 + The source of the information about the power consumed by CPUs can vary greatly 13 + from one platform to another. These power costs can be estimated using 14 + devicetree data in some cases. In others, the firmware will know better. 15 + Alternatively, userspace might be best positioned. And so on. In order to avoid 16 + each and every client subsystem to re-implement support for each and every 17 + possible source of information on its own, the EM framework intervenes as an 18 + abstraction layer which standardizes the format of power cost tables in the 19 + kernel, hence enabling to avoid redundant work. 20 + 21 + The figure below depicts an example of drivers (Arm-specific here, but the 22 + approach is applicable to any architecture) providing power costs to the EM 23 + framework, and interested clients reading the data from it:: 24 + 25 + +---------------+ +-----------------+ +---------------+ 26 + | Thermal (IPA) | | Scheduler (EAS) | | Other | 27 + +---------------+ +-----------------+ +---------------+ 28 + | | em_pd_energy() | 29 + | | em_cpu_get() | 30 + +---------+ | +---------+ 31 + | | | 32 + v v v 33 + +---------------------+ 34 + | Energy Model | 35 + | Framework | 36 + +---------------------+ 37 + ^ ^ ^ 38 + | | | em_register_perf_domain() 39 + +----------+ | +---------+ 40 + | | | 41 + +---------------+ +---------------+ +--------------+ 42 + | cpufreq-dt | | arm_scmi | | Other | 43 + +---------------+ +---------------+ +--------------+ 44 + ^ ^ ^ 45 + | | | 46 + +--------------+ +---------------+ +--------------+ 47 + | Device Tree | | Firmware | | ? | 48 + +--------------+ +---------------+ +--------------+ 49 + 50 + The EM framework manages power cost tables per 'performance domain' in the 51 + system. A performance domain is a group of CPUs whose performance is scaled 52 + together. Performance domains generally have a 1-to-1 mapping with CPUFreq 53 + policies. All CPUs in a performance domain are required to have the same 54 + micro-architecture. CPUs in different performance domains can have different 55 + micro-architectures. 56 + 57 + 58 + 2. Core APIs 59 + ------------ 60 + 61 + 2.1 Config options 62 + ^^^^^^^^^^^^^^^^^^ 63 + 64 + CONFIG_ENERGY_MODEL must be enabled to use the EM framework. 65 + 66 + 67 + 2.2 Registration of performance domains 68 + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 69 + 70 + Drivers are expected to register performance domains into the EM framework by 71 + calling the following API:: 72 + 73 + int em_register_perf_domain(cpumask_t *span, unsigned int nr_states, 74 + struct em_data_callback *cb); 75 + 76 + Drivers must specify the CPUs of the performance domains using the cpumask 77 + argument, and provide a callback function returning <frequency, power> tuples 78 + for each capacity state. The callback function provided by the driver is free 79 + to fetch data from any relevant location (DT, firmware, ...), and by any mean 80 + deemed necessary. See Section 3. for an example of driver implementing this 81 + callback, and kernel/power/energy_model.c for further documentation on this 82 + API. 83 + 84 + 85 + 2.3 Accessing performance domains 86 + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 87 + 88 + Subsystems interested in the energy model of a CPU can retrieve it using the 89 + em_cpu_get() API. The energy model tables are allocated once upon creation of 90 + the performance domains, and kept in memory untouched. 91 + 92 + The energy consumed by a performance domain can be estimated using the 93 + em_pd_energy() API. The estimation is performed assuming that the schedutil 94 + CPUfreq governor is in use. 95 + 96 + More details about the above APIs can be found in include/linux/energy_model.h. 97 + 98 + 99 + 3. Example driver 100 + ----------------- 101 + 102 + This section provides a simple example of a CPUFreq driver registering a 103 + performance domain in the Energy Model framework using the (fake) 'foo' 104 + protocol. The driver implements an est_power() function to be provided to the 105 + EM framework:: 106 + 107 + -> drivers/cpufreq/foo_cpufreq.c 108 + 109 + 01 static int est_power(unsigned long *mW, unsigned long *KHz, int cpu) 110 + 02 { 111 + 03 long freq, power; 112 + 04 113 + 05 /* Use the 'foo' protocol to ceil the frequency */ 114 + 06 freq = foo_get_freq_ceil(cpu, *KHz); 115 + 07 if (freq < 0); 116 + 08 return freq; 117 + 09 118 + 10 /* Estimate the power cost for the CPU at the relevant freq. */ 119 + 11 power = foo_estimate_power(cpu, freq); 120 + 12 if (power < 0); 121 + 13 return power; 122 + 14 123 + 15 /* Return the values to the EM framework */ 124 + 16 *mW = power; 125 + 17 *KHz = freq; 126 + 18 127 + 19 return 0; 128 + 20 } 129 + 21 130 + 22 static int foo_cpufreq_init(struct cpufreq_policy *policy) 131 + 23 { 132 + 24 struct em_data_callback em_cb = EM_DATA_CB(est_power); 133 + 25 int nr_opp, ret; 134 + 26 135 + 27 /* Do the actual CPUFreq init work ... */ 136 + 28 ret = do_foo_cpufreq_init(policy); 137 + 29 if (ret) 138 + 30 return ret; 139 + 31 140 + 32 /* Find the number of OPPs for this policy */ 141 + 33 nr_opp = foo_get_nr_opp(policy); 142 + 34 143 + 35 /* And register the new performance domain */ 144 + 36 em_register_perf_domain(policy->cpus, nr_opp, &em_cb); 145 + 37 146 + 38 return 0; 147 + 39 }

-144

Documentation/power/energy-model.txt

··· 1 - ==================== 2 - Energy Model of CPUs 3 - ==================== 4 - 5 - 1. Overview 6 - ----------- 7 - 8 - The Energy Model (EM) framework serves as an interface between drivers knowing 9 - the power consumed by CPUs at various performance levels, and the kernel 10 - subsystems willing to use that information to make energy-aware decisions. 11 - 12 - The source of the information about the power consumed by CPUs can vary greatly 13 - from one platform to another. These power costs can be estimated using 14 - devicetree data in some cases. In others, the firmware will know better. 15 - Alternatively, userspace might be best positioned. And so on. In order to avoid 16 - each and every client subsystem to re-implement support for each and every 17 - possible source of information on its own, the EM framework intervenes as an 18 - abstraction layer which standardizes the format of power cost tables in the 19 - kernel, hence enabling to avoid redundant work. 20 - 21 - The figure below depicts an example of drivers (Arm-specific here, but the 22 - approach is applicable to any architecture) providing power costs to the EM 23 - framework, and interested clients reading the data from it. 24 - 25 - +---------------+ +-----------------+ +---------------+ 26 - | Thermal (IPA) | | Scheduler (EAS) | | Other | 27 - +---------------+ +-----------------+ +---------------+ 28 - | | em_pd_energy() | 29 - | | em_cpu_get() | 30 - +---------+ | +---------+ 31 - | | | 32 - v v v 33 - +---------------------+ 34 - | Energy Model | 35 - | Framework | 36 - +---------------------+ 37 - ^ ^ ^ 38 - | | | em_register_perf_domain() 39 - +----------+ | +---------+ 40 - | | | 41 - +---------------+ +---------------+ +--------------+ 42 - | cpufreq-dt | | arm_scmi | | Other | 43 - +---------------+ +---------------+ +--------------+ 44 - ^ ^ ^ 45 - | | | 46 - +--------------+ +---------------+ +--------------+ 47 - | Device Tree | | Firmware | | ? | 48 - +--------------+ +---------------+ +--------------+ 49 - 50 - The EM framework manages power cost tables per 'performance domain' in the 51 - system. A performance domain is a group of CPUs whose performance is scaled 52 - together. Performance domains generally have a 1-to-1 mapping with CPUFreq 53 - policies. All CPUs in a performance domain are required to have the same 54 - micro-architecture. CPUs in different performance domains can have different 55 - micro-architectures. 56 - 57 - 58 - 2. Core APIs 59 - ------------ 60 - 61 - 2.1 Config options 62 - 63 - CONFIG_ENERGY_MODEL must be enabled to use the EM framework. 64 - 65 - 66 - 2.2 Registration of performance domains 67 - 68 - Drivers are expected to register performance domains into the EM framework by 69 - calling the following API: 70 - 71 - int em_register_perf_domain(cpumask_t *span, unsigned int nr_states, 72 - struct em_data_callback *cb); 73 - 74 - Drivers must specify the CPUs of the performance domains using the cpumask 75 - argument, and provide a callback function returning <frequency, power> tuples 76 - for each capacity state. The callback function provided by the driver is free 77 - to fetch data from any relevant location (DT, firmware, ...), and by any mean 78 - deemed necessary. See Section 3. for an example of driver implementing this 79 - callback, and kernel/power/energy_model.c for further documentation on this 80 - API. 81 - 82 - 83 - 2.3 Accessing performance domains 84 - 85 - Subsystems interested in the energy model of a CPU can retrieve it using the 86 - em_cpu_get() API. The energy model tables are allocated once upon creation of 87 - the performance domains, and kept in memory untouched. 88 - 89 - The energy consumed by a performance domain can be estimated using the 90 - em_pd_energy() API. The estimation is performed assuming that the schedutil 91 - CPUfreq governor is in use. 92 - 93 - More details about the above APIs can be found in include/linux/energy_model.h. 94 - 95 - 96 - 3. Example driver 97 - ----------------- 98 - 99 - This section provides a simple example of a CPUFreq driver registering a 100 - performance domain in the Energy Model framework using the (fake) 'foo' 101 - protocol. The driver implements an est_power() function to be provided to the 102 - EM framework. 103 - 104 - -> drivers/cpufreq/foo_cpufreq.c 105 - 106 - 01 static int est_power(unsigned long *mW, unsigned long *KHz, int cpu) 107 - 02 { 108 - 03 long freq, power; 109 - 04 110 - 05 /* Use the 'foo' protocol to ceil the frequency */ 111 - 06 freq = foo_get_freq_ceil(cpu, *KHz); 112 - 07 if (freq < 0); 113 - 08 return freq; 114 - 09 115 - 10 /* Estimate the power cost for the CPU at the relevant freq. */ 116 - 11 power = foo_estimate_power(cpu, freq); 117 - 12 if (power < 0); 118 - 13 return power; 119 - 14 120 - 15 /* Return the values to the EM framework */ 121 - 16 *mW = power; 122 - 17 *KHz = freq; 123 - 18 124 - 19 return 0; 125 - 20 } 126 - 21 127 - 22 static int foo_cpufreq_init(struct cpufreq_policy *policy) 128 - 23 { 129 - 24 struct em_data_callback em_cb = EM_DATA_CB(est_power); 130 - 25 int nr_opp, ret; 131 - 26 132 - 27 /* Do the actual CPUFreq init work ... */ 133 - 28 ret = do_foo_cpufreq_init(policy); 134 - 29 if (ret) 135 - 30 return ret; 136 - 31 137 - 32 /* Find the number of OPPs for this policy */ 138 - 33 nr_opp = foo_get_nr_opp(policy); 139 - 34 140 - 35 /* And register the new performance domain */ 141 - 36 em_register_perf_domain(policy->cpus, nr_opp, &em_cb); 142 - 37 143 - 38 return 0; 144 - 39 }

+244

Documentation/power/freezing-of-tasks.rst

··· 1 + ================= 2 + Freezing of tasks 3 + ================= 4 + 5 + (C) 2007 Rafael J. Wysocki <rjw@sisk.pl>, GPL 6 + 7 + I. What is the freezing of tasks? 8 + ================================= 9 + 10 + The freezing of tasks is a mechanism by which user space processes and some 11 + kernel threads are controlled during hibernation or system-wide suspend (on some 12 + architectures). 13 + 14 + II. How does it work? 15 + ===================== 16 + 17 + There are three per-task flags used for that, PF_NOFREEZE, PF_FROZEN 18 + and PF_FREEZER_SKIP (the last one is auxiliary). The tasks that have 19 + PF_NOFREEZE unset (all user space processes and some kernel threads) are 20 + regarded as 'freezable' and treated in a special way before the system enters a 21 + suspend state as well as before a hibernation image is created (in what follows 22 + we only consider hibernation, but the description also applies to suspend). 23 + 24 + Namely, as the first step of the hibernation procedure the function 25 + freeze_processes() (defined in kernel/power/process.c) is called. A system-wide 26 + variable system_freezing_cnt (as opposed to a per-task flag) is used to indicate 27 + whether the system is to undergo a freezing operation. And freeze_processes() 28 + sets this variable. After this, it executes try_to_freeze_tasks() that sends a 29 + fake signal to all user space processes, and wakes up all the kernel threads. 30 + All freezable tasks must react to that by calling try_to_freeze(), which 31 + results in a call to __refrigerator() (defined in kernel/freezer.c), which sets 32 + the task's PF_FROZEN flag, changes its state to TASK_UNINTERRUPTIBLE and makes 33 + it loop until PF_FROZEN is cleared for it. Then, we say that the task is 34 + 'frozen' and therefore the set of functions handling this mechanism is referred 35 + to as 'the freezer' (these functions are defined in kernel/power/process.c, 36 + kernel/freezer.c & include/linux/freezer.h). User space processes are generally 37 + frozen before kernel threads. 38 + 39 + __refrigerator() must not be called directly. Instead, use the 40 + try_to_freeze() function (defined in include/linux/freezer.h), that checks 41 + if the task is to be frozen and makes the task enter __refrigerator(). 42 + 43 + For user space processes try_to_freeze() is called automatically from the 44 + signal-handling code, but the freezable kernel threads need to call it 45 + explicitly in suitable places or use the wait_event_freezable() or 46 + wait_event_freezable_timeout() macros (defined in include/linux/freezer.h) 47 + that combine interruptible sleep with checking if the task is to be frozen and 48 + calling try_to_freeze(). The main loop of a freezable kernel thread may look 49 + like the following one:: 50 + 51 + set_freezable(); 52 + do { 53 + hub_events(); 54 + wait_event_freezable(khubd_wait, 55 + !list_empty(&hub_event_list) || 56 + kthread_should_stop()); 57 + } while (!kthread_should_stop() || !list_empty(&hub_event_list)); 58 + 59 + (from drivers/usb/core/hub.c::hub_thread()). 60 + 61 + If a freezable kernel thread fails to call try_to_freeze() after the freezer has 62 + initiated a freezing operation, the freezing of tasks will fail and the entire 63 + hibernation operation will be cancelled. For this reason, freezable kernel 64 + threads must call try_to_freeze() somewhere or use one of the 65 + wait_event_freezable() and wait_event_freezable_timeout() macros. 66 + 67 + After the system memory state has been restored from a hibernation image and 68 + devices have been reinitialized, the function thaw_processes() is called in 69 + order to clear the PF_FROZEN flag for each frozen task. Then, the tasks that 70 + have been frozen leave __refrigerator() and continue running. 71 + 72 + 73 + Rationale behind the functions dealing with freezing and thawing of tasks 74 + ------------------------------------------------------------------------- 75 + 76 + freeze_processes(): 77 + - freezes only userspace tasks 78 + 79 + freeze_kernel_threads(): 80 + - freezes all tasks (including kernel threads) because we can't freeze 81 + kernel threads without freezing userspace tasks 82 + 83 + thaw_kernel_threads(): 84 + - thaws only kernel threads; this is particularly useful if we need to do 85 + anything special in between thawing of kernel threads and thawing of 86 + userspace tasks, or if we want to postpone the thawing of userspace tasks 87 + 88 + thaw_processes(): 89 + - thaws all tasks (including kernel threads) because we can't thaw userspace 90 + tasks without thawing kernel threads 91 + 92 + 93 + III. Which kernel threads are freezable? 94 + ======================================== 95 + 96 + Kernel threads are not freezable by default. However, a kernel thread may clear 97 + PF_NOFREEZE for itself by calling set_freezable() (the resetting of PF_NOFREEZE 98 + directly is not allowed). From this point it is regarded as freezable 99 + and must call try_to_freeze() in a suitable place. 100 + 101 + IV. Why do we do that? 102 + ====================== 103 + 104 + Generally speaking, there is a couple of reasons to use the freezing of tasks: 105 + 106 + 1. The principal reason is to prevent filesystems from being damaged after 107 + hibernation. At the moment we have no simple means of checkpointing 108 + filesystems, so if there are any modifications made to filesystem data and/or 109 + metadata on disks, we cannot bring them back to the state from before the 110 + modifications. At the same time each hibernation image contains some 111 + filesystem-related information that must be consistent with the state of the 112 + on-disk data and metadata after the system memory state has been restored 113 + from the image (otherwise the filesystems will be damaged in a nasty way, 114 + usually making them almost impossible to repair). We therefore freeze 115 + tasks that might cause the on-disk filesystems' data and metadata to be 116 + modified after the hibernation image has been created and before the 117 + system is finally powered off. The majority of these are user space 118 + processes, but if any of the kernel threads may cause something like this 119 + to happen, they have to be freezable. 120 + 121 + 2. Next, to create the hibernation image we need to free a sufficient amount of 122 + memory (approximately 50% of available RAM) and we need to do that before 123 + devices are deactivated, because we generally need them for swapping out. 124 + Then, after the memory for the image has been freed, we don't want tasks 125 + to allocate additional memory and we prevent them from doing that by 126 + freezing them earlier. [Of course, this also means that device drivers 127 + should not allocate substantial amounts of memory from their .suspend() 128 + callbacks before hibernation, but this is a separate issue.] 129 + 130 + 3. The third reason is to prevent user space processes and some kernel threads 131 + from interfering with the suspending and resuming of devices. A user space 132 + process running on a second CPU while we are suspending devices may, for 133 + example, be troublesome and without the freezing of tasks we would need some 134 + safeguards against race conditions that might occur in such a case. 135 + 136 + Although Linus Torvalds doesn't like the freezing of tasks, he said this in one 137 + of the discussions on LKML (http://lkml.org/lkml/2007/4/27/608): 138 + 139 + "RJW:> Why we freeze tasks at all or why we freeze kernel threads? 140 + 141 + Linus: In many ways, 'at all'. 142 + 143 + I **do** realize the IO request queue issues, and that we cannot actually do 144 + s2ram with some devices in the middle of a DMA. So we want to be able to 145 + avoid *that*, there's no question about that. And I suspect that stopping 146 + user threads and then waiting for a sync is practically one of the easier 147 + ways to do so. 148 + 149 + So in practice, the 'at all' may become a 'why freeze kernel threads?' and 150 + freezing user threads I don't find really objectionable." 151 + 152 + Still, there are kernel threads that may want to be freezable. For example, if 153 + a kernel thread that belongs to a device driver accesses the device directly, it 154 + in principle needs to know when the device is suspended, so that it doesn't try 155 + to access it at that time. However, if the kernel thread is freezable, it will 156 + be frozen before the driver's .suspend() callback is executed and it will be 157 + thawed after the driver's .resume() callback has run, so it won't be accessing 158 + the device while it's suspended. 159 + 160 + 4. Another reason for freezing tasks is to prevent user space processes from 161 + realizing that hibernation (or suspend) operation takes place. Ideally, user 162 + space processes should not notice that such a system-wide operation has 163 + occurred and should continue running without any problems after the restore 164 + (or resume from suspend). Unfortunately, in the most general case this 165 + is quite difficult to achieve without the freezing of tasks. Consider, 166 + for example, a process that depends on all CPUs being online while it's 167 + running. Since we need to disable nonboot CPUs during the hibernation, 168 + if this process is not frozen, it may notice that the number of CPUs has 169 + changed and may start to work incorrectly because of that. 170 + 171 + V. Are there any problems related to the freezing of tasks? 172 + =========================================================== 173 + 174 + Yes, there are. 175 + 176 + First of all, the freezing of kernel threads may be tricky if they depend one 177 + on another. For example, if kernel thread A waits for a completion (in the 178 + TASK_UNINTERRUPTIBLE state) that needs to be done by freezable kernel thread B 179 + and B is frozen in the meantime, then A will be blocked until B is thawed, which 180 + may be undesirable. That's why kernel threads are not freezable by default. 181 + 182 + Second, there are the following two problems related to the freezing of user 183 + space processes: 184 + 185 + 1. Putting processes into an uninterruptible sleep distorts the load average. 186 + 2. Now that we have FUSE, plus the framework for doing device drivers in 187 + userspace, it gets even more complicated because some userspace processes are 188 + now doing the sorts of things that kernel threads do 189 + (https://lists.linux-foundation.org/pipermail/linux-pm/2007-May/012309.html). 190 + 191 + The problem 1. seems to be fixable, although it hasn't been fixed so far. The 192 + other one is more serious, but it seems that we can work around it by using 193 + hibernation (and suspend) notifiers (in that case, though, we won't be able to 194 + avoid the realization by the user space processes that the hibernation is taking 195 + place). 196 + 197 + There are also problems that the freezing of tasks tends to expose, although 198 + they are not directly related to it. For example, if request_firmware() is 199 + called from a device driver's .resume() routine, it will timeout and eventually 200 + fail, because the user land process that should respond to the request is frozen 201 + at this point. So, seemingly, the failure is due to the freezing of tasks. 202 + Suppose, however, that the firmware file is located on a filesystem accessible 203 + only through another device that hasn't been resumed yet. In that case, 204 + request_firmware() will fail regardless of whether or not the freezing of tasks 205 + is used. Consequently, the problem is not really related to the freezing of 206 + tasks, since it generally exists anyway. 207 + 208 + A driver must have all firmwares it may need in RAM before suspend() is called. 209 + If keeping them is not practical, for example due to their size, they must be 210 + requested early enough using the suspend notifier API described in 211 + Documentation/driver-api/pm/notifiers.rst. 212 + 213 + VI. Are there any precautions to be taken to prevent freezing failures? 214 + ======================================================================= 215 + 216 + Yes, there are. 217 + 218 + First of all, grabbing the 'system_transition_mutex' lock to mutually exclude a piece of code 219 + from system-wide sleep such as suspend/hibernation is not encouraged. 220 + If possible, that piece of code must instead hook onto the suspend/hibernation 221 + notifiers to achieve mutual exclusion. Look at the CPU-Hotplug code 222 + (kernel/cpu.c) for an example. 223 + 224 + However, if that is not feasible, and grabbing 'system_transition_mutex' is deemed necessary, 225 + it is strongly discouraged to directly call mutex_[un]lock(&system_transition_mutex) since 226 + that could lead to freezing failures, because if the suspend/hibernate code 227 + successfully acquired the 'system_transition_mutex' lock, and hence that other entity failed 228 + to acquire the lock, then that task would get blocked in TASK_UNINTERRUPTIBLE 229 + state. As a consequence, the freezer would not be able to freeze that task, 230 + leading to freezing failure. 231 + 232 + However, the [un]lock_system_sleep() APIs are safe to use in this scenario, 233 + since they ask the freezer to skip freezing this task, since it is anyway 234 + "frozen enough" as it is blocked on 'system_transition_mutex', which will be released 235 + only after the entire suspend/hibernation sequence is complete. 236 + So, to summarize, use [un]lock_system_sleep() instead of directly using 237 + mutex_[un]lock(&system_transition_mutex). That would prevent freezing failures. 238 + 239 + V. Miscellaneous 240 + ================ 241 + 242 + /sys/power/pm_freeze_timeout controls how long it will cost at most to freeze 243 + all user space processes or all freezable kernel threads, in unit of millisecond. 244 + The default value is 20000, with range of unsigned integer.

-231

Documentation/power/freezing-of-tasks.txt

··· 1 - Freezing of tasks 2 - (C) 2007 Rafael J. Wysocki <rjw@sisk.pl>, GPL 3 - 4 - I. What is the freezing of tasks? 5 - 6 - The freezing of tasks is a mechanism by which user space processes and some 7 - kernel threads are controlled during hibernation or system-wide suspend (on some 8 - architectures). 9 - 10 - II. How does it work? 11 - 12 - There are three per-task flags used for that, PF_NOFREEZE, PF_FROZEN 13 - and PF_FREEZER_SKIP (the last one is auxiliary). The tasks that have 14 - PF_NOFREEZE unset (all user space processes and some kernel threads) are 15 - regarded as 'freezable' and treated in a special way before the system enters a 16 - suspend state as well as before a hibernation image is created (in what follows 17 - we only consider hibernation, but the description also applies to suspend). 18 - 19 - Namely, as the first step of the hibernation procedure the function 20 - freeze_processes() (defined in kernel/power/process.c) is called. A system-wide 21 - variable system_freezing_cnt (as opposed to a per-task flag) is used to indicate 22 - whether the system is to undergo a freezing operation. And freeze_processes() 23 - sets this variable. After this, it executes try_to_freeze_tasks() that sends a 24 - fake signal to all user space processes, and wakes up all the kernel threads. 25 - All freezable tasks must react to that by calling try_to_freeze(), which 26 - results in a call to __refrigerator() (defined in kernel/freezer.c), which sets 27 - the task's PF_FROZEN flag, changes its state to TASK_UNINTERRUPTIBLE and makes 28 - it loop until PF_FROZEN is cleared for it. Then, we say that the task is 29 - 'frozen' and therefore the set of functions handling this mechanism is referred 30 - to as 'the freezer' (these functions are defined in kernel/power/process.c, 31 - kernel/freezer.c & include/linux/freezer.h). User space processes are generally 32 - frozen before kernel threads. 33 - 34 - __refrigerator() must not be called directly. Instead, use the 35 - try_to_freeze() function (defined in include/linux/freezer.h), that checks 36 - if the task is to be frozen and makes the task enter __refrigerator(). 37 - 38 - For user space processes try_to_freeze() is called automatically from the 39 - signal-handling code, but the freezable kernel threads need to call it 40 - explicitly in suitable places or use the wait_event_freezable() or 41 - wait_event_freezable_timeout() macros (defined in include/linux/freezer.h) 42 - that combine interruptible sleep with checking if the task is to be frozen and 43 - calling try_to_freeze(). The main loop of a freezable kernel thread may look 44 - like the following one: 45 - 46 - set_freezable(); 47 - do { 48 - hub_events(); 49 - wait_event_freezable(khubd_wait, 50 - !list_empty(&hub_event_list) || 51 - kthread_should_stop()); 52 - } while (!kthread_should_stop() || !list_empty(&hub_event_list)); 53 - 54 - (from drivers/usb/core/hub.c::hub_thread()). 55 - 56 - If a freezable kernel thread fails to call try_to_freeze() after the freezer has 57 - initiated a freezing operation, the freezing of tasks will fail and the entire 58 - hibernation operation will be cancelled. For this reason, freezable kernel 59 - threads must call try_to_freeze() somewhere or use one of the 60 - wait_event_freezable() and wait_event_freezable_timeout() macros. 61 - 62 - After the system memory state has been restored from a hibernation image and 63 - devices have been reinitialized, the function thaw_processes() is called in 64 - order to clear the PF_FROZEN flag for each frozen task. Then, the tasks that 65 - have been frozen leave __refrigerator() and continue running. 66 - 67 - 68 - Rationale behind the functions dealing with freezing and thawing of tasks: 69 - ------------------------------------------------------------------------- 70 - 71 - freeze_processes(): 72 - - freezes only userspace tasks 73 - 74 - freeze_kernel_threads(): 75 - - freezes all tasks (including kernel threads) because we can't freeze 76 - kernel threads without freezing userspace tasks 77 - 78 - thaw_kernel_threads(): 79 - - thaws only kernel threads; this is particularly useful if we need to do 80 - anything special in between thawing of kernel threads and thawing of 81 - userspace tasks, or if we want to postpone the thawing of userspace tasks 82 - 83 - thaw_processes(): 84 - - thaws all tasks (including kernel threads) because we can't thaw userspace 85 - tasks without thawing kernel threads 86 - 87 - 88 - III. Which kernel threads are freezable? 89 - 90 - Kernel threads are not freezable by default. However, a kernel thread may clear 91 - PF_NOFREEZE for itself by calling set_freezable() (the resetting of PF_NOFREEZE 92 - directly is not allowed). From this point it is regarded as freezable 93 - and must call try_to_freeze() in a suitable place. 94 - 95 - IV. Why do we do that? 96 - 97 - Generally speaking, there is a couple of reasons to use the freezing of tasks: 98 - 99 - 1. The principal reason is to prevent filesystems from being damaged after 100 - hibernation. At the moment we have no simple means of checkpointing 101 - filesystems, so if there are any modifications made to filesystem data and/or 102 - metadata on disks, we cannot bring them back to the state from before the 103 - modifications. At the same time each hibernation image contains some 104 - filesystem-related information that must be consistent with the state of the 105 - on-disk data and metadata after the system memory state has been restored from 106 - the image (otherwise the filesystems will be damaged in a nasty way, usually 107 - making them almost impossible to repair). We therefore freeze tasks that might 108 - cause the on-disk filesystems' data and metadata to be modified after the 109 - hibernation image has been created and before the system is finally powered off. 110 - The majority of these are user space processes, but if any of the kernel threads 111 - may cause something like this to happen, they have to be freezable. 112 - 113 - 2. Next, to create the hibernation image we need to free a sufficient amount of 114 - memory (approximately 50% of available RAM) and we need to do that before 115 - devices are deactivated, because we generally need them for swapping out. Then, 116 - after the memory for the image has been freed, we don't want tasks to allocate 117 - additional memory and we prevent them from doing that by freezing them earlier. 118 - [Of course, this also means that device drivers should not allocate substantial 119 - amounts of memory from their .suspend() callbacks before hibernation, but this 120 - is a separate issue.] 121 - 122 - 3. The third reason is to prevent user space processes and some kernel threads 123 - from interfering with the suspending and resuming of devices. A user space 124 - process running on a second CPU while we are suspending devices may, for 125 - example, be troublesome and without the freezing of tasks we would need some 126 - safeguards against race conditions that might occur in such a case. 127 - 128 - Although Linus Torvalds doesn't like the freezing of tasks, he said this in one 129 - of the discussions on LKML (http://lkml.org/lkml/2007/4/27/608): 130 - 131 - "RJW:> Why we freeze tasks at all or why we freeze kernel threads? 132 - 133 - Linus: In many ways, 'at all'. 134 - 135 - I _do_ realize the IO request queue issues, and that we cannot actually do 136 - s2ram with some devices in the middle of a DMA. So we want to be able to 137 - avoid *that*, there's no question about that. And I suspect that stopping 138 - user threads and then waiting for a sync is practically one of the easier 139 - ways to do so. 140 - 141 - So in practice, the 'at all' may become a 'why freeze kernel threads?' and 142 - freezing user threads I don't find really objectionable." 143 - 144 - Still, there are kernel threads that may want to be freezable. For example, if 145 - a kernel thread that belongs to a device driver accesses the device directly, it 146 - in principle needs to know when the device is suspended, so that it doesn't try 147 - to access it at that time. However, if the kernel thread is freezable, it will 148 - be frozen before the driver's .suspend() callback is executed and it will be 149 - thawed after the driver's .resume() callback has run, so it won't be accessing 150 - the device while it's suspended. 151 - 152 - 4. Another reason for freezing tasks is to prevent user space processes from 153 - realizing that hibernation (or suspend) operation takes place. Ideally, user 154 - space processes should not notice that such a system-wide operation has occurred 155 - and should continue running without any problems after the restore (or resume 156 - from suspend). Unfortunately, in the most general case this is quite difficult 157 - to achieve without the freezing of tasks. Consider, for example, a process 158 - that depends on all CPUs being online while it's running. Since we need to 159 - disable nonboot CPUs during the hibernation, if this process is not frozen, it 160 - may notice that the number of CPUs has changed and may start to work incorrectly 161 - because of that. 162 - 163 - V. Are there any problems related to the freezing of tasks? 164 - 165 - Yes, there are. 166 - 167 - First of all, the freezing of kernel threads may be tricky if they depend one 168 - on another. For example, if kernel thread A waits for a completion (in the 169 - TASK_UNINTERRUPTIBLE state) that needs to be done by freezable kernel thread B 170 - and B is frozen in the meantime, then A will be blocked until B is thawed, which 171 - may be undesirable. That's why kernel threads are not freezable by default. 172 - 173 - Second, there are the following two problems related to the freezing of user 174 - space processes: 175 - 1. Putting processes into an uninterruptible sleep distorts the load average. 176 - 2. Now that we have FUSE, plus the framework for doing device drivers in 177 - userspace, it gets even more complicated because some userspace processes are 178 - now doing the sorts of things that kernel threads do 179 - (https://lists.linux-foundation.org/pipermail/linux-pm/2007-May/012309.html). 180 - 181 - The problem 1. seems to be fixable, although it hasn't been fixed so far. The 182 - other one is more serious, but it seems that we can work around it by using 183 - hibernation (and suspend) notifiers (in that case, though, we won't be able to 184 - avoid the realization by the user space processes that the hibernation is taking 185 - place). 186 - 187 - There are also problems that the freezing of tasks tends to expose, although 188 - they are not directly related to it. For example, if request_firmware() is 189 - called from a device driver's .resume() routine, it will timeout and eventually 190 - fail, because the user land process that should respond to the request is frozen 191 - at this point. So, seemingly, the failure is due to the freezing of tasks. 192 - Suppose, however, that the firmware file is located on a filesystem accessible 193 - only through another device that hasn't been resumed yet. In that case, 194 - request_firmware() will fail regardless of whether or not the freezing of tasks 195 - is used. Consequently, the problem is not really related to the freezing of 196 - tasks, since it generally exists anyway. 197 - 198 - A driver must have all firmwares it may need in RAM before suspend() is called. 199 - If keeping them is not practical, for example due to their size, they must be 200 - requested early enough using the suspend notifier API described in 201 - Documentation/driver-api/pm/notifiers.rst. 202 - 203 - VI. Are there any precautions to be taken to prevent freezing failures? 204 - 205 - Yes, there are. 206 - 207 - First of all, grabbing the 'system_transition_mutex' lock to mutually exclude a piece of code 208 - from system-wide sleep such as suspend/hibernation is not encouraged. 209 - If possible, that piece of code must instead hook onto the suspend/hibernation 210 - notifiers to achieve mutual exclusion. Look at the CPU-Hotplug code 211 - (kernel/cpu.c) for an example. 212 - 213 - However, if that is not feasible, and grabbing 'system_transition_mutex' is deemed necessary, 214 - it is strongly discouraged to directly call mutex_[un]lock(&system_transition_mutex) since 215 - that could lead to freezing failures, because if the suspend/hibernate code 216 - successfully acquired the 'system_transition_mutex' lock, and hence that other entity failed 217 - to acquire the lock, then that task would get blocked in TASK_UNINTERRUPTIBLE 218 - state. As a consequence, the freezer would not be able to freeze that task, 219 - leading to freezing failure. 220 - 221 - However, the [un]lock_system_sleep() APIs are safe to use in this scenario, 222 - since they ask the freezer to skip freezing this task, since it is anyway 223 - "frozen enough" as it is blocked on 'system_transition_mutex', which will be released 224 - only after the entire suspend/hibernation sequence is complete. 225 - So, to summarize, use [un]lock_system_sleep() instead of directly using 226 - mutex_[un]lock(&system_transition_mutex). That would prevent freezing failures. 227 - 228 - V. Miscellaneous 229 - /sys/power/pm_freeze_timeout controls how long it will cost at most to freeze 230 - all user space processes or all freezable kernel threads, in unit of millisecond. 231 - The default value is 20000, with range of unsigned integer.

+46

Documentation/power/index.rst

··· 1 + :orphan: 2 + 3 + ================ 4 + Power Management 5 + ================ 6 + 7 + .. toctree:: 8 + :maxdepth: 1 9 + 10 + apm-acpi 11 + basic-pm-debugging 12 + charger-manager 13 + drivers-testing 14 + energy-model 15 + freezing-of-tasks 16 + interface 17 + opp 18 + pci 19 + pm_qos_interface 20 + power_supply_class 21 + runtime_pm 22 + s2ram 23 + suspend-and-cpuhotplug 24 + suspend-and-interrupts 25 + swsusp-and-swap-files 26 + swsusp-dmcrypt 27 + swsusp 28 + video 29 + tricks 30 + 31 + userland-swsusp 32 + 33 + powercap/powercap 34 + 35 + regulator/consumer 36 + regulator/design 37 + regulator/machine 38 + regulator/overview 39 + regulator/regulator 40 + 41 + .. only:: subproject and html 42 + 43 + Indices 44 + ======= 45 + 46 + * :ref:`genindex`

+79

Documentation/power/interface.rst

··· 1 + =========================================== 2 + Power Management Interface for System Sleep 3 + =========================================== 4 + 5 + Copyright (c) 2016 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com> 6 + 7 + The power management subsystem provides userspace with a unified sysfs interface 8 + for system sleep regardless of the underlying system architecture or platform. 9 + The interface is located in the /sys/power/ directory (assuming that sysfs is 10 + mounted at /sys). 11 + 12 + /sys/power/state is the system sleep state control file. 13 + 14 + Reading from it returns a list of supported sleep states, encoded as: 15 + 16 + - 'freeze' (Suspend-to-Idle) 17 + - 'standby' (Power-On Suspend) 18 + - 'mem' (Suspend-to-RAM) 19 + - 'disk' (Suspend-to-Disk) 20 + 21 + Suspend-to-Idle is always supported. Suspend-to-Disk is always supported 22 + too as long the kernel has been configured to support hibernation at all 23 + (ie. CONFIG_HIBERNATION is set in the kernel configuration file). Support 24 + for Suspend-to-RAM and Power-On Suspend depends on the capabilities of the 25 + platform. 26 + 27 + If one of the strings listed in /sys/power/state is written to it, the system 28 + will attempt to transition into the corresponding sleep state. Refer to 29 + Documentation/admin-guide/pm/sleep-states.rst for a description of each of 30 + those states. 31 + 32 + /sys/power/disk controls the operating mode of hibernation (Suspend-to-Disk). 33 + Specifically, it tells the kernel what to do after creating a hibernation image. 34 + 35 + Reading from it returns a list of supported options encoded as: 36 + 37 + - 'platform' (put the system into sleep using a platform-provided method) 38 + - 'shutdown' (shut the system down) 39 + - 'reboot' (reboot the system) 40 + - 'suspend' (trigger a Suspend-to-RAM transition) 41 + - 'test_resume' (resume-after-hibernation test mode) 42 + 43 + The currently selected option is printed in square brackets. 44 + 45 + The 'platform' option is only available if the platform provides a special 46 + mechanism to put the system to sleep after creating a hibernation image (ACPI 47 + does that, for example). The 'suspend' option is available if Suspend-to-RAM 48 + is supported. Refer to Documentation/power/basic-pm-debugging.rst for the 49 + description of the 'test_resume' option. 50 + 51 + To select an option, write the string representing it to /sys/power/disk. 52 + 53 + /sys/power/image_size controls the size of hibernation images. 54 + 55 + It can be written a string representing a non-negative integer that will be 56 + used as a best-effort upper limit of the image size, in bytes. The hibernation 57 + core will do its best to ensure that the image size will not exceed that number. 58 + However, if that turns out to be impossible to achieve, a hibernation image will 59 + still be created and its size will be as small as possible. In particular, 60 + writing '0' to this file will enforce hibernation images to be as small as 61 + possible. 62 + 63 + Reading from this file returns the current image size limit, which is set to 64 + around 2/5 of available RAM by default. 65 + 66 + /sys/power/pm_trace controls the PM trace mechanism saving the last suspend 67 + or resume event point in the RTC across reboots. 68 + 69 + It helps to debug hard lockups or reboots due to device driver failures that 70 + occur during system suspend or resume (which is more common) more effectively. 71 + 72 + If /sys/power/pm_trace contains '1', the fingerprint of each suspend/resume 73 + event point in turn will be stored in the RTC memory (overwriting the actual 74 + RTC information), so it will survive a system crash if one occurs right after 75 + storing it and it can be used later to identify the driver that caused the crash 76 + to happen (see Documentation/power/s2ram.rst for more information). 77 + 78 + Initially it contains '0' which may be changed to '1' by writing a string 79 + representing a nonzero integer into it.

-77

Documentation/power/interface.txt

··· 1 - Power Management Interface for System Sleep 2 - 3 - Copyright (c) 2016 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com> 4 - 5 - The power management subsystem provides userspace with a unified sysfs interface 6 - for system sleep regardless of the underlying system architecture or platform. 7 - The interface is located in the /sys/power/ directory (assuming that sysfs is 8 - mounted at /sys). 9 - 10 - /sys/power/state is the system sleep state control file. 11 - 12 - Reading from it returns a list of supported sleep states, encoded as: 13 - 14 - 'freeze' (Suspend-to-Idle) 15 - 'standby' (Power-On Suspend) 16 - 'mem' (Suspend-to-RAM) 17 - 'disk' (Suspend-to-Disk) 18 - 19 - Suspend-to-Idle is always supported. Suspend-to-Disk is always supported 20 - too as long the kernel has been configured to support hibernation at all 21 - (ie. CONFIG_HIBERNATION is set in the kernel configuration file). Support 22 - for Suspend-to-RAM and Power-On Suspend depends on the capabilities of the 23 - platform. 24 - 25 - If one of the strings listed in /sys/power/state is written to it, the system 26 - will attempt to transition into the corresponding sleep state. Refer to 27 - Documentation/admin-guide/pm/sleep-states.rst for a description of each of 28 - those states. 29 - 30 - /sys/power/disk controls the operating mode of hibernation (Suspend-to-Disk). 31 - Specifically, it tells the kernel what to do after creating a hibernation image. 32 - 33 - Reading from it returns a list of supported options encoded as: 34 - 35 - 'platform' (put the system into sleep using a platform-provided method) 36 - 'shutdown' (shut the system down) 37 - 'reboot' (reboot the system) 38 - 'suspend' (trigger a Suspend-to-RAM transition) 39 - 'test_resume' (resume-after-hibernation test mode) 40 - 41 - The currently selected option is printed in square brackets. 42 - 43 - The 'platform' option is only available if the platform provides a special 44 - mechanism to put the system to sleep after creating a hibernation image (ACPI 45 - does that, for example). The 'suspend' option is available if Suspend-to-RAM 46 - is supported. Refer to Documentation/power/basic-pm-debugging.txt for the 47 - description of the 'test_resume' option. 48 - 49 - To select an option, write the string representing it to /sys/power/disk. 50 - 51 - /sys/power/image_size controls the size of hibernation images. 52 - 53 - It can be written a string representing a non-negative integer that will be 54 - used as a best-effort upper limit of the image size, in bytes. The hibernation 55 - core will do its best to ensure that the image size will not exceed that number. 56 - However, if that turns out to be impossible to achieve, a hibernation image will 57 - still be created and its size will be as small as possible. In particular, 58 - writing '0' to this file will enforce hibernation images to be as small as 59 - possible. 60 - 61 - Reading from this file returns the current image size limit, which is set to 62 - around 2/5 of available RAM by default. 63 - 64 - /sys/power/pm_trace controls the PM trace mechanism saving the last suspend 65 - or resume event point in the RTC across reboots. 66 - 67 - It helps to debug hard lockups or reboots due to device driver failures that 68 - occur during system suspend or resume (which is more common) more effectively. 69 - 70 - If /sys/power/pm_trace contains '1', the fingerprint of each suspend/resume 71 - event point in turn will be stored in the RTC memory (overwriting the actual 72 - RTC information), so it will survive a system crash if one occurs right after 73 - storing it and it can be used later to identify the driver that caused the crash 74 - to happen (see Documentation/power/s2ram.txt for more information). 75 - 76 - Initially it contains '0' which may be changed to '1' by writing a string 77 - representing a nonzero integer into it.

+379

Documentation/power/opp.rst

··· 1 + ========================================== 2 + Operating Performance Points (OPP) Library 3 + ========================================== 4 + 5 + (C) 2009-2010 Nishanth Menon <nm@ti.com>, Texas Instruments Incorporated 6 + 7 + .. Contents 8 + 9 + 1. Introduction 10 + 2. Initial OPP List Registration 11 + 3. OPP Search Functions 12 + 4. OPP Availability Control Functions 13 + 5. OPP Data Retrieval Functions 14 + 6. Data Structures 15 + 16 + 1. Introduction 17 + =============== 18 + 19 + 1.1 What is an Operating Performance Point (OPP)? 20 + ------------------------------------------------- 21 + 22 + Complex SoCs of today consists of a multiple sub-modules working in conjunction. 23 + In an operational system executing varied use cases, not all modules in the SoC 24 + need to function at their highest performing frequency all the time. To 25 + facilitate this, sub-modules in a SoC are grouped into domains, allowing some 26 + domains to run at lower voltage and frequency while other domains run at 27 + voltage/frequency pairs that are higher. 28 + 29 + The set of discrete tuples consisting of frequency and voltage pairs that 30 + the device will support per domain are called Operating Performance Points or 31 + OPPs. 32 + 33 + As an example: 34 + 35 + Let us consider an MPU device which supports the following: 36 + {300MHz at minimum voltage of 1V}, {800MHz at minimum voltage of 1.2V}, 37 + {1GHz at minimum voltage of 1.3V} 38 + 39 + We can represent these as three OPPs as the following {Hz, uV} tuples: 40 + 41 + - {300000000, 1000000} 42 + - {800000000, 1200000} 43 + - {1000000000, 1300000} 44 + 45 + 1.2 Operating Performance Points Library 46 + ---------------------------------------- 47 + 48 + OPP library provides a set of helper functions to organize and query the OPP 49 + information. The library is located in drivers/base/power/opp.c and the header 50 + is located in include/linux/pm_opp.h. OPP library can be enabled by enabling 51 + CONFIG_PM_OPP from power management menuconfig menu. OPP library depends on 52 + CONFIG_PM as certain SoCs such as Texas Instrument's OMAP framework allows to 53 + optionally boot at a certain OPP without needing cpufreq. 54 + 55 + Typical usage of the OPP library is as follows:: 56 + 57 + (users) -> registers a set of default OPPs -> (library) 58 + SoC framework -> modifies on required cases certain OPPs -> OPP layer 59 + -> queries to search/retrieve information -> 60 + 61 + OPP layer expects each domain to be represented by a unique device pointer. SoC 62 + framework registers a set of initial OPPs per device with the OPP layer. This 63 + list is expected to be an optimally small number typically around 5 per device. 64 + This initial list contains a set of OPPs that the framework expects to be safely 65 + enabled by default in the system. 66 + 67 + Note on OPP Availability 68 + ^^^^^^^^^^^^^^^^^^^^^^^^ 69 + 70 + As the system proceeds to operate, SoC framework may choose to make certain 71 + OPPs available or not available on each device based on various external 72 + factors. Example usage: Thermal management or other exceptional situations where 73 + SoC framework might choose to disable a higher frequency OPP to safely continue 74 + operations until that OPP could be re-enabled if possible. 75 + 76 + OPP library facilitates this concept in it's implementation. The following 77 + operational functions operate only on available opps: 78 + opp_find_freq_{ceil, floor}, dev_pm_opp_get_voltage, dev_pm_opp_get_freq, dev_pm_opp_get_opp_count 79 + 80 + dev_pm_opp_find_freq_exact is meant to be used to find the opp pointer which can then 81 + be used for dev_pm_opp_enable/disable functions to make an opp available as required. 82 + 83 + WARNING: Users of OPP library should refresh their availability count using 84 + get_opp_count if dev_pm_opp_enable/disable functions are invoked for a device, the 85 + exact mechanism to trigger these or the notification mechanism to other 86 + dependent subsystems such as cpufreq are left to the discretion of the SoC 87 + specific framework which uses the OPP library. Similar care needs to be taken 88 + care to refresh the cpufreq table in cases of these operations. 89 + 90 + 2. Initial OPP List Registration 91 + ================================ 92 + The SoC implementation calls dev_pm_opp_add function iteratively to add OPPs per 93 + device. It is expected that the SoC framework will register the OPP entries 94 + optimally- typical numbers range to be less than 5. The list generated by 95 + registering the OPPs is maintained by OPP library throughout the device 96 + operation. The SoC framework can subsequently control the availability of the 97 + OPPs dynamically using the dev_pm_opp_enable / disable functions. 98 + 99 + dev_pm_opp_add 100 + Add a new OPP for a specific domain represented by the device pointer. 101 + The OPP is defined using the frequency and voltage. Once added, the OPP 102 + is assumed to be available and control of it's availability can be done 103 + with the dev_pm_opp_enable/disable functions. OPP library internally stores 104 + and manages this information in the opp struct. This function may be 105 + used by SoC framework to define a optimal list as per the demands of 106 + SoC usage environment. 107 + 108 + WARNING: 109 + Do not use this function in interrupt context. 110 + 111 + Example:: 112 + 113 + soc_pm_init() 114 + { 115 + /* Do things */ 116 + r = dev_pm_opp_add(mpu_dev, 1000000, 900000); 117 + if (!r) { 118 + pr_err("%s: unable to register mpu opp(%d)\n", r); 119 + goto no_cpufreq; 120 + } 121 + /* Do cpufreq things */ 122 + no_cpufreq: 123 + /* Do remaining things */ 124 + } 125 + 126 + 3. OPP Search Functions 127 + ======================= 128 + High level framework such as cpufreq operates on frequencies. To map the 129 + frequency back to the corresponding OPP, OPP library provides handy functions 130 + to search the OPP list that OPP library internally manages. These search 131 + functions return the matching pointer representing the opp if a match is 132 + found, else returns error. These errors are expected to be handled by standard 133 + error checks such as IS_ERR() and appropriate actions taken by the caller. 134 + 135 + Callers of these functions shall call dev_pm_opp_put() after they have used the 136 + OPP. Otherwise the memory for the OPP will never get freed and result in 137 + memleak. 138 + 139 + dev_pm_opp_find_freq_exact 140 + Search for an OPP based on an *exact* frequency and 141 + availability. This function is especially useful to enable an OPP which 142 + is not available by default. 143 + Example: In a case when SoC framework detects a situation where a 144 + higher frequency could be made available, it can use this function to 145 + find the OPP prior to call the dev_pm_opp_enable to actually make 146 + it available:: 147 + 148 + opp = dev_pm_opp_find_freq_exact(dev, 1000000000, false); 149 + dev_pm_opp_put(opp); 150 + /* dont operate on the pointer.. just do a sanity check.. */ 151 + if (IS_ERR(opp)) { 152 + pr_err("frequency not disabled!\n"); 153 + /* trigger appropriate actions.. */ 154 + } else { 155 + dev_pm_opp_enable(dev,1000000000); 156 + } 157 + 158 + NOTE: 159 + This is the only search function that operates on OPPs which are 160 + not available. 161 + 162 + dev_pm_opp_find_freq_floor 163 + Search for an available OPP which is *at most* the 164 + provided frequency. This function is useful while searching for a lesser 165 + match OR operating on OPP information in the order of decreasing 166 + frequency. 167 + Example: To find the highest opp for a device:: 168 + 169 + freq = ULONG_MAX; 170 + opp = dev_pm_opp_find_freq_floor(dev, &freq); 171 + dev_pm_opp_put(opp); 172 + 173 + dev_pm_opp_find_freq_ceil 174 + Search for an available OPP which is *at least* the 175 + provided frequency. This function is useful while searching for a 176 + higher match OR operating on OPP information in the order of increasing 177 + frequency. 178 + Example 1: To find the lowest opp for a device:: 179 + 180 + freq = 0; 181 + opp = dev_pm_opp_find_freq_ceil(dev, &freq); 182 + dev_pm_opp_put(opp); 183 + 184 + Example 2: A simplified implementation of a SoC cpufreq_driver->target:: 185 + 186 + soc_cpufreq_target(..) 187 + { 188 + /* Do stuff like policy checks etc. */ 189 + /* Find the best frequency match for the req */ 190 + opp = dev_pm_opp_find_freq_ceil(dev, &freq); 191 + dev_pm_opp_put(opp); 192 + if (!IS_ERR(opp)) 193 + soc_switch_to_freq_voltage(freq); 194 + else 195 + /* do something when we can't satisfy the req */ 196 + /* do other stuff */ 197 + } 198 + 199 + 4. OPP Availability Control Functions 200 + ===================================== 201 + A default OPP list registered with the OPP library may not cater to all possible 202 + situation. The OPP library provides a set of functions to modify the 203 + availability of a OPP within the OPP list. This allows SoC frameworks to have 204 + fine grained dynamic control of which sets of OPPs are operationally available. 205 + These functions are intended to *temporarily* remove an OPP in conditions such 206 + as thermal considerations (e.g. don't use OPPx until the temperature drops). 207 + 208 + WARNING: 209 + Do not use these functions in interrupt context. 210 + 211 + dev_pm_opp_enable 212 + Make a OPP available for operation. 213 + Example: Lets say that 1GHz OPP is to be made available only if the 214 + SoC temperature is lower than a certain threshold. The SoC framework 215 + implementation might choose to do something as follows:: 216 + 217 + if (cur_temp < temp_low_thresh) { 218 + /* Enable 1GHz if it was disabled */ 219 + opp = dev_pm_opp_find_freq_exact(dev, 1000000000, false); 220 + dev_pm_opp_put(opp); 221 + /* just error check */ 222 + if (!IS_ERR(opp)) 223 + ret = dev_pm_opp_enable(dev, 1000000000); 224 + else 225 + goto try_something_else; 226 + } 227 + 228 + dev_pm_opp_disable 229 + Make an OPP to be not available for operation 230 + Example: Lets say that 1GHz OPP is to be disabled if the temperature 231 + exceeds a threshold value. The SoC framework implementation might 232 + choose to do something as follows:: 233 + 234 + if (cur_temp > temp_high_thresh) { 235 + /* Disable 1GHz if it was enabled */ 236 + opp = dev_pm_opp_find_freq_exact(dev, 1000000000, true); 237 + dev_pm_opp_put(opp); 238 + /* just error check */ 239 + if (!IS_ERR(opp)) 240 + ret = dev_pm_opp_disable(dev, 1000000000); 241 + else 242 + goto try_something_else; 243 + } 244 + 245 + 5. OPP Data Retrieval Functions 246 + =============================== 247 + Since OPP library abstracts away the OPP information, a set of functions to pull 248 + information from the OPP structure is necessary. Once an OPP pointer is 249 + retrieved using the search functions, the following functions can be used by SoC 250 + framework to retrieve the information represented inside the OPP layer. 251 + 252 + dev_pm_opp_get_voltage 253 + Retrieve the voltage represented by the opp pointer. 254 + Example: At a cpufreq transition to a different frequency, SoC 255 + framework requires to set the voltage represented by the OPP using 256 + the regulator framework to the Power Management chip providing the 257 + voltage:: 258 + 259 + soc_switch_to_freq_voltage(freq) 260 + { 261 + /* do things */ 262 + opp = dev_pm_opp_find_freq_ceil(dev, &freq); 263 + v = dev_pm_opp_get_voltage(opp); 264 + dev_pm_opp_put(opp); 265 + if (v) 266 + regulator_set_voltage(.., v); 267 + /* do other things */ 268 + } 269 + 270 + dev_pm_opp_get_freq 271 + Retrieve the freq represented by the opp pointer. 272 + Example: Lets say the SoC framework uses a couple of helper functions 273 + we could pass opp pointers instead of doing additional parameters to 274 + handle quiet a bit of data parameters:: 275 + 276 + soc_cpufreq_target(..) 277 + { 278 + /* do things.. */ 279 + max_freq = ULONG_MAX; 280 + max_opp = dev_pm_opp_find_freq_floor(dev,&max_freq); 281 + requested_opp = dev_pm_opp_find_freq_ceil(dev,&freq); 282 + if (!IS_ERR(max_opp) && !IS_ERR(requested_opp)) 283 + r = soc_test_validity(max_opp, requested_opp); 284 + dev_pm_opp_put(max_opp); 285 + dev_pm_opp_put(requested_opp); 286 + /* do other things */ 287 + } 288 + soc_test_validity(..) 289 + { 290 + if(dev_pm_opp_get_voltage(max_opp) < dev_pm_opp_get_voltage(requested_opp)) 291 + return -EINVAL; 292 + if(dev_pm_opp_get_freq(max_opp) < dev_pm_opp_get_freq(requested_opp)) 293 + return -EINVAL; 294 + /* do things.. */ 295 + } 296 + 297 + dev_pm_opp_get_opp_count 298 + Retrieve the number of available opps for a device 299 + Example: Lets say a co-processor in the SoC needs to know the available 300 + frequencies in a table, the main processor can notify as following:: 301 + 302 + soc_notify_coproc_available_frequencies() 303 + { 304 + /* Do things */ 305 + num_available = dev_pm_opp_get_opp_count(dev); 306 + speeds = kzalloc(sizeof(u32) * num_available, GFP_KERNEL); 307 + /* populate the table in increasing order */ 308 + freq = 0; 309 + while (!IS_ERR(opp = dev_pm_opp_find_freq_ceil(dev, &freq))) { 310 + speeds[i] = freq; 311 + freq++; 312 + i++; 313 + dev_pm_opp_put(opp); 314 + } 315 + 316 + soc_notify_coproc(AVAILABLE_FREQs, speeds, num_available); 317 + /* Do other things */ 318 + } 319 + 320 + 6. Data Structures 321 + ================== 322 + Typically an SoC contains multiple voltage domains which are variable. Each 323 + domain is represented by a device pointer. The relationship to OPP can be 324 + represented as follows:: 325 + 326 + SoC 327 + |- device 1 328 + | |- opp 1 (availability, freq, voltage) 329 + | |- opp 2 .. 330 + ... ... 331 + | `- opp n .. 332 + |- device 2 333 + ... 334 + `- device m 335 + 336 + OPP library maintains a internal list that the SoC framework populates and 337 + accessed by various functions as described above. However, the structures 338 + representing the actual OPPs and domains are internal to the OPP library itself 339 + to allow for suitable abstraction reusable across systems. 340 + 341 + struct dev_pm_opp 342 + The internal data structure of OPP library which is used to 343 + represent an OPP. In addition to the freq, voltage, availability 344 + information, it also contains internal book keeping information required 345 + for the OPP library to operate on. Pointer to this structure is 346 + provided back to the users such as SoC framework to be used as a 347 + identifier for OPP in the interactions with OPP layer. 348 + 349 + WARNING: 350 + The struct dev_pm_opp pointer should not be parsed or modified by the 351 + users. The defaults of for an instance is populated by 352 + dev_pm_opp_add, but the availability of the OPP can be modified 353 + by dev_pm_opp_enable/disable functions. 354 + 355 + struct device 356 + This is used to identify a domain to the OPP layer. The 357 + nature of the device and it's implementation is left to the user of 358 + OPP library such as the SoC framework. 359 + 360 + Overall, in a simplistic view, the data structure operations is represented as 361 + following:: 362 + 363 + Initialization / modification: 364 + +-----+ /- dev_pm_opp_enable 365 + dev_pm_opp_add --> | opp | <------- 366 + | +-----+ \- dev_pm_opp_disable 367 + \-------> domain_info(device) 368 + 369 + Search functions: 370 + /-- dev_pm_opp_find_freq_ceil ---\ +-----+ 371 + domain_info<---- dev_pm_opp_find_freq_exact -----> | opp | 372 + \-- dev_pm_opp_find_freq_floor ---/ +-----+ 373 + 374 + Retrieval functions: 375 + +-----+ /- dev_pm_opp_get_voltage 376 + | opp | <--- 377 + +-----+ \- dev_pm_opp_get_freq 378 + 379 + domain_info <- dev_pm_opp_get_opp_count

-342

Documentation/power/opp.txt

··· 1 - Operating Performance Points (OPP) Library 2 - ========================================== 3 - 4 - (C) 2009-2010 Nishanth Menon <nm@ti.com>, Texas Instruments Incorporated 5 - 6 - Contents 7 - -------- 8 - 1. Introduction 9 - 2. Initial OPP List Registration 10 - 3. OPP Search Functions 11 - 4. OPP Availability Control Functions 12 - 5. OPP Data Retrieval Functions 13 - 6. Data Structures 14 - 15 - 1. Introduction 16 - =============== 17 - 1.1 What is an Operating Performance Point (OPP)? 18 - 19 - Complex SoCs of today consists of a multiple sub-modules working in conjunction. 20 - In an operational system executing varied use cases, not all modules in the SoC 21 - need to function at their highest performing frequency all the time. To 22 - facilitate this, sub-modules in a SoC are grouped into domains, allowing some 23 - domains to run at lower voltage and frequency while other domains run at 24 - voltage/frequency pairs that are higher. 25 - 26 - The set of discrete tuples consisting of frequency and voltage pairs that 27 - the device will support per domain are called Operating Performance Points or 28 - OPPs. 29 - 30 - As an example: 31 - Let us consider an MPU device which supports the following: 32 - {300MHz at minimum voltage of 1V}, {800MHz at minimum voltage of 1.2V}, 33 - {1GHz at minimum voltage of 1.3V} 34 - 35 - We can represent these as three OPPs as the following {Hz, uV} tuples: 36 - {300000000, 1000000} 37 - {800000000, 1200000} 38 - {1000000000, 1300000} 39 - 40 - 1.2 Operating Performance Points Library 41 - 42 - OPP library provides a set of helper functions to organize and query the OPP 43 - information. The library is located in drivers/base/power/opp.c and the header 44 - is located in include/linux/pm_opp.h. OPP library can be enabled by enabling 45 - CONFIG_PM_OPP from power management menuconfig menu. OPP library depends on 46 - CONFIG_PM as certain SoCs such as Texas Instrument's OMAP framework allows to 47 - optionally boot at a certain OPP without needing cpufreq. 48 - 49 - Typical usage of the OPP library is as follows: 50 - (users) -> registers a set of default OPPs -> (library) 51 - SoC framework -> modifies on required cases certain OPPs -> OPP layer 52 - -> queries to search/retrieve information -> 53 - 54 - OPP layer expects each domain to be represented by a unique device pointer. SoC 55 - framework registers a set of initial OPPs per device with the OPP layer. This 56 - list is expected to be an optimally small number typically around 5 per device. 57 - This initial list contains a set of OPPs that the framework expects to be safely 58 - enabled by default in the system. 59 - 60 - Note on OPP Availability: 61 - ------------------------ 62 - As the system proceeds to operate, SoC framework may choose to make certain 63 - OPPs available or not available on each device based on various external 64 - factors. Example usage: Thermal management or other exceptional situations where 65 - SoC framework might choose to disable a higher frequency OPP to safely continue 66 - operations until that OPP could be re-enabled if possible. 67 - 68 - OPP library facilitates this concept in it's implementation. The following 69 - operational functions operate only on available opps: 70 - opp_find_freq_{ceil, floor}, dev_pm_opp_get_voltage, dev_pm_opp_get_freq, dev_pm_opp_get_opp_count 71 - 72 - dev_pm_opp_find_freq_exact is meant to be used to find the opp pointer which can then 73 - be used for dev_pm_opp_enable/disable functions to make an opp available as required. 74 - 75 - WARNING: Users of OPP library should refresh their availability count using 76 - get_opp_count if dev_pm_opp_enable/disable functions are invoked for a device, the 77 - exact mechanism to trigger these or the notification mechanism to other 78 - dependent subsystems such as cpufreq are left to the discretion of the SoC 79 - specific framework which uses the OPP library. Similar care needs to be taken 80 - care to refresh the cpufreq table in cases of these operations. 81 - 82 - 2. Initial OPP List Registration 83 - ================================ 84 - The SoC implementation calls dev_pm_opp_add function iteratively to add OPPs per 85 - device. It is expected that the SoC framework will register the OPP entries 86 - optimally- typical numbers range to be less than 5. The list generated by 87 - registering the OPPs is maintained by OPP library throughout the device 88 - operation. The SoC framework can subsequently control the availability of the 89 - OPPs dynamically using the dev_pm_opp_enable / disable functions. 90 - 91 - dev_pm_opp_add - Add a new OPP for a specific domain represented by the device pointer. 92 - The OPP is defined using the frequency and voltage. Once added, the OPP 93 - is assumed to be available and control of it's availability can be done 94 - with the dev_pm_opp_enable/disable functions. OPP library internally stores 95 - and manages this information in the opp struct. This function may be 96 - used by SoC framework to define a optimal list as per the demands of 97 - SoC usage environment. 98 - 99 - WARNING: Do not use this function in interrupt context. 100 - 101 - Example: 102 - soc_pm_init() 103 - { 104 - /* Do things */ 105 - r = dev_pm_opp_add(mpu_dev, 1000000, 900000); 106 - if (!r) { 107 - pr_err("%s: unable to register mpu opp(%d)\n", r); 108 - goto no_cpufreq; 109 - } 110 - /* Do cpufreq things */ 111 - no_cpufreq: 112 - /* Do remaining things */ 113 - } 114 - 115 - 3. OPP Search Functions 116 - ======================= 117 - High level framework such as cpufreq operates on frequencies. To map the 118 - frequency back to the corresponding OPP, OPP library provides handy functions 119 - to search the OPP list that OPP library internally manages. These search 120 - functions return the matching pointer representing the opp if a match is 121 - found, else returns error. These errors are expected to be handled by standard 122 - error checks such as IS_ERR() and appropriate actions taken by the caller. 123 - 124 - Callers of these functions shall call dev_pm_opp_put() after they have used the 125 - OPP. Otherwise the memory for the OPP will never get freed and result in 126 - memleak. 127 - 128 - dev_pm_opp_find_freq_exact - Search for an OPP based on an *exact* frequency and 129 - availability. This function is especially useful to enable an OPP which 130 - is not available by default. 131 - Example: In a case when SoC framework detects a situation where a 132 - higher frequency could be made available, it can use this function to 133 - find the OPP prior to call the dev_pm_opp_enable to actually make it available. 134 - opp = dev_pm_opp_find_freq_exact(dev, 1000000000, false); 135 - dev_pm_opp_put(opp); 136 - /* dont operate on the pointer.. just do a sanity check.. */ 137 - if (IS_ERR(opp)) { 138 - pr_err("frequency not disabled!\n"); 139 - /* trigger appropriate actions.. */ 140 - } else { 141 - dev_pm_opp_enable(dev,1000000000); 142 - } 143 - 144 - NOTE: This is the only search function that operates on OPPs which are 145 - not available. 146 - 147 - dev_pm_opp_find_freq_floor - Search for an available OPP which is *at most* the 148 - provided frequency. This function is useful while searching for a lesser 149 - match OR operating on OPP information in the order of decreasing 150 - frequency. 151 - Example: To find the highest opp for a device: 152 - freq = ULONG_MAX; 153 - opp = dev_pm_opp_find_freq_floor(dev, &freq); 154 - dev_pm_opp_put(opp); 155 - 156 - dev_pm_opp_find_freq_ceil - Search for an available OPP which is *at least* the 157 - provided frequency. This function is useful while searching for a 158 - higher match OR operating on OPP information in the order of increasing 159 - frequency. 160 - Example 1: To find the lowest opp for a device: 161 - freq = 0; 162 - opp = dev_pm_opp_find_freq_ceil(dev, &freq); 163 - dev_pm_opp_put(opp); 164 - Example 2: A simplified implementation of a SoC cpufreq_driver->target: 165 - soc_cpufreq_target(..) 166 - { 167 - /* Do stuff like policy checks etc. */ 168 - /* Find the best frequency match for the req */ 169 - opp = dev_pm_opp_find_freq_ceil(dev, &freq); 170 - dev_pm_opp_put(opp); 171 - if (!IS_ERR(opp)) 172 - soc_switch_to_freq_voltage(freq); 173 - else 174 - /* do something when we can't satisfy the req */ 175 - /* do other stuff */ 176 - } 177 - 178 - 4. OPP Availability Control Functions 179 - ===================================== 180 - A default OPP list registered with the OPP library may not cater to all possible 181 - situation. The OPP library provides a set of functions to modify the 182 - availability of a OPP within the OPP list. This allows SoC frameworks to have 183 - fine grained dynamic control of which sets of OPPs are operationally available. 184 - These functions are intended to *temporarily* remove an OPP in conditions such 185 - as thermal considerations (e.g. don't use OPPx until the temperature drops). 186 - 187 - WARNING: Do not use these functions in interrupt context. 188 - 189 - dev_pm_opp_enable - Make a OPP available for operation. 190 - Example: Lets say that 1GHz OPP is to be made available only if the 191 - SoC temperature is lower than a certain threshold. The SoC framework 192 - implementation might choose to do something as follows: 193 - if (cur_temp < temp_low_thresh) { 194 - /* Enable 1GHz if it was disabled */ 195 - opp = dev_pm_opp_find_freq_exact(dev, 1000000000, false); 196 - dev_pm_opp_put(opp); 197 - /* just error check */ 198 - if (!IS_ERR(opp)) 199 - ret = dev_pm_opp_enable(dev, 1000000000); 200 - else 201 - goto try_something_else; 202 - } 203 - 204 - dev_pm_opp_disable - Make an OPP to be not available for operation 205 - Example: Lets say that 1GHz OPP is to be disabled if the temperature 206 - exceeds a threshold value. The SoC framework implementation might 207 - choose to do something as follows: 208 - if (cur_temp > temp_high_thresh) { 209 - /* Disable 1GHz if it was enabled */ 210 - opp = dev_pm_opp_find_freq_exact(dev, 1000000000, true); 211 - dev_pm_opp_put(opp); 212 - /* just error check */ 213 - if (!IS_ERR(opp)) 214 - ret = dev_pm_opp_disable(dev, 1000000000); 215 - else 216 - goto try_something_else; 217 - } 218 - 219 - 5. OPP Data Retrieval Functions 220 - =============================== 221 - Since OPP library abstracts away the OPP information, a set of functions to pull 222 - information from the OPP structure is necessary. Once an OPP pointer is 223 - retrieved using the search functions, the following functions can be used by SoC 224 - framework to retrieve the information represented inside the OPP layer. 225 - 226 - dev_pm_opp_get_voltage - Retrieve the voltage represented by the opp pointer. 227 - Example: At a cpufreq transition to a different frequency, SoC 228 - framework requires to set the voltage represented by the OPP using 229 - the regulator framework to the Power Management chip providing the 230 - voltage. 231 - soc_switch_to_freq_voltage(freq) 232 - { 233 - /* do things */ 234 - opp = dev_pm_opp_find_freq_ceil(dev, &freq); 235 - v = dev_pm_opp_get_voltage(opp); 236 - dev_pm_opp_put(opp); 237 - if (v) 238 - regulator_set_voltage(.., v); 239 - /* do other things */ 240 - } 241 - 242 - dev_pm_opp_get_freq - Retrieve the freq represented by the opp pointer. 243 - Example: Lets say the SoC framework uses a couple of helper functions 244 - we could pass opp pointers instead of doing additional parameters to 245 - handle quiet a bit of data parameters. 246 - soc_cpufreq_target(..) 247 - { 248 - /* do things.. */ 249 - max_freq = ULONG_MAX; 250 - max_opp = dev_pm_opp_find_freq_floor(dev,&max_freq); 251 - requested_opp = dev_pm_opp_find_freq_ceil(dev,&freq); 252 - if (!IS_ERR(max_opp) && !IS_ERR(requested_opp)) 253 - r = soc_test_validity(max_opp, requested_opp); 254 - dev_pm_opp_put(max_opp); 255 - dev_pm_opp_put(requested_opp); 256 - /* do other things */ 257 - } 258 - soc_test_validity(..) 259 - { 260 - if(dev_pm_opp_get_voltage(max_opp) < dev_pm_opp_get_voltage(requested_opp)) 261 - return -EINVAL; 262 - if(dev_pm_opp_get_freq(max_opp) < dev_pm_opp_get_freq(requested_opp)) 263 - return -EINVAL; 264 - /* do things.. */ 265 - } 266 - 267 - dev_pm_opp_get_opp_count - Retrieve the number of available opps for a device 268 - Example: Lets say a co-processor in the SoC needs to know the available 269 - frequencies in a table, the main processor can notify as following: 270 - soc_notify_coproc_available_frequencies() 271 - { 272 - /* Do things */ 273 - num_available = dev_pm_opp_get_opp_count(dev); 274 - speeds = kzalloc(sizeof(u32) * num_available, GFP_KERNEL); 275 - /* populate the table in increasing order */ 276 - freq = 0; 277 - while (!IS_ERR(opp = dev_pm_opp_find_freq_ceil(dev, &freq))) { 278 - speeds[i] = freq; 279 - freq++; 280 - i++; 281 - dev_pm_opp_put(opp); 282 - } 283 - 284 - soc_notify_coproc(AVAILABLE_FREQs, speeds, num_available); 285 - /* Do other things */ 286 - } 287 - 288 - 6. Data Structures 289 - ================== 290 - Typically an SoC contains multiple voltage domains which are variable. Each 291 - domain is represented by a device pointer. The relationship to OPP can be 292 - represented as follows: 293 - SoC 294 - |- device 1 295 - | |- opp 1 (availability, freq, voltage) 296 - | |- opp 2 .. 297 - ... ... 298 - | `- opp n .. 299 - |- device 2 300 - ... 301 - `- device m 302 - 303 - OPP library maintains a internal list that the SoC framework populates and 304 - accessed by various functions as described above. However, the structures 305 - representing the actual OPPs and domains are internal to the OPP library itself 306 - to allow for suitable abstraction reusable across systems. 307 - 308 - struct dev_pm_opp - The internal data structure of OPP library which is used to 309 - represent an OPP. In addition to the freq, voltage, availability 310 - information, it also contains internal book keeping information required 311 - for the OPP library to operate on. Pointer to this structure is 312 - provided back to the users such as SoC framework to be used as a 313 - identifier for OPP in the interactions with OPP layer. 314 - 315 - WARNING: The struct dev_pm_opp pointer should not be parsed or modified by the 316 - users. The defaults of for an instance is populated by dev_pm_opp_add, but the 317 - availability of the OPP can be modified by dev_pm_opp_enable/disable functions. 318 - 319 - struct device - This is used to identify a domain to the OPP layer. The 320 - nature of the device and it's implementation is left to the user of 321 - OPP library such as the SoC framework. 322 - 323 - Overall, in a simplistic view, the data structure operations is represented as 324 - following: 325 - 326 - Initialization / modification: 327 - +-----+ /- dev_pm_opp_enable 328 - dev_pm_opp_add --> | opp | <------- 329 - | +-----+ \- dev_pm_opp_disable 330 - \-------> domain_info(device) 331 - 332 - Search functions: 333 - /-- dev_pm_opp_find_freq_ceil ---\ +-----+ 334 - domain_info<---- dev_pm_opp_find_freq_exact -----> | opp | 335 - \-- dev_pm_opp_find_freq_floor ---/ +-----+ 336 - 337 - Retrieval functions: 338 - +-----+ /- dev_pm_opp_get_voltage 339 - | opp | <--- 340 - +-----+ \- dev_pm_opp_get_freq 341 - 342 - domain_info <- dev_pm_opp_get_opp_count

+1135

Documentation/power/pci.rst

··· 1 + ==================== 2 + PCI Power Management 3 + ==================== 4 + 5 + Copyright (c) 2010 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc. 6 + 7 + An overview of concepts and the Linux kernel's interfaces related to PCI power 8 + management. Based on previous work by Patrick Mochel <mochel@transmeta.com> 9 + (and others). 10 + 11 + This document only covers the aspects of power management specific to PCI 12 + devices. For general description of the kernel's interfaces related to device 13 + power management refer to Documentation/driver-api/pm/devices.rst and 14 + Documentation/power/runtime_pm.rst. 15 + 16 + .. contents: 17 + 18 + 1. Hardware and Platform Support for PCI Power Management 19 + 2. PCI Subsystem and Device Power Management 20 + 3. PCI Device Drivers and Power Management 21 + 4. Resources 22 + 23 + 24 + 1. Hardware and Platform Support for PCI Power Management 25 + ========================================================= 26 + 27 + 1.1. Native and Platform-Based Power Management 28 + ----------------------------------------------- 29 + 30 + In general, power management is a feature allowing one to save energy by putting 31 + devices into states in which they draw less power (low-power states) at the 32 + price of reduced functionality or performance. 33 + 34 + Usually, a device is put into a low-power state when it is underutilized or 35 + completely inactive. However, when it is necessary to use the device once 36 + again, it has to be put back into the "fully functional" state (full-power 37 + state). This may happen when there are some data for the device to handle or 38 + as a result of an external event requiring the device to be active, which may 39 + be signaled by the device itself. 40 + 41 + PCI devices may be put into low-power states in two ways, by using the device 42 + capabilities introduced by the PCI Bus Power Management Interface Specification, 43 + or with the help of platform firmware, such as an ACPI BIOS. In the first 44 + approach, that is referred to as the native PCI power management (native PCI PM) 45 + in what follows, the device power state is changed as a result of writing a 46 + specific value into one of its standard configuration registers. The second 47 + approach requires the platform firmware to provide special methods that may be 48 + used by the kernel to change the device's power state. 49 + 50 + Devices supporting the native PCI PM usually can generate wakeup signals called 51 + Power Management Events (PMEs) to let the kernel know about external events 52 + requiring the device to be active. After receiving a PME the kernel is supposed 53 + to put the device that sent it into the full-power state. However, the PCI Bus 54 + Power Management Interface Specification doesn't define any standard method of 55 + delivering the PME from the device to the CPU and the operating system kernel. 56 + It is assumed that the platform firmware will perform this task and therefore, 57 + even though a PCI device is set up to generate PMEs, it also may be necessary to 58 + prepare the platform firmware for notifying the CPU of the PMEs coming from the 59 + device (e.g. by generating interrupts). 60 + 61 + In turn, if the methods provided by the platform firmware are used for changing 62 + the power state of a device, usually the platform also provides a method for 63 + preparing the device to generate wakeup signals. In that case, however, it 64 + often also is necessary to prepare the device for generating PMEs using the 65 + native PCI PM mechanism, because the method provided by the platform depends on 66 + that. 67 + 68 + Thus in many situations both the native and the platform-based power management 69 + mechanisms have to be used simultaneously to obtain the desired result. 70 + 71 + 1.2. Native PCI Power Management 72 + -------------------------------- 73 + 74 + The PCI Bus Power Management Interface Specification (PCI PM Spec) was 75 + introduced between the PCI 2.1 and PCI 2.2 Specifications. It defined a 76 + standard interface for performing various operations related to power 77 + management. 78 + 79 + The implementation of the PCI PM Spec is optional for conventional PCI devices, 80 + but it is mandatory for PCI Express devices. If a device supports the PCI PM 81 + Spec, it has an 8 byte power management capability field in its PCI 82 + configuration space. This field is used to describe and control the standard 83 + features related to the native PCI power management. 84 + 85 + The PCI PM Spec defines 4 operating states for devices (D0-D3) and for buses 86 + (B0-B3). The higher the number, the less power is drawn by the device or bus 87 + in that state. However, the higher the number, the longer the latency for 88 + the device or bus to return to the full-power state (D0 or B0, respectively). 89 + 90 + There are two variants of the D3 state defined by the specification. The first 91 + one is D3hot, referred to as the software accessible D3, because devices can be 92 + programmed to go into it. The second one, D3cold, is the state that PCI devices 93 + are in when the supply voltage (Vcc) is removed from them. It is not possible 94 + to program a PCI device to go into D3cold, although there may be a programmable 95 + interface for putting the bus the device is on into a state in which Vcc is 96 + removed from all devices on the bus. 97 + 98 + PCI bus power management, however, is not supported by the Linux kernel at the 99 + time of this writing and therefore it is not covered by this document. 100 + 101 + Note that every PCI device can be in the full-power state (D0) or in D3cold, 102 + regardless of whether or not it implements the PCI PM Spec. In addition to 103 + that, if the PCI PM Spec is implemented by the device, it must support D3hot 104 + as well as D0. The support for the D1 and D2 power states is optional. 105 + 106 + PCI devices supporting the PCI PM Spec can be programmed to go to any of the 107 + supported low-power states (except for D3cold). While in D1-D3hot the 108 + standard configuration registers of the device must be accessible to software 109 + (i.e. the device is required to respond to PCI configuration accesses), although 110 + its I/O and memory spaces are then disabled. This allows the device to be 111 + programmatically put into D0. Thus the kernel can switch the device back and 112 + forth between D0 and the supported low-power states (except for D3cold) and the 113 + possible power state transitions the device can undergo are the following: 114 + 115 + +----------------------------+ 116 + | Current State | New State | 117 + +----------------------------+ 118 + | D0 | D1, D2, D3 | 119 + +----------------------------+ 120 + | D1 | D2, D3 | 121 + +----------------------------+ 122 + | D2 | D3 | 123 + +----------------------------+ 124 + | D1, D2, D3 | D0 | 125 + +----------------------------+ 126 + 127 + The transition from D3cold to D0 occurs when the supply voltage is provided to 128 + the device (i.e. power is restored). In that case the device returns to D0 with 129 + a full power-on reset sequence and the power-on defaults are restored to the 130 + device by hardware just as at initial power up. 131 + 132 + PCI devices supporting the PCI PM Spec can be programmed to generate PMEs 133 + while in a low-power state (D1-D3), but they are not required to be capable 134 + of generating PMEs from all supported low-power states. In particular, the 135 + capability of generating PMEs from D3cold is optional and depends on the 136 + presence of additional voltage (3.3Vaux) allowing the device to remain 137 + sufficiently active to generate a wakeup signal. 138 + 139 + 1.3. ACPI Device Power Management 140 + --------------------------------- 141 + 142 + The platform firmware support for the power management of PCI devices is 143 + system-specific. However, if the system in question is compliant with the 144 + Advanced Configuration and Power Interface (ACPI) Specification, like the 145 + majority of x86-based systems, it is supposed to implement device power 146 + management interfaces defined by the ACPI standard. 147 + 148 + For this purpose the ACPI BIOS provides special functions called "control 149 + methods" that may be executed by the kernel to perform specific tasks, such as 150 + putting a device into a low-power state. These control methods are encoded 151 + using special byte-code language called the ACPI Machine Language (AML) and 152 + stored in the machine's BIOS. The kernel loads them from the BIOS and executes 153 + them as needed using an AML interpreter that translates the AML byte code into 154 + computations and memory or I/O space accesses. This way, in theory, a BIOS 155 + writer can provide the kernel with a means to perform actions depending 156 + on the system design in a system-specific fashion. 157 + 158 + ACPI control methods may be divided into global control methods, that are not 159 + associated with any particular devices, and device control methods, that have 160 + to be defined separately for each device supposed to be handled with the help of 161 + the platform. This means, in particular, that ACPI device control methods can 162 + only be used to handle devices that the BIOS writer knew about in advance. The 163 + ACPI methods used for device power management fall into that category. 164 + 165 + The ACPI specification assumes that devices can be in one of four power states 166 + labeled as D0, D1, D2, and D3 that roughly correspond to the native PCI PM 167 + D0-D3 states (although the difference between D3hot and D3cold is not taken 168 + into account by ACPI). Moreover, for each power state of a device there is a 169 + set of power resources that have to be enabled for the device to be put into 170 + that state. These power resources are controlled (i.e. enabled or disabled) 171 + with the help of their own control methods, _ON and _OFF, that have to be 172 + defined individually for each of them. 173 + 174 + To put a device into the ACPI power state Dx (where x is a number between 0 and 175 + 3 inclusive) the kernel is supposed to (1) enable the power resources required 176 + by the device in this state using their _ON control methods and (2) execute the 177 + _PSx control method defined for the device. In addition to that, if the device 178 + is going to be put into a low-power state (D1-D3) and is supposed to generate 179 + wakeup signals from that state, the _DSW (or _PSW, replaced with _DSW by ACPI 180 + 3.0) control method defined for it has to be executed before _PSx. Power 181 + resources that are not required by the device in the target power state and are 182 + not required any more by any other device should be disabled (by executing their 183 + _OFF control methods). If the current power state of the device is D3, it can 184 + only be put into D0 this way. 185 + 186 + However, quite often the power states of devices are changed during a 187 + system-wide transition into a sleep state or back into the working state. ACPI 188 + defines four system sleep states, S1, S2, S3, and S4, and denotes the system 189 + working state as S0. In general, the target system sleep (or working) state 190 + determines the highest power (lowest number) state the device can be put 191 + into and the kernel is supposed to obtain this information by executing the 192 + device's _SxD control method (where x is a number between 0 and 4 inclusive). 193 + If the device is required to wake up the system from the target sleep state, the 194 + lowest power (highest number) state it can be put into is also determined by the 195 + target state of the system. The kernel is then supposed to use the device's 196 + _SxW control method to obtain the number of that state. It also is supposed to 197 + use the device's _PRW control method to learn which power resources need to be 198 + enabled for the device to be able to generate wakeup signals. 199 + 200 + 1.4. Wakeup Signaling 201 + --------------------- 202 + 203 + Wakeup signals generated by PCI devices, either as native PCI PMEs, or as 204 + a result of the execution of the _DSW (or _PSW) ACPI control method before 205 + putting the device into a low-power state, have to be caught and handled as 206 + appropriate. If they are sent while the system is in the working state 207 + (ACPI S0), they should be translated into interrupts so that the kernel can 208 + put the devices generating them into the full-power state and take care of the 209 + events that triggered them. In turn, if they are sent while the system is 210 + sleeping, they should cause the system's core logic to trigger wakeup. 211 + 212 + On ACPI-based systems wakeup signals sent by conventional PCI devices are 213 + converted into ACPI General-Purpose Events (GPEs) which are hardware signals 214 + from the system core logic generated in response to various events that need to 215 + be acted upon. Every GPE is associated with one or more sources of potentially 216 + interesting events. In particular, a GPE may be associated with a PCI device 217 + capable of signaling wakeup. The information on the connections between GPEs 218 + and event sources is recorded in the system's ACPI BIOS from where it can be 219 + read by the kernel. 220 + 221 + If a PCI device known to the system's ACPI BIOS signals wakeup, the GPE 222 + associated with it (if there is one) is triggered. The GPEs associated with PCI 223 + bridges may also be triggered in response to a wakeup signal from one of the 224 + devices below the bridge (this also is the case for root bridges) and, for 225 + example, native PCI PMEs from devices unknown to the system's ACPI BIOS may be 226 + handled this way. 227 + 228 + A GPE may be triggered when the system is sleeping (i.e. when it is in one of 229 + the ACPI S1-S4 states), in which case system wakeup is started by its core logic 230 + (the device that was the source of the signal causing the system wakeup to occur 231 + may be identified later). The GPEs used in such situations are referred to as 232 + wakeup GPEs. 233 + 234 + Usually, however, GPEs are also triggered when the system is in the working 235 + state (ACPI S0) and in that case the system's core logic generates a System 236 + Control Interrupt (SCI) to notify the kernel of the event. Then, the SCI 237 + handler identifies the GPE that caused the interrupt to be generated which, 238 + in turn, allows the kernel to identify the source of the event (that may be 239 + a PCI device signaling wakeup). The GPEs used for notifying the kernel of 240 + events occurring while the system is in the working state are referred to as 241 + runtime GPEs. 242 + 243 + Unfortunately, there is no standard way of handling wakeup signals sent by 244 + conventional PCI devices on systems that are not ACPI-based, but there is one 245 + for PCI Express devices. Namely, the PCI Express Base Specification introduced 246 + a native mechanism for converting native PCI PMEs into interrupts generated by 247 + root ports. For conventional PCI devices native PMEs are out-of-band, so they 248 + are routed separately and they need not pass through bridges (in principle they 249 + may be routed directly to the system's core logic), but for PCI Express devices 250 + they are in-band messages that have to pass through the PCI Express hierarchy, 251 + including the root port on the path from the device to the Root Complex. Thus 252 + it was possible to introduce a mechanism by which a root port generates an 253 + interrupt whenever it receives a PME message from one of the devices below it. 254 + The PCI Express Requester ID of the device that sent the PME message is then 255 + recorded in one of the root port's configuration registers from where it may be 256 + read by the interrupt handler allowing the device to be identified. [PME 257 + messages sent by PCI Express endpoints integrated with the Root Complex don't 258 + pass through root ports, but instead they cause a Root Complex Event Collector 259 + (if there is one) to generate interrupts.] 260 + 261 + In principle the native PCI Express PME signaling may also be used on ACPI-based 262 + systems along with the GPEs, but to use it the kernel has to ask the system's 263 + ACPI BIOS to release control of root port configuration registers. The ACPI 264 + BIOS, however, is not required to allow the kernel to control these registers 265 + and if it doesn't do that, the kernel must not modify their contents. Of course 266 + the native PCI Express PME signaling cannot be used by the kernel in that case. 267 + 268 + 269 + 2. PCI Subsystem and Device Power Management 270 + ============================================ 271 + 272 + 2.1. Device Power Management Callbacks 273 + -------------------------------------- 274 + 275 + The PCI Subsystem participates in the power management of PCI devices in a 276 + number of ways. First of all, it provides an intermediate code layer between 277 + the device power management core (PM core) and PCI device drivers. 278 + Specifically, the pm field of the PCI subsystem's struct bus_type object, 279 + pci_bus_type, points to a struct dev_pm_ops object, pci_dev_pm_ops, containing 280 + pointers to several device power management callbacks:: 281 + 282 + const struct dev_pm_ops pci_dev_pm_ops = { 283 + .prepare = pci_pm_prepare, 284 + .complete = pci_pm_complete, 285 + .suspend = pci_pm_suspend, 286 + .resume = pci_pm_resume, 287 + .freeze = pci_pm_freeze, 288 + .thaw = pci_pm_thaw, 289 + .poweroff = pci_pm_poweroff, 290 + .restore = pci_pm_restore, 291 + .suspend_noirq = pci_pm_suspend_noirq, 292 + .resume_noirq = pci_pm_resume_noirq, 293 + .freeze_noirq = pci_pm_freeze_noirq, 294 + .thaw_noirq = pci_pm_thaw_noirq, 295 + .poweroff_noirq = pci_pm_poweroff_noirq, 296 + .restore_noirq = pci_pm_restore_noirq, 297 + .runtime_suspend = pci_pm_runtime_suspend, 298 + .runtime_resume = pci_pm_runtime_resume, 299 + .runtime_idle = pci_pm_runtime_idle, 300 + }; 301 + 302 + These callbacks are executed by the PM core in various situations related to 303 + device power management and they, in turn, execute power management callbacks 304 + provided by PCI device drivers. They also perform power management operations 305 + involving some standard configuration registers of PCI devices that device 306 + drivers need not know or care about. 307 + 308 + The structure representing a PCI device, struct pci_dev, contains several fields 309 + that these callbacks operate on:: 310 + 311 + struct pci_dev { 312 + ... 313 + pci_power_t current_state; /* Current operating state. */ 314 + int pm_cap; /* PM capability offset in the 315 + configuration space */ 316 + unsigned int pme_support:5; /* Bitmask of states from which PME# 317 + can be generated */ 318 + unsigned int pme_interrupt:1;/* Is native PCIe PME signaling used? */ 319 + unsigned int d1_support:1; /* Low power state D1 is supported */ 320 + unsigned int d2_support:1; /* Low power state D2 is supported */ 321 + unsigned int no_d1d2:1; /* D1 and D2 are forbidden */ 322 + unsigned int wakeup_prepared:1; /* Device prepared for wake up */ 323 + unsigned int d3_delay; /* D3->D0 transition time in ms */ 324 + ... 325 + }; 326 + 327 + They also indirectly use some fields of the struct device that is embedded in 328 + struct pci_dev. 329 + 330 + 2.2. Device Initialization 331 + -------------------------- 332 + 333 + The PCI subsystem's first task related to device power management is to 334 + prepare the device for power management and initialize the fields of struct 335 + pci_dev used for this purpose. This happens in two functions defined in 336 + drivers/pci/pci.c, pci_pm_init() and platform_pci_wakeup_init(). 337 + 338 + The first of these functions checks if the device supports native PCI PM 339 + and if that's the case the offset of its power management capability structure 340 + in the configuration space is stored in the pm_cap field of the device's struct 341 + pci_dev object. Next, the function checks which PCI low-power states are 342 + supported by the device and from which low-power states the device can generate 343 + native PCI PMEs. The power management fields of the device's struct pci_dev and 344 + the struct device embedded in it are updated accordingly and the generation of 345 + PMEs by the device is disabled. 346 + 347 + The second function checks if the device can be prepared to signal wakeup with 348 + the help of the platform firmware, such as the ACPI BIOS. If that is the case, 349 + the function updates the wakeup fields in struct device embedded in the 350 + device's struct pci_dev and uses the firmware-provided method to prevent the 351 + device from signaling wakeup. 352 + 353 + At this point the device is ready for power management. For driverless devices, 354 + however, this functionality is limited to a few basic operations carried out 355 + during system-wide transitions to a sleep state and back to the working state. 356 + 357 + 2.3. Runtime Device Power Management 358 + ------------------------------------ 359 + 360 + The PCI subsystem plays a vital role in the runtime power management of PCI 361 + devices. For this purpose it uses the general runtime power management 362 + (runtime PM) framework described in Documentation/power/runtime_pm.rst. 363 + Namely, it provides subsystem-level callbacks:: 364 + 365 + pci_pm_runtime_suspend() 366 + pci_pm_runtime_resume() 367 + pci_pm_runtime_idle() 368 + 369 + that are executed by the core runtime PM routines. It also implements the 370 + entire mechanics necessary for handling runtime wakeup signals from PCI devices 371 + in low-power states, which at the time of this writing works for both the native 372 + PCI Express PME signaling and the ACPI GPE-based wakeup signaling described in 373 + Section 1. 374 + 375 + First, a PCI device is put into a low-power state, or suspended, with the help 376 + of pm_schedule_suspend() or pm_runtime_suspend() which for PCI devices call 377 + pci_pm_runtime_suspend() to do the actual job. For this to work, the device's 378 + driver has to provide a pm->runtime_suspend() callback (see below), which is 379 + run by pci_pm_runtime_suspend() as the first action. If the driver's callback 380 + returns successfully, the device's standard configuration registers are saved, 381 + the device is prepared to generate wakeup signals and, finally, it is put into 382 + the target low-power state. 383 + 384 + The low-power state to put the device into is the lowest-power (highest number) 385 + state from which it can signal wakeup. The exact method of signaling wakeup is 386 + system-dependent and is determined by the PCI subsystem on the basis of the 387 + reported capabilities of the device and the platform firmware. To prepare the 388 + device for signaling wakeup and put it into the selected low-power state, the 389 + PCI subsystem can use the platform firmware as well as the device's native PCI 390 + PM capabilities, if supported. 391 + 392 + It is expected that the device driver's pm->runtime_suspend() callback will 393 + not attempt to prepare the device for signaling wakeup or to put it into a 394 + low-power state. The driver ought to leave these tasks to the PCI subsystem 395 + that has all of the information necessary to perform them. 396 + 397 + A suspended device is brought back into the "active" state, or resumed, 398 + with the help of pm_request_resume() or pm_runtime_resume() which both call 399 + pci_pm_runtime_resume() for PCI devices. Again, this only works if the device's 400 + driver provides a pm->runtime_resume() callback (see below). However, before 401 + the driver's callback is executed, pci_pm_runtime_resume() brings the device 402 + back into the full-power state, prevents it from signaling wakeup while in that 403 + state and restores its standard configuration registers. Thus the driver's 404 + callback need not worry about the PCI-specific aspects of the device resume. 405 + 406 + Note that generally pci_pm_runtime_resume() may be called in two different 407 + situations. First, it may be called at the request of the device's driver, for 408 + example if there are some data for it to process. Second, it may be called 409 + as a result of a wakeup signal from the device itself (this sometimes is 410 + referred to as "remote wakeup"). Of course, for this purpose the wakeup signal 411 + is handled in one of the ways described in Section 1 and finally converted into 412 + a notification for the PCI subsystem after the source device has been 413 + identified. 414 + 415 + The pci_pm_runtime_idle() function, called for PCI devices by pm_runtime_idle() 416 + and pm_request_idle(), executes the device driver's pm->runtime_idle() 417 + callback, if defined, and if that callback doesn't return error code (or is not 418 + present at all), suspends the device with the help of pm_runtime_suspend(). 419 + Sometimes pci_pm_runtime_idle() is called automatically by the PM core (for 420 + example, it is called right after the device has just been resumed), in which 421 + cases it is expected to suspend the device if that makes sense. Usually, 422 + however, the PCI subsystem doesn't really know if the device really can be 423 + suspended, so it lets the device's driver decide by running its 424 + pm->runtime_idle() callback. 425 + 426 + 2.4. System-Wide Power Transitions 427 + ---------------------------------- 428 + There are a few different types of system-wide power transitions, described in 429 + Documentation/driver-api/pm/devices.rst. Each of them requires devices to be handled 430 + in a specific way and the PM core executes subsystem-level power management 431 + callbacks for this purpose. They are executed in phases such that each phase 432 + involves executing the same subsystem-level callback for every device belonging 433 + to the given subsystem before the next phase begins. These phases always run 434 + after tasks have been frozen. 435 + 436 + 2.4.1. System Suspend 437 + ^^^^^^^^^^^^^^^^^^^^^ 438 + 439 + When the system is going into a sleep state in which the contents of memory will 440 + be preserved, such as one of the ACPI sleep states S1-S3, the phases are: 441 + 442 + prepare, suspend, suspend_noirq. 443 + 444 + The following PCI bus type's callbacks, respectively, are used in these phases:: 445 + 446 + pci_pm_prepare() 447 + pci_pm_suspend() 448 + pci_pm_suspend_noirq() 449 + 450 + The pci_pm_prepare() routine first puts the device into the "fully functional" 451 + state with the help of pm_runtime_resume(). Then, it executes the device 452 + driver's pm->prepare() callback if defined (i.e. if the driver's struct 453 + dev_pm_ops object is present and the prepare pointer in that object is valid). 454 + 455 + The pci_pm_suspend() routine first checks if the device's driver implements 456 + legacy PCI suspend routines (see Section 3), in which case the driver's legacy 457 + suspend callback is executed, if present, and its result is returned. Next, if 458 + the device's driver doesn't provide a struct dev_pm_ops object (containing 459 + pointers to the driver's callbacks), pci_pm_default_suspend() is called, which 460 + simply turns off the device's bus master capability and runs 461 + pcibios_disable_device() to disable it, unless the device is a bridge (PCI 462 + bridges are ignored by this routine). Next, the device driver's pm->suspend() 463 + callback is executed, if defined, and its result is returned if it fails. 464 + Finally, pci_fixup_device() is called to apply hardware suspend quirks related 465 + to the device if necessary. 466 + 467 + Note that the suspend phase is carried out asynchronously for PCI devices, so 468 + the pci_pm_suspend() callback may be executed in parallel for any pair of PCI 469 + devices that don't depend on each other in a known way (i.e. none of the paths 470 + in the device tree from the root bridge to a leaf device contains both of them). 471 + 472 + The pci_pm_suspend_noirq() routine is executed after suspend_device_irqs() has 473 + been called, which means that the device driver's interrupt handler won't be 474 + invoked while this routine is running. It first checks if the device's driver 475 + implements legacy PCI suspends routines (Section 3), in which case the legacy 476 + late suspend routine is called and its result is returned (the standard 477 + configuration registers of the device are saved if the driver's callback hasn't 478 + done that). Second, if the device driver's struct dev_pm_ops object is not 479 + present, the device's standard configuration registers are saved and the routine 480 + returns success. Otherwise the device driver's pm->suspend_noirq() callback is 481 + executed, if present, and its result is returned if it fails. Next, if the 482 + device's standard configuration registers haven't been saved yet (one of the 483 + device driver's callbacks executed before might do that), pci_pm_suspend_noirq() 484 + saves them, prepares the device to signal wakeup (if necessary) and puts it into 485 + a low-power state. 486 + 487 + The low-power state to put the device into is the lowest-power (highest number) 488 + state from which it can signal wakeup while the system is in the target sleep 489 + state. Just like in the runtime PM case described above, the mechanism of 490 + signaling wakeup is system-dependent and determined by the PCI subsystem, which 491 + is also responsible for preparing the device to signal wakeup from the system's 492 + target sleep state as appropriate. 493 + 494 + PCI device drivers (that don't implement legacy power management callbacks) are 495 + generally not expected to prepare devices for signaling wakeup or to put them 496 + into low-power states. However, if one of the driver's suspend callbacks 497 + (pm->suspend() or pm->suspend_noirq()) saves the device's standard configuration 498 + registers, pci_pm_suspend_noirq() will assume that the device has been prepared 499 + to signal wakeup and put into a low-power state by the driver (the driver is 500 + then assumed to have used the helper functions provided by the PCI subsystem for 501 + this purpose). PCI device drivers are not encouraged to do that, but in some 502 + rare cases doing that in the driver may be the optimum approach. 503 + 504 + 2.4.2. System Resume 505 + ^^^^^^^^^^^^^^^^^^^^ 506 + 507 + When the system is undergoing a transition from a sleep state in which the 508 + contents of memory have been preserved, such as one of the ACPI sleep states 509 + S1-S3, into the working state (ACPI S0), the phases are: 510 + 511 + resume_noirq, resume, complete. 512 + 513 + The following PCI bus type's callbacks, respectively, are executed in these 514 + phases:: 515 + 516 + pci_pm_resume_noirq() 517 + pci_pm_resume() 518 + pci_pm_complete() 519 + 520 + The pci_pm_resume_noirq() routine first puts the device into the full-power 521 + state, restores its standard configuration registers and applies early resume 522 + hardware quirks related to the device, if necessary. This is done 523 + unconditionally, regardless of whether or not the device's driver implements 524 + legacy PCI power management callbacks (this way all PCI devices are in the 525 + full-power state and their standard configuration registers have been restored 526 + when their interrupt handlers are invoked for the first time during resume, 527 + which allows the kernel to avoid problems with the handling of shared interrupts 528 + by drivers whose devices are still suspended). If legacy PCI power management 529 + callbacks (see Section 3) are implemented by the device's driver, the legacy 530 + early resume callback is executed and its result is returned. Otherwise, the 531 + device driver's pm->resume_noirq() callback is executed, if defined, and its 532 + result is returned. 533 + 534 + The pci_pm_resume() routine first checks if the device's standard configuration 535 + registers have been restored and restores them if that's not the case (this 536 + only is necessary in the error path during a failing suspend). Next, resume 537 + hardware quirks related to the device are applied, if necessary, and if the 538 + device's driver implements legacy PCI power management callbacks (see 539 + Section 3), the driver's legacy resume callback is executed and its result is 540 + returned. Otherwise, the device's wakeup signaling mechanisms are blocked and 541 + its driver's pm->resume() callback is executed, if defined (the callback's 542 + result is then returned). 543 + 544 + The resume phase is carried out asynchronously for PCI devices, like the 545 + suspend phase described above, which means that if two PCI devices don't depend 546 + on each other in a known way, the pci_pm_resume() routine may be executed for 547 + the both of them in parallel. 548 + 549 + The pci_pm_complete() routine only executes the device driver's pm->complete() 550 + callback, if defined. 551 + 552 + 2.4.3. System Hibernation 553 + ^^^^^^^^^^^^^^^^^^^^^^^^^ 554 + 555 + System hibernation is more complicated than system suspend, because it requires 556 + a system image to be created and written into a persistent storage medium. The 557 + image is created atomically and all devices are quiesced, or frozen, before that 558 + happens. 559 + 560 + The freezing of devices is carried out after enough memory has been freed (at 561 + the time of this writing the image creation requires at least 50% of system RAM 562 + to be free) in the following three phases: 563 + 564 + prepare, freeze, freeze_noirq 565 + 566 + that correspond to the PCI bus type's callbacks:: 567 + 568 + pci_pm_prepare() 569 + pci_pm_freeze() 570 + pci_pm_freeze_noirq() 571 + 572 + This means that the prepare phase is exactly the same as for system suspend. 573 + The other two phases, however, are different. 574 + 575 + The pci_pm_freeze() routine is quite similar to pci_pm_suspend(), but it runs 576 + the device driver's pm->freeze() callback, if defined, instead of pm->suspend(), 577 + and it doesn't apply the suspend-related hardware quirks. It is executed 578 + asynchronously for different PCI devices that don't depend on each other in a 579 + known way. 580 + 581 + The pci_pm_freeze_noirq() routine, in turn, is similar to 582 + pci_pm_suspend_noirq(), but it calls the device driver's pm->freeze_noirq() 583 + routine instead of pm->suspend_noirq(). It also doesn't attempt to prepare the 584 + device for signaling wakeup and put it into a low-power state. Still, it saves 585 + the device's standard configuration registers if they haven't been saved by one 586 + of the driver's callbacks. 587 + 588 + Once the image has been created, it has to be saved. However, at this point all 589 + devices are frozen and they cannot handle I/O, while their ability to handle 590 + I/O is obviously necessary for the image saving. Thus they have to be brought 591 + back to the fully functional state and this is done in the following phases: 592 + 593 + thaw_noirq, thaw, complete 594 + 595 + using the following PCI bus type's callbacks:: 596 + 597 + pci_pm_thaw_noirq() 598 + pci_pm_thaw() 599 + pci_pm_complete() 600 + 601 + respectively. 602 + 603 + The first of them, pci_pm_thaw_noirq(), is analogous to pci_pm_resume_noirq(), 604 + but it doesn't put the device into the full power state and doesn't attempt to 605 + restore its standard configuration registers. It also executes the device 606 + driver's pm->thaw_noirq() callback, if defined, instead of pm->resume_noirq(). 607 + 608 + The pci_pm_thaw() routine is similar to pci_pm_resume(), but it runs the device 609 + driver's pm->thaw() callback instead of pm->resume(). It is executed 610 + asynchronously for different PCI devices that don't depend on each other in a 611 + known way. 612 + 613 + The complete phase it the same as for system resume. 614 + 615 + After saving the image, devices need to be powered down before the system can 616 + enter the target sleep state (ACPI S4 for ACPI-based systems). This is done in 617 + three phases: 618 + 619 + prepare, poweroff, poweroff_noirq 620 + 621 + where the prepare phase is exactly the same as for system suspend. The other 622 + two phases are analogous to the suspend and suspend_noirq phases, respectively. 623 + The PCI subsystem-level callbacks they correspond to:: 624 + 625 + pci_pm_poweroff() 626 + pci_pm_poweroff_noirq() 627 + 628 + work in analogy with pci_pm_suspend() and pci_pm_poweroff_noirq(), respectively, 629 + although they don't attempt to save the device's standard configuration 630 + registers. 631 + 632 + 2.4.4. System Restore 633 + ^^^^^^^^^^^^^^^^^^^^^ 634 + 635 + System restore requires a hibernation image to be loaded into memory and the 636 + pre-hibernation memory contents to be restored before the pre-hibernation system 637 + activity can be resumed. 638 + 639 + As described in Documentation/driver-api/pm/devices.rst, the hibernation image is loaded 640 + into memory by a fresh instance of the kernel, called the boot kernel, which in 641 + turn is loaded and run by a boot loader in the usual way. After the boot kernel 642 + has loaded the image, it needs to replace its own code and data with the code 643 + and data of the "hibernated" kernel stored within the image, called the image 644 + kernel. For this purpose all devices are frozen just like before creating 645 + the image during hibernation, in the 646 + 647 + prepare, freeze, freeze_noirq 648 + 649 + phases described above. However, the devices affected by these phases are only 650 + those having drivers in the boot kernel; other devices will still be in whatever 651 + state the boot loader left them. 652 + 653 + Should the restoration of the pre-hibernation memory contents fail, the boot 654 + kernel would go through the "thawing" procedure described above, using the 655 + thaw_noirq, thaw, and complete phases (that will only affect the devices having 656 + drivers in the boot kernel), and then continue running normally. 657 + 658 + If the pre-hibernation memory contents are restored successfully, which is the 659 + usual situation, control is passed to the image kernel, which then becomes 660 + responsible for bringing the system back to the working state. To achieve this, 661 + it must restore the devices' pre-hibernation functionality, which is done much 662 + like waking up from the memory sleep state, although it involves different 663 + phases: 664 + 665 + restore_noirq, restore, complete 666 + 667 + The first two of these are analogous to the resume_noirq and resume phases 668 + described above, respectively, and correspond to the following PCI subsystem 669 + callbacks:: 670 + 671 + pci_pm_restore_noirq() 672 + pci_pm_restore() 673 + 674 + These callbacks work in analogy with pci_pm_resume_noirq() and pci_pm_resume(), 675 + respectively, but they execute the device driver's pm->restore_noirq() and 676 + pm->restore() callbacks, if available. 677 + 678 + The complete phase is carried out in exactly the same way as during system 679 + resume. 680 + 681 + 682 + 3. PCI Device Drivers and Power Management 683 + ========================================== 684 + 685 + 3.1. Power Management Callbacks 686 + ------------------------------- 687 + 688 + PCI device drivers participate in power management by providing callbacks to be 689 + executed by the PCI subsystem's power management routines described above and by 690 + controlling the runtime power management of their devices. 691 + 692 + At the time of this writing there are two ways to define power management 693 + callbacks for a PCI device driver, the recommended one, based on using a 694 + dev_pm_ops structure described in Documentation/driver-api/pm/devices.rst, and the 695 + "legacy" one, in which the .suspend(), .suspend_late(), .resume_early(), and 696 + .resume() callbacks from struct pci_driver are used. The legacy approach, 697 + however, doesn't allow one to define runtime power management callbacks and is 698 + not really suitable for any new drivers. Therefore it is not covered by this 699 + document (refer to the source code to learn more about it). 700 + 701 + It is recommended that all PCI device drivers define a struct dev_pm_ops object 702 + containing pointers to power management (PM) callbacks that will be executed by 703 + the PCI subsystem's PM routines in various circumstances. A pointer to the 704 + driver's struct dev_pm_ops object has to be assigned to the driver.pm field in 705 + its struct pci_driver object. Once that has happened, the "legacy" PM callbacks 706 + in struct pci_driver are ignored (even if they are not NULL). 707 + 708 + The PM callbacks in struct dev_pm_ops are not mandatory and if they are not 709 + defined (i.e. the respective fields of struct dev_pm_ops are unset) the PCI 710 + subsystem will handle the device in a simplified default manner. If they are 711 + defined, though, they are expected to behave as described in the following 712 + subsections. 713 + 714 + 3.1.1. prepare() 715 + ^^^^^^^^^^^^^^^^ 716 + 717 + The prepare() callback is executed during system suspend, during hibernation 718 + (when a hibernation image is about to be created), during power-off after 719 + saving a hibernation image and during system restore, when a hibernation image 720 + has just been loaded into memory. 721 + 722 + This callback is only necessary if the driver's device has children that in 723 + general may be registered at any time. In that case the role of the prepare() 724 + callback is to prevent new children of the device from being registered until 725 + one of the resume_noirq(), thaw_noirq(), or restore_noirq() callbacks is run. 726 + 727 + In addition to that the prepare() callback may carry out some operations 728 + preparing the device to be suspended, although it should not allocate memory 729 + (if additional memory is required to suspend the device, it has to be 730 + preallocated earlier, for example in a suspend/hibernate notifier as described 731 + in Documentation/driver-api/pm/notifiers.rst). 732 + 733 + 3.1.2. suspend() 734 + ^^^^^^^^^^^^^^^^ 735 + 736 + The suspend() callback is only executed during system suspend, after prepare() 737 + callbacks have been executed for all devices in the system. 738 + 739 + This callback is expected to quiesce the device and prepare it to be put into a 740 + low-power state by the PCI subsystem. It is not required (in fact it even is 741 + not recommended) that a PCI driver's suspend() callback save the standard 742 + configuration registers of the device, prepare it for waking up the system, or 743 + put it into a low-power state. All of these operations can very well be taken 744 + care of by the PCI subsystem, without the driver's participation. 745 + 746 + However, in some rare case it is convenient to carry out these operations in 747 + a PCI driver. Then, pci_save_state(), pci_prepare_to_sleep(), and 748 + pci_set_power_state() should be used to save the device's standard configuration 749 + registers, to prepare it for system wakeup (if necessary), and to put it into a 750 + low-power state, respectively. Moreover, if the driver calls pci_save_state(), 751 + the PCI subsystem will not execute either pci_prepare_to_sleep(), or 752 + pci_set_power_state() for its device, so the driver is then responsible for 753 + handling the device as appropriate. 754 + 755 + While the suspend() callback is being executed, the driver's interrupt handler 756 + can be invoked to handle an interrupt from the device, so all suspend-related 757 + operations relying on the driver's ability to handle interrupts should be 758 + carried out in this callback. 759 + 760 + 3.1.3. suspend_noirq() 761 + ^^^^^^^^^^^^^^^^^^^^^^ 762 + 763 + The suspend_noirq() callback is only executed during system suspend, after 764 + suspend() callbacks have been executed for all devices in the system and 765 + after device interrupts have been disabled by the PM core. 766 + 767 + The difference between suspend_noirq() and suspend() is that the driver's 768 + interrupt handler will not be invoked while suspend_noirq() is running. Thus 769 + suspend_noirq() can carry out operations that would cause race conditions to 770 + arise if they were performed in suspend(). 771 + 772 + 3.1.4. freeze() 773 + ^^^^^^^^^^^^^^^ 774 + 775 + The freeze() callback is hibernation-specific and is executed in two situations, 776 + during hibernation, after prepare() callbacks have been executed for all devices 777 + in preparation for the creation of a system image, and during restore, 778 + after a system image has been loaded into memory from persistent storage and the 779 + prepare() callbacks have been executed for all devices. 780 + 781 + The role of this callback is analogous to the role of the suspend() callback 782 + described above. In fact, they only need to be different in the rare cases when 783 + the driver takes the responsibility for putting the device into a low-power 784 + state. 785 + 786 + In that cases the freeze() callback should not prepare the device system wakeup 787 + or put it into a low-power state. Still, either it or freeze_noirq() should 788 + save the device's standard configuration registers using pci_save_state(). 789 + 790 + 3.1.5. freeze_noirq() 791 + ^^^^^^^^^^^^^^^^^^^^^ 792 + 793 + The freeze_noirq() callback is hibernation-specific. It is executed during 794 + hibernation, after prepare() and freeze() callbacks have been executed for all 795 + devices in preparation for the creation of a system image, and during restore, 796 + after a system image has been loaded into memory and after prepare() and 797 + freeze() callbacks have been executed for all devices. It is always executed 798 + after device interrupts have been disabled by the PM core. 799 + 800 + The role of this callback is analogous to the role of the suspend_noirq() 801 + callback described above and it very rarely is necessary to define 802 + freeze_noirq(). 803 + 804 + The difference between freeze_noirq() and freeze() is analogous to the 805 + difference between suspend_noirq() and suspend(). 806 + 807 + 3.1.6. poweroff() 808 + ^^^^^^^^^^^^^^^^^ 809 + 810 + The poweroff() callback is hibernation-specific. It is executed when the system 811 + is about to be powered off after saving a hibernation image to a persistent 812 + storage. prepare() callbacks are executed for all devices before poweroff() is 813 + called. 814 + 815 + The role of this callback is analogous to the role of the suspend() and freeze() 816 + callbacks described above, although it does not need to save the contents of 817 + the device's registers. In particular, if the driver wants to put the device 818 + into a low-power state itself instead of allowing the PCI subsystem to do that, 819 + the poweroff() callback should use pci_prepare_to_sleep() and 820 + pci_set_power_state() to prepare the device for system wakeup and to put it 821 + into a low-power state, respectively, but it need not save the device's standard 822 + configuration registers. 823 + 824 + 3.1.7. poweroff_noirq() 825 + ^^^^^^^^^^^^^^^^^^^^^^^ 826 + 827 + The poweroff_noirq() callback is hibernation-specific. It is executed after 828 + poweroff() callbacks have been executed for all devices in the system. 829 + 830 + The role of this callback is analogous to the role of the suspend_noirq() and 831 + freeze_noirq() callbacks described above, but it does not need to save the 832 + contents of the device's registers. 833 + 834 + The difference between poweroff_noirq() and poweroff() is analogous to the 835 + difference between suspend_noirq() and suspend(). 836 + 837 + 3.1.8. resume_noirq() 838 + ^^^^^^^^^^^^^^^^^^^^^ 839 + 840 + The resume_noirq() callback is only executed during system resume, after the 841 + PM core has enabled the non-boot CPUs. The driver's interrupt handler will not 842 + be invoked while resume_noirq() is running, so this callback can carry out 843 + operations that might race with the interrupt handler. 844 + 845 + Since the PCI subsystem unconditionally puts all devices into the full power 846 + state in the resume_noirq phase of system resume and restores their standard 847 + configuration registers, resume_noirq() is usually not necessary. In general 848 + it should only be used for performing operations that would lead to race 849 + conditions if carried out by resume(). 850 + 851 + 3.1.9. resume() 852 + ^^^^^^^^^^^^^^^ 853 + 854 + The resume() callback is only executed during system resume, after 855 + resume_noirq() callbacks have been executed for all devices in the system and 856 + device interrupts have been enabled by the PM core. 857 + 858 + This callback is responsible for restoring the pre-suspend configuration of the 859 + device and bringing it back to the fully functional state. The device should be 860 + able to process I/O in a usual way after resume() has returned. 861 + 862 + 3.1.10. thaw_noirq() 863 + ^^^^^^^^^^^^^^^^^^^^ 864 + 865 + The thaw_noirq() callback is hibernation-specific. It is executed after a 866 + system image has been created and the non-boot CPUs have been enabled by the PM 867 + core, in the thaw_noirq phase of hibernation. It also may be executed if the 868 + loading of a hibernation image fails during system restore (it is then executed 869 + after enabling the non-boot CPUs). The driver's interrupt handler will not be 870 + invoked while thaw_noirq() is running. 871 + 872 + The role of this callback is analogous to the role of resume_noirq(). The 873 + difference between these two callbacks is that thaw_noirq() is executed after 874 + freeze() and freeze_noirq(), so in general it does not need to modify the 875 + contents of the device's registers. 876 + 877 + 3.1.11. thaw() 878 + ^^^^^^^^^^^^^^ 879 + 880 + The thaw() callback is hibernation-specific. It is executed after thaw_noirq() 881 + callbacks have been executed for all devices in the system and after device 882 + interrupts have been enabled by the PM core. 883 + 884 + This callback is responsible for restoring the pre-freeze configuration of 885 + the device, so that it will work in a usual way after thaw() has returned. 886 + 887 + 3.1.12. restore_noirq() 888 + ^^^^^^^^^^^^^^^^^^^^^^^ 889 + 890 + The restore_noirq() callback is hibernation-specific. It is executed in the 891 + restore_noirq phase of hibernation, when the boot kernel has passed control to 892 + the image kernel and the non-boot CPUs have been enabled by the image kernel's 893 + PM core. 894 + 895 + This callback is analogous to resume_noirq() with the exception that it cannot 896 + make any assumption on the previous state of the device, even if the BIOS (or 897 + generally the platform firmware) is known to preserve that state over a 898 + suspend-resume cycle. 899 + 900 + For the vast majority of PCI device drivers there is no difference between 901 + resume_noirq() and restore_noirq(). 902 + 903 + 3.1.13. restore() 904 + ^^^^^^^^^^^^^^^^^ 905 + 906 + The restore() callback is hibernation-specific. It is executed after 907 + restore_noirq() callbacks have been executed for all devices in the system and 908 + after the PM core has enabled device drivers' interrupt handlers to be invoked. 909 + 910 + This callback is analogous to resume(), just like restore_noirq() is analogous 911 + to resume_noirq(). Consequently, the difference between restore_noirq() and 912 + restore() is analogous to the difference between resume_noirq() and resume(). 913 + 914 + For the vast majority of PCI device drivers there is no difference between 915 + resume() and restore(). 916 + 917 + 3.1.14. complete() 918 + ^^^^^^^^^^^^^^^^^^ 919 + 920 + The complete() callback is executed in the following situations: 921 + 922 + - during system resume, after resume() callbacks have been executed for all 923 + devices, 924 + - during hibernation, before saving the system image, after thaw() callbacks 925 + have been executed for all devices, 926 + - during system restore, when the system is going back to its pre-hibernation 927 + state, after restore() callbacks have been executed for all devices. 928 + 929 + It also may be executed if the loading of a hibernation image into memory fails 930 + (in that case it is run after thaw() callbacks have been executed for all 931 + devices that have drivers in the boot kernel). 932 + 933 + This callback is entirely optional, although it may be necessary if the 934 + prepare() callback performs operations that need to be reversed. 935 + 936 + 3.1.15. runtime_suspend() 937 + ^^^^^^^^^^^^^^^^^^^^^^^^^ 938 + 939 + The runtime_suspend() callback is specific to device runtime power management 940 + (runtime PM). It is executed by the PM core's runtime PM framework when the 941 + device is about to be suspended (i.e. quiesced and put into a low-power state) 942 + at run time. 943 + 944 + This callback is responsible for freezing the device and preparing it to be 945 + put into a low-power state, but it must allow the PCI subsystem to perform all 946 + of the PCI-specific actions necessary for suspending the device. 947 + 948 + 3.1.16. runtime_resume() 949 + ^^^^^^^^^^^^^^^^^^^^^^^^ 950 + 951 + The runtime_resume() callback is specific to device runtime PM. It is executed 952 + by the PM core's runtime PM framework when the device is about to be resumed 953 + (i.e. put into the full-power state and programmed to process I/O normally) at 954 + run time. 955 + 956 + This callback is responsible for restoring the normal functionality of the 957 + device after it has been put into the full-power state by the PCI subsystem. 958 + The device is expected to be able to process I/O in the usual way after 959 + runtime_resume() has returned. 960 + 961 + 3.1.17. runtime_idle() 962 + ^^^^^^^^^^^^^^^^^^^^^^ 963 + 964 + The runtime_idle() callback is specific to device runtime PM. It is executed 965 + by the PM core's runtime PM framework whenever it may be desirable to suspend 966 + the device according to the PM core's information. In particular, it is 967 + automatically executed right after runtime_resume() has returned in case the 968 + resume of the device has happened as a result of a spurious event. 969 + 970 + This callback is optional, but if it is not implemented or if it returns 0, the 971 + PCI subsystem will call pm_runtime_suspend() for the device, which in turn will 972 + cause the driver's runtime_suspend() callback to be executed. 973 + 974 + 3.1.18. Pointing Multiple Callback Pointers to One Routine 975 + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 976 + 977 + Although in principle each of the callbacks described in the previous 978 + subsections can be defined as a separate function, it often is convenient to 979 + point two or more members of struct dev_pm_ops to the same routine. There are 980 + a few convenience macros that can be used for this purpose. 981 + 982 + The SIMPLE_DEV_PM_OPS macro declares a struct dev_pm_ops object with one 983 + suspend routine pointed to by the .suspend(), .freeze(), and .poweroff() 984 + members and one resume routine pointed to by the .resume(), .thaw(), and 985 + .restore() members. The other function pointers in this struct dev_pm_ops are 986 + unset. 987 + 988 + The UNIVERSAL_DEV_PM_OPS macro is similar to SIMPLE_DEV_PM_OPS, but it 989 + additionally sets the .runtime_resume() pointer to the same value as 990 + .resume() (and .thaw(), and .restore()) and the .runtime_suspend() pointer to 991 + the same value as .suspend() (and .freeze() and .poweroff()). 992 + 993 + The SET_SYSTEM_SLEEP_PM_OPS can be used inside of a declaration of struct 994 + dev_pm_ops to indicate that one suspend routine is to be pointed to by the 995 + .suspend(), .freeze(), and .poweroff() members and one resume routine is to 996 + be pointed to by the .resume(), .thaw(), and .restore() members. 997 + 998 + 3.1.19. Driver Flags for Power Management 999 + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1000 + 1001 + The PM core allows device drivers to set flags that influence the handling of 1002 + power management for the devices by the core itself and by middle layer code 1003 + including the PCI bus type. The flags should be set once at the driver probe 1004 + time with the help of the dev_pm_set_driver_flags() function and they should not 1005 + be updated directly afterwards. 1006 + 1007 + The DPM_FLAG_NEVER_SKIP flag prevents the PM core from using the direct-complete 1008 + mechanism allowing device suspend/resume callbacks to be skipped if the device 1009 + is in runtime suspend when the system suspend starts. That also affects all of 1010 + the ancestors of the device, so this flag should only be used if absolutely 1011 + necessary. 1012 + 1013 + The DPM_FLAG_SMART_PREPARE flag instructs the PCI bus type to only return a 1014 + positive value from pci_pm_prepare() if the ->prepare callback provided by the 1015 + driver of the device returns a positive value. That allows the driver to opt 1016 + out from using the direct-complete mechanism dynamically. 1017 + 1018 + The DPM_FLAG_SMART_SUSPEND flag tells the PCI bus type that from the driver's 1019 + perspective the device can be safely left in runtime suspend during system 1020 + suspend. That causes pci_pm_suspend(), pci_pm_freeze() and pci_pm_poweroff() 1021 + to skip resuming the device from runtime suspend unless there are PCI-specific 1022 + reasons for doing that. Also, it causes pci_pm_suspend_late/noirq(), 1023 + pci_pm_freeze_late/noirq() and pci_pm_poweroff_late/noirq() to return early 1024 + if the device remains in runtime suspend in the beginning of the "late" phase 1025 + of the system-wide transition under way. Moreover, if the device is in 1026 + runtime suspend in pci_pm_resume_noirq() or pci_pm_restore_noirq(), its runtime 1027 + power management status will be changed to "active" (as it is going to be put 1028 + into D0 going forward), but if it is in runtime suspend in pci_pm_thaw_noirq(), 1029 + the function will set the power.direct_complete flag for it (to make the PM core 1030 + skip the subsequent "thaw" callbacks for it) and return. 1031 + 1032 + Setting the DPM_FLAG_LEAVE_SUSPENDED flag means that the driver prefers the 1033 + device to be left in suspend after system-wide transitions to the working state. 1034 + This flag is checked by the PM core, but the PCI bus type informs the PM core 1035 + which devices may be left in suspend from its perspective (that happens during 1036 + the "noirq" phase of system-wide suspend and analogous transitions) and next it 1037 + uses the dev_pm_may_skip_resume() helper to decide whether or not to return from 1038 + pci_pm_resume_noirq() early, as the PM core will skip the remaining resume 1039 + callbacks for the device during the transition under way and will set its 1040 + runtime PM status to "suspended" if dev_pm_may_skip_resume() returns "true" for 1041 + it. 1042 + 1043 + 3.2. Device Runtime Power Management 1044 + ------------------------------------ 1045 + 1046 + In addition to providing device power management callbacks PCI device drivers 1047 + are responsible for controlling the runtime power management (runtime PM) of 1048 + their devices. 1049 + 1050 + The PCI device runtime PM is optional, but it is recommended that PCI device 1051 + drivers implement it at least in the cases where there is a reliable way of 1052 + verifying that the device is not used (like when the network cable is detached 1053 + from an Ethernet adapter or there are no devices attached to a USB controller). 1054 + 1055 + To support the PCI runtime PM the driver first needs to implement the 1056 + runtime_suspend() and runtime_resume() callbacks. It also may need to implement 1057 + the runtime_idle() callback to prevent the device from being suspended again 1058 + every time right after the runtime_resume() callback has returned 1059 + (alternatively, the runtime_suspend() callback will have to check if the 1060 + device should really be suspended and return -EAGAIN if that is not the case). 1061 + 1062 + The runtime PM of PCI devices is enabled by default by the PCI core. PCI 1063 + device drivers do not need to enable it and should not attempt to do so. 1064 + However, it is blocked by pci_pm_init() that runs the pm_runtime_forbid() 1065 + helper function. In addition to that, the runtime PM usage counter of 1066 + each PCI device is incremented by local_pci_probe() before executing the 1067 + probe callback provided by the device's driver. 1068 + 1069 + If a PCI driver implements the runtime PM callbacks and intends to use the 1070 + runtime PM framework provided by the PM core and the PCI subsystem, it needs 1071 + to decrement the device's runtime PM usage counter in its probe callback 1072 + function. If it doesn't do that, the counter will always be different from 1073 + zero for the device and it will never be runtime-suspended. The simplest 1074 + way to do that is by calling pm_runtime_put_noidle(), but if the driver 1075 + wants to schedule an autosuspend right away, for example, it may call 1076 + pm_runtime_put_autosuspend() instead for this purpose. Generally, it 1077 + just needs to call a function that decrements the devices usage counter 1078 + from its probe routine to make runtime PM work for the device. 1079 + 1080 + It is important to remember that the driver's runtime_suspend() callback 1081 + may be executed right after the usage counter has been decremented, because 1082 + user space may already have caused the pm_runtime_allow() helper function 1083 + unblocking the runtime PM of the device to run via sysfs, so the driver must 1084 + be prepared to cope with that. 1085 + 1086 + The driver itself should not call pm_runtime_allow(), though. Instead, it 1087 + should let user space or some platform-specific code do that (user space can 1088 + do it via sysfs as stated above), but it must be prepared to handle the 1089 + runtime PM of the device correctly as soon as pm_runtime_allow() is called 1090 + (which may happen at any time, even before the driver is loaded). 1091 + 1092 + When the driver's remove callback runs, it has to balance the decrementation 1093 + of the device's runtime PM usage counter at the probe time. For this reason, 1094 + if it has decremented the counter in its probe callback, it must run 1095 + pm_runtime_get_noresume() in its remove callback. [Since the core carries 1096 + out a runtime resume of the device and bumps up the device's usage counter 1097 + before running the driver's remove callback, the runtime PM of the device 1098 + is effectively disabled for the duration of the remove execution and all 1099 + runtime PM helper functions incrementing the device's usage counter are 1100 + then effectively equivalent to pm_runtime_get_noresume().] 1101 + 1102 + The runtime PM framework works by processing requests to suspend or resume 1103 + devices, or to check if they are idle (in which cases it is reasonable to 1104 + subsequently request that they be suspended). These requests are represented 1105 + by work items put into the power management workqueue, pm_wq. Although there 1106 + are a few situations in which power management requests are automatically 1107 + queued by the PM core (for example, after processing a request to resume a 1108 + device the PM core automatically queues a request to check if the device is 1109 + idle), device drivers are generally responsible for queuing power management 1110 + requests for their devices. For this purpose they should use the runtime PM 1111 + helper functions provided by the PM core, discussed in 1112 + Documentation/power/runtime_pm.rst. 1113 + 1114 + Devices can also be suspended and resumed synchronously, without placing a 1115 + request into pm_wq. In the majority of cases this also is done by their 1116 + drivers that use helper functions provided by the PM core for this purpose. 1117 + 1118 + For more information on the runtime PM of devices refer to 1119 + Documentation/power/runtime_pm.rst. 1120 + 1121 + 1122 + 4. Resources 1123 + ============ 1124 + 1125 + PCI Local Bus Specification, Rev. 3.0 1126 + 1127 + PCI Bus Power Management Interface Specification, Rev. 1.2 1128 + 1129 + Advanced Configuration and Power Interface (ACPI) Specification, Rev. 3.0b 1130 + 1131 + PCI Express Base Specification, Rev. 2.0 1132 + 1133 + Documentation/driver-api/pm/devices.rst 1134 + 1135 + Documentation/power/runtime_pm.rst

-1094

Documentation/power/pci.txt

··· 1 - PCI Power Management 2 - 3 - Copyright (c) 2010 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc. 4 - 5 - An overview of concepts and the Linux kernel's interfaces related to PCI power 6 - management. Based on previous work by Patrick Mochel <mochel@transmeta.com> 7 - (and others). 8 - 9 - This document only covers the aspects of power management specific to PCI 10 - devices. For general description of the kernel's interfaces related to device 11 - power management refer to Documentation/driver-api/pm/devices.rst and 12 - Documentation/power/runtime_pm.txt. 13 - 14 - --------------------------------------------------------------------------- 15 - 16 - 1. Hardware and Platform Support for PCI Power Management 17 - 2. PCI Subsystem and Device Power Management 18 - 3. PCI Device Drivers and Power Management 19 - 4. Resources 20 - 21 - 22 - 1. Hardware and Platform Support for PCI Power Management 23 - ========================================================= 24 - 25 - 1.1. Native and Platform-Based Power Management 26 - ----------------------------------------------- 27 - In general, power management is a feature allowing one to save energy by putting 28 - devices into states in which they draw less power (low-power states) at the 29 - price of reduced functionality or performance. 30 - 31 - Usually, a device is put into a low-power state when it is underutilized or 32 - completely inactive. However, when it is necessary to use the device once 33 - again, it has to be put back into the "fully functional" state (full-power 34 - state). This may happen when there are some data for the device to handle or 35 - as a result of an external event requiring the device to be active, which may 36 - be signaled by the device itself. 37 - 38 - PCI devices may be put into low-power states in two ways, by using the device 39 - capabilities introduced by the PCI Bus Power Management Interface Specification, 40 - or with the help of platform firmware, such as an ACPI BIOS. In the first 41 - approach, that is referred to as the native PCI power management (native PCI PM) 42 - in what follows, the device power state is changed as a result of writing a 43 - specific value into one of its standard configuration registers. The second 44 - approach requires the platform firmware to provide special methods that may be 45 - used by the kernel to change the device's power state. 46 - 47 - Devices supporting the native PCI PM usually can generate wakeup signals called 48 - Power Management Events (PMEs) to let the kernel know about external events 49 - requiring the device to be active. After receiving a PME the kernel is supposed 50 - to put the device that sent it into the full-power state. However, the PCI Bus 51 - Power Management Interface Specification doesn't define any standard method of 52 - delivering the PME from the device to the CPU and the operating system kernel. 53 - It is assumed that the platform firmware will perform this task and therefore, 54 - even though a PCI device is set up to generate PMEs, it also may be necessary to 55 - prepare the platform firmware for notifying the CPU of the PMEs coming from the 56 - device (e.g. by generating interrupts). 57 - 58 - In turn, if the methods provided by the platform firmware are used for changing 59 - the power state of a device, usually the platform also provides a method for 60 - preparing the device to generate wakeup signals. In that case, however, it 61 - often also is necessary to prepare the device for generating PMEs using the 62 - native PCI PM mechanism, because the method provided by the platform depends on 63 - that. 64 - 65 - Thus in many situations both the native and the platform-based power management 66 - mechanisms have to be used simultaneously to obtain the desired result. 67 - 68 - 1.2. Native PCI Power Management 69 - -------------------------------- 70 - The PCI Bus Power Management Interface Specification (PCI PM Spec) was 71 - introduced between the PCI 2.1 and PCI 2.2 Specifications. It defined a 72 - standard interface for performing various operations related to power 73 - management. 74 - 75 - The implementation of the PCI PM Spec is optional for conventional PCI devices, 76 - but it is mandatory for PCI Express devices. If a device supports the PCI PM 77 - Spec, it has an 8 byte power management capability field in its PCI 78 - configuration space. This field is used to describe and control the standard 79 - features related to the native PCI power management. 80 - 81 - The PCI PM Spec defines 4 operating states for devices (D0-D3) and for buses 82 - (B0-B3). The higher the number, the less power is drawn by the device or bus 83 - in that state. However, the higher the number, the longer the latency for 84 - the device or bus to return to the full-power state (D0 or B0, respectively). 85 - 86 - There are two variants of the D3 state defined by the specification. The first 87 - one is D3hot, referred to as the software accessible D3, because devices can be 88 - programmed to go into it. The second one, D3cold, is the state that PCI devices 89 - are in when the supply voltage (Vcc) is removed from them. It is not possible 90 - to program a PCI device to go into D3cold, although there may be a programmable 91 - interface for putting the bus the device is on into a state in which Vcc is 92 - removed from all devices on the bus. 93 - 94 - PCI bus power management, however, is not supported by the Linux kernel at the 95 - time of this writing and therefore it is not covered by this document. 96 - 97 - Note that every PCI device can be in the full-power state (D0) or in D3cold, 98 - regardless of whether or not it implements the PCI PM Spec. In addition to 99 - that, if the PCI PM Spec is implemented by the device, it must support D3hot 100 - as well as D0. The support for the D1 and D2 power states is optional. 101 - 102 - PCI devices supporting the PCI PM Spec can be programmed to go to any of the 103 - supported low-power states (except for D3cold). While in D1-D3hot the 104 - standard configuration registers of the device must be accessible to software 105 - (i.e. the device is required to respond to PCI configuration accesses), although 106 - its I/O and memory spaces are then disabled. This allows the device to be 107 - programmatically put into D0. Thus the kernel can switch the device back and 108 - forth between D0 and the supported low-power states (except for D3cold) and the 109 - possible power state transitions the device can undergo are the following: 110 - 111 - +----------------------------+ 112 - | Current State | New State | 113 - +----------------------------+ 114 - | D0 | D1, D2, D3 | 115 - +----------------------------+ 116 - | D1 | D2, D3 | 117 - +----------------------------+ 118 - | D2 | D3 | 119 - +----------------------------+ 120 - | D1, D2, D3 | D0 | 121 - +----------------------------+ 122 - 123 - The transition from D3cold to D0 occurs when the supply voltage is provided to 124 - the device (i.e. power is restored). In that case the device returns to D0 with 125 - a full power-on reset sequence and the power-on defaults are restored to the 126 - device by hardware just as at initial power up. 127 - 128 - PCI devices supporting the PCI PM Spec can be programmed to generate PMEs 129 - while in a low-power state (D1-D3), but they are not required to be capable 130 - of generating PMEs from all supported low-power states. In particular, the 131 - capability of generating PMEs from D3cold is optional and depends on the 132 - presence of additional voltage (3.3Vaux) allowing the device to remain 133 - sufficiently active to generate a wakeup signal. 134 - 135 - 1.3. ACPI Device Power Management 136 - --------------------------------- 137 - The platform firmware support for the power management of PCI devices is 138 - system-specific. However, if the system in question is compliant with the 139 - Advanced Configuration and Power Interface (ACPI) Specification, like the 140 - majority of x86-based systems, it is supposed to implement device power 141 - management interfaces defined by the ACPI standard. 142 - 143 - For this purpose the ACPI BIOS provides special functions called "control 144 - methods" that may be executed by the kernel to perform specific tasks, such as 145 - putting a device into a low-power state. These control methods are encoded 146 - using special byte-code language called the ACPI Machine Language (AML) and 147 - stored in the machine's BIOS. The kernel loads them from the BIOS and executes 148 - them as needed using an AML interpreter that translates the AML byte code into 149 - computations and memory or I/O space accesses. This way, in theory, a BIOS 150 - writer can provide the kernel with a means to perform actions depending 151 - on the system design in a system-specific fashion. 152 - 153 - ACPI control methods may be divided into global control methods, that are not 154 - associated with any particular devices, and device control methods, that have 155 - to be defined separately for each device supposed to be handled with the help of 156 - the platform. This means, in particular, that ACPI device control methods can 157 - only be used to handle devices that the BIOS writer knew about in advance. The 158 - ACPI methods used for device power management fall into that category. 159 - 160 - The ACPI specification assumes that devices can be in one of four power states 161 - labeled as D0, D1, D2, and D3 that roughly correspond to the native PCI PM 162 - D0-D3 states (although the difference between D3hot and D3cold is not taken 163 - into account by ACPI). Moreover, for each power state of a device there is a 164 - set of power resources that have to be enabled for the device to be put into 165 - that state. These power resources are controlled (i.e. enabled or disabled) 166 - with the help of their own control methods, _ON and _OFF, that have to be 167 - defined individually for each of them. 168 - 169 - To put a device into the ACPI power state Dx (where x is a number between 0 and 170 - 3 inclusive) the kernel is supposed to (1) enable the power resources required 171 - by the device in this state using their _ON control methods and (2) execute the 172 - _PSx control method defined for the device. In addition to that, if the device 173 - is going to be put into a low-power state (D1-D3) and is supposed to generate 174 - wakeup signals from that state, the _DSW (or _PSW, replaced with _DSW by ACPI 175 - 3.0) control method defined for it has to be executed before _PSx. Power 176 - resources that are not required by the device in the target power state and are 177 - not required any more by any other device should be disabled (by executing their 178 - _OFF control methods). If the current power state of the device is D3, it can 179 - only be put into D0 this way. 180 - 181 - However, quite often the power states of devices are changed during a 182 - system-wide transition into a sleep state or back into the working state. ACPI 183 - defines four system sleep states, S1, S2, S3, and S4, and denotes the system 184 - working state as S0. In general, the target system sleep (or working) state 185 - determines the highest power (lowest number) state the device can be put 186 - into and the kernel is supposed to obtain this information by executing the 187 - device's _SxD control method (where x is a number between 0 and 4 inclusive). 188 - If the device is required to wake up the system from the target sleep state, the 189 - lowest power (highest number) state it can be put into is also determined by the 190 - target state of the system. The kernel is then supposed to use the device's 191 - _SxW control method to obtain the number of that state. It also is supposed to 192 - use the device's _PRW control method to learn which power resources need to be 193 - enabled for the device to be able to generate wakeup signals. 194 - 195 - 1.4. Wakeup Signaling 196 - --------------------- 197 - Wakeup signals generated by PCI devices, either as native PCI PMEs, or as 198 - a result of the execution of the _DSW (or _PSW) ACPI control method before 199 - putting the device into a low-power state, have to be caught and handled as 200 - appropriate. If they are sent while the system is in the working state 201 - (ACPI S0), they should be translated into interrupts so that the kernel can 202 - put the devices generating them into the full-power state and take care of the 203 - events that triggered them. In turn, if they are sent while the system is 204 - sleeping, they should cause the system's core logic to trigger wakeup. 205 - 206 - On ACPI-based systems wakeup signals sent by conventional PCI devices are 207 - converted into ACPI General-Purpose Events (GPEs) which are hardware signals 208 - from the system core logic generated in response to various events that need to 209 - be acted upon. Every GPE is associated with one or more sources of potentially 210 - interesting events. In particular, a GPE may be associated with a PCI device 211 - capable of signaling wakeup. The information on the connections between GPEs 212 - and event sources is recorded in the system's ACPI BIOS from where it can be 213 - read by the kernel. 214 - 215 - If a PCI device known to the system's ACPI BIOS signals wakeup, the GPE 216 - associated with it (if there is one) is triggered. The GPEs associated with PCI 217 - bridges may also be triggered in response to a wakeup signal from one of the 218 - devices below the bridge (this also is the case for root bridges) and, for 219 - example, native PCI PMEs from devices unknown to the system's ACPI BIOS may be 220 - handled this way. 221 - 222 - A GPE may be triggered when the system is sleeping (i.e. when it is in one of 223 - the ACPI S1-S4 states), in which case system wakeup is started by its core logic 224 - (the device that was the source of the signal causing the system wakeup to occur 225 - may be identified later). The GPEs used in such situations are referred to as 226 - wakeup GPEs. 227 - 228 - Usually, however, GPEs are also triggered when the system is in the working 229 - state (ACPI S0) and in that case the system's core logic generates a System 230 - Control Interrupt (SCI) to notify the kernel of the event. Then, the SCI 231 - handler identifies the GPE that caused the interrupt to be generated which, 232 - in turn, allows the kernel to identify the source of the event (that may be 233 - a PCI device signaling wakeup). The GPEs used for notifying the kernel of 234 - events occurring while the system is in the working state are referred to as 235 - runtime GPEs. 236 - 237 - Unfortunately, there is no standard way of handling wakeup signals sent by 238 - conventional PCI devices on systems that are not ACPI-based, but there is one 239 - for PCI Express devices. Namely, the PCI Express Base Specification introduced 240 - a native mechanism for converting native PCI PMEs into interrupts generated by 241 - root ports. For conventional PCI devices native PMEs are out-of-band, so they 242 - are routed separately and they need not pass through bridges (in principle they 243 - may be routed directly to the system's core logic), but for PCI Express devices 244 - they are in-band messages that have to pass through the PCI Express hierarchy, 245 - including the root port on the path from the device to the Root Complex. Thus 246 - it was possible to introduce a mechanism by which a root port generates an 247 - interrupt whenever it receives a PME message from one of the devices below it. 248 - The PCI Express Requester ID of the device that sent the PME message is then 249 - recorded in one of the root port's configuration registers from where it may be 250 - read by the interrupt handler allowing the device to be identified. [PME 251 - messages sent by PCI Express endpoints integrated with the Root Complex don't 252 - pass through root ports, but instead they cause a Root Complex Event Collector 253 - (if there is one) to generate interrupts.] 254 - 255 - In principle the native PCI Express PME signaling may also be used on ACPI-based 256 - systems along with the GPEs, but to use it the kernel has to ask the system's 257 - ACPI BIOS to release control of root port configuration registers. The ACPI 258 - BIOS, however, is not required to allow the kernel to control these registers 259 - and if it doesn't do that, the kernel must not modify their contents. Of course 260 - the native PCI Express PME signaling cannot be used by the kernel in that case. 261 - 262 - 263 - 2. PCI Subsystem and Device Power Management 264 - ============================================ 265 - 266 - 2.1. Device Power Management Callbacks 267 - -------------------------------------- 268 - The PCI Subsystem participates in the power management of PCI devices in a 269 - number of ways. First of all, it provides an intermediate code layer between 270 - the device power management core (PM core) and PCI device drivers. 271 - Specifically, the pm field of the PCI subsystem's struct bus_type object, 272 - pci_bus_type, points to a struct dev_pm_ops object, pci_dev_pm_ops, containing 273 - pointers to several device power management callbacks: 274 - 275 - const struct dev_pm_ops pci_dev_pm_ops = { 276 - .prepare = pci_pm_prepare, 277 - .complete = pci_pm_complete, 278 - .suspend = pci_pm_suspend, 279 - .resume = pci_pm_resume, 280 - .freeze = pci_pm_freeze, 281 - .thaw = pci_pm_thaw, 282 - .poweroff = pci_pm_poweroff, 283 - .restore = pci_pm_restore, 284 - .suspend_noirq = pci_pm_suspend_noirq, 285 - .resume_noirq = pci_pm_resume_noirq, 286 - .freeze_noirq = pci_pm_freeze_noirq, 287 - .thaw_noirq = pci_pm_thaw_noirq, 288 - .poweroff_noirq = pci_pm_poweroff_noirq, 289 - .restore_noirq = pci_pm_restore_noirq, 290 - .runtime_suspend = pci_pm_runtime_suspend, 291 - .runtime_resume = pci_pm_runtime_resume, 292 - .runtime_idle = pci_pm_runtime_idle, 293 - }; 294 - 295 - These callbacks are executed by the PM core in various situations related to 296 - device power management and they, in turn, execute power management callbacks 297 - provided by PCI device drivers. They also perform power management operations 298 - involving some standard configuration registers of PCI devices that device 299 - drivers need not know or care about. 300 - 301 - The structure representing a PCI device, struct pci_dev, contains several fields 302 - that these callbacks operate on: 303 - 304 - struct pci_dev { 305 - ... 306 - pci_power_t current_state; /* Current operating state. */ 307 - int pm_cap; /* PM capability offset in the 308 - configuration space */ 309 - unsigned int pme_support:5; /* Bitmask of states from which PME# 310 - can be generated */ 311 - unsigned int pme_interrupt:1;/* Is native PCIe PME signaling used? */ 312 - unsigned int d1_support:1; /* Low power state D1 is supported */ 313 - unsigned int d2_support:1; /* Low power state D2 is supported */ 314 - unsigned int no_d1d2:1; /* D1 and D2 are forbidden */ 315 - unsigned int wakeup_prepared:1; /* Device prepared for wake up */ 316 - unsigned int d3_delay; /* D3->D0 transition time in ms */ 317 - ... 318 - }; 319 - 320 - They also indirectly use some fields of the struct device that is embedded in 321 - struct pci_dev. 322 - 323 - 2.2. Device Initialization 324 - -------------------------- 325 - The PCI subsystem's first task related to device power management is to 326 - prepare the device for power management and initialize the fields of struct 327 - pci_dev used for this purpose. This happens in two functions defined in 328 - drivers/pci/pci.c, pci_pm_init() and platform_pci_wakeup_init(). 329 - 330 - The first of these functions checks if the device supports native PCI PM 331 - and if that's the case the offset of its power management capability structure 332 - in the configuration space is stored in the pm_cap field of the device's struct 333 - pci_dev object. Next, the function checks which PCI low-power states are 334 - supported by the device and from which low-power states the device can generate 335 - native PCI PMEs. The power management fields of the device's struct pci_dev and 336 - the struct device embedded in it are updated accordingly and the generation of 337 - PMEs by the device is disabled. 338 - 339 - The second function checks if the device can be prepared to signal wakeup with 340 - the help of the platform firmware, such as the ACPI BIOS. If that is the case, 341 - the function updates the wakeup fields in struct device embedded in the 342 - device's struct pci_dev and uses the firmware-provided method to prevent the 343 - device from signaling wakeup. 344 - 345 - At this point the device is ready for power management. For driverless devices, 346 - however, this functionality is limited to a few basic operations carried out 347 - during system-wide transitions to a sleep state and back to the working state. 348 - 349 - 2.3. Runtime Device Power Management 350 - ------------------------------------ 351 - The PCI subsystem plays a vital role in the runtime power management of PCI 352 - devices. For this purpose it uses the general runtime power management 353 - (runtime PM) framework described in Documentation/power/runtime_pm.txt. 354 - Namely, it provides subsystem-level callbacks: 355 - 356 - pci_pm_runtime_suspend() 357 - pci_pm_runtime_resume() 358 - pci_pm_runtime_idle() 359 - 360 - that are executed by the core runtime PM routines. It also implements the 361 - entire mechanics necessary for handling runtime wakeup signals from PCI devices 362 - in low-power states, which at the time of this writing works for both the native 363 - PCI Express PME signaling and the ACPI GPE-based wakeup signaling described in 364 - Section 1. 365 - 366 - First, a PCI device is put into a low-power state, or suspended, with the help 367 - of pm_schedule_suspend() or pm_runtime_suspend() which for PCI devices call 368 - pci_pm_runtime_suspend() to do the actual job. For this to work, the device's 369 - driver has to provide a pm->runtime_suspend() callback (see below), which is 370 - run by pci_pm_runtime_suspend() as the first action. If the driver's callback 371 - returns successfully, the device's standard configuration registers are saved, 372 - the device is prepared to generate wakeup signals and, finally, it is put into 373 - the target low-power state. 374 - 375 - The low-power state to put the device into is the lowest-power (highest number) 376 - state from which it can signal wakeup. The exact method of signaling wakeup is 377 - system-dependent and is determined by the PCI subsystem on the basis of the 378 - reported capabilities of the device and the platform firmware. To prepare the 379 - device for signaling wakeup and put it into the selected low-power state, the 380 - PCI subsystem can use the platform firmware as well as the device's native PCI 381 - PM capabilities, if supported. 382 - 383 - It is expected that the device driver's pm->runtime_suspend() callback will 384 - not attempt to prepare the device for signaling wakeup or to put it into a 385 - low-power state. The driver ought to leave these tasks to the PCI subsystem 386 - that has all of the information necessary to perform them. 387 - 388 - A suspended device is brought back into the "active" state, or resumed, 389 - with the help of pm_request_resume() or pm_runtime_resume() which both call 390 - pci_pm_runtime_resume() for PCI devices. Again, this only works if the device's 391 - driver provides a pm->runtime_resume() callback (see below). However, before 392 - the driver's callback is executed, pci_pm_runtime_resume() brings the device 393 - back into the full-power state, prevents it from signaling wakeup while in that 394 - state and restores its standard configuration registers. Thus the driver's 395 - callback need not worry about the PCI-specific aspects of the device resume. 396 - 397 - Note that generally pci_pm_runtime_resume() may be called in two different 398 - situations. First, it may be called at the request of the device's driver, for 399 - example if there are some data for it to process. Second, it may be called 400 - as a result of a wakeup signal from the device itself (this sometimes is 401 - referred to as "remote wakeup"). Of course, for this purpose the wakeup signal 402 - is handled in one of the ways described in Section 1 and finally converted into 403 - a notification for the PCI subsystem after the source device has been 404 - identified. 405 - 406 - The pci_pm_runtime_idle() function, called for PCI devices by pm_runtime_idle() 407 - and pm_request_idle(), executes the device driver's pm->runtime_idle() 408 - callback, if defined, and if that callback doesn't return error code (or is not 409 - present at all), suspends the device with the help of pm_runtime_suspend(). 410 - Sometimes pci_pm_runtime_idle() is called automatically by the PM core (for 411 - example, it is called right after the device has just been resumed), in which 412 - cases it is expected to suspend the device if that makes sense. Usually, 413 - however, the PCI subsystem doesn't really know if the device really can be 414 - suspended, so it lets the device's driver decide by running its 415 - pm->runtime_idle() callback. 416 - 417 - 2.4. System-Wide Power Transitions 418 - ---------------------------------- 419 - There are a few different types of system-wide power transitions, described in 420 - Documentation/driver-api/pm/devices.rst. Each of them requires devices to be handled 421 - in a specific way and the PM core executes subsystem-level power management 422 - callbacks for this purpose. They are executed in phases such that each phase 423 - involves executing the same subsystem-level callback for every device belonging 424 - to the given subsystem before the next phase begins. These phases always run 425 - after tasks have been frozen. 426 - 427 - 2.4.1. System Suspend 428 - 429 - When the system is going into a sleep state in which the contents of memory will 430 - be preserved, such as one of the ACPI sleep states S1-S3, the phases are: 431 - 432 - prepare, suspend, suspend_noirq. 433 - 434 - The following PCI bus type's callbacks, respectively, are used in these phases: 435 - 436 - pci_pm_prepare() 437 - pci_pm_suspend() 438 - pci_pm_suspend_noirq() 439 - 440 - The pci_pm_prepare() routine first puts the device into the "fully functional" 441 - state with the help of pm_runtime_resume(). Then, it executes the device 442 - driver's pm->prepare() callback if defined (i.e. if the driver's struct 443 - dev_pm_ops object is present and the prepare pointer in that object is valid). 444 - 445 - The pci_pm_suspend() routine first checks if the device's driver implements 446 - legacy PCI suspend routines (see Section 3), in which case the driver's legacy 447 - suspend callback is executed, if present, and its result is returned. Next, if 448 - the device's driver doesn't provide a struct dev_pm_ops object (containing 449 - pointers to the driver's callbacks), pci_pm_default_suspend() is called, which 450 - simply turns off the device's bus master capability and runs 451 - pcibios_disable_device() to disable it, unless the device is a bridge (PCI 452 - bridges are ignored by this routine). Next, the device driver's pm->suspend() 453 - callback is executed, if defined, and its result is returned if it fails. 454 - Finally, pci_fixup_device() is called to apply hardware suspend quirks related 455 - to the device if necessary. 456 - 457 - Note that the suspend phase is carried out asynchronously for PCI devices, so 458 - the pci_pm_suspend() callback may be executed in parallel for any pair of PCI 459 - devices that don't depend on each other in a known way (i.e. none of the paths 460 - in the device tree from the root bridge to a leaf device contains both of them). 461 - 462 - The pci_pm_suspend_noirq() routine is executed after suspend_device_irqs() has 463 - been called, which means that the device driver's interrupt handler won't be 464 - invoked while this routine is running. It first checks if the device's driver 465 - implements legacy PCI suspends routines (Section 3), in which case the legacy 466 - late suspend routine is called and its result is returned (the standard 467 - configuration registers of the device are saved if the driver's callback hasn't 468 - done that). Second, if the device driver's struct dev_pm_ops object is not 469 - present, the device's standard configuration registers are saved and the routine 470 - returns success. Otherwise the device driver's pm->suspend_noirq() callback is 471 - executed, if present, and its result is returned if it fails. Next, if the 472 - device's standard configuration registers haven't been saved yet (one of the 473 - device driver's callbacks executed before might do that), pci_pm_suspend_noirq() 474 - saves them, prepares the device to signal wakeup (if necessary) and puts it into 475 - a low-power state. 476 - 477 - The low-power state to put the device into is the lowest-power (highest number) 478 - state from which it can signal wakeup while the system is in the target sleep 479 - state. Just like in the runtime PM case described above, the mechanism of 480 - signaling wakeup is system-dependent and determined by the PCI subsystem, which 481 - is also responsible for preparing the device to signal wakeup from the system's 482 - target sleep state as appropriate. 483 - 484 - PCI device drivers (that don't implement legacy power management callbacks) are 485 - generally not expected to prepare devices for signaling wakeup or to put them 486 - into low-power states. However, if one of the driver's suspend callbacks 487 - (pm->suspend() or pm->suspend_noirq()) saves the device's standard configuration 488 - registers, pci_pm_suspend_noirq() will assume that the device has been prepared 489 - to signal wakeup and put into a low-power state by the driver (the driver is 490 - then assumed to have used the helper functions provided by the PCI subsystem for 491 - this purpose). PCI device drivers are not encouraged to do that, but in some 492 - rare cases doing that in the driver may be the optimum approach. 493 - 494 - 2.4.2. System Resume 495 - 496 - When the system is undergoing a transition from a sleep state in which the 497 - contents of memory have been preserved, such as one of the ACPI sleep states 498 - S1-S3, into the working state (ACPI S0), the phases are: 499 - 500 - resume_noirq, resume, complete. 501 - 502 - The following PCI bus type's callbacks, respectively, are executed in these 503 - phases: 504 - 505 - pci_pm_resume_noirq() 506 - pci_pm_resume() 507 - pci_pm_complete() 508 - 509 - The pci_pm_resume_noirq() routine first puts the device into the full-power 510 - state, restores its standard configuration registers and applies early resume 511 - hardware quirks related to the device, if necessary. This is done 512 - unconditionally, regardless of whether or not the device's driver implements 513 - legacy PCI power management callbacks (this way all PCI devices are in the 514 - full-power state and their standard configuration registers have been restored 515 - when their interrupt handlers are invoked for the first time during resume, 516 - which allows the kernel to avoid problems with the handling of shared interrupts 517 - by drivers whose devices are still suspended). If legacy PCI power management 518 - callbacks (see Section 3) are implemented by the device's driver, the legacy 519 - early resume callback is executed and its result is returned. Otherwise, the 520 - device driver's pm->resume_noirq() callback is executed, if defined, and its 521 - result is returned. 522 - 523 - The pci_pm_resume() routine first checks if the device's standard configuration 524 - registers have been restored and restores them if that's not the case (this 525 - only is necessary in the error path during a failing suspend). Next, resume 526 - hardware quirks related to the device are applied, if necessary, and if the 527 - device's driver implements legacy PCI power management callbacks (see 528 - Section 3), the driver's legacy resume callback is executed and its result is 529 - returned. Otherwise, the device's wakeup signaling mechanisms are blocked and 530 - its driver's pm->resume() callback is executed, if defined (the callback's 531 - result is then returned). 532 - 533 - The resume phase is carried out asynchronously for PCI devices, like the 534 - suspend phase described above, which means that if two PCI devices don't depend 535 - on each other in a known way, the pci_pm_resume() routine may be executed for 536 - the both of them in parallel. 537 - 538 - The pci_pm_complete() routine only executes the device driver's pm->complete() 539 - callback, if defined. 540 - 541 - 2.4.3. System Hibernation 542 - 543 - System hibernation is more complicated than system suspend, because it requires 544 - a system image to be created and written into a persistent storage medium. The 545 - image is created atomically and all devices are quiesced, or frozen, before that 546 - happens. 547 - 548 - The freezing of devices is carried out after enough memory has been freed (at 549 - the time of this writing the image creation requires at least 50% of system RAM 550 - to be free) in the following three phases: 551 - 552 - prepare, freeze, freeze_noirq 553 - 554 - that correspond to the PCI bus type's callbacks: 555 - 556 - pci_pm_prepare() 557 - pci_pm_freeze() 558 - pci_pm_freeze_noirq() 559 - 560 - This means that the prepare phase is exactly the same as for system suspend. 561 - The other two phases, however, are different. 562 - 563 - The pci_pm_freeze() routine is quite similar to pci_pm_suspend(), but it runs 564 - the device driver's pm->freeze() callback, if defined, instead of pm->suspend(), 565 - and it doesn't apply the suspend-related hardware quirks. It is executed 566 - asynchronously for different PCI devices that don't depend on each other in a 567 - known way. 568 - 569 - The pci_pm_freeze_noirq() routine, in turn, is similar to 570 - pci_pm_suspend_noirq(), but it calls the device driver's pm->freeze_noirq() 571 - routine instead of pm->suspend_noirq(). It also doesn't attempt to prepare the 572 - device for signaling wakeup and put it into a low-power state. Still, it saves 573 - the device's standard configuration registers if they haven't been saved by one 574 - of the driver's callbacks. 575 - 576 - Once the image has been created, it has to be saved. However, at this point all 577 - devices are frozen and they cannot handle I/O, while their ability to handle 578 - I/O is obviously necessary for the image saving. Thus they have to be brought 579 - back to the fully functional state and this is done in the following phases: 580 - 581 - thaw_noirq, thaw, complete 582 - 583 - using the following PCI bus type's callbacks: 584 - 585 - pci_pm_thaw_noirq() 586 - pci_pm_thaw() 587 - pci_pm_complete() 588 - 589 - respectively. 590 - 591 - The first of them, pci_pm_thaw_noirq(), is analogous to pci_pm_resume_noirq(), 592 - but it doesn't put the device into the full power state and doesn't attempt to 593 - restore its standard configuration registers. It also executes the device 594 - driver's pm->thaw_noirq() callback, if defined, instead of pm->resume_noirq(). 595 - 596 - The pci_pm_thaw() routine is similar to pci_pm_resume(), but it runs the device 597 - driver's pm->thaw() callback instead of pm->resume(). It is executed 598 - asynchronously for different PCI devices that don't depend on each other in a 599 - known way. 600 - 601 - The complete phase it the same as for system resume. 602 - 603 - After saving the image, devices need to be powered down before the system can 604 - enter the target sleep state (ACPI S4 for ACPI-based systems). This is done in 605 - three phases: 606 - 607 - prepare, poweroff, poweroff_noirq 608 - 609 - where the prepare phase is exactly the same as for system suspend. The other 610 - two phases are analogous to the suspend and suspend_noirq phases, respectively. 611 - The PCI subsystem-level callbacks they correspond to 612 - 613 - pci_pm_poweroff() 614 - pci_pm_poweroff_noirq() 615 - 616 - work in analogy with pci_pm_suspend() and pci_pm_poweroff_noirq(), respectively, 617 - although they don't attempt to save the device's standard configuration 618 - registers. 619 - 620 - 2.4.4. System Restore 621 - 622 - System restore requires a hibernation image to be loaded into memory and the 623 - pre-hibernation memory contents to be restored before the pre-hibernation system 624 - activity can be resumed. 625 - 626 - As described in Documentation/driver-api/pm/devices.rst, the hibernation image is loaded 627 - into memory by a fresh instance of the kernel, called the boot kernel, which in 628 - turn is loaded and run by a boot loader in the usual way. After the boot kernel 629 - has loaded the image, it needs to replace its own code and data with the code 630 - and data of the "hibernated" kernel stored within the image, called the image 631 - kernel. For this purpose all devices are frozen just like before creating 632 - the image during hibernation, in the 633 - 634 - prepare, freeze, freeze_noirq 635 - 636 - phases described above. However, the devices affected by these phases are only 637 - those having drivers in the boot kernel; other devices will still be in whatever 638 - state the boot loader left them. 639 - 640 - Should the restoration of the pre-hibernation memory contents fail, the boot 641 - kernel would go through the "thawing" procedure described above, using the 642 - thaw_noirq, thaw, and complete phases (that will only affect the devices having 643 - drivers in the boot kernel), and then continue running normally. 644 - 645 - If the pre-hibernation memory contents are restored successfully, which is the 646 - usual situation, control is passed to the image kernel, which then becomes 647 - responsible for bringing the system back to the working state. To achieve this, 648 - it must restore the devices' pre-hibernation functionality, which is done much 649 - like waking up from the memory sleep state, although it involves different 650 - phases: 651 - 652 - restore_noirq, restore, complete 653 - 654 - The first two of these are analogous to the resume_noirq and resume phases 655 - described above, respectively, and correspond to the following PCI subsystem 656 - callbacks: 657 - 658 - pci_pm_restore_noirq() 659 - pci_pm_restore() 660 - 661 - These callbacks work in analogy with pci_pm_resume_noirq() and pci_pm_resume(), 662 - respectively, but they execute the device driver's pm->restore_noirq() and 663 - pm->restore() callbacks, if available. 664 - 665 - The complete phase is carried out in exactly the same way as during system 666 - resume. 667 - 668 - 669 - 3. PCI Device Drivers and Power Management 670 - ========================================== 671 - 672 - 3.1. Power Management Callbacks 673 - ------------------------------- 674 - PCI device drivers participate in power management by providing callbacks to be 675 - executed by the PCI subsystem's power management routines described above and by 676 - controlling the runtime power management of their devices. 677 - 678 - At the time of this writing there are two ways to define power management 679 - callbacks for a PCI device driver, the recommended one, based on using a 680 - dev_pm_ops structure described in Documentation/driver-api/pm/devices.rst, and the 681 - "legacy" one, in which the .suspend(), .suspend_late(), .resume_early(), and 682 - .resume() callbacks from struct pci_driver are used. The legacy approach, 683 - however, doesn't allow one to define runtime power management callbacks and is 684 - not really suitable for any new drivers. Therefore it is not covered by this 685 - document (refer to the source code to learn more about it). 686 - 687 - It is recommended that all PCI device drivers define a struct dev_pm_ops object 688 - containing pointers to power management (PM) callbacks that will be executed by 689 - the PCI subsystem's PM routines in various circumstances. A pointer to the 690 - driver's struct dev_pm_ops object has to be assigned to the driver.pm field in 691 - its struct pci_driver object. Once that has happened, the "legacy" PM callbacks 692 - in struct pci_driver are ignored (even if they are not NULL). 693 - 694 - The PM callbacks in struct dev_pm_ops are not mandatory and if they are not 695 - defined (i.e. the respective fields of struct dev_pm_ops are unset) the PCI 696 - subsystem will handle the device in a simplified default manner. If they are 697 - defined, though, they are expected to behave as described in the following 698 - subsections. 699 - 700 - 3.1.1. prepare() 701 - 702 - The prepare() callback is executed during system suspend, during hibernation 703 - (when a hibernation image is about to be created), during power-off after 704 - saving a hibernation image and during system restore, when a hibernation image 705 - has just been loaded into memory. 706 - 707 - This callback is only necessary if the driver's device has children that in 708 - general may be registered at any time. In that case the role of the prepare() 709 - callback is to prevent new children of the device from being registered until 710 - one of the resume_noirq(), thaw_noirq(), or restore_noirq() callbacks is run. 711 - 712 - In addition to that the prepare() callback may carry out some operations 713 - preparing the device to be suspended, although it should not allocate memory 714 - (if additional memory is required to suspend the device, it has to be 715 - preallocated earlier, for example in a suspend/hibernate notifier as described 716 - in Documentation/driver-api/pm/notifiers.rst). 717 - 718 - 3.1.2. suspend() 719 - 720 - The suspend() callback is only executed during system suspend, after prepare() 721 - callbacks have been executed for all devices in the system. 722 - 723 - This callback is expected to quiesce the device and prepare it to be put into a 724 - low-power state by the PCI subsystem. It is not required (in fact it even is 725 - not recommended) that a PCI driver's suspend() callback save the standard 726 - configuration registers of the device, prepare it for waking up the system, or 727 - put it into a low-power state. All of these operations can very well be taken 728 - care of by the PCI subsystem, without the driver's participation. 729 - 730 - However, in some rare case it is convenient to carry out these operations in 731 - a PCI driver. Then, pci_save_state(), pci_prepare_to_sleep(), and 732 - pci_set_power_state() should be used to save the device's standard configuration 733 - registers, to prepare it for system wakeup (if necessary), and to put it into a 734 - low-power state, respectively. Moreover, if the driver calls pci_save_state(), 735 - the PCI subsystem will not execute either pci_prepare_to_sleep(), or 736 - pci_set_power_state() for its device, so the driver is then responsible for 737 - handling the device as appropriate. 738 - 739 - While the suspend() callback is being executed, the driver's interrupt handler 740 - can be invoked to handle an interrupt from the device, so all suspend-related 741 - operations relying on the driver's ability to handle interrupts should be 742 - carried out in this callback. 743 - 744 - 3.1.3. suspend_noirq() 745 - 746 - The suspend_noirq() callback is only executed during system suspend, after 747 - suspend() callbacks have been executed for all devices in the system and 748 - after device interrupts have been disabled by the PM core. 749 - 750 - The difference between suspend_noirq() and suspend() is that the driver's 751 - interrupt handler will not be invoked while suspend_noirq() is running. Thus 752 - suspend_noirq() can carry out operations that would cause race conditions to 753 - arise if they were performed in suspend(). 754 - 755 - 3.1.4. freeze() 756 - 757 - The freeze() callback is hibernation-specific and is executed in two situations, 758 - during hibernation, after prepare() callbacks have been executed for all devices 759 - in preparation for the creation of a system image, and during restore, 760 - after a system image has been loaded into memory from persistent storage and the 761 - prepare() callbacks have been executed for all devices. 762 - 763 - The role of this callback is analogous to the role of the suspend() callback 764 - described above. In fact, they only need to be different in the rare cases when 765 - the driver takes the responsibility for putting the device into a low-power 766 - state. 767 - 768 - In that cases the freeze() callback should not prepare the device system wakeup 769 - or put it into a low-power state. Still, either it or freeze_noirq() should 770 - save the device's standard configuration registers using pci_save_state(). 771 - 772 - 3.1.5. freeze_noirq() 773 - 774 - The freeze_noirq() callback is hibernation-specific. It is executed during 775 - hibernation, after prepare() and freeze() callbacks have been executed for all 776 - devices in preparation for the creation of a system image, and during restore, 777 - after a system image has been loaded into memory and after prepare() and 778 - freeze() callbacks have been executed for all devices. It is always executed 779 - after device interrupts have been disabled by the PM core. 780 - 781 - The role of this callback is analogous to the role of the suspend_noirq() 782 - callback described above and it very rarely is necessary to define 783 - freeze_noirq(). 784 - 785 - The difference between freeze_noirq() and freeze() is analogous to the 786 - difference between suspend_noirq() and suspend(). 787 - 788 - 3.1.6. poweroff() 789 - 790 - The poweroff() callback is hibernation-specific. It is executed when the system 791 - is about to be powered off after saving a hibernation image to a persistent 792 - storage. prepare() callbacks are executed for all devices before poweroff() is 793 - called. 794 - 795 - The role of this callback is analogous to the role of the suspend() and freeze() 796 - callbacks described above, although it does not need to save the contents of 797 - the device's registers. In particular, if the driver wants to put the device 798 - into a low-power state itself instead of allowing the PCI subsystem to do that, 799 - the poweroff() callback should use pci_prepare_to_sleep() and 800 - pci_set_power_state() to prepare the device for system wakeup and to put it 801 - into a low-power state, respectively, but it need not save the device's standard 802 - configuration registers. 803 - 804 - 3.1.7. poweroff_noirq() 805 - 806 - The poweroff_noirq() callback is hibernation-specific. It is executed after 807 - poweroff() callbacks have been executed for all devices in the system. 808 - 809 - The role of this callback is analogous to the role of the suspend_noirq() and 810 - freeze_noirq() callbacks described above, but it does not need to save the 811 - contents of the device's registers. 812 - 813 - The difference between poweroff_noirq() and poweroff() is analogous to the 814 - difference between suspend_noirq() and suspend(). 815 - 816 - 3.1.8. resume_noirq() 817 - 818 - The resume_noirq() callback is only executed during system resume, after the 819 - PM core has enabled the non-boot CPUs. The driver's interrupt handler will not 820 - be invoked while resume_noirq() is running, so this callback can carry out 821 - operations that might race with the interrupt handler. 822 - 823 - Since the PCI subsystem unconditionally puts all devices into the full power 824 - state in the resume_noirq phase of system resume and restores their standard 825 - configuration registers, resume_noirq() is usually not necessary. In general 826 - it should only be used for performing operations that would lead to race 827 - conditions if carried out by resume(). 828 - 829 - 3.1.9. resume() 830 - 831 - The resume() callback is only executed during system resume, after 832 - resume_noirq() callbacks have been executed for all devices in the system and 833 - device interrupts have been enabled by the PM core. 834 - 835 - This callback is responsible for restoring the pre-suspend configuration of the 836 - device and bringing it back to the fully functional state. The device should be 837 - able to process I/O in a usual way after resume() has returned. 838 - 839 - 3.1.10. thaw_noirq() 840 - 841 - The thaw_noirq() callback is hibernation-specific. It is executed after a 842 - system image has been created and the non-boot CPUs have been enabled by the PM 843 - core, in the thaw_noirq phase of hibernation. It also may be executed if the 844 - loading of a hibernation image fails during system restore (it is then executed 845 - after enabling the non-boot CPUs). The driver's interrupt handler will not be 846 - invoked while thaw_noirq() is running. 847 - 848 - The role of this callback is analogous to the role of resume_noirq(). The 849 - difference between these two callbacks is that thaw_noirq() is executed after 850 - freeze() and freeze_noirq(), so in general it does not need to modify the 851 - contents of the device's registers. 852 - 853 - 3.1.11. thaw() 854 - 855 - The thaw() callback is hibernation-specific. It is executed after thaw_noirq() 856 - callbacks have been executed for all devices in the system and after device 857 - interrupts have been enabled by the PM core. 858 - 859 - This callback is responsible for restoring the pre-freeze configuration of 860 - the device, so that it will work in a usual way after thaw() has returned. 861 - 862 - 3.1.12. restore_noirq() 863 - 864 - The restore_noirq() callback is hibernation-specific. It is executed in the 865 - restore_noirq phase of hibernation, when the boot kernel has passed control to 866 - the image kernel and the non-boot CPUs have been enabled by the image kernel's 867 - PM core. 868 - 869 - This callback is analogous to resume_noirq() with the exception that it cannot 870 - make any assumption on the previous state of the device, even if the BIOS (or 871 - generally the platform firmware) is known to preserve that state over a 872 - suspend-resume cycle. 873 - 874 - For the vast majority of PCI device drivers there is no difference between 875 - resume_noirq() and restore_noirq(). 876 - 877 - 3.1.13. restore() 878 - 879 - The restore() callback is hibernation-specific. It is executed after 880 - restore_noirq() callbacks have been executed for all devices in the system and 881 - after the PM core has enabled device drivers' interrupt handlers to be invoked. 882 - 883 - This callback is analogous to resume(), just like restore_noirq() is analogous 884 - to resume_noirq(). Consequently, the difference between restore_noirq() and 885 - restore() is analogous to the difference between resume_noirq() and resume(). 886 - 887 - For the vast majority of PCI device drivers there is no difference between 888 - resume() and restore(). 889 - 890 - 3.1.14. complete() 891 - 892 - The complete() callback is executed in the following situations: 893 - - during system resume, after resume() callbacks have been executed for all 894 - devices, 895 - - during hibernation, before saving the system image, after thaw() callbacks 896 - have been executed for all devices, 897 - - during system restore, when the system is going back to its pre-hibernation 898 - state, after restore() callbacks have been executed for all devices. 899 - It also may be executed if the loading of a hibernation image into memory fails 900 - (in that case it is run after thaw() callbacks have been executed for all 901 - devices that have drivers in the boot kernel). 902 - 903 - This callback is entirely optional, although it may be necessary if the 904 - prepare() callback performs operations that need to be reversed. 905 - 906 - 3.1.15. runtime_suspend() 907 - 908 - The runtime_suspend() callback is specific to device runtime power management 909 - (runtime PM). It is executed by the PM core's runtime PM framework when the 910 - device is about to be suspended (i.e. quiesced and put into a low-power state) 911 - at run time. 912 - 913 - This callback is responsible for freezing the device and preparing it to be 914 - put into a low-power state, but it must allow the PCI subsystem to perform all 915 - of the PCI-specific actions necessary for suspending the device. 916 - 917 - 3.1.16. runtime_resume() 918 - 919 - The runtime_resume() callback is specific to device runtime PM. It is executed 920 - by the PM core's runtime PM framework when the device is about to be resumed 921 - (i.e. put into the full-power state and programmed to process I/O normally) at 922 - run time. 923 - 924 - This callback is responsible for restoring the normal functionality of the 925 - device after it has been put into the full-power state by the PCI subsystem. 926 - The device is expected to be able to process I/O in the usual way after 927 - runtime_resume() has returned. 928 - 929 - 3.1.17. runtime_idle() 930 - 931 - The runtime_idle() callback is specific to device runtime PM. It is executed 932 - by the PM core's runtime PM framework whenever it may be desirable to suspend 933 - the device according to the PM core's information. In particular, it is 934 - automatically executed right after runtime_resume() has returned in case the 935 - resume of the device has happened as a result of a spurious event. 936 - 937 - This callback is optional, but if it is not implemented or if it returns 0, the 938 - PCI subsystem will call pm_runtime_suspend() for the device, which in turn will 939 - cause the driver's runtime_suspend() callback to be executed. 940 - 941 - 3.1.18. Pointing Multiple Callback Pointers to One Routine 942 - 943 - Although in principle each of the callbacks described in the previous 944 - subsections can be defined as a separate function, it often is convenient to 945 - point two or more members of struct dev_pm_ops to the same routine. There are 946 - a few convenience macros that can be used for this purpose. 947 - 948 - The SIMPLE_DEV_PM_OPS macro declares a struct dev_pm_ops object with one 949 - suspend routine pointed to by the .suspend(), .freeze(), and .poweroff() 950 - members and one resume routine pointed to by the .resume(), .thaw(), and 951 - .restore() members. The other function pointers in this struct dev_pm_ops are 952 - unset. 953 - 954 - The UNIVERSAL_DEV_PM_OPS macro is similar to SIMPLE_DEV_PM_OPS, but it 955 - additionally sets the .runtime_resume() pointer to the same value as 956 - .resume() (and .thaw(), and .restore()) and the .runtime_suspend() pointer to 957 - the same value as .suspend() (and .freeze() and .poweroff()). 958 - 959 - The SET_SYSTEM_SLEEP_PM_OPS can be used inside of a declaration of struct 960 - dev_pm_ops to indicate that one suspend routine is to be pointed to by the 961 - .suspend(), .freeze(), and .poweroff() members and one resume routine is to 962 - be pointed to by the .resume(), .thaw(), and .restore() members. 963 - 964 - 3.1.19. Driver Flags for Power Management 965 - 966 - The PM core allows device drivers to set flags that influence the handling of 967 - power management for the devices by the core itself and by middle layer code 968 - including the PCI bus type. The flags should be set once at the driver probe 969 - time with the help of the dev_pm_set_driver_flags() function and they should not 970 - be updated directly afterwards. 971 - 972 - The DPM_FLAG_NEVER_SKIP flag prevents the PM core from using the direct-complete 973 - mechanism allowing device suspend/resume callbacks to be skipped if the device 974 - is in runtime suspend when the system suspend starts. That also affects all of 975 - the ancestors of the device, so this flag should only be used if absolutely 976 - necessary. 977 - 978 - The DPM_FLAG_SMART_PREPARE flag instructs the PCI bus type to only return a 979 - positive value from pci_pm_prepare() if the ->prepare callback provided by the 980 - driver of the device returns a positive value. That allows the driver to opt 981 - out from using the direct-complete mechanism dynamically. 982 - 983 - The DPM_FLAG_SMART_SUSPEND flag tells the PCI bus type that from the driver's 984 - perspective the device can be safely left in runtime suspend during system 985 - suspend. That causes pci_pm_suspend(), pci_pm_freeze() and pci_pm_poweroff() 986 - to skip resuming the device from runtime suspend unless there are PCI-specific 987 - reasons for doing that. Also, it causes pci_pm_suspend_late/noirq(), 988 - pci_pm_freeze_late/noirq() and pci_pm_poweroff_late/noirq() to return early 989 - if the device remains in runtime suspend in the beginning of the "late" phase 990 - of the system-wide transition under way. Moreover, if the device is in 991 - runtime suspend in pci_pm_resume_noirq() or pci_pm_restore_noirq(), its runtime 992 - power management status will be changed to "active" (as it is going to be put 993 - into D0 going forward), but if it is in runtime suspend in pci_pm_thaw_noirq(), 994 - the function will set the power.direct_complete flag for it (to make the PM core 995 - skip the subsequent "thaw" callbacks for it) and return. 996 - 997 - Setting the DPM_FLAG_LEAVE_SUSPENDED flag means that the driver prefers the 998 - device to be left in suspend after system-wide transitions to the working state. 999 - This flag is checked by the PM core, but the PCI bus type informs the PM core 1000 - which devices may be left in suspend from its perspective (that happens during 1001 - the "noirq" phase of system-wide suspend and analogous transitions) and next it 1002 - uses the dev_pm_may_skip_resume() helper to decide whether or not to return from 1003 - pci_pm_resume_noirq() early, as the PM core will skip the remaining resume 1004 - callbacks for the device during the transition under way and will set its 1005 - runtime PM status to "suspended" if dev_pm_may_skip_resume() returns "true" for 1006 - it. 1007 - 1008 - 3.2. Device Runtime Power Management 1009 - ------------------------------------ 1010 - In addition to providing device power management callbacks PCI device drivers 1011 - are responsible for controlling the runtime power management (runtime PM) of 1012 - their devices. 1013 - 1014 - The PCI device runtime PM is optional, but it is recommended that PCI device 1015 - drivers implement it at least in the cases where there is a reliable way of 1016 - verifying that the device is not used (like when the network cable is detached 1017 - from an Ethernet adapter or there are no devices attached to a USB controller). 1018 - 1019 - To support the PCI runtime PM the driver first needs to implement the 1020 - runtime_suspend() and runtime_resume() callbacks. It also may need to implement 1021 - the runtime_idle() callback to prevent the device from being suspended again 1022 - every time right after the runtime_resume() callback has returned 1023 - (alternatively, the runtime_suspend() callback will have to check if the 1024 - device should really be suspended and return -EAGAIN if that is not the case). 1025 - 1026 - The runtime PM of PCI devices is enabled by default by the PCI core. PCI 1027 - device drivers do not need to enable it and should not attempt to do so. 1028 - However, it is blocked by pci_pm_init() that runs the pm_runtime_forbid() 1029 - helper function. In addition to that, the runtime PM usage counter of 1030 - each PCI device is incremented by local_pci_probe() before executing the 1031 - probe callback provided by the device's driver. 1032 - 1033 - If a PCI driver implements the runtime PM callbacks and intends to use the 1034 - runtime PM framework provided by the PM core and the PCI subsystem, it needs 1035 - to decrement the device's runtime PM usage counter in its probe callback 1036 - function. If it doesn't do that, the counter will always be different from 1037 - zero for the device and it will never be runtime-suspended. The simplest 1038 - way to do that is by calling pm_runtime_put_noidle(), but if the driver 1039 - wants to schedule an autosuspend right away, for example, it may call 1040 - pm_runtime_put_autosuspend() instead for this purpose. Generally, it 1041 - just needs to call a function that decrements the devices usage counter 1042 - from its probe routine to make runtime PM work for the device. 1043 - 1044 - It is important to remember that the driver's runtime_suspend() callback 1045 - may be executed right after the usage counter has been decremented, because 1046 - user space may already have caused the pm_runtime_allow() helper function 1047 - unblocking the runtime PM of the device to run via sysfs, so the driver must 1048 - be prepared to cope with that. 1049 - 1050 - The driver itself should not call pm_runtime_allow(), though. Instead, it 1051 - should let user space or some platform-specific code do that (user space can 1052 - do it via sysfs as stated above), but it must be prepared to handle the 1053 - runtime PM of the device correctly as soon as pm_runtime_allow() is called 1054 - (which may happen at any time, even before the driver is loaded). 1055 - 1056 - When the driver's remove callback runs, it has to balance the decrementation 1057 - of the device's runtime PM usage counter at the probe time. For this reason, 1058 - if it has decremented the counter in its probe callback, it must run 1059 - pm_runtime_get_noresume() in its remove callback. [Since the core carries 1060 - out a runtime resume of the device and bumps up the device's usage counter 1061 - before running the driver's remove callback, the runtime PM of the device 1062 - is effectively disabled for the duration of the remove execution and all 1063 - runtime PM helper functions incrementing the device's usage counter are 1064 - then effectively equivalent to pm_runtime_get_noresume().] 1065 - 1066 - The runtime PM framework works by processing requests to suspend or resume 1067 - devices, or to check if they are idle (in which cases it is reasonable to 1068 - subsequently request that they be suspended). These requests are represented 1069 - by work items put into the power management workqueue, pm_wq. Although there 1070 - are a few situations in which power management requests are automatically 1071 - queued by the PM core (for example, after processing a request to resume a 1072 - device the PM core automatically queues a request to check if the device is 1073 - idle), device drivers are generally responsible for queuing power management 1074 - requests for their devices. For this purpose they should use the runtime PM 1075 - helper functions provided by the PM core, discussed in 1076 - Documentation/power/runtime_pm.txt. 1077 - 1078 - Devices can also be suspended and resumed synchronously, without placing a 1079 - request into pm_wq. In the majority of cases this also is done by their 1080 - drivers that use helper functions provided by the PM core for this purpose. 1081 - 1082 - For more information on the runtime PM of devices refer to 1083 - Documentation/power/runtime_pm.txt. 1084 - 1085 - 1086 - 4. Resources 1087 - ============ 1088 - 1089 - PCI Local Bus Specification, Rev. 3.0 1090 - PCI Bus Power Management Interface Specification, Rev. 1.2 1091 - Advanced Configuration and Power Interface (ACPI) Specification, Rev. 3.0b 1092 - PCI Express Base Specification, Rev. 2.0 1093 - Documentation/driver-api/pm/devices.rst 1094 - Documentation/power/runtime_pm.txt

+225

Documentation/power/pm_qos_interface.rst

··· 1 + =============================== 2 + PM Quality Of Service Interface 3 + =============================== 4 + 5 + This interface provides a kernel and user mode interface for registering 6 + performance expectations by drivers, subsystems and user space applications on 7 + one of the parameters. 8 + 9 + Two different PM QoS frameworks are available: 10 + 1. PM QoS classes for cpu_dma_latency, network_latency, network_throughput, 11 + memory_bandwidth. 12 + 2. the per-device PM QoS framework provides the API to manage the per-device latency 13 + constraints and PM QoS flags. 14 + 15 + Each parameters have defined units: 16 + 17 + * latency: usec 18 + * timeout: usec 19 + * throughput: kbs (kilo bit / sec) 20 + * memory bandwidth: mbs (mega bit / sec) 21 + 22 + 23 + 1. PM QoS framework 24 + =================== 25 + 26 + The infrastructure exposes multiple misc device nodes one per implemented 27 + parameter. The set of parameters implement is defined by pm_qos_power_init() 28 + and pm_qos_params.h. This is done because having the available parameters 29 + being runtime configurable or changeable from a driver was seen as too easy to 30 + abuse. 31 + 32 + For each parameter a list of performance requests is maintained along with 33 + an aggregated target value. The aggregated target value is updated with 34 + changes to the request list or elements of the list. Typically the 35 + aggregated target value is simply the max or min of the request values held 36 + in the parameter list elements. 37 + Note: the aggregated target value is implemented as an atomic variable so that 38 + reading the aggregated value does not require any locking mechanism. 39 + 40 + 41 + From kernel mode the use of this interface is simple: 42 + 43 + void pm_qos_add_request(handle, param_class, target_value): 44 + Will insert an element into the list for that identified PM QoS class with the 45 + target value. Upon change to this list the new target is recomputed and any 46 + registered notifiers are called only if the target value is now different. 47 + Clients of pm_qos need to save the returned handle for future use in other 48 + pm_qos API functions. 49 + 50 + void pm_qos_update_request(handle, new_target_value): 51 + Will update the list element pointed to by the handle with the new target value 52 + and recompute the new aggregated target, calling the notification tree if the 53 + target is changed. 54 + 55 + void pm_qos_remove_request(handle): 56 + Will remove the element. After removal it will update the aggregate target and 57 + call the notification tree if the target was changed as a result of removing 58 + the request. 59 + 60 + int pm_qos_request(param_class): 61 + Returns the aggregated value for a given PM QoS class. 62 + 63 + int pm_qos_request_active(handle): 64 + Returns if the request is still active, i.e. it has not been removed from a 65 + PM QoS class constraints list. 66 + 67 + int pm_qos_add_notifier(param_class, notifier): 68 + Adds a notification callback function to the PM QoS class. The callback is 69 + called when the aggregated value for the PM QoS class is changed. 70 + 71 + int pm_qos_remove_notifier(int param_class, notifier): 72 + Removes the notification callback function for the PM QoS class. 73 + 74 + 75 + From user mode: 76 + 77 + Only processes can register a pm_qos request. To provide for automatic 78 + cleanup of a process, the interface requires the process to register its 79 + parameter requests in the following way: 80 + 81 + To register the default pm_qos target for the specific parameter, the process 82 + must open one of /dev/[cpu_dma_latency, network_latency, network_throughput] 83 + 84 + As long as the device node is held open that process has a registered 85 + request on the parameter. 86 + 87 + To change the requested target value the process needs to write an s32 value to 88 + the open device node. Alternatively the user mode program could write a hex 89 + string for the value using 10 char long format e.g. "0x12345678". This 90 + translates to a pm_qos_update_request call. 91 + 92 + To remove the user mode request for a target value simply close the device 93 + node. 94 + 95 + 96 + 2. PM QoS per-device latency and flags framework 97 + ================================================ 98 + 99 + For each device, there are three lists of PM QoS requests. Two of them are 100 + maintained along with the aggregated targets of resume latency and active 101 + state latency tolerance (in microseconds) and the third one is for PM QoS flags. 102 + Values are updated in response to changes of the request list. 103 + 104 + The target values of resume latency and active state latency tolerance are 105 + simply the minimum of the request values held in the parameter list elements. 106 + The PM QoS flags aggregate value is a gather (bitwise OR) of all list elements' 107 + values. One device PM QoS flag is defined currently: PM_QOS_FLAG_NO_POWER_OFF. 108 + 109 + Note: The aggregated target values are implemented in such a way that reading 110 + the aggregated value does not require any locking mechanism. 111 + 112 + 113 + From kernel mode the use of this interface is the following: 114 + 115 + int dev_pm_qos_add_request(device, handle, type, value): 116 + Will insert an element into the list for that identified device with the 117 + target value. Upon change to this list the new target is recomputed and any 118 + registered notifiers are called only if the target value is now different. 119 + Clients of dev_pm_qos need to save the handle for future use in other 120 + dev_pm_qos API functions. 121 + 122 + int dev_pm_qos_update_request(handle, new_value): 123 + Will update the list element pointed to by the handle with the new target 124 + value and recompute the new aggregated target, calling the notification 125 + trees if the target is changed. 126 + 127 + int dev_pm_qos_remove_request(handle): 128 + Will remove the element. After removal it will update the aggregate target 129 + and call the notification trees if the target was changed as a result of 130 + removing the request. 131 + 132 + s32 dev_pm_qos_read_value(device): 133 + Returns the aggregated value for a given device's constraints list. 134 + 135 + enum pm_qos_flags_status dev_pm_qos_flags(device, mask) 136 + Check PM QoS flags of the given device against the given mask of flags. 137 + The meaning of the return values is as follows: 138 + 139 + PM_QOS_FLAGS_ALL: 140 + All flags from the mask are set 141 + PM_QOS_FLAGS_SOME: 142 + Some flags from the mask are set 143 + PM_QOS_FLAGS_NONE: 144 + No flags from the mask are set 145 + PM_QOS_FLAGS_UNDEFINED: 146 + The device's PM QoS structure has not been initialized 147 + or the list of requests is empty. 148 + 149 + int dev_pm_qos_add_ancestor_request(dev, handle, type, value) 150 + Add a PM QoS request for the first direct ancestor of the given device whose 151 + power.ignore_children flag is unset (for DEV_PM_QOS_RESUME_LATENCY requests) 152 + or whose power.set_latency_tolerance callback pointer is not NULL (for 153 + DEV_PM_QOS_LATENCY_TOLERANCE requests). 154 + 155 + int dev_pm_qos_expose_latency_limit(device, value) 156 + Add a request to the device's PM QoS list of resume latency constraints and 157 + create a sysfs attribute pm_qos_resume_latency_us under the device's power 158 + directory allowing user space to manipulate that request. 159 + 160 + void dev_pm_qos_hide_latency_limit(device) 161 + Drop the request added by dev_pm_qos_expose_latency_limit() from the device's 162 + PM QoS list of resume latency constraints and remove sysfs attribute 163 + pm_qos_resume_latency_us from the device's power directory. 164 + 165 + int dev_pm_qos_expose_flags(device, value) 166 + Add a request to the device's PM QoS list of flags and create sysfs attribute 167 + pm_qos_no_power_off under the device's power directory allowing user space to 168 + change the value of the PM_QOS_FLAG_NO_POWER_OFF flag. 169 + 170 + void dev_pm_qos_hide_flags(device) 171 + Drop the request added by dev_pm_qos_expose_flags() from the device's PM QoS list 172 + of flags and remove sysfs attribute pm_qos_no_power_off from the device's power 173 + directory. 174 + 175 + Notification mechanisms: 176 + 177 + The per-device PM QoS framework has a per-device notification tree. 178 + 179 + int dev_pm_qos_add_notifier(device, notifier): 180 + Adds a notification callback function for the device. 181 + The callback is called when the aggregated value of the device constraints list 182 + is changed (for resume latency device PM QoS only). 183 + 184 + int dev_pm_qos_remove_notifier(device, notifier): 185 + Removes the notification callback function for the device. 186 + 187 + 188 + Active state latency tolerance 189 + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 190 + 191 + This device PM QoS type is used to support systems in which hardware may switch 192 + to energy-saving operation modes on the fly. In those systems, if the operation 193 + mode chosen by the hardware attempts to save energy in an overly aggressive way, 194 + it may cause excess latencies to be visible to software, causing it to miss 195 + certain protocol requirements or target frame or sample rates etc. 196 + 197 + If there is a latency tolerance control mechanism for a given device available 198 + to software, the .set_latency_tolerance callback in that device's dev_pm_info 199 + structure should be populated. The routine pointed to by it is should implement 200 + whatever is necessary to transfer the effective requirement value to the 201 + hardware. 202 + 203 + Whenever the effective latency tolerance changes for the device, its 204 + .set_latency_tolerance() callback will be executed and the effective value will 205 + be passed to it. If that value is negative, which means that the list of 206 + latency tolerance requirements for the device is empty, the callback is expected 207 + to switch the underlying hardware latency tolerance control mechanism to an 208 + autonomous mode if available. If that value is PM_QOS_LATENCY_ANY, in turn, and 209 + the hardware supports a special "no requirement" setting, the callback is 210 + expected to use it. That allows software to prevent the hardware from 211 + automatically updating the device's latency tolerance in response to its power 212 + state changes (e.g. during transitions from D3cold to D0), which generally may 213 + be done in the autonomous latency tolerance control mode. 214 + 215 + If .set_latency_tolerance() is present for the device, sysfs attribute 216 + pm_qos_latency_tolerance_us will be present in the devivce's power directory. 217 + Then, user space can use that attribute to specify its latency tolerance 218 + requirement for the device, if any. Writing "any" to it means "no requirement, 219 + but do not let the hardware control latency tolerance" and writing "auto" to it 220 + allows the hardware to be switched to the autonomous mode if there are no other 221 + requirements from the kernel side in the device's list. 222 + 223 + Kernel code can use the functions described above along with the 224 + DEV_PM_QOS_LATENCY_TOLERANCE device PM QoS type to add, remove and update 225 + latency tolerance requirements for devices.

-212

Documentation/power/pm_qos_interface.txt

··· 1 - PM Quality Of Service Interface. 2 - 3 - This interface provides a kernel and user mode interface for registering 4 - performance expectations by drivers, subsystems and user space applications on 5 - one of the parameters. 6 - 7 - Two different PM QoS frameworks are available: 8 - 1. PM QoS classes for cpu_dma_latency, network_latency, network_throughput, 9 - memory_bandwidth. 10 - 2. the per-device PM QoS framework provides the API to manage the per-device latency 11 - constraints and PM QoS flags. 12 - 13 - Each parameters have defined units: 14 - * latency: usec 15 - * timeout: usec 16 - * throughput: kbs (kilo bit / sec) 17 - * memory bandwidth: mbs (mega bit / sec) 18 - 19 - 20 - 1. PM QoS framework 21 - 22 - The infrastructure exposes multiple misc device nodes one per implemented 23 - parameter. The set of parameters implement is defined by pm_qos_power_init() 24 - and pm_qos_params.h. This is done because having the available parameters 25 - being runtime configurable or changeable from a driver was seen as too easy to 26 - abuse. 27 - 28 - For each parameter a list of performance requests is maintained along with 29 - an aggregated target value. The aggregated target value is updated with 30 - changes to the request list or elements of the list. Typically the 31 - aggregated target value is simply the max or min of the request values held 32 - in the parameter list elements. 33 - Note: the aggregated target value is implemented as an atomic variable so that 34 - reading the aggregated value does not require any locking mechanism. 35 - 36 - 37 - From kernel mode the use of this interface is simple: 38 - 39 - void pm_qos_add_request(handle, param_class, target_value): 40 - Will insert an element into the list for that identified PM QoS class with the 41 - target value. Upon change to this list the new target is recomputed and any 42 - registered notifiers are called only if the target value is now different. 43 - Clients of pm_qos need to save the returned handle for future use in other 44 - pm_qos API functions. 45 - 46 - void pm_qos_update_request(handle, new_target_value): 47 - Will update the list element pointed to by the handle with the new target value 48 - and recompute the new aggregated target, calling the notification tree if the 49 - target is changed. 50 - 51 - void pm_qos_remove_request(handle): 52 - Will remove the element. After removal it will update the aggregate target and 53 - call the notification tree if the target was changed as a result of removing 54 - the request. 55 - 56 - int pm_qos_request(param_class): 57 - Returns the aggregated value for a given PM QoS class. 58 - 59 - int pm_qos_request_active(handle): 60 - Returns if the request is still active, i.e. it has not been removed from a 61 - PM QoS class constraints list. 62 - 63 - int pm_qos_add_notifier(param_class, notifier): 64 - Adds a notification callback function to the PM QoS class. The callback is 65 - called when the aggregated value for the PM QoS class is changed. 66 - 67 - int pm_qos_remove_notifier(int param_class, notifier): 68 - Removes the notification callback function for the PM QoS class. 69 - 70 - 71 - From user mode: 72 - Only processes can register a pm_qos request. To provide for automatic 73 - cleanup of a process, the interface requires the process to register its 74 - parameter requests in the following way: 75 - 76 - To register the default pm_qos target for the specific parameter, the process 77 - must open one of /dev/[cpu_dma_latency, network_latency, network_throughput] 78 - 79 - As long as the device node is held open that process has a registered 80 - request on the parameter. 81 - 82 - To change the requested target value the process needs to write an s32 value to 83 - the open device node. Alternatively the user mode program could write a hex 84 - string for the value using 10 char long format e.g. "0x12345678". This 85 - translates to a pm_qos_update_request call. 86 - 87 - To remove the user mode request for a target value simply close the device 88 - node. 89 - 90 - 91 - 2. PM QoS per-device latency and flags framework 92 - 93 - For each device, there are three lists of PM QoS requests. Two of them are 94 - maintained along with the aggregated targets of resume latency and active 95 - state latency tolerance (in microseconds) and the third one is for PM QoS flags. 96 - Values are updated in response to changes of the request list. 97 - 98 - The target values of resume latency and active state latency tolerance are 99 - simply the minimum of the request values held in the parameter list elements. 100 - The PM QoS flags aggregate value is a gather (bitwise OR) of all list elements' 101 - values. One device PM QoS flag is defined currently: PM_QOS_FLAG_NO_POWER_OFF. 102 - 103 - Note: The aggregated target values are implemented in such a way that reading 104 - the aggregated value does not require any locking mechanism. 105 - 106 - 107 - From kernel mode the use of this interface is the following: 108 - 109 - int dev_pm_qos_add_request(device, handle, type, value): 110 - Will insert an element into the list for that identified device with the 111 - target value. Upon change to this list the new target is recomputed and any 112 - registered notifiers are called only if the target value is now different. 113 - Clients of dev_pm_qos need to save the handle for future use in other 114 - dev_pm_qos API functions. 115 - 116 - int dev_pm_qos_update_request(handle, new_value): 117 - Will update the list element pointed to by the handle with the new target value 118 - and recompute the new aggregated target, calling the notification trees if the 119 - target is changed. 120 - 121 - int dev_pm_qos_remove_request(handle): 122 - Will remove the element. After removal it will update the aggregate target and 123 - call the notification trees if the target was changed as a result of removing 124 - the request. 125 - 126 - s32 dev_pm_qos_read_value(device): 127 - Returns the aggregated value for a given device's constraints list. 128 - 129 - enum pm_qos_flags_status dev_pm_qos_flags(device, mask) 130 - Check PM QoS flags of the given device against the given mask of flags. 131 - The meaning of the return values is as follows: 132 - PM_QOS_FLAGS_ALL: All flags from the mask are set 133 - PM_QOS_FLAGS_SOME: Some flags from the mask are set 134 - PM_QOS_FLAGS_NONE: No flags from the mask are set 135 - PM_QOS_FLAGS_UNDEFINED: The device's PM QoS structure has not been 136 - initialized or the list of requests is empty. 137 - 138 - int dev_pm_qos_add_ancestor_request(dev, handle, type, value) 139 - Add a PM QoS request for the first direct ancestor of the given device whose 140 - power.ignore_children flag is unset (for DEV_PM_QOS_RESUME_LATENCY requests) 141 - or whose power.set_latency_tolerance callback pointer is not NULL (for 142 - DEV_PM_QOS_LATENCY_TOLERANCE requests). 143 - 144 - int dev_pm_qos_expose_latency_limit(device, value) 145 - Add a request to the device's PM QoS list of resume latency constraints and 146 - create a sysfs attribute pm_qos_resume_latency_us under the device's power 147 - directory allowing user space to manipulate that request. 148 - 149 - void dev_pm_qos_hide_latency_limit(device) 150 - Drop the request added by dev_pm_qos_expose_latency_limit() from the device's 151 - PM QoS list of resume latency constraints and remove sysfs attribute 152 - pm_qos_resume_latency_us from the device's power directory. 153 - 154 - int dev_pm_qos_expose_flags(device, value) 155 - Add a request to the device's PM QoS list of flags and create sysfs attribute 156 - pm_qos_no_power_off under the device's power directory allowing user space to 157 - change the value of the PM_QOS_FLAG_NO_POWER_OFF flag. 158 - 159 - void dev_pm_qos_hide_flags(device) 160 - Drop the request added by dev_pm_qos_expose_flags() from the device's PM QoS list 161 - of flags and remove sysfs attribute pm_qos_no_power_off from the device's power 162 - directory. 163 - 164 - Notification mechanisms: 165 - The per-device PM QoS framework has a per-device notification tree. 166 - 167 - int dev_pm_qos_add_notifier(device, notifier): 168 - Adds a notification callback function for the device. 169 - The callback is called when the aggregated value of the device constraints list 170 - is changed (for resume latency device PM QoS only). 171 - 172 - int dev_pm_qos_remove_notifier(device, notifier): 173 - Removes the notification callback function for the device. 174 - 175 - 176 - Active state latency tolerance 177 - 178 - This device PM QoS type is used to support systems in which hardware may switch 179 - to energy-saving operation modes on the fly. In those systems, if the operation 180 - mode chosen by the hardware attempts to save energy in an overly aggressive way, 181 - it may cause excess latencies to be visible to software, causing it to miss 182 - certain protocol requirements or target frame or sample rates etc. 183 - 184 - If there is a latency tolerance control mechanism for a given device available 185 - to software, the .set_latency_tolerance callback in that device's dev_pm_info 186 - structure should be populated. The routine pointed to by it is should implement 187 - whatever is necessary to transfer the effective requirement value to the 188 - hardware. 189 - 190 - Whenever the effective latency tolerance changes for the device, its 191 - .set_latency_tolerance() callback will be executed and the effective value will 192 - be passed to it. If that value is negative, which means that the list of 193 - latency tolerance requirements for the device is empty, the callback is expected 194 - to switch the underlying hardware latency tolerance control mechanism to an 195 - autonomous mode if available. If that value is PM_QOS_LATENCY_ANY, in turn, and 196 - the hardware supports a special "no requirement" setting, the callback is 197 - expected to use it. That allows software to prevent the hardware from 198 - automatically updating the device's latency tolerance in response to its power 199 - state changes (e.g. during transitions from D3cold to D0), which generally may 200 - be done in the autonomous latency tolerance control mode. 201 - 202 - If .set_latency_tolerance() is present for the device, sysfs attribute 203 - pm_qos_latency_tolerance_us will be present in the devivce's power directory. 204 - Then, user space can use that attribute to specify its latency tolerance 205 - requirement for the device, if any. Writing "any" to it means "no requirement, 206 - but do not let the hardware control latency tolerance" and writing "auto" to it 207 - allows the hardware to be switched to the autonomous mode if there are no other 208 - requirements from the kernel side in the device's list. 209 - 210 - Kernel code can use the functions described above along with the 211 - DEV_PM_QOS_LATENCY_TOLERANCE device PM QoS type to add, remove and update 212 - latency tolerance requirements for devices.

+282

Documentation/power/power_supply_class.rst

··· 1 + ======================== 2 + Linux power supply class 3 + ======================== 4 + 5 + Synopsis 6 + ~~~~~~~~ 7 + Power supply class used to represent battery, UPS, AC or DC power supply 8 + properties to user-space. 9 + 10 + It defines core set of attributes, which should be applicable to (almost) 11 + every power supply out there. Attributes are available via sysfs and uevent 12 + interfaces. 13 + 14 + Each attribute has well defined meaning, up to unit of measure used. While 15 + the attributes provided are believed to be universally applicable to any 16 + power supply, specific monitoring hardware may not be able to provide them 17 + all, so any of them may be skipped. 18 + 19 + Power supply class is extensible, and allows to define drivers own attributes. 20 + The core attribute set is subject to the standard Linux evolution (i.e. 21 + if it will be found that some attribute is applicable to many power supply 22 + types or their drivers, it can be added to the core set). 23 + 24 + It also integrates with LED framework, for the purpose of providing 25 + typically expected feedback of battery charging/fully charged status and 26 + AC/USB power supply online status. (Note that specific details of the 27 + indication (including whether to use it at all) are fully controllable by 28 + user and/or specific machine defaults, per design principles of LED 29 + framework). 30 + 31 + 32 + Attributes/properties 33 + ~~~~~~~~~~~~~~~~~~~~~ 34 + Power supply class has predefined set of attributes, this eliminates code 35 + duplication across drivers. Power supply class insist on reusing its 36 + predefined attributes *and* their units. 37 + 38 + So, userspace gets predictable set of attributes and their units for any 39 + kind of power supply, and can process/present them to a user in consistent 40 + manner. Results for different power supplies and machines are also directly 41 + comparable. 42 + 43 + See drivers/power/supply/ds2760_battery.c and drivers/power/supply/pda_power.c 44 + for the example how to declare and handle attributes. 45 + 46 + 47 + Units 48 + ~~~~~ 49 + Quoting include/linux/power_supply.h: 50 + 51 + All voltages, currents, charges, energies, time and temperatures in µV, 52 + µA, µAh, µWh, seconds and tenths of degree Celsius unless otherwise 53 + stated. It's driver's job to convert its raw values to units in which 54 + this class operates. 55 + 56 + 57 + Attributes/properties detailed 58 + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 59 + 60 + +--------------------------------------------------------------------------+ 61 + | **Charge/Energy/Capacity - how to not confuse** | 62 + +--------------------------------------------------------------------------+ 63 + | **Because both "charge" (µAh) and "energy" (µWh) represents "capacity" | 64 + | of battery, this class distinguish these terms. Don't mix them!** | 65 + | | 66 + | - `CHARGE_*` | 67 + | attributes represents capacity in µAh only. | 68 + | - `ENERGY_*` | 69 + | attributes represents capacity in µWh only. | 70 + | - `CAPACITY` | 71 + | attribute represents capacity in *percents*, from 0 to 100. | 72 + +--------------------------------------------------------------------------+ 73 + 74 + Postfixes: 75 + 76 + _AVG 77 + *hardware* averaged value, use it if your hardware is really able to 78 + report averaged values. 79 + _NOW 80 + momentary/instantaneous values. 81 + 82 + STATUS 83 + this attribute represents operating status (charging, full, 84 + discharging (i.e. powering a load), etc.). This corresponds to 85 + `BATTERY_STATUS_*` values, as defined in battery.h. 86 + 87 + CHARGE_TYPE 88 + batteries can typically charge at different rates. 89 + This defines trickle and fast charges. For batteries that 90 + are already charged or discharging, 'n/a' can be displayed (or 91 + 'unknown', if the status is not known). 92 + 93 + AUTHENTIC 94 + indicates the power supply (battery or charger) connected 95 + to the platform is authentic(1) or non authentic(0). 96 + 97 + HEALTH 98 + represents health of the battery, values corresponds to 99 + POWER_SUPPLY_HEALTH_*, defined in battery.h. 100 + 101 + VOLTAGE_OCV 102 + open circuit voltage of the battery. 103 + 104 + VOLTAGE_MAX_DESIGN, VOLTAGE_MIN_DESIGN 105 + design values for maximal and minimal power supply voltages. 106 + Maximal/minimal means values of voltages when battery considered 107 + "full"/"empty" at normal conditions. Yes, there is no direct relation 108 + between voltage and battery capacity, but some dumb 109 + batteries use voltage for very approximated calculation of capacity. 110 + Battery driver also can use this attribute just to inform userspace 111 + about maximal and minimal voltage thresholds of a given battery. 112 + 113 + VOLTAGE_MAX, VOLTAGE_MIN 114 + same as _DESIGN voltage values except that these ones should be used 115 + if hardware could only guess (measure and retain) the thresholds of a 116 + given power supply. 117 + 118 + VOLTAGE_BOOT 119 + Reports the voltage measured during boot 120 + 121 + CURRENT_BOOT 122 + Reports the current measured during boot 123 + 124 + CHARGE_FULL_DESIGN, CHARGE_EMPTY_DESIGN 125 + design charge values, when battery considered full/empty. 126 + 127 + ENERGY_FULL_DESIGN, ENERGY_EMPTY_DESIGN 128 + same as above but for energy. 129 + 130 + CHARGE_FULL, CHARGE_EMPTY 131 + These attributes means "last remembered value of charge when battery 132 + became full/empty". It also could mean "value of charge when battery 133 + considered full/empty at given conditions (temperature, age)". 134 + I.e. these attributes represents real thresholds, not design values. 135 + 136 + ENERGY_FULL, ENERGY_EMPTY 137 + same as above but for energy. 138 + 139 + CHARGE_COUNTER 140 + the current charge counter (in µAh). This could easily 141 + be negative; there is no empty or full value. It is only useful for 142 + relative, time-based measurements. 143 + 144 + PRECHARGE_CURRENT 145 + the maximum charge current during precharge phase of charge cycle 146 + (typically 20% of battery capacity). 147 + 148 + CHARGE_TERM_CURRENT 149 + Charge termination current. The charge cycle terminates when battery 150 + voltage is above recharge threshold, and charge current is below 151 + this setting (typically 10% of battery capacity). 152 + 153 + CONSTANT_CHARGE_CURRENT 154 + constant charge current programmed by charger. 155 + 156 + 157 + CONSTANT_CHARGE_CURRENT_MAX 158 + maximum charge current supported by the power supply object. 159 + 160 + CONSTANT_CHARGE_VOLTAGE 161 + constant charge voltage programmed by charger. 162 + CONSTANT_CHARGE_VOLTAGE_MAX 163 + maximum charge voltage supported by the power supply object. 164 + 165 + INPUT_CURRENT_LIMIT 166 + input current limit programmed by charger. Indicates 167 + the current drawn from a charging source. 168 + 169 + CHARGE_CONTROL_LIMIT 170 + current charge control limit setting 171 + CHARGE_CONTROL_LIMIT_MAX 172 + maximum charge control limit setting 173 + 174 + CALIBRATE 175 + battery or coulomb counter calibration status 176 + 177 + CAPACITY 178 + capacity in percents. 179 + CAPACITY_ALERT_MIN 180 + minimum capacity alert value in percents. 181 + CAPACITY_ALERT_MAX 182 + maximum capacity alert value in percents. 183 + CAPACITY_LEVEL 184 + capacity level. This corresponds to POWER_SUPPLY_CAPACITY_LEVEL_*. 185 + 186 + TEMP 187 + temperature of the power supply. 188 + TEMP_ALERT_MIN 189 + minimum battery temperature alert. 190 + TEMP_ALERT_MAX 191 + maximum battery temperature alert. 192 + TEMP_AMBIENT 193 + ambient temperature. 194 + TEMP_AMBIENT_ALERT_MIN 195 + minimum ambient temperature alert. 196 + TEMP_AMBIENT_ALERT_MAX 197 + maximum ambient temperature alert. 198 + TEMP_MIN 199 + minimum operatable temperature 200 + TEMP_MAX 201 + maximum operatable temperature 202 + 203 + TIME_TO_EMPTY 204 + seconds left for battery to be considered empty 205 + (i.e. while battery powers a load) 206 + TIME_TO_FULL 207 + seconds left for battery to be considered full 208 + (i.e. while battery is charging) 209 + 210 + 211 + Battery <-> external power supply interaction 212 + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 213 + Often power supplies are acting as supplies and supplicants at the same 214 + time. Batteries are good example. So, batteries usually care if they're 215 + externally powered or not. 216 + 217 + For that case, power supply class implements notification mechanism for 218 + batteries. 219 + 220 + External power supply (AC) lists supplicants (batteries) names in 221 + "supplied_to" struct member, and each power_supply_changed() call 222 + issued by external power supply will notify supplicants via 223 + external_power_changed callback. 224 + 225 + 226 + Devicetree battery characteristics 227 + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 228 + Drivers should call power_supply_get_battery_info() to obtain battery 229 + characteristics from a devicetree battery node, defined in 230 + Documentation/devicetree/bindings/power/supply/battery.txt. This is 231 + implemented in drivers/power/supply/bq27xxx_battery.c. 232 + 233 + Properties in struct power_supply_battery_info and their counterparts in the 234 + battery node have names corresponding to elements in enum power_supply_property, 235 + for naming consistency between sysfs attributes and battery node properties. 236 + 237 + 238 + QA 239 + ~~ 240 + 241 + Q: 242 + Where is POWER_SUPPLY_PROP_XYZ attribute? 243 + A: 244 + If you cannot find attribute suitable for your driver needs, feel free 245 + to add it and send patch along with your driver. 246 + 247 + The attributes available currently are the ones currently provided by the 248 + drivers written. 249 + 250 + Good candidates to add in future: model/part#, cycle_time, manufacturer, 251 + etc. 252 + 253 + 254 + Q: 255 + I have some very specific attribute (e.g. battery color), should I add 256 + this attribute to standard ones? 257 + A: 258 + Most likely, no. Such attribute can be placed in the driver itself, if 259 + it is useful. Of course, if the attribute in question applicable to 260 + large set of batteries, provided by many drivers, and/or comes from 261 + some general battery specification/standard, it may be a candidate to 262 + be added to the core attribute set. 263 + 264 + 265 + Q: 266 + Suppose, my battery monitoring chip/firmware does not provides capacity 267 + in percents, but provides charge_{now,full,empty}. Should I calculate 268 + percentage capacity manually, inside the driver, and register CAPACITY 269 + attribute? The same question about time_to_empty/time_to_full. 270 + A: 271 + Most likely, no. This class is designed to export properties which are 272 + directly measurable by the specific hardware available. 273 + 274 + Inferring not available properties using some heuristics or mathematical 275 + model is not subject of work for a battery driver. Such functionality 276 + should be factored out, and in fact, apm_power, the driver to serve 277 + legacy APM API on top of power supply class, uses a simple heuristic of 278 + approximating remaining battery capacity based on its charge, current, 279 + voltage and so on. But full-fledged battery model is likely not subject 280 + for kernel at all, as it would require floating point calculation to deal 281 + with things like differential equations and Kalman filters. This is 282 + better be handled by batteryd/libbattery, yet to be written.

-231

Documentation/power/power_supply_class.txt

··· 1 - Linux power supply class 2 - ======================== 3 - 4 - Synopsis 5 - ~~~~~~~~ 6 - Power supply class used to represent battery, UPS, AC or DC power supply 7 - properties to user-space. 8 - 9 - It defines core set of attributes, which should be applicable to (almost) 10 - every power supply out there. Attributes are available via sysfs and uevent 11 - interfaces. 12 - 13 - Each attribute has well defined meaning, up to unit of measure used. While 14 - the attributes provided are believed to be universally applicable to any 15 - power supply, specific monitoring hardware may not be able to provide them 16 - all, so any of them may be skipped. 17 - 18 - Power supply class is extensible, and allows to define drivers own attributes. 19 - The core attribute set is subject to the standard Linux evolution (i.e. 20 - if it will be found that some attribute is applicable to many power supply 21 - types or their drivers, it can be added to the core set). 22 - 23 - It also integrates with LED framework, for the purpose of providing 24 - typically expected feedback of battery charging/fully charged status and 25 - AC/USB power supply online status. (Note that specific details of the 26 - indication (including whether to use it at all) are fully controllable by 27 - user and/or specific machine defaults, per design principles of LED 28 - framework). 29 - 30 - 31 - Attributes/properties 32 - ~~~~~~~~~~~~~~~~~~~~~ 33 - Power supply class has predefined set of attributes, this eliminates code 34 - duplication across drivers. Power supply class insist on reusing its 35 - predefined attributes *and* their units. 36 - 37 - So, userspace gets predictable set of attributes and their units for any 38 - kind of power supply, and can process/present them to a user in consistent 39 - manner. Results for different power supplies and machines are also directly 40 - comparable. 41 - 42 - See drivers/power/supply/ds2760_battery.c and drivers/power/supply/pda_power.c 43 - for the example how to declare and handle attributes. 44 - 45 - 46 - Units 47 - ~~~~~ 48 - Quoting include/linux/power_supply.h: 49 - 50 - All voltages, currents, charges, energies, time and temperatures in µV, 51 - µA, µAh, µWh, seconds and tenths of degree Celsius unless otherwise 52 - stated. It's driver's job to convert its raw values to units in which 53 - this class operates. 54 - 55 - 56 - Attributes/properties detailed 57 - ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 58 - 59 - ~ ~ ~ ~ ~ ~ ~ Charge/Energy/Capacity - how to not confuse ~ ~ ~ ~ ~ ~ ~ 60 - ~ ~ 61 - ~ Because both "charge" (µAh) and "energy" (µWh) represents "capacity" ~ 62 - ~ of battery, this class distinguish these terms. Don't mix them! ~ 63 - ~ ~ 64 - ~ CHARGE_* attributes represents capacity in µAh only. ~ 65 - ~ ENERGY_* attributes represents capacity in µWh only. ~ 66 - ~ CAPACITY attribute represents capacity in *percents*, from 0 to 100. ~ 67 - ~ ~ 68 - ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ 69 - 70 - Postfixes: 71 - _AVG - *hardware* averaged value, use it if your hardware is really able to 72 - report averaged values. 73 - _NOW - momentary/instantaneous values. 74 - 75 - STATUS - this attribute represents operating status (charging, full, 76 - discharging (i.e. powering a load), etc.). This corresponds to 77 - BATTERY_STATUS_* values, as defined in battery.h. 78 - 79 - CHARGE_TYPE - batteries can typically charge at different rates. 80 - This defines trickle and fast charges. For batteries that 81 - are already charged or discharging, 'n/a' can be displayed (or 82 - 'unknown', if the status is not known). 83 - 84 - AUTHENTIC - indicates the power supply (battery or charger) connected 85 - to the platform is authentic(1) or non authentic(0). 86 - 87 - HEALTH - represents health of the battery, values corresponds to 88 - POWER_SUPPLY_HEALTH_*, defined in battery.h. 89 - 90 - VOLTAGE_OCV - open circuit voltage of the battery. 91 - 92 - VOLTAGE_MAX_DESIGN, VOLTAGE_MIN_DESIGN - design values for maximal and 93 - minimal power supply voltages. Maximal/minimal means values of voltages 94 - when battery considered "full"/"empty" at normal conditions. Yes, there is 95 - no direct relation between voltage and battery capacity, but some dumb 96 - batteries use voltage for very approximated calculation of capacity. 97 - Battery driver also can use this attribute just to inform userspace 98 - about maximal and minimal voltage thresholds of a given battery. 99 - 100 - VOLTAGE_MAX, VOLTAGE_MIN - same as _DESIGN voltage values except that 101 - these ones should be used if hardware could only guess (measure and 102 - retain) the thresholds of a given power supply. 103 - 104 - VOLTAGE_BOOT - Reports the voltage measured during boot 105 - 106 - CURRENT_BOOT - Reports the current measured during boot 107 - 108 - CHARGE_FULL_DESIGN, CHARGE_EMPTY_DESIGN - design charge values, when 109 - battery considered full/empty. 110 - 111 - ENERGY_FULL_DESIGN, ENERGY_EMPTY_DESIGN - same as above but for energy. 112 - 113 - CHARGE_FULL, CHARGE_EMPTY - These attributes means "last remembered value 114 - of charge when battery became full/empty". It also could mean "value of 115 - charge when battery considered full/empty at given conditions (temperature, 116 - age)". I.e. these attributes represents real thresholds, not design values. 117 - 118 - ENERGY_FULL, ENERGY_EMPTY - same as above but for energy. 119 - 120 - CHARGE_COUNTER - the current charge counter (in µAh). This could easily 121 - be negative; there is no empty or full value. It is only useful for 122 - relative, time-based measurements. 123 - 124 - PRECHARGE_CURRENT - the maximum charge current during precharge phase 125 - of charge cycle (typically 20% of battery capacity). 126 - CHARGE_TERM_CURRENT - Charge termination current. The charge cycle 127 - terminates when battery voltage is above recharge threshold, and charge 128 - current is below this setting (typically 10% of battery capacity). 129 - 130 - CONSTANT_CHARGE_CURRENT - constant charge current programmed by charger. 131 - CONSTANT_CHARGE_CURRENT_MAX - maximum charge current supported by the 132 - power supply object. 133 - 134 - CONSTANT_CHARGE_VOLTAGE - constant charge voltage programmed by charger. 135 - CONSTANT_CHARGE_VOLTAGE_MAX - maximum charge voltage supported by the 136 - power supply object. 137 - 138 - INPUT_CURRENT_LIMIT - input current limit programmed by charger. Indicates 139 - the current drawn from a charging source. 140 - 141 - CHARGE_CONTROL_LIMIT - current charge control limit setting 142 - CHARGE_CONTROL_LIMIT_MAX - maximum charge control limit setting 143 - 144 - CALIBRATE - battery or coulomb counter calibration status 145 - 146 - CAPACITY - capacity in percents. 147 - CAPACITY_ALERT_MIN - minimum capacity alert value in percents. 148 - CAPACITY_ALERT_MAX - maximum capacity alert value in percents. 149 - CAPACITY_LEVEL - capacity level. This corresponds to 150 - POWER_SUPPLY_CAPACITY_LEVEL_*. 151 - 152 - TEMP - temperature of the power supply. 153 - TEMP_ALERT_MIN - minimum battery temperature alert. 154 - TEMP_ALERT_MAX - maximum battery temperature alert. 155 - TEMP_AMBIENT - ambient temperature. 156 - TEMP_AMBIENT_ALERT_MIN - minimum ambient temperature alert. 157 - TEMP_AMBIENT_ALERT_MAX - maximum ambient temperature alert. 158 - TEMP_MIN - minimum operatable temperature 159 - TEMP_MAX - maximum operatable temperature 160 - 161 - TIME_TO_EMPTY - seconds left for battery to be considered empty (i.e. 162 - while battery powers a load) 163 - TIME_TO_FULL - seconds left for battery to be considered full (i.e. 164 - while battery is charging) 165 - 166 - 167 - Battery <-> external power supply interaction 168 - ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 169 - Often power supplies are acting as supplies and supplicants at the same 170 - time. Batteries are good example. So, batteries usually care if they're 171 - externally powered or not. 172 - 173 - For that case, power supply class implements notification mechanism for 174 - batteries. 175 - 176 - External power supply (AC) lists supplicants (batteries) names in 177 - "supplied_to" struct member, and each power_supply_changed() call 178 - issued by external power supply will notify supplicants via 179 - external_power_changed callback. 180 - 181 - 182 - Devicetree battery characteristics 183 - ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 184 - Drivers should call power_supply_get_battery_info() to obtain battery 185 - characteristics from a devicetree battery node, defined in 186 - Documentation/devicetree/bindings/power/supply/battery.txt. This is 187 - implemented in drivers/power/supply/bq27xxx_battery.c. 188 - 189 - Properties in struct power_supply_battery_info and their counterparts in the 190 - battery node have names corresponding to elements in enum power_supply_property, 191 - for naming consistency between sysfs attributes and battery node properties. 192 - 193 - 194 - QA 195 - ~~ 196 - Q: Where is POWER_SUPPLY_PROP_XYZ attribute? 197 - A: If you cannot find attribute suitable for your driver needs, feel free 198 - to add it and send patch along with your driver. 199 - 200 - The attributes available currently are the ones currently provided by the 201 - drivers written. 202 - 203 - Good candidates to add in future: model/part#, cycle_time, manufacturer, 204 - etc. 205 - 206 - 207 - Q: I have some very specific attribute (e.g. battery color), should I add 208 - this attribute to standard ones? 209 - A: Most likely, no. Such attribute can be placed in the driver itself, if 210 - it is useful. Of course, if the attribute in question applicable to 211 - large set of batteries, provided by many drivers, and/or comes from 212 - some general battery specification/standard, it may be a candidate to 213 - be added to the core attribute set. 214 - 215 - 216 - Q: Suppose, my battery monitoring chip/firmware does not provides capacity 217 - in percents, but provides charge_{now,full,empty}. Should I calculate 218 - percentage capacity manually, inside the driver, and register CAPACITY 219 - attribute? The same question about time_to_empty/time_to_full. 220 - A: Most likely, no. This class is designed to export properties which are 221 - directly measurable by the specific hardware available. 222 - 223 - Inferring not available properties using some heuristics or mathematical 224 - model is not subject of work for a battery driver. Such functionality 225 - should be factored out, and in fact, apm_power, the driver to serve 226 - legacy APM API on top of power supply class, uses a simple heuristic of 227 - approximating remaining battery capacity based on its charge, current, 228 - voltage and so on. But full-fledged battery model is likely not subject 229 - for kernel at all, as it would require floating point calculation to deal 230 - with things like differential equations and Kalman filters. This is 231 - better be handled by batteryd/libbattery, yet to be written.

+257

Documentation/power/powercap/powercap.rst

··· 1 + ======================= 2 + Power Capping Framework 3 + ======================= 4 + 5 + The power capping framework provides a consistent interface between the kernel 6 + and the user space that allows power capping drivers to expose the settings to 7 + user space in a uniform way. 8 + 9 + Terminology 10 + =========== 11 + 12 + The framework exposes power capping devices to user space via sysfs in the 13 + form of a tree of objects. The objects at the root level of the tree represent 14 + 'control types', which correspond to different methods of power capping. For 15 + example, the intel-rapl control type represents the Intel "Running Average 16 + Power Limit" (RAPL) technology, whereas the 'idle-injection' control type 17 + corresponds to the use of idle injection for controlling power. 18 + 19 + Power zones represent different parts of the system, which can be controlled and 20 + monitored using the power capping method determined by the control type the 21 + given zone belongs to. They each contain attributes for monitoring power, as 22 + well as controls represented in the form of power constraints. If the parts of 23 + the system represented by different power zones are hierarchical (that is, one 24 + bigger part consists of multiple smaller parts that each have their own power 25 + controls), those power zones may also be organized in a hierarchy with one 26 + parent power zone containing multiple subzones and so on to reflect the power 27 + control topology of the system. In that case, it is possible to apply power 28 + capping to a set of devices together using the parent power zone and if more 29 + fine grained control is required, it can be applied through the subzones. 30 + 31 + 32 + Example sysfs interface tree:: 33 + 34 + /sys/devices/virtual/powercap 35 + └──intel-rapl 36 + ├──intel-rapl:0 37 + │ ├──constraint_0_name 38 + │ ├──constraint_0_power_limit_uw 39 + │ ├──constraint_0_time_window_us 40 + │ ├──constraint_1_name 41 + │ ├──constraint_1_power_limit_uw 42 + │ ├──constraint_1_time_window_us 43 + │ ├──device -> ../../intel-rapl 44 + │ ├──energy_uj 45 + │ ├──intel-rapl:0:0 46 + │ │ ├──constraint_0_name 47 + │ │ ├──constraint_0_power_limit_uw 48 + │ │ ├──constraint_0_time_window_us 49 + │ │ ├──constraint_1_name 50 + │ │ ├──constraint_1_power_limit_uw 51 + │ │ ├──constraint_1_time_window_us 52 + │ │ ├──device -> ../../intel-rapl:0 53 + │ │ ├──energy_uj 54 + │ │ ├──max_energy_range_uj 55 + │ │ ├──name 56 + │ │ ├──enabled 57 + │ │ ├──power 58 + │ │ │ ├──async 59 + │ │ │ [] 60 + │ │ ├──subsystem -> ../../../../../../class/power_cap 61 + │ │ └──uevent 62 + │ ├──intel-rapl:0:1 63 + │ │ ├──constraint_0_name 64 + │ │ ├──constraint_0_power_limit_uw 65 + │ │ ├──constraint_0_time_window_us 66 + │ │ ├──constraint_1_name 67 + │ │ ├──constraint_1_power_limit_uw 68 + │ │ ├──constraint_1_time_window_us 69 + │ │ ├──device -> ../../intel-rapl:0 70 + │ │ ├──energy_uj 71 + │ │ ├──max_energy_range_uj 72 + │ │ ├──name 73 + │ │ ├──enabled 74 + │ │ ├──power 75 + │ │ │ ├──async 76 + │ │ │ [] 77 + │ │ ├──subsystem -> ../../../../../../class/power_cap 78 + │ │ └──uevent 79 + │ ├──max_energy_range_uj 80 + │ ├──max_power_range_uw 81 + │ ├──name 82 + │ ├──enabled 83 + │ ├──power 84 + │ │ ├──async 85 + │ │ [] 86 + │ ├──subsystem -> ../../../../../class/power_cap 87 + │ ├──enabled 88 + │ ├──uevent 89 + ├──intel-rapl:1 90 + │ ├──constraint_0_name 91 + │ ├──constraint_0_power_limit_uw 92 + │ ├──constraint_0_time_window_us 93 + │ ├──constraint_1_name 94 + │ ├──constraint_1_power_limit_uw 95 + │ ├──constraint_1_time_window_us 96 + │ ├──device -> ../../intel-rapl 97 + │ ├──energy_uj 98 + │ ├──intel-rapl:1:0 99 + │ │ ├──constraint_0_name 100 + │ │ ├──constraint_0_power_limit_uw 101 + │ │ ├──constraint_0_time_window_us 102 + │ │ ├──constraint_1_name 103 + │ │ ├──constraint_1_power_limit_uw 104 + │ │ ├──constraint_1_time_window_us 105 + │ │ ├──device -> ../../intel-rapl:1 106 + │ │ ├──energy_uj 107 + │ │ ├──max_energy_range_uj 108 + │ │ ├──name 109 + │ │ ├──enabled 110 + │ │ ├──power 111 + │ │ │ ├──async 112 + │ │ │ [] 113 + │ │ ├──subsystem -> ../../../../../../class/power_cap 114 + │ │ └──uevent 115 + │ ├──intel-rapl:1:1 116 + │ │ ├──constraint_0_name 117 + │ │ ├──constraint_0_power_limit_uw 118 + │ │ ├──constraint_0_time_window_us 119 + │ │ ├──constraint_1_name 120 + │ │ ├──constraint_1_power_limit_uw 121 + │ │ ├──constraint_1_time_window_us 122 + │ │ ├──device -> ../../intel-rapl:1 123 + │ │ ├──energy_uj 124 + │ │ ├──max_energy_range_uj 125 + │ │ ├──name 126 + │ │ ├──enabled 127 + │ │ ├──power 128 + │ │ │ ├──async 129 + │ │ │ [] 130 + │ │ ├──subsystem -> ../../../../../../class/power_cap 131 + │ │ └──uevent 132 + │ ├──max_energy_range_uj 133 + │ ├──max_power_range_uw 134 + │ ├──name 135 + │ ├──enabled 136 + │ ├──power 137 + │ │ ├──async 138 + │ │ [] 139 + │ ├──subsystem -> ../../../../../class/power_cap 140 + │ ├──uevent 141 + ├──power 142 + │ ├──async 143 + │ [] 144 + ├──subsystem -> ../../../../class/power_cap 145 + ├──enabled 146 + └──uevent 147 + 148 + The above example illustrates a case in which the Intel RAPL technology, 149 + available in Intel® IA-64 and IA-32 Processor Architectures, is used. There is one 150 + control type called intel-rapl which contains two power zones, intel-rapl:0 and 151 + intel-rapl:1, representing CPU packages. Each of these power zones contains 152 + two subzones, intel-rapl:j:0 and intel-rapl:j:1 (j = 0, 1), representing the 153 + "core" and the "uncore" parts of the given CPU package, respectively. All of 154 + the zones and subzones contain energy monitoring attributes (energy_uj, 155 + max_energy_range_uj) and constraint attributes (constraint_*) allowing controls 156 + to be applied (the constraints in the 'package' power zones apply to the whole 157 + CPU packages and the subzone constraints only apply to the respective parts of 158 + the given package individually). Since Intel RAPL doesn't provide instantaneous 159 + power value, there is no power_uw attribute. 160 + 161 + In addition to that, each power zone contains a name attribute, allowing the 162 + part of the system represented by that zone to be identified. 163 + For example:: 164 + 165 + cat /sys/class/power_cap/intel-rapl/intel-rapl:0/name 166 + 167 + package-0 168 + --------- 169 + 170 + The Intel RAPL technology allows two constraints, short term and long term, 171 + with two different time windows to be applied to each power zone. Thus for 172 + each zone there are 2 attributes representing the constraint names, 2 power 173 + limits and 2 attributes representing the sizes of the time windows. Such that, 174 + constraint_j_* attributes correspond to the jth constraint (j = 0,1). 175 + 176 + For example:: 177 + 178 + constraint_0_name 179 + constraint_0_power_limit_uw 180 + constraint_0_time_window_us 181 + constraint_1_name 182 + constraint_1_power_limit_uw 183 + constraint_1_time_window_us 184 + 185 + Power Zone Attributes 186 + ===================== 187 + 188 + Monitoring attributes 189 + --------------------- 190 + 191 + energy_uj (rw) 192 + Current energy counter in micro joules. Write "0" to reset. 193 + If the counter can not be reset, then this attribute is read only. 194 + 195 + max_energy_range_uj (ro) 196 + Range of the above energy counter in micro-joules. 197 + 198 + power_uw (ro) 199 + Current power in micro watts. 200 + 201 + max_power_range_uw (ro) 202 + Range of the above power value in micro-watts. 203 + 204 + name (ro) 205 + Name of this power zone. 206 + 207 + It is possible that some domains have both power ranges and energy counter ranges; 208 + however, only one is mandatory. 209 + 210 + Constraints 211 + ----------- 212 + 213 + constraint_X_power_limit_uw (rw) 214 + Power limit in micro watts, which should be applicable for the 215 + time window specified by "constraint_X_time_window_us". 216 + 217 + constraint_X_time_window_us (rw) 218 + Time window in micro seconds. 219 + 220 + constraint_X_name (ro) 221 + An optional name of the constraint 222 + 223 + constraint_X_max_power_uw(ro) 224 + Maximum allowed power in micro watts. 225 + 226 + constraint_X_min_power_uw(ro) 227 + Minimum allowed power in micro watts. 228 + 229 + constraint_X_max_time_window_us(ro) 230 + Maximum allowed time window in micro seconds. 231 + 232 + constraint_X_min_time_window_us(ro) 233 + Minimum allowed time window in micro seconds. 234 + 235 + Except power_limit_uw and time_window_us other fields are optional. 236 + 237 + Common zone and control type attributes 238 + --------------------------------------- 239 + 240 + enabled (rw): Enable/Disable controls at zone level or for all zones using 241 + a control type. 242 + 243 + Power Cap Client Driver Interface 244 + ================================= 245 + 246 + The API summary: 247 + 248 + Call powercap_register_control_type() to register control type object. 249 + Call powercap_register_zone() to register a power zone (under a given 250 + control type), either as a top-level power zone or as a subzone of another 251 + power zone registered earlier. 252 + The number of constraints in a power zone and the corresponding callbacks have 253 + to be defined prior to calling powercap_register_zone() to register that zone. 254 + 255 + To Free a power zone call powercap_unregister_zone(). 256 + To free a control type object call powercap_unregister_control_type(). 257 + Detailed API can be generated using kernel-doc on include/linux/powercap.h.

-236

Documentation/power/powercap/powercap.txt

··· 1 - Power Capping Framework 2 - ================================== 3 - 4 - The power capping framework provides a consistent interface between the kernel 5 - and the user space that allows power capping drivers to expose the settings to 6 - user space in a uniform way. 7 - 8 - Terminology 9 - ========================= 10 - The framework exposes power capping devices to user space via sysfs in the 11 - form of a tree of objects. The objects at the root level of the tree represent 12 - 'control types', which correspond to different methods of power capping. For 13 - example, the intel-rapl control type represents the Intel "Running Average 14 - Power Limit" (RAPL) technology, whereas the 'idle-injection' control type 15 - corresponds to the use of idle injection for controlling power. 16 - 17 - Power zones represent different parts of the system, which can be controlled and 18 - monitored using the power capping method determined by the control type the 19 - given zone belongs to. They each contain attributes for monitoring power, as 20 - well as controls represented in the form of power constraints. If the parts of 21 - the system represented by different power zones are hierarchical (that is, one 22 - bigger part consists of multiple smaller parts that each have their own power 23 - controls), those power zones may also be organized in a hierarchy with one 24 - parent power zone containing multiple subzones and so on to reflect the power 25 - control topology of the system. In that case, it is possible to apply power 26 - capping to a set of devices together using the parent power zone and if more 27 - fine grained control is required, it can be applied through the subzones. 28 - 29 - 30 - Example sysfs interface tree: 31 - 32 - /sys/devices/virtual/powercap 33 - ??? intel-rapl 34 - ??? intel-rapl:0 35 - ? ??? constraint_0_name 36 - ? ??? constraint_0_power_limit_uw 37 - ? ??? constraint_0_time_window_us 38 - ? ??? constraint_1_name 39 - ? ??? constraint_1_power_limit_uw 40 - ? ??? constraint_1_time_window_us 41 - ? ??? device -> ../../intel-rapl 42 - ? ??? energy_uj 43 - ? ??? intel-rapl:0:0 44 - ? ? ??? constraint_0_name 45 - ? ? ??? constraint_0_power_limit_uw 46 - ? ? ??? constraint_0_time_window_us 47 - ? ? ??? constraint_1_name 48 - ? ? ??? constraint_1_power_limit_uw 49 - ? ? ??? constraint_1_time_window_us 50 - ? ? ??? device -> ../../intel-rapl:0 51 - ? ? ??? energy_uj 52 - ? ? ??? max_energy_range_uj 53 - ? ? ??? name 54 - ? ? ??? enabled 55 - ? ? ??? power 56 - ? ? ? ??? async 57 - ? ? ? [] 58 - ? ? ??? subsystem -> ../../../../../../class/power_cap 59 - ? ? ??? uevent 60 - ? ??? intel-rapl:0:1 61 - ? ? ??? constraint_0_name 62 - ? ? ??? constraint_0_power_limit_uw 63 - ? ? ??? constraint_0_time_window_us 64 - ? ? ??? constraint_1_name 65 - ? ? ??? constraint_1_power_limit_uw 66 - ? ? ??? constraint_1_time_window_us 67 - ? ? ??? device -> ../../intel-rapl:0 68 - ? ? ??? energy_uj 69 - ? ? ??? max_energy_range_uj 70 - ? ? ??? name 71 - ? ? ??? enabled 72 - ? ? ??? power 73 - ? ? ? ??? async 74 - ? ? ? [] 75 - ? ? ??? subsystem -> ../../../../../../class/power_cap 76 - ? ? ??? uevent 77 - ? ??? max_energy_range_uj 78 - ? ??? max_power_range_uw 79 - ? ??? name 80 - ? ??? enabled 81 - ? ??? power 82 - ? ? ??? async 83 - ? ? [] 84 - ? ??? subsystem -> ../../../../../class/power_cap 85 - ? ??? enabled 86 - ? ??? uevent 87 - ??? intel-rapl:1 88 - ? ??? constraint_0_name 89 - ? ??? constraint_0_power_limit_uw 90 - ? ??? constraint_0_time_window_us 91 - ? ??? constraint_1_name 92 - ? ??? constraint_1_power_limit_uw 93 - ? ??? constraint_1_time_window_us 94 - ? ??? device -> ../../intel-rapl 95 - ? ??? energy_uj 96 - ? ??? intel-rapl:1:0 97 - ? ? ??? constraint_0_name 98 - ? ? ??? constraint_0_power_limit_uw 99 - ? ? ??? constraint_0_time_window_us 100 - ? ? ??? constraint_1_name 101 - ? ? ??? constraint_1_power_limit_uw 102 - ? ? ??? constraint_1_time_window_us 103 - ? ? ??? device -> ../../intel-rapl:1 104 - ? ? ??? energy_uj 105 - ? ? ??? max_energy_range_uj 106 - ? ? ??? name 107 - ? ? ??? enabled 108 - ? ? ??? power 109 - ? ? ? ??? async 110 - ? ? ? [] 111 - ? ? ??? subsystem -> ../../../../../../class/power_cap 112 - ? ? ??? uevent 113 - ? ??? intel-rapl:1:1 114 - ? ? ??? constraint_0_name 115 - ? ? ??? constraint_0_power_limit_uw 116 - ? ? ??? constraint_0_time_window_us 117 - ? ? ??? constraint_1_name 118 - ? ? ??? constraint_1_power_limit_uw 119 - ? ? ??? constraint_1_time_window_us 120 - ? ? ??? device -> ../../intel-rapl:1 121 - ? ? ??? energy_uj 122 - ? ? ??? max_energy_range_uj 123 - ? ? ??? name 124 - ? ? ??? enabled 125 - ? ? ??? power 126 - ? ? ? ??? async 127 - ? ? ? [] 128 - ? ? ??? subsystem -> ../../../../../../class/power_cap 129 - ? ? ??? uevent 130 - ? ??? max_energy_range_uj 131 - ? ??? max_power_range_uw 132 - ? ??? name 133 - ? ??? enabled 134 - ? ??? power 135 - ? ? ??? async 136 - ? ? [] 137 - ? ??? subsystem -> ../../../../../class/power_cap 138 - ? ??? uevent 139 - ??? power 140 - ? ??? async 141 - ? [] 142 - ??? subsystem -> ../../../../class/power_cap 143 - ??? enabled 144 - ??? uevent 145 - 146 - The above example illustrates a case in which the Intel RAPL technology, 147 - available in Intel® IA-64 and IA-32 Processor Architectures, is used. There is one 148 - control type called intel-rapl which contains two power zones, intel-rapl:0 and 149 - intel-rapl:1, representing CPU packages. Each of these power zones contains 150 - two subzones, intel-rapl:j:0 and intel-rapl:j:1 (j = 0, 1), representing the 151 - "core" and the "uncore" parts of the given CPU package, respectively. All of 152 - the zones and subzones contain energy monitoring attributes (energy_uj, 153 - max_energy_range_uj) and constraint attributes (constraint_*) allowing controls 154 - to be applied (the constraints in the 'package' power zones apply to the whole 155 - CPU packages and the subzone constraints only apply to the respective parts of 156 - the given package individually). Since Intel RAPL doesn't provide instantaneous 157 - power value, there is no power_uw attribute. 158 - 159 - In addition to that, each power zone contains a name attribute, allowing the 160 - part of the system represented by that zone to be identified. 161 - For example: 162 - 163 - cat /sys/class/power_cap/intel-rapl/intel-rapl:0/name 164 - package-0 165 - 166 - The Intel RAPL technology allows two constraints, short term and long term, 167 - with two different time windows to be applied to each power zone. Thus for 168 - each zone there are 2 attributes representing the constraint names, 2 power 169 - limits and 2 attributes representing the sizes of the time windows. Such that, 170 - constraint_j_* attributes correspond to the jth constraint (j = 0,1). 171 - 172 - For example: 173 - constraint_0_name 174 - constraint_0_power_limit_uw 175 - constraint_0_time_window_us 176 - constraint_1_name 177 - constraint_1_power_limit_uw 178 - constraint_1_time_window_us 179 - 180 - Power Zone Attributes 181 - ================================= 182 - Monitoring attributes 183 - ---------------------- 184 - 185 - energy_uj (rw): Current energy counter in micro joules. Write "0" to reset. 186 - If the counter can not be reset, then this attribute is read only. 187 - 188 - max_energy_range_uj (ro): Range of the above energy counter in micro-joules. 189 - 190 - power_uw (ro): Current power in micro watts. 191 - 192 - max_power_range_uw (ro): Range of the above power value in micro-watts. 193 - 194 - name (ro): Name of this power zone. 195 - 196 - It is possible that some domains have both power ranges and energy counter ranges; 197 - however, only one is mandatory. 198 - 199 - Constraints 200 - ---------------- 201 - constraint_X_power_limit_uw (rw): Power limit in micro watts, which should be 202 - applicable for the time window specified by "constraint_X_time_window_us". 203 - 204 - constraint_X_time_window_us (rw): Time window in micro seconds. 205 - 206 - constraint_X_name (ro): An optional name of the constraint 207 - 208 - constraint_X_max_power_uw(ro): Maximum allowed power in micro watts. 209 - 210 - constraint_X_min_power_uw(ro): Minimum allowed power in micro watts. 211 - 212 - constraint_X_max_time_window_us(ro): Maximum allowed time window in micro seconds. 213 - 214 - constraint_X_min_time_window_us(ro): Minimum allowed time window in micro seconds. 215 - 216 - Except power_limit_uw and time_window_us other fields are optional. 217 - 218 - Common zone and control type attributes 219 - ---------------------------------------- 220 - enabled (rw): Enable/Disable controls at zone level or for all zones using 221 - a control type. 222 - 223 - Power Cap Client Driver Interface 224 - ================================== 225 - The API summary: 226 - 227 - Call powercap_register_control_type() to register control type object. 228 - Call powercap_register_zone() to register a power zone (under a given 229 - control type), either as a top-level power zone or as a subzone of another 230 - power zone registered earlier. 231 - The number of constraints in a power zone and the corresponding callbacks have 232 - to be defined prior to calling powercap_register_zone() to register that zone. 233 - 234 - To Free a power zone call powercap_unregister_zone(). 235 - To free a control type object call powercap_unregister_control_type(). 236 - Detailed API can be generated using kernel-doc on include/linux/powercap.h.

+229

Documentation/power/regulator/consumer.rst

··· 1 + =================================== 2 + Regulator Consumer Driver Interface 3 + =================================== 4 + 5 + This text describes the regulator interface for consumer device drivers. 6 + Please see overview.txt for a description of the terms used in this text. 7 + 8 + 9 + 1. Consumer Regulator Access (static & dynamic drivers) 10 + ======================================================= 11 + 12 + A consumer driver can get access to its supply regulator by calling :: 13 + 14 + regulator = regulator_get(dev, "Vcc"); 15 + 16 + The consumer passes in its struct device pointer and power supply ID. The core 17 + then finds the correct regulator by consulting a machine specific lookup table. 18 + If the lookup is successful then this call will return a pointer to the struct 19 + regulator that supplies this consumer. 20 + 21 + To release the regulator the consumer driver should call :: 22 + 23 + regulator_put(regulator); 24 + 25 + Consumers can be supplied by more than one regulator e.g. codec consumer with 26 + analog and digital supplies :: 27 + 28 + digital = regulator_get(dev, "Vcc"); /* digital core */ 29 + analog = regulator_get(dev, "Avdd"); /* analog */ 30 + 31 + The regulator access functions regulator_get() and regulator_put() will 32 + usually be called in your device drivers probe() and remove() respectively. 33 + 34 + 35 + 2. Regulator Output Enable & Disable (static & dynamic drivers) 36 + =============================================================== 37 + 38 + 39 + A consumer can enable its power supply by calling:: 40 + 41 + int regulator_enable(regulator); 42 + 43 + NOTE: 44 + The supply may already be enabled before regulator_enabled() is called. 45 + This may happen if the consumer shares the regulator or the regulator has been 46 + previously enabled by bootloader or kernel board initialization code. 47 + 48 + A consumer can determine if a regulator is enabled by calling:: 49 + 50 + int regulator_is_enabled(regulator); 51 + 52 + This will return > zero when the regulator is enabled. 53 + 54 + 55 + A consumer can disable its supply when no longer needed by calling:: 56 + 57 + int regulator_disable(regulator); 58 + 59 + NOTE: 60 + This may not disable the supply if it's shared with other consumers. The 61 + regulator will only be disabled when the enabled reference count is zero. 62 + 63 + Finally, a regulator can be forcefully disabled in the case of an emergency:: 64 + 65 + int regulator_force_disable(regulator); 66 + 67 + NOTE: 68 + this will immediately and forcefully shutdown the regulator output. All 69 + consumers will be powered off. 70 + 71 + 72 + 3. Regulator Voltage Control & Status (dynamic drivers) 73 + ======================================================= 74 + 75 + Some consumer drivers need to be able to dynamically change their supply 76 + voltage to match system operating points. e.g. CPUfreq drivers can scale 77 + voltage along with frequency to save power, SD drivers may need to select the 78 + correct card voltage, etc. 79 + 80 + Consumers can control their supply voltage by calling:: 81 + 82 + int regulator_set_voltage(regulator, min_uV, max_uV); 83 + 84 + Where min_uV and max_uV are the minimum and maximum acceptable voltages in 85 + microvolts. 86 + 87 + NOTE: this can be called when the regulator is enabled or disabled. If called 88 + when enabled, then the voltage changes instantly, otherwise the voltage 89 + configuration changes and the voltage is physically set when the regulator is 90 + next enabled. 91 + 92 + The regulators configured voltage output can be found by calling:: 93 + 94 + int regulator_get_voltage(regulator); 95 + 96 + NOTE: 97 + get_voltage() will return the configured output voltage whether the 98 + regulator is enabled or disabled and should NOT be used to determine regulator 99 + output state. However this can be used in conjunction with is_enabled() to 100 + determine the regulator physical output voltage. 101 + 102 + 103 + 4. Regulator Current Limit Control & Status (dynamic drivers) 104 + ============================================================= 105 + 106 + Some consumer drivers need to be able to dynamically change their supply 107 + current limit to match system operating points. e.g. LCD backlight driver can 108 + change the current limit to vary the backlight brightness, USB drivers may want 109 + to set the limit to 500mA when supplying power. 110 + 111 + Consumers can control their supply current limit by calling:: 112 + 113 + int regulator_set_current_limit(regulator, min_uA, max_uA); 114 + 115 + Where min_uA and max_uA are the minimum and maximum acceptable current limit in 116 + microamps. 117 + 118 + NOTE: 119 + this can be called when the regulator is enabled or disabled. If called 120 + when enabled, then the current limit changes instantly, otherwise the current 121 + limit configuration changes and the current limit is physically set when the 122 + regulator is next enabled. 123 + 124 + A regulators current limit can be found by calling:: 125 + 126 + int regulator_get_current_limit(regulator); 127 + 128 + NOTE: 129 + get_current_limit() will return the current limit whether the regulator 130 + is enabled or disabled and should not be used to determine regulator current 131 + load. 132 + 133 + 134 + 5. Regulator Operating Mode Control & Status (dynamic drivers) 135 + ============================================================== 136 + 137 + Some consumers can further save system power by changing the operating mode of 138 + their supply regulator to be more efficient when the consumers operating state 139 + changes. e.g. consumer driver is idle and subsequently draws less current 140 + 141 + Regulator operating mode can be changed indirectly or directly. 142 + 143 + Indirect operating mode control. 144 + -------------------------------- 145 + Consumer drivers can request a change in their supply regulator operating mode 146 + by calling:: 147 + 148 + int regulator_set_load(struct regulator *regulator, int load_uA); 149 + 150 + This will cause the core to recalculate the total load on the regulator (based 151 + on all its consumers) and change operating mode (if necessary and permitted) 152 + to best match the current operating load. 153 + 154 + The load_uA value can be determined from the consumer's datasheet. e.g. most 155 + datasheets have tables showing the maximum current consumed in certain 156 + situations. 157 + 158 + Most consumers will use indirect operating mode control since they have no 159 + knowledge of the regulator or whether the regulator is shared with other 160 + consumers. 161 + 162 + Direct operating mode control. 163 + ------------------------------ 164 + 165 + Bespoke or tightly coupled drivers may want to directly control regulator 166 + operating mode depending on their operating point. This can be achieved by 167 + calling:: 168 + 169 + int regulator_set_mode(struct regulator *regulator, unsigned int mode); 170 + unsigned int regulator_get_mode(struct regulator *regulator); 171 + 172 + Direct mode will only be used by consumers that *know* about the regulator and 173 + are not sharing the regulator with other consumers. 174 + 175 + 176 + 6. Regulator Events 177 + =================== 178 + 179 + Regulators can notify consumers of external events. Events could be received by 180 + consumers under regulator stress or failure conditions. 181 + 182 + Consumers can register interest in regulator events by calling:: 183 + 184 + int regulator_register_notifier(struct regulator *regulator, 185 + struct notifier_block *nb); 186 + 187 + Consumers can unregister interest by calling:: 188 + 189 + int regulator_unregister_notifier(struct regulator *regulator, 190 + struct notifier_block *nb); 191 + 192 + Regulators use the kernel notifier framework to send event to their interested 193 + consumers. 194 + 195 + 7. Regulator Direct Register Access 196 + =================================== 197 + 198 + Some kinds of power management hardware or firmware are designed such that 199 + they need to do low-level hardware access to regulators, with no involvement 200 + from the kernel. Examples of such devices are: 201 + 202 + - clocksource with a voltage-controlled oscillator and control logic to change 203 + the supply voltage over I2C to achieve a desired output clock rate 204 + - thermal management firmware that can issue an arbitrary I2C transaction to 205 + perform system poweroff during overtemperature conditions 206 + 207 + To set up such a device/firmware, various parameters like I2C address of the 208 + regulator, addresses of various regulator registers etc. need to be configured 209 + to it. The regulator framework provides the following helpers for querying 210 + these details. 211 + 212 + Bus-specific details, like I2C addresses or transfer rates are handled by the 213 + regmap framework. To get the regulator's regmap (if supported), use:: 214 + 215 + struct regmap *regulator_get_regmap(struct regulator *regulator); 216 + 217 + To obtain the hardware register offset and bitmask for the regulator's voltage 218 + selector register, use:: 219 + 220 + int regulator_get_hardware_vsel_register(struct regulator *regulator, 221 + unsigned *vsel_reg, 222 + unsigned *vsel_mask); 223 + 224 + To convert a regulator framework voltage selector code (used by 225 + regulator_list_voltage) to a hardware-specific voltage selector that can be 226 + directly written to the voltage selector register, use:: 227 + 228 + int regulator_list_hardware_vsel(struct regulator *regulator, 229 + unsigned selector);

-218

Documentation/power/regulator/consumer.txt

··· 1 - Regulator Consumer Driver Interface 2 - =================================== 3 - 4 - This text describes the regulator interface for consumer device drivers. 5 - Please see overview.txt for a description of the terms used in this text. 6 - 7 - 8 - 1. Consumer Regulator Access (static & dynamic drivers) 9 - ======================================================= 10 - 11 - A consumer driver can get access to its supply regulator by calling :- 12 - 13 - regulator = regulator_get(dev, "Vcc"); 14 - 15 - The consumer passes in its struct device pointer and power supply ID. The core 16 - then finds the correct regulator by consulting a machine specific lookup table. 17 - If the lookup is successful then this call will return a pointer to the struct 18 - regulator that supplies this consumer. 19 - 20 - To release the regulator the consumer driver should call :- 21 - 22 - regulator_put(regulator); 23 - 24 - Consumers can be supplied by more than one regulator e.g. codec consumer with 25 - analog and digital supplies :- 26 - 27 - digital = regulator_get(dev, "Vcc"); /* digital core */ 28 - analog = regulator_get(dev, "Avdd"); /* analog */ 29 - 30 - The regulator access functions regulator_get() and regulator_put() will 31 - usually be called in your device drivers probe() and remove() respectively. 32 - 33 - 34 - 2. Regulator Output Enable & Disable (static & dynamic drivers) 35 - ==================================================================== 36 - 37 - A consumer can enable its power supply by calling:- 38 - 39 - int regulator_enable(regulator); 40 - 41 - NOTE: The supply may already be enabled before regulator_enabled() is called. 42 - This may happen if the consumer shares the regulator or the regulator has been 43 - previously enabled by bootloader or kernel board initialization code. 44 - 45 - A consumer can determine if a regulator is enabled by calling :- 46 - 47 - int regulator_is_enabled(regulator); 48 - 49 - This will return > zero when the regulator is enabled. 50 - 51 - 52 - A consumer can disable its supply when no longer needed by calling :- 53 - 54 - int regulator_disable(regulator); 55 - 56 - NOTE: This may not disable the supply if it's shared with other consumers. The 57 - regulator will only be disabled when the enabled reference count is zero. 58 - 59 - Finally, a regulator can be forcefully disabled in the case of an emergency :- 60 - 61 - int regulator_force_disable(regulator); 62 - 63 - NOTE: this will immediately and forcefully shutdown the regulator output. All 64 - consumers will be powered off. 65 - 66 - 67 - 3. Regulator Voltage Control & Status (dynamic drivers) 68 - ====================================================== 69 - 70 - Some consumer drivers need to be able to dynamically change their supply 71 - voltage to match system operating points. e.g. CPUfreq drivers can scale 72 - voltage along with frequency to save power, SD drivers may need to select the 73 - correct card voltage, etc. 74 - 75 - Consumers can control their supply voltage by calling :- 76 - 77 - int regulator_set_voltage(regulator, min_uV, max_uV); 78 - 79 - Where min_uV and max_uV are the minimum and maximum acceptable voltages in 80 - microvolts. 81 - 82 - NOTE: this can be called when the regulator is enabled or disabled. If called 83 - when enabled, then the voltage changes instantly, otherwise the voltage 84 - configuration changes and the voltage is physically set when the regulator is 85 - next enabled. 86 - 87 - The regulators configured voltage output can be found by calling :- 88 - 89 - int regulator_get_voltage(regulator); 90 - 91 - NOTE: get_voltage() will return the configured output voltage whether the 92 - regulator is enabled or disabled and should NOT be used to determine regulator 93 - output state. However this can be used in conjunction with is_enabled() to 94 - determine the regulator physical output voltage. 95 - 96 - 97 - 4. Regulator Current Limit Control & Status (dynamic drivers) 98 - =========================================================== 99 - 100 - Some consumer drivers need to be able to dynamically change their supply 101 - current limit to match system operating points. e.g. LCD backlight driver can 102 - change the current limit to vary the backlight brightness, USB drivers may want 103 - to set the limit to 500mA when supplying power. 104 - 105 - Consumers can control their supply current limit by calling :- 106 - 107 - int regulator_set_current_limit(regulator, min_uA, max_uA); 108 - 109 - Where min_uA and max_uA are the minimum and maximum acceptable current limit in 110 - microamps. 111 - 112 - NOTE: this can be called when the regulator is enabled or disabled. If called 113 - when enabled, then the current limit changes instantly, otherwise the current 114 - limit configuration changes and the current limit is physically set when the 115 - regulator is next enabled. 116 - 117 - A regulators current limit can be found by calling :- 118 - 119 - int regulator_get_current_limit(regulator); 120 - 121 - NOTE: get_current_limit() will return the current limit whether the regulator 122 - is enabled or disabled and should not be used to determine regulator current 123 - load. 124 - 125 - 126 - 5. Regulator Operating Mode Control & Status (dynamic drivers) 127 - ============================================================= 128 - 129 - Some consumers can further save system power by changing the operating mode of 130 - their supply regulator to be more efficient when the consumers operating state 131 - changes. e.g. consumer driver is idle and subsequently draws less current 132 - 133 - Regulator operating mode can be changed indirectly or directly. 134 - 135 - Indirect operating mode control. 136 - -------------------------------- 137 - Consumer drivers can request a change in their supply regulator operating mode 138 - by calling :- 139 - 140 - int regulator_set_load(struct regulator *regulator, int load_uA); 141 - 142 - This will cause the core to recalculate the total load on the regulator (based 143 - on all its consumers) and change operating mode (if necessary and permitted) 144 - to best match the current operating load. 145 - 146 - The load_uA value can be determined from the consumer's datasheet. e.g. most 147 - datasheets have tables showing the maximum current consumed in certain 148 - situations. 149 - 150 - Most consumers will use indirect operating mode control since they have no 151 - knowledge of the regulator or whether the regulator is shared with other 152 - consumers. 153 - 154 - Direct operating mode control. 155 - ------------------------------ 156 - Bespoke or tightly coupled drivers may want to directly control regulator 157 - operating mode depending on their operating point. This can be achieved by 158 - calling :- 159 - 160 - int regulator_set_mode(struct regulator *regulator, unsigned int mode); 161 - unsigned int regulator_get_mode(struct regulator *regulator); 162 - 163 - Direct mode will only be used by consumers that *know* about the regulator and 164 - are not sharing the regulator with other consumers. 165 - 166 - 167 - 6. Regulator Events 168 - =================== 169 - Regulators can notify consumers of external events. Events could be received by 170 - consumers under regulator stress or failure conditions. 171 - 172 - Consumers can register interest in regulator events by calling :- 173 - 174 - int regulator_register_notifier(struct regulator *regulator, 175 - struct notifier_block *nb); 176 - 177 - Consumers can unregister interest by calling :- 178 - 179 - int regulator_unregister_notifier(struct regulator *regulator, 180 - struct notifier_block *nb); 181 - 182 - Regulators use the kernel notifier framework to send event to their interested 183 - consumers. 184 - 185 - 7. Regulator Direct Register Access 186 - =================================== 187 - Some kinds of power management hardware or firmware are designed such that 188 - they need to do low-level hardware access to regulators, with no involvement 189 - from the kernel. Examples of such devices are: 190 - 191 - - clocksource with a voltage-controlled oscillator and control logic to change 192 - the supply voltage over I2C to achieve a desired output clock rate 193 - - thermal management firmware that can issue an arbitrary I2C transaction to 194 - perform system poweroff during overtemperature conditions 195 - 196 - To set up such a device/firmware, various parameters like I2C address of the 197 - regulator, addresses of various regulator registers etc. need to be configured 198 - to it. The regulator framework provides the following helpers for querying 199 - these details. 200 - 201 - Bus-specific details, like I2C addresses or transfer rates are handled by the 202 - regmap framework. To get the regulator's regmap (if supported), use :- 203 - 204 - struct regmap *regulator_get_regmap(struct regulator *regulator); 205 - 206 - To obtain the hardware register offset and bitmask for the regulator's voltage 207 - selector register, use :- 208 - 209 - int regulator_get_hardware_vsel_register(struct regulator *regulator, 210 - unsigned *vsel_reg, 211 - unsigned *vsel_mask); 212 - 213 - To convert a regulator framework voltage selector code (used by 214 - regulator_list_voltage) to a hardware-specific voltage selector that can be 215 - directly written to the voltage selector register, use :- 216 - 217 - int regulator_list_hardware_vsel(struct regulator *regulator, 218 - unsigned selector);

+38

Documentation/power/regulator/design.rst

··· 1 + ========================== 2 + Regulator API design notes 3 + ========================== 4 + 5 + This document provides a brief, partially structured, overview of some 6 + of the design considerations which impact the regulator API design. 7 + 8 + Safety 9 + ------ 10 + 11 + - Errors in regulator configuration can have very serious consequences 12 + for the system, potentially including lasting hardware damage. 13 + - It is not possible to automatically determine the power configuration 14 + of the system - software-equivalent variants of the same chip may 15 + have different power requirements, and not all components with power 16 + requirements are visible to software. 17 + 18 + .. note:: 19 + 20 + The API should make no changes to the hardware state unless it has 21 + specific knowledge that these changes are safe to perform on this 22 + particular system. 23 + 24 + Consumer use cases 25 + ------------------ 26 + 27 + - The overwhelming majority of devices in a system will have no 28 + requirement to do any runtime configuration of their power beyond 29 + being able to turn it on or off. 30 + 31 + - Many of the power supplies in the system will be shared between many 32 + different consumers. 33 + 34 + .. note:: 35 + 36 + The consumer API should be structured so that these use cases are 37 + very easy to handle and so that consumers will work with shared 38 + supplies without any additional effort.

-33

Documentation/power/regulator/design.txt

··· 1 - Regulator API design notes 2 - ========================== 3 - 4 - This document provides a brief, partially structured, overview of some 5 - of the design considerations which impact the regulator API design. 6 - 7 - Safety 8 - ------ 9 - 10 - - Errors in regulator configuration can have very serious consequences 11 - for the system, potentially including lasting hardware damage. 12 - - It is not possible to automatically determine the power configuration 13 - of the system - software-equivalent variants of the same chip may 14 - have different power requirements, and not all components with power 15 - requirements are visible to software. 16 - 17 - => The API should make no changes to the hardware state unless it has 18 - specific knowledge that these changes are safe to perform on this 19 - particular system. 20 - 21 - Consumer use cases 22 - ------------------ 23 - 24 - - The overwhelming majority of devices in a system will have no 25 - requirement to do any runtime configuration of their power beyond 26 - being able to turn it on or off. 27 - 28 - - Many of the power supplies in the system will be shared between many 29 - different consumers. 30 - 31 - => The consumer API should be structured so that these use cases are 32 - very easy to handle and so that consumers will work with shared 33 - supplies without any additional effort.

+97

Documentation/power/regulator/machine.rst

··· 1 + ================================== 2 + Regulator Machine Driver Interface 3 + ================================== 4 + 5 + The regulator machine driver interface is intended for board/machine specific 6 + initialisation code to configure the regulator subsystem. 7 + 8 + Consider the following machine:: 9 + 10 + Regulator-1 -+-> Regulator-2 --> [Consumer A @ 1.8 - 2.0V] 11 + | 12 + +-> [Consumer B @ 3.3V] 13 + 14 + The drivers for consumers A & B must be mapped to the correct regulator in 15 + order to control their power supplies. This mapping can be achieved in machine 16 + initialisation code by creating a struct regulator_consumer_supply for 17 + each regulator:: 18 + 19 + struct regulator_consumer_supply { 20 + const char *dev_name; /* consumer dev_name() */ 21 + const char *supply; /* consumer supply - e.g. "vcc" */ 22 + }; 23 + 24 + e.g. for the machine above:: 25 + 26 + static struct regulator_consumer_supply regulator1_consumers[] = { 27 + REGULATOR_SUPPLY("Vcc", "consumer B"), 28 + }; 29 + 30 + static struct regulator_consumer_supply regulator2_consumers[] = { 31 + REGULATOR_SUPPLY("Vcc", "consumer A"), 32 + }; 33 + 34 + This maps Regulator-1 to the 'Vcc' supply for Consumer B and maps Regulator-2 35 + to the 'Vcc' supply for Consumer A. 36 + 37 + Constraints can now be registered by defining a struct regulator_init_data 38 + for each regulator power domain. This structure also maps the consumers 39 + to their supply regulators:: 40 + 41 + static struct regulator_init_data regulator1_data = { 42 + .constraints = { 43 + .name = "Regulator-1", 44 + .min_uV = 3300000, 45 + .max_uV = 3300000, 46 + .valid_modes_mask = REGULATOR_MODE_NORMAL, 47 + }, 48 + .num_consumer_supplies = ARRAY_SIZE(regulator1_consumers), 49 + .consumer_supplies = regulator1_consumers, 50 + }; 51 + 52 + The name field should be set to something that is usefully descriptive 53 + for the board for configuration of supplies for other regulators and 54 + for use in logging and other diagnostic output. Normally the name 55 + used for the supply rail in the schematic is a good choice. If no 56 + name is provided then the subsystem will choose one. 57 + 58 + Regulator-1 supplies power to Regulator-2. This relationship must be registered 59 + with the core so that Regulator-1 is also enabled when Consumer A enables its 60 + supply (Regulator-2). The supply regulator is set by the supply_regulator 61 + field below and co:: 62 + 63 + static struct regulator_init_data regulator2_data = { 64 + .supply_regulator = "Regulator-1", 65 + .constraints = { 66 + .min_uV = 1800000, 67 + .max_uV = 2000000, 68 + .valid_ops_mask = REGULATOR_CHANGE_VOLTAGE, 69 + .valid_modes_mask = REGULATOR_MODE_NORMAL, 70 + }, 71 + .num_consumer_supplies = ARRAY_SIZE(regulator2_consumers), 72 + .consumer_supplies = regulator2_consumers, 73 + }; 74 + 75 + Finally the regulator devices must be registered in the usual manner:: 76 + 77 + static struct platform_device regulator_devices[] = { 78 + { 79 + .name = "regulator", 80 + .id = DCDC_1, 81 + .dev = { 82 + .platform_data = &regulator1_data, 83 + }, 84 + }, 85 + { 86 + .name = "regulator", 87 + .id = DCDC_2, 88 + .dev = { 89 + .platform_data = &regulator2_data, 90 + }, 91 + }, 92 + }; 93 + /* register regulator 1 device */ 94 + platform_device_register(&regulator_devices[0]); 95 + 96 + /* register regulator 2 device */ 97 + platform_device_register(&regulator_devices[1]);

-96

Documentation/power/regulator/machine.txt

··· 1 - Regulator Machine Driver Interface 2 - =================================== 3 - 4 - The regulator machine driver interface is intended for board/machine specific 5 - initialisation code to configure the regulator subsystem. 6 - 7 - Consider the following machine :- 8 - 9 - Regulator-1 -+-> Regulator-2 --> [Consumer A @ 1.8 - 2.0V] 10 - | 11 - +-> [Consumer B @ 3.3V] 12 - 13 - The drivers for consumers A & B must be mapped to the correct regulator in 14 - order to control their power supplies. This mapping can be achieved in machine 15 - initialisation code by creating a struct regulator_consumer_supply for 16 - each regulator. 17 - 18 - struct regulator_consumer_supply { 19 - const char *dev_name; /* consumer dev_name() */ 20 - const char *supply; /* consumer supply - e.g. "vcc" */ 21 - }; 22 - 23 - e.g. for the machine above 24 - 25 - static struct regulator_consumer_supply regulator1_consumers[] = { 26 - REGULATOR_SUPPLY("Vcc", "consumer B"), 27 - }; 28 - 29 - static struct regulator_consumer_supply regulator2_consumers[] = { 30 - REGULATOR_SUPPLY("Vcc", "consumer A"), 31 - }; 32 - 33 - This maps Regulator-1 to the 'Vcc' supply for Consumer B and maps Regulator-2 34 - to the 'Vcc' supply for Consumer A. 35 - 36 - Constraints can now be registered by defining a struct regulator_init_data 37 - for each regulator power domain. This structure also maps the consumers 38 - to their supply regulators :- 39 - 40 - static struct regulator_init_data regulator1_data = { 41 - .constraints = { 42 - .name = "Regulator-1", 43 - .min_uV = 3300000, 44 - .max_uV = 3300000, 45 - .valid_modes_mask = REGULATOR_MODE_NORMAL, 46 - }, 47 - .num_consumer_supplies = ARRAY_SIZE(regulator1_consumers), 48 - .consumer_supplies = regulator1_consumers, 49 - }; 50 - 51 - The name field should be set to something that is usefully descriptive 52 - for the board for configuration of supplies for other regulators and 53 - for use in logging and other diagnostic output. Normally the name 54 - used for the supply rail in the schematic is a good choice. If no 55 - name is provided then the subsystem will choose one. 56 - 57 - Regulator-1 supplies power to Regulator-2. This relationship must be registered 58 - with the core so that Regulator-1 is also enabled when Consumer A enables its 59 - supply (Regulator-2). The supply regulator is set by the supply_regulator 60 - field below and co:- 61 - 62 - static struct regulator_init_data regulator2_data = { 63 - .supply_regulator = "Regulator-1", 64 - .constraints = { 65 - .min_uV = 1800000, 66 - .max_uV = 2000000, 67 - .valid_ops_mask = REGULATOR_CHANGE_VOLTAGE, 68 - .valid_modes_mask = REGULATOR_MODE_NORMAL, 69 - }, 70 - .num_consumer_supplies = ARRAY_SIZE(regulator2_consumers), 71 - .consumer_supplies = regulator2_consumers, 72 - }; 73 - 74 - Finally the regulator devices must be registered in the usual manner. 75 - 76 - static struct platform_device regulator_devices[] = { 77 - { 78 - .name = "regulator", 79 - .id = DCDC_1, 80 - .dev = { 81 - .platform_data = &regulator1_data, 82 - }, 83 - }, 84 - { 85 - .name = "regulator", 86 - .id = DCDC_2, 87 - .dev = { 88 - .platform_data = &regulator2_data, 89 - }, 90 - }, 91 - }; 92 - /* register regulator 1 device */ 93 - platform_device_register(&regulator_devices[0]); 94 - 95 - /* register regulator 2 device */ 96 - platform_device_register(&regulator_devices[1]);

+178

Documentation/power/regulator/overview.rst

··· 1 + ============================================= 2 + Linux voltage and current regulator framework 3 + ============================================= 4 + 5 + About 6 + ===== 7 + 8 + This framework is designed to provide a standard kernel interface to control 9 + voltage and current regulators. 10 + 11 + The intention is to allow systems to dynamically control regulator power output 12 + in order to save power and prolong battery life. This applies to both voltage 13 + regulators (where voltage output is controllable) and current sinks (where 14 + current limit is controllable). 15 + 16 + (C) 2008 Wolfson Microelectronics PLC. 17 + 18 + Author: Liam Girdwood <lrg@slimlogic.co.uk> 19 + 20 + 21 + Nomenclature 22 + ============ 23 + 24 + Some terms used in this document: 25 + 26 + - Regulator 27 + - Electronic device that supplies power to other devices. 28 + Most regulators can enable and disable their output while 29 + some can control their output voltage and or current. 30 + 31 + Input Voltage -> Regulator -> Output Voltage 32 + 33 + 34 + - PMIC 35 + - Power Management IC. An IC that contains numerous 36 + regulators and often contains other subsystems. 37 + 38 + 39 + - Consumer 40 + - Electronic device that is supplied power by a regulator. 41 + Consumers can be classified into two types:- 42 + 43 + Static: consumer does not change its supply voltage or 44 + current limit. It only needs to enable or disable its 45 + power supply. Its supply voltage is set by the hardware, 46 + bootloader, firmware or kernel board initialisation code. 47 + 48 + Dynamic: consumer needs to change its supply voltage or 49 + current limit to meet operation demands. 50 + 51 + 52 + - Power Domain 53 + - Electronic circuit that is supplied its input power by the 54 + output power of a regulator, switch or by another power 55 + domain. 56 + 57 + The supply regulator may be behind a switch(s). i.e.:: 58 + 59 + Regulator -+-> Switch-1 -+-> Switch-2 --> [Consumer A] 60 + | | 61 + | +-> [Consumer B], [Consumer C] 62 + | 63 + +-> [Consumer D], [Consumer E] 64 + 65 + That is one regulator and three power domains: 66 + 67 + - Domain 1: Switch-1, Consumers D & E. 68 + - Domain 2: Switch-2, Consumers B & C. 69 + - Domain 3: Consumer A. 70 + 71 + and this represents a "supplies" relationship: 72 + 73 + Domain-1 --> Domain-2 --> Domain-3. 74 + 75 + A power domain may have regulators that are supplied power 76 + by other regulators. i.e.:: 77 + 78 + Regulator-1 -+-> Regulator-2 -+-> [Consumer A] 79 + | 80 + +-> [Consumer B] 81 + 82 + This gives us two regulators and two power domains: 83 + 84 + - Domain 1: Regulator-2, Consumer B. 85 + - Domain 2: Consumer A. 86 + 87 + and a "supplies" relationship: 88 + 89 + Domain-1 --> Domain-2 90 + 91 + 92 + - Constraints 93 + - Constraints are used to define power levels for performance 94 + and hardware protection. Constraints exist at three levels: 95 + 96 + Regulator Level: This is defined by the regulator hardware 97 + operating parameters and is specified in the regulator 98 + datasheet. i.e. 99 + 100 + - voltage output is in the range 800mV -> 3500mV. 101 + - regulator current output limit is 20mA @ 5V but is 102 + 10mA @ 10V. 103 + 104 + Power Domain Level: This is defined in software by kernel 105 + level board initialisation code. It is used to constrain a 106 + power domain to a particular power range. i.e. 107 + 108 + - Domain-1 voltage is 3300mV 109 + - Domain-2 voltage is 1400mV -> 1600mV 110 + - Domain-3 current limit is 0mA -> 20mA. 111 + 112 + Consumer Level: This is defined by consumer drivers 113 + dynamically setting voltage or current limit levels. 114 + 115 + e.g. a consumer backlight driver asks for a current increase 116 + from 5mA to 10mA to increase LCD illumination. This passes 117 + to through the levels as follows :- 118 + 119 + Consumer: need to increase LCD brightness. Lookup and 120 + request next current mA value in brightness table (the 121 + consumer driver could be used on several different 122 + personalities based upon the same reference device). 123 + 124 + Power Domain: is the new current limit within the domain 125 + operating limits for this domain and system state (e.g. 126 + battery power, USB power) 127 + 128 + Regulator Domains: is the new current limit within the 129 + regulator operating parameters for input/output voltage. 130 + 131 + If the regulator request passes all the constraint tests 132 + then the new regulator value is applied. 133 + 134 + 135 + Design 136 + ====== 137 + 138 + The framework is designed and targeted at SoC based devices but may also be 139 + relevant to non SoC devices and is split into the following four interfaces:- 140 + 141 + 142 + 1. Consumer driver interface. 143 + 144 + This uses a similar API to the kernel clock interface in that consumer 145 + drivers can get and put a regulator (like they can with clocks atm) and 146 + get/set voltage, current limit, mode, enable and disable. This should 147 + allow consumers complete control over their supply voltage and current 148 + limit. This also compiles out if not in use so drivers can be reused in 149 + systems with no regulator based power control. 150 + 151 + See Documentation/power/regulator/consumer.rst 152 + 153 + 2. Regulator driver interface. 154 + 155 + This allows regulator drivers to register their regulators and provide 156 + operations to the core. It also has a notifier call chain for propagating 157 + regulator events to clients. 158 + 159 + See Documentation/power/regulator/regulator.rst 160 + 161 + 3. Machine interface. 162 + 163 + This interface is for machine specific code and allows the creation of 164 + voltage/current domains (with constraints) for each regulator. It can 165 + provide regulator constraints that will prevent device damage through 166 + overvoltage or overcurrent caused by buggy client drivers. It also 167 + allows the creation of a regulator tree whereby some regulators are 168 + supplied by others (similar to a clock tree). 169 + 170 + See Documentation/power/regulator/machine.rst 171 + 172 + 4. Userspace ABI. 173 + 174 + The framework also exports a lot of useful voltage/current/opmode data to 175 + userspace via sysfs. This could be used to help monitor device power 176 + consumption and status. 177 + 178 + See Documentation/ABI/testing/sysfs-class-regulator

-171

Documentation/power/regulator/overview.txt

··· 1 - Linux voltage and current regulator framework 2 - ============================================= 3 - 4 - About 5 - ===== 6 - 7 - This framework is designed to provide a standard kernel interface to control 8 - voltage and current regulators. 9 - 10 - The intention is to allow systems to dynamically control regulator power output 11 - in order to save power and prolong battery life. This applies to both voltage 12 - regulators (where voltage output is controllable) and current sinks (where 13 - current limit is controllable). 14 - 15 - (C) 2008 Wolfson Microelectronics PLC. 16 - Author: Liam Girdwood <lrg@slimlogic.co.uk> 17 - 18 - 19 - Nomenclature 20 - ============ 21 - 22 - Some terms used in this document:- 23 - 24 - o Regulator - Electronic device that supplies power to other devices. 25 - Most regulators can enable and disable their output while 26 - some can control their output voltage and or current. 27 - 28 - Input Voltage -> Regulator -> Output Voltage 29 - 30 - 31 - o PMIC - Power Management IC. An IC that contains numerous regulators 32 - and often contains other subsystems. 33 - 34 - 35 - o Consumer - Electronic device that is supplied power by a regulator. 36 - Consumers can be classified into two types:- 37 - 38 - Static: consumer does not change its supply voltage or 39 - current limit. It only needs to enable or disable its 40 - power supply. Its supply voltage is set by the hardware, 41 - bootloader, firmware or kernel board initialisation code. 42 - 43 - Dynamic: consumer needs to change its supply voltage or 44 - current limit to meet operation demands. 45 - 46 - 47 - o Power Domain - Electronic circuit that is supplied its input power by the 48 - output power of a regulator, switch or by another power 49 - domain. 50 - 51 - The supply regulator may be behind a switch(s). i.e. 52 - 53 - Regulator -+-> Switch-1 -+-> Switch-2 --> [Consumer A] 54 - | | 55 - | +-> [Consumer B], [Consumer C] 56 - | 57 - +-> [Consumer D], [Consumer E] 58 - 59 - That is one regulator and three power domains: 60 - 61 - Domain 1: Switch-1, Consumers D & E. 62 - Domain 2: Switch-2, Consumers B & C. 63 - Domain 3: Consumer A. 64 - 65 - and this represents a "supplies" relationship: 66 - 67 - Domain-1 --> Domain-2 --> Domain-3. 68 - 69 - A power domain may have regulators that are supplied power 70 - by other regulators. i.e. 71 - 72 - Regulator-1 -+-> Regulator-2 -+-> [Consumer A] 73 - | 74 - +-> [Consumer B] 75 - 76 - This gives us two regulators and two power domains: 77 - 78 - Domain 1: Regulator-2, Consumer B. 79 - Domain 2: Consumer A. 80 - 81 - and a "supplies" relationship: 82 - 83 - Domain-1 --> Domain-2 84 - 85 - 86 - o Constraints - Constraints are used to define power levels for performance 87 - and hardware protection. Constraints exist at three levels: 88 - 89 - Regulator Level: This is defined by the regulator hardware 90 - operating parameters and is specified in the regulator 91 - datasheet. i.e. 92 - 93 - - voltage output is in the range 800mV -> 3500mV. 94 - - regulator current output limit is 20mA @ 5V but is 95 - 10mA @ 10V. 96 - 97 - Power Domain Level: This is defined in software by kernel 98 - level board initialisation code. It is used to constrain a 99 - power domain to a particular power range. i.e. 100 - 101 - - Domain-1 voltage is 3300mV 102 - - Domain-2 voltage is 1400mV -> 1600mV 103 - - Domain-3 current limit is 0mA -> 20mA. 104 - 105 - Consumer Level: This is defined by consumer drivers 106 - dynamically setting voltage or current limit levels. 107 - 108 - e.g. a consumer backlight driver asks for a current increase 109 - from 5mA to 10mA to increase LCD illumination. This passes 110 - to through the levels as follows :- 111 - 112 - Consumer: need to increase LCD brightness. Lookup and 113 - request next current mA value in brightness table (the 114 - consumer driver could be used on several different 115 - personalities based upon the same reference device). 116 - 117 - Power Domain: is the new current limit within the domain 118 - operating limits for this domain and system state (e.g. 119 - battery power, USB power) 120 - 121 - Regulator Domains: is the new current limit within the 122 - regulator operating parameters for input/output voltage. 123 - 124 - If the regulator request passes all the constraint tests 125 - then the new regulator value is applied. 126 - 127 - 128 - Design 129 - ====== 130 - 131 - The framework is designed and targeted at SoC based devices but may also be 132 - relevant to non SoC devices and is split into the following four interfaces:- 133 - 134 - 135 - 1. Consumer driver interface. 136 - 137 - This uses a similar API to the kernel clock interface in that consumer 138 - drivers can get and put a regulator (like they can with clocks atm) and 139 - get/set voltage, current limit, mode, enable and disable. This should 140 - allow consumers complete control over their supply voltage and current 141 - limit. This also compiles out if not in use so drivers can be reused in 142 - systems with no regulator based power control. 143 - 144 - See Documentation/power/regulator/consumer.txt 145 - 146 - 2. Regulator driver interface. 147 - 148 - This allows regulator drivers to register their regulators and provide 149 - operations to the core. It also has a notifier call chain for propagating 150 - regulator events to clients. 151 - 152 - See Documentation/power/regulator/regulator.txt 153 - 154 - 3. Machine interface. 155 - 156 - This interface is for machine specific code and allows the creation of 157 - voltage/current domains (with constraints) for each regulator. It can 158 - provide regulator constraints that will prevent device damage through 159 - overvoltage or overcurrent caused by buggy client drivers. It also 160 - allows the creation of a regulator tree whereby some regulators are 161 - supplied by others (similar to a clock tree). 162 - 163 - See Documentation/power/regulator/machine.txt 164 - 165 - 4. Userspace ABI. 166 - 167 - The framework also exports a lot of useful voltage/current/opmode data to 168 - userspace via sysfs. This could be used to help monitor device power 169 - consumption and status. 170 - 171 - See Documentation/ABI/testing/sysfs-class-regulator

+32

Documentation/power/regulator/regulator.rst

··· 1 + ========================== 2 + Regulator Driver Interface 3 + ========================== 4 + 5 + The regulator driver interface is relatively simple and designed to allow 6 + regulator drivers to register their services with the core framework. 7 + 8 + 9 + Registration 10 + ============ 11 + 12 + Drivers can register a regulator by calling:: 13 + 14 + struct regulator_dev *regulator_register(struct regulator_desc *regulator_desc, 15 + const struct regulator_config *config); 16 + 17 + This will register the regulator's capabilities and operations to the regulator 18 + core. 19 + 20 + Regulators can be unregistered by calling:: 21 + 22 + void regulator_unregister(struct regulator_dev *rdev); 23 + 24 + 25 + Regulator Events 26 + ================ 27 + 28 + Regulators can send events (e.g. overtemperature, undervoltage, etc) to 29 + consumer drivers by calling:: 30 + 31 + int regulator_notifier_call_chain(struct regulator_dev *rdev, 32 + unsigned long event, void *data);

-30

Documentation/power/regulator/regulator.txt

··· 1 - Regulator Driver Interface 2 - ========================== 3 - 4 - The regulator driver interface is relatively simple and designed to allow 5 - regulator drivers to register their services with the core framework. 6 - 7 - 8 - Registration 9 - ============ 10 - 11 - Drivers can register a regulator by calling :- 12 - 13 - struct regulator_dev *regulator_register(struct regulator_desc *regulator_desc, 14 - const struct regulator_config *config); 15 - 16 - This will register the regulator's capabilities and operations to the regulator 17 - core. 18 - 19 - Regulators can be unregistered by calling :- 20 - 21 - void regulator_unregister(struct regulator_dev *rdev); 22 - 23 - 24 - Regulator Events 25 - ================ 26 - Regulators can send events (e.g. overtemperature, undervoltage, etc) to 27 - consumer drivers by calling :- 28 - 29 - int regulator_notifier_call_chain(struct regulator_dev *rdev, 30 - unsigned long event, void *data);

+940

Documentation/power/runtime_pm.rst

··· 1 + ================================================== 2 + Runtime Power Management Framework for I/O Devices 3 + ================================================== 4 + 5 + (C) 2009-2011 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc. 6 + 7 + (C) 2010 Alan Stern <stern@rowland.harvard.edu> 8 + 9 + (C) 2014 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com> 10 + 11 + 1. Introduction 12 + =============== 13 + 14 + Support for runtime power management (runtime PM) of I/O devices is provided 15 + at the power management core (PM core) level by means of: 16 + 17 + * The power management workqueue pm_wq in which bus types and device drivers can 18 + put their PM-related work items. It is strongly recommended that pm_wq be 19 + used for queuing all work items related to runtime PM, because this allows 20 + them to be synchronized with system-wide power transitions (suspend to RAM, 21 + hibernation and resume from system sleep states). pm_wq is declared in 22 + include/linux/pm_runtime.h and defined in kernel/power/main.c. 23 + 24 + * A number of runtime PM fields in the 'power' member of 'struct device' (which 25 + is of the type 'struct dev_pm_info', defined in include/linux/pm.h) that can 26 + be used for synchronizing runtime PM operations with one another. 27 + 28 + * Three device runtime PM callbacks in 'struct dev_pm_ops' (defined in 29 + include/linux/pm.h). 30 + 31 + * A set of helper functions defined in drivers/base/power/runtime.c that can be 32 + used for carrying out runtime PM operations in such a way that the 33 + synchronization between them is taken care of by the PM core. Bus types and 34 + device drivers are encouraged to use these functions. 35 + 36 + The runtime PM callbacks present in 'struct dev_pm_ops', the device runtime PM 37 + fields of 'struct dev_pm_info' and the core helper functions provided for 38 + runtime PM are described below. 39 + 40 + 2. Device Runtime PM Callbacks 41 + ============================== 42 + 43 + There are three device runtime PM callbacks defined in 'struct dev_pm_ops':: 44 + 45 + struct dev_pm_ops { 46 + ... 47 + int (*runtime_suspend)(struct device *dev); 48 + int (*runtime_resume)(struct device *dev); 49 + int (*runtime_idle)(struct device *dev); 50 + ... 51 + }; 52 + 53 + The ->runtime_suspend(), ->runtime_resume() and ->runtime_idle() callbacks 54 + are executed by the PM core for the device's subsystem that may be either of 55 + the following: 56 + 57 + 1. PM domain of the device, if the device's PM domain object, dev->pm_domain, 58 + is present. 59 + 60 + 2. Device type of the device, if both dev->type and dev->type->pm are present. 61 + 62 + 3. Device class of the device, if both dev->class and dev->class->pm are 63 + present. 64 + 65 + 4. Bus type of the device, if both dev->bus and dev->bus->pm are present. 66 + 67 + If the subsystem chosen by applying the above rules doesn't provide the relevant 68 + callback, the PM core will invoke the corresponding driver callback stored in 69 + dev->driver->pm directly (if present). 70 + 71 + The PM core always checks which callback to use in the order given above, so the 72 + priority order of callbacks from high to low is: PM domain, device type, class 73 + and bus type. Moreover, the high-priority one will always take precedence over 74 + a low-priority one. The PM domain, bus type, device type and class callbacks 75 + are referred to as subsystem-level callbacks in what follows. 76 + 77 + By default, the callbacks are always invoked in process context with interrupts 78 + enabled. However, the pm_runtime_irq_safe() helper function can be used to tell 79 + the PM core that it is safe to run the ->runtime_suspend(), ->runtime_resume() 80 + and ->runtime_idle() callbacks for the given device in atomic context with 81 + interrupts disabled. This implies that the callback routines in question must 82 + not block or sleep, but it also means that the synchronous helper functions 83 + listed at the end of Section 4 may be used for that device within an interrupt 84 + handler or generally in an atomic context. 85 + 86 + The subsystem-level suspend callback, if present, is _entirely_ _responsible_ 87 + for handling the suspend of the device as appropriate, which may, but need not 88 + include executing the device driver's own ->runtime_suspend() callback (from the 89 + PM core's point of view it is not necessary to implement a ->runtime_suspend() 90 + callback in a device driver as long as the subsystem-level suspend callback 91 + knows what to do to handle the device). 92 + 93 + * Once the subsystem-level suspend callback (or the driver suspend callback, 94 + if invoked directly) has completed successfully for the given device, the PM 95 + core regards the device as suspended, which need not mean that it has been 96 + put into a low power state. It is supposed to mean, however, that the 97 + device will not process data and will not communicate with the CPU(s) and 98 + RAM until the appropriate resume callback is executed for it. The runtime 99 + PM status of a device after successful execution of the suspend callback is 100 + 'suspended'. 101 + 102 + * If the suspend callback returns -EBUSY or -EAGAIN, the device's runtime PM 103 + status remains 'active', which means that the device _must_ be fully 104 + operational afterwards. 105 + 106 + * If the suspend callback returns an error code different from -EBUSY and 107 + -EAGAIN, the PM core regards this as a fatal error and will refuse to run 108 + the helper functions described in Section 4 for the device until its status 109 + is directly set to either 'active', or 'suspended' (the PM core provides 110 + special helper functions for this purpose). 111 + 112 + In particular, if the driver requires remote wakeup capability (i.e. hardware 113 + mechanism allowing the device to request a change of its power state, such as 114 + PCI PME) for proper functioning and device_can_wakeup() returns 'false' for the 115 + device, then ->runtime_suspend() should return -EBUSY. On the other hand, if 116 + device_can_wakeup() returns 'true' for the device and the device is put into a 117 + low-power state during the execution of the suspend callback, it is expected 118 + that remote wakeup will be enabled for the device. Generally, remote wakeup 119 + should be enabled for all input devices put into low-power states at run time. 120 + 121 + The subsystem-level resume callback, if present, is **entirely responsible** for 122 + handling the resume of the device as appropriate, which may, but need not 123 + include executing the device driver's own ->runtime_resume() callback (from the 124 + PM core's point of view it is not necessary to implement a ->runtime_resume() 125 + callback in a device driver as long as the subsystem-level resume callback knows 126 + what to do to handle the device). 127 + 128 + * Once the subsystem-level resume callback (or the driver resume callback, if 129 + invoked directly) has completed successfully, the PM core regards the device 130 + as fully operational, which means that the device _must_ be able to complete 131 + I/O operations as needed. The runtime PM status of the device is then 132 + 'active'. 133 + 134 + * If the resume callback returns an error code, the PM core regards this as a 135 + fatal error and will refuse to run the helper functions described in Section 136 + 4 for the device, until its status is directly set to either 'active', or 137 + 'suspended' (by means of special helper functions provided by the PM core 138 + for this purpose). 139 + 140 + The idle callback (a subsystem-level one, if present, or the driver one) is 141 + executed by the PM core whenever the device appears to be idle, which is 142 + indicated to the PM core by two counters, the device's usage counter and the 143 + counter of 'active' children of the device. 144 + 145 + * If any of these counters is decreased using a helper function provided by 146 + the PM core and it turns out to be equal to zero, the other counter is 147 + checked. If that counter also is equal to zero, the PM core executes the 148 + idle callback with the device as its argument. 149 + 150 + The action performed by the idle callback is totally dependent on the subsystem 151 + (or driver) in question, but the expected and recommended action is to check 152 + if the device can be suspended (i.e. if all of the conditions necessary for 153 + suspending the device are satisfied) and to queue up a suspend request for the 154 + device in that case. If there is no idle callback, or if the callback returns 155 + 0, then the PM core will attempt to carry out a runtime suspend of the device, 156 + also respecting devices configured for autosuspend. In essence this means a 157 + call to pm_runtime_autosuspend() (do note that drivers needs to update the 158 + device last busy mark, pm_runtime_mark_last_busy(), to control the delay under 159 + this circumstance). To prevent this (for example, if the callback routine has 160 + started a delayed suspend), the routine must return a non-zero value. Negative 161 + error return codes are ignored by the PM core. 162 + 163 + The helper functions provided by the PM core, described in Section 4, guarantee 164 + that the following constraints are met with respect to runtime PM callbacks for 165 + one device: 166 + 167 + (1) The callbacks are mutually exclusive (e.g. it is forbidden to execute 168 + ->runtime_suspend() in parallel with ->runtime_resume() or with another 169 + instance of ->runtime_suspend() for the same device) with the exception that 170 + ->runtime_suspend() or ->runtime_resume() can be executed in parallel with 171 + ->runtime_idle() (although ->runtime_idle() will not be started while any 172 + of the other callbacks is being executed for the same device). 173 + 174 + (2) ->runtime_idle() and ->runtime_suspend() can only be executed for 'active' 175 + devices (i.e. the PM core will only execute ->runtime_idle() or 176 + ->runtime_suspend() for the devices the runtime PM status of which is 177 + 'active'). 178 + 179 + (3) ->runtime_idle() and ->runtime_suspend() can only be executed for a device 180 + the usage counter of which is equal to zero _and_ either the counter of 181 + 'active' children of which is equal to zero, or the 'power.ignore_children' 182 + flag of which is set. 183 + 184 + (4) ->runtime_resume() can only be executed for 'suspended' devices (i.e. the 185 + PM core will only execute ->runtime_resume() for the devices the runtime 186 + PM status of which is 'suspended'). 187 + 188 + Additionally, the helper functions provided by the PM core obey the following 189 + rules: 190 + 191 + * If ->runtime_suspend() is about to be executed or there's a pending request 192 + to execute it, ->runtime_idle() will not be executed for the same device. 193 + 194 + * A request to execute or to schedule the execution of ->runtime_suspend() 195 + will cancel any pending requests to execute ->runtime_idle() for the same 196 + device. 197 + 198 + * If ->runtime_resume() is about to be executed or there's a pending request 199 + to execute it, the other callbacks will not be executed for the same device. 200 + 201 + * A request to execute ->runtime_resume() will cancel any pending or 202 + scheduled requests to execute the other callbacks for the same device, 203 + except for scheduled autosuspends. 204 + 205 + 3. Runtime PM Device Fields 206 + =========================== 207 + 208 + The following device runtime PM fields are present in 'struct dev_pm_info', as 209 + defined in include/linux/pm.h: 210 + 211 + `struct timer_list suspend_timer;` 212 + - timer used for scheduling (delayed) suspend and autosuspend requests 213 + 214 + `unsigned long timer_expires;` 215 + - timer expiration time, in jiffies (if this is different from zero, the 216 + timer is running and will expire at that time, otherwise the timer is not 217 + running) 218 + 219 + `struct work_struct work;` 220 + - work structure used for queuing up requests (i.e. work items in pm_wq) 221 + 222 + `wait_queue_head_t wait_queue;` 223 + - wait queue used if any of the helper functions needs to wait for another 224 + one to complete 225 + 226 + `spinlock_t lock;` 227 + - lock used for synchronization 228 + 229 + `atomic_t usage_count;` 230 + - the usage counter of the device 231 + 232 + `atomic_t child_count;` 233 + - the count of 'active' children of the device 234 + 235 + `unsigned int ignore_children;` 236 + - if set, the value of child_count is ignored (but still updated) 237 + 238 + `unsigned int disable_depth;` 239 + - used for disabling the helper functions (they work normally if this is 240 + equal to zero); the initial value of it is 1 (i.e. runtime PM is 241 + initially disabled for all devices) 242 + 243 + `int runtime_error;` 244 + - if set, there was a fatal error (one of the callbacks returned error code 245 + as described in Section 2), so the helper functions will not work until 246 + this flag is cleared; this is the error code returned by the failing 247 + callback 248 + 249 + `unsigned int idle_notification;` 250 + - if set, ->runtime_idle() is being executed 251 + 252 + `unsigned int request_pending;` 253 + - if set, there's a pending request (i.e. a work item queued up into pm_wq) 254 + 255 + `enum rpm_request request;` 256 + - type of request that's pending (valid if request_pending is set) 257 + 258 + `unsigned int deferred_resume;` 259 + - set if ->runtime_resume() is about to be run while ->runtime_suspend() is 260 + being executed for that device and it is not practical to wait for the 261 + suspend to complete; means "start a resume as soon as you've suspended" 262 + 263 + `enum rpm_status runtime_status;` 264 + - the runtime PM status of the device; this field's initial value is 265 + RPM_SUSPENDED, which means that each device is initially regarded by the 266 + PM core as 'suspended', regardless of its real hardware status 267 + 268 + `unsigned int runtime_auto;` 269 + - if set, indicates that the user space has allowed the device driver to 270 + power manage the device at run time via the /sys/devices/.../power/control 271 + `interface;` it may only be modified with the help of the pm_runtime_allow() 272 + and pm_runtime_forbid() helper functions 273 + 274 + `unsigned int no_callbacks;` 275 + - indicates that the device does not use the runtime PM callbacks (see 276 + Section 8); it may be modified only by the pm_runtime_no_callbacks() 277 + helper function 278 + 279 + `unsigned int irq_safe;` 280 + - indicates that the ->runtime_suspend() and ->runtime_resume() callbacks 281 + will be invoked with the spinlock held and interrupts disabled 282 + 283 + `unsigned int use_autosuspend;` 284 + - indicates that the device's driver supports delayed autosuspend (see 285 + Section 9); it may be modified only by the 286 + pm_runtime{_dont}_use_autosuspend() helper functions 287 + 288 + `unsigned int timer_autosuspends;` 289 + - indicates that the PM core should attempt to carry out an autosuspend 290 + when the timer expires rather than a normal suspend 291 + 292 + `int autosuspend_delay;` 293 + - the delay time (in milliseconds) to be used for autosuspend 294 + 295 + `unsigned long last_busy;` 296 + - the time (in jiffies) when the pm_runtime_mark_last_busy() helper 297 + function was last called for this device; used in calculating inactivity 298 + periods for autosuspend 299 + 300 + All of the above fields are members of the 'power' member of 'struct device'. 301 + 302 + 4. Runtime PM Device Helper Functions 303 + ===================================== 304 + 305 + The following runtime PM helper functions are defined in 306 + drivers/base/power/runtime.c and include/linux/pm_runtime.h: 307 + 308 + `void pm_runtime_init(struct device *dev);` 309 + - initialize the device runtime PM fields in 'struct dev_pm_info' 310 + 311 + `void pm_runtime_remove(struct device *dev);` 312 + - make sure that the runtime PM of the device will be disabled after 313 + removing the device from device hierarchy 314 + 315 + `int pm_runtime_idle(struct device *dev);` 316 + - execute the subsystem-level idle callback for the device; returns an 317 + error code on failure, where -EINPROGRESS means that ->runtime_idle() is 318 + already being executed; if there is no callback or the callback returns 0 319 + then run pm_runtime_autosuspend(dev) and return its result 320 + 321 + `int pm_runtime_suspend(struct device *dev);` 322 + - execute the subsystem-level suspend callback for the device; returns 0 on 323 + success, 1 if the device's runtime PM status was already 'suspended', or 324 + error code on failure, where -EAGAIN or -EBUSY means it is safe to attempt 325 + to suspend the device again in future and -EACCES means that 326 + 'power.disable_depth' is different from 0 327 + 328 + `int pm_runtime_autosuspend(struct device *dev);` 329 + - same as pm_runtime_suspend() except that the autosuspend delay is taken 330 + `into account;` if pm_runtime_autosuspend_expiration() says the delay has 331 + not yet expired then an autosuspend is scheduled for the appropriate time 332 + and 0 is returned 333 + 334 + `int pm_runtime_resume(struct device *dev);` 335 + - execute the subsystem-level resume callback for the device; returns 0 on 336 + success, 1 if the device's runtime PM status was already 'active' or 337 + error code on failure, where -EAGAIN means it may be safe to attempt to 338 + resume the device again in future, but 'power.runtime_error' should be 339 + checked additionally, and -EACCES means that 'power.disable_depth' is 340 + different from 0 341 + 342 + `int pm_request_idle(struct device *dev);` 343 + - submit a request to execute the subsystem-level idle callback for the 344 + device (the request is represented by a work item in pm_wq); returns 0 on 345 + success or error code if the request has not been queued up 346 + 347 + `int pm_request_autosuspend(struct device *dev);` 348 + - schedule the execution of the subsystem-level suspend callback for the 349 + device when the autosuspend delay has expired; if the delay has already 350 + expired then the work item is queued up immediately 351 + 352 + `int pm_schedule_suspend(struct device *dev, unsigned int delay);` 353 + - schedule the execution of the subsystem-level suspend callback for the 354 + device in future, where 'delay' is the time to wait before queuing up a 355 + suspend work item in pm_wq, in milliseconds (if 'delay' is zero, the work 356 + item is queued up immediately); returns 0 on success, 1 if the device's PM 357 + runtime status was already 'suspended', or error code if the request 358 + hasn't been scheduled (or queued up if 'delay' is 0); if the execution of 359 + ->runtime_suspend() is already scheduled and not yet expired, the new 360 + value of 'delay' will be used as the time to wait 361 + 362 + `int pm_request_resume(struct device *dev);` 363 + - submit a request to execute the subsystem-level resume callback for the 364 + device (the request is represented by a work item in pm_wq); returns 0 on 365 + success, 1 if the device's runtime PM status was already 'active', or 366 + error code if the request hasn't been queued up 367 + 368 + `void pm_runtime_get_noresume(struct device *dev);` 369 + - increment the device's usage counter 370 + 371 + `int pm_runtime_get(struct device *dev);` 372 + - increment the device's usage counter, run pm_request_resume(dev) and 373 + return its result 374 + 375 + `int pm_runtime_get_sync(struct device *dev);` 376 + - increment the device's usage counter, run pm_runtime_resume(dev) and 377 + return its result 378 + 379 + `int pm_runtime_get_if_in_use(struct device *dev);` 380 + - return -EINVAL if 'power.disable_depth' is nonzero; otherwise, if the 381 + runtime PM status is RPM_ACTIVE and the runtime PM usage counter is 382 + nonzero, increment the counter and return 1; otherwise return 0 without 383 + changing the counter 384 + 385 + `void pm_runtime_put_noidle(struct device *dev);` 386 + - decrement the device's usage counter 387 + 388 + `int pm_runtime_put(struct device *dev);` 389 + - decrement the device's usage counter; if the result is 0 then run 390 + pm_request_idle(dev) and return its result 391 + 392 + `int pm_runtime_put_autosuspend(struct device *dev);` 393 + - decrement the device's usage counter; if the result is 0 then run 394 + pm_request_autosuspend(dev) and return its result 395 + 396 + `int pm_runtime_put_sync(struct device *dev);` 397 + - decrement the device's usage counter; if the result is 0 then run 398 + pm_runtime_idle(dev) and return its result 399 + 400 + `int pm_runtime_put_sync_suspend(struct device *dev);` 401 + - decrement the device's usage counter; if the result is 0 then run 402 + pm_runtime_suspend(dev) and return its result 403 + 404 + `int pm_runtime_put_sync_autosuspend(struct device *dev);` 405 + - decrement the device's usage counter; if the result is 0 then run 406 + pm_runtime_autosuspend(dev) and return its result 407 + 408 + `void pm_runtime_enable(struct device *dev);` 409 + - decrement the device's 'power.disable_depth' field; if that field is equal 410 + to zero, the runtime PM helper functions can execute subsystem-level 411 + callbacks described in Section 2 for the device 412 + 413 + `int pm_runtime_disable(struct device *dev);` 414 + - increment the device's 'power.disable_depth' field (if the value of that 415 + field was previously zero, this prevents subsystem-level runtime PM 416 + callbacks from being run for the device), make sure that all of the 417 + pending runtime PM operations on the device are either completed or 418 + canceled; returns 1 if there was a resume request pending and it was 419 + necessary to execute the subsystem-level resume callback for the device 420 + to satisfy that request, otherwise 0 is returned 421 + 422 + `int pm_runtime_barrier(struct device *dev);` 423 + - check if there's a resume request pending for the device and resume it 424 + (synchronously) in that case, cancel any other pending runtime PM requests 425 + regarding it and wait for all runtime PM operations on it in progress to 426 + complete; returns 1 if there was a resume request pending and it was 427 + necessary to execute the subsystem-level resume callback for the device to 428 + satisfy that request, otherwise 0 is returned 429 + 430 + `void pm_suspend_ignore_children(struct device *dev, bool enable);` 431 + - set/unset the power.ignore_children flag of the device 432 + 433 + `int pm_runtime_set_active(struct device *dev);` 434 + - clear the device's 'power.runtime_error' flag, set the device's runtime 435 + PM status to 'active' and update its parent's counter of 'active' 436 + children as appropriate (it is only valid to use this function if 437 + 'power.runtime_error' is set or 'power.disable_depth' is greater than 438 + zero); it will fail and return error code if the device has a parent 439 + which is not active and the 'power.ignore_children' flag of which is unset 440 + 441 + `void pm_runtime_set_suspended(struct device *dev);` 442 + - clear the device's 'power.runtime_error' flag, set the device's runtime 443 + PM status to 'suspended' and update its parent's counter of 'active' 444 + children as appropriate (it is only valid to use this function if 445 + 'power.runtime_error' is set or 'power.disable_depth' is greater than 446 + zero) 447 + 448 + `bool pm_runtime_active(struct device *dev);` 449 + - return true if the device's runtime PM status is 'active' or its 450 + 'power.disable_depth' field is not equal to zero, or false otherwise 451 + 452 + `bool pm_runtime_suspended(struct device *dev);` 453 + - return true if the device's runtime PM status is 'suspended' and its 454 + 'power.disable_depth' field is equal to zero, or false otherwise 455 + 456 + `bool pm_runtime_status_suspended(struct device *dev);` 457 + - return true if the device's runtime PM status is 'suspended' 458 + 459 + `void pm_runtime_allow(struct device *dev);` 460 + - set the power.runtime_auto flag for the device and decrease its usage 461 + counter (used by the /sys/devices/.../power/control interface to 462 + effectively allow the device to be power managed at run time) 463 + 464 + `void pm_runtime_forbid(struct device *dev);` 465 + - unset the power.runtime_auto flag for the device and increase its usage 466 + counter (used by the /sys/devices/.../power/control interface to 467 + effectively prevent the device from being power managed at run time) 468 + 469 + `void pm_runtime_no_callbacks(struct device *dev);` 470 + - set the power.no_callbacks flag for the device and remove the runtime 471 + PM attributes from /sys/devices/.../power (or prevent them from being 472 + added when the device is registered) 473 + 474 + `void pm_runtime_irq_safe(struct device *dev);` 475 + - set the power.irq_safe flag for the device, causing the runtime-PM 476 + callbacks to be invoked with interrupts off 477 + 478 + `bool pm_runtime_is_irq_safe(struct device *dev);` 479 + - return true if power.irq_safe flag was set for the device, causing 480 + the runtime-PM callbacks to be invoked with interrupts off 481 + 482 + `void pm_runtime_mark_last_busy(struct device *dev);` 483 + - set the power.last_busy field to the current time 484 + 485 + `void pm_runtime_use_autosuspend(struct device *dev);` 486 + - set the power.use_autosuspend flag, enabling autosuspend delays; call 487 + pm_runtime_get_sync if the flag was previously cleared and 488 + power.autosuspend_delay is negative 489 + 490 + `void pm_runtime_dont_use_autosuspend(struct device *dev);` 491 + - clear the power.use_autosuspend flag, disabling autosuspend delays; 492 + decrement the device's usage counter if the flag was previously set and 493 + power.autosuspend_delay is negative; call pm_runtime_idle 494 + 495 + `void pm_runtime_set_autosuspend_delay(struct device *dev, int delay);` 496 + - set the power.autosuspend_delay value to 'delay' (expressed in 497 + milliseconds); if 'delay' is negative then runtime suspends are 498 + prevented; if power.use_autosuspend is set, pm_runtime_get_sync may be 499 + called or the device's usage counter may be decremented and 500 + pm_runtime_idle called depending on if power.autosuspend_delay is 501 + changed to or from a negative value; if power.use_autosuspend is clear, 502 + pm_runtime_idle is called 503 + 504 + `unsigned long pm_runtime_autosuspend_expiration(struct device *dev);` 505 + - calculate the time when the current autosuspend delay period will expire, 506 + based on power.last_busy and power.autosuspend_delay; if the delay time 507 + is 1000 ms or larger then the expiration time is rounded up to the 508 + nearest second; returns 0 if the delay period has already expired or 509 + power.use_autosuspend isn't set, otherwise returns the expiration time 510 + in jiffies 511 + 512 + It is safe to execute the following helper functions from interrupt context: 513 + 514 + - pm_request_idle() 515 + - pm_request_autosuspend() 516 + - pm_schedule_suspend() 517 + - pm_request_resume() 518 + - pm_runtime_get_noresume() 519 + - pm_runtime_get() 520 + - pm_runtime_put_noidle() 521 + - pm_runtime_put() 522 + - pm_runtime_put_autosuspend() 523 + - pm_runtime_enable() 524 + - pm_suspend_ignore_children() 525 + - pm_runtime_set_active() 526 + - pm_runtime_set_suspended() 527 + - pm_runtime_suspended() 528 + - pm_runtime_mark_last_busy() 529 + - pm_runtime_autosuspend_expiration() 530 + 531 + If pm_runtime_irq_safe() has been called for a device then the following helper 532 + functions may also be used in interrupt context: 533 + 534 + - pm_runtime_idle() 535 + - pm_runtime_suspend() 536 + - pm_runtime_autosuspend() 537 + - pm_runtime_resume() 538 + - pm_runtime_get_sync() 539 + - pm_runtime_put_sync() 540 + - pm_runtime_put_sync_suspend() 541 + - pm_runtime_put_sync_autosuspend() 542 + 543 + 5. Runtime PM Initialization, Device Probing and Removal 544 + ======================================================== 545 + 546 + Initially, the runtime PM is disabled for all devices, which means that the 547 + majority of the runtime PM helper functions described in Section 4 will return 548 + -EAGAIN until pm_runtime_enable() is called for the device. 549 + 550 + In addition to that, the initial runtime PM status of all devices is 551 + 'suspended', but it need not reflect the actual physical state of the device. 552 + Thus, if the device is initially active (i.e. it is able to process I/O), its 553 + runtime PM status must be changed to 'active', with the help of 554 + pm_runtime_set_active(), before pm_runtime_enable() is called for the device. 555 + 556 + However, if the device has a parent and the parent's runtime PM is enabled, 557 + calling pm_runtime_set_active() for the device will affect the parent, unless 558 + the parent's 'power.ignore_children' flag is set. Namely, in that case the 559 + parent won't be able to suspend at run time, using the PM core's helper 560 + functions, as long as the child's status is 'active', even if the child's 561 + runtime PM is still disabled (i.e. pm_runtime_enable() hasn't been called for 562 + the child yet or pm_runtime_disable() has been called for it). For this reason, 563 + once pm_runtime_set_active() has been called for the device, pm_runtime_enable() 564 + should be called for it too as soon as reasonably possible or its runtime PM 565 + status should be changed back to 'suspended' with the help of 566 + pm_runtime_set_suspended(). 567 + 568 + If the default initial runtime PM status of the device (i.e. 'suspended') 569 + reflects the actual state of the device, its bus type's or its driver's 570 + ->probe() callback will likely need to wake it up using one of the PM core's 571 + helper functions described in Section 4. In that case, pm_runtime_resume() 572 + should be used. Of course, for this purpose the device's runtime PM has to be 573 + enabled earlier by calling pm_runtime_enable(). 574 + 575 + Note, if the device may execute pm_runtime calls during the probe (such as 576 + if it is registers with a subsystem that may call back in) then the 577 + pm_runtime_get_sync() call paired with a pm_runtime_put() call will be 578 + appropriate to ensure that the device is not put back to sleep during the 579 + probe. This can happen with systems such as the network device layer. 580 + 581 + It may be desirable to suspend the device once ->probe() has finished. 582 + Therefore the driver core uses the asynchronous pm_request_idle() to submit a 583 + request to execute the subsystem-level idle callback for the device at that 584 + time. A driver that makes use of the runtime autosuspend feature, may want to 585 + update the last busy mark before returning from ->probe(). 586 + 587 + Moreover, the driver core prevents runtime PM callbacks from racing with the bus 588 + notifier callback in __device_release_driver(), which is necessary, because the 589 + notifier is used by some subsystems to carry out operations affecting the 590 + runtime PM functionality. It does so by calling pm_runtime_get_sync() before 591 + driver_sysfs_remove() and the BUS_NOTIFY_UNBIND_DRIVER notifications. This 592 + resumes the device if it's in the suspended state and prevents it from 593 + being suspended again while those routines are being executed. 594 + 595 + To allow bus types and drivers to put devices into the suspended state by 596 + calling pm_runtime_suspend() from their ->remove() routines, the driver core 597 + executes pm_runtime_put_sync() after running the BUS_NOTIFY_UNBIND_DRIVER 598 + notifications in __device_release_driver(). This requires bus types and 599 + drivers to make their ->remove() callbacks avoid races with runtime PM directly, 600 + but also it allows of more flexibility in the handling of devices during the 601 + removal of their drivers. 602 + 603 + Drivers in ->remove() callback should undo the runtime PM changes done 604 + in ->probe(). Usually this means calling pm_runtime_disable(), 605 + pm_runtime_dont_use_autosuspend() etc. 606 + 607 + The user space can effectively disallow the driver of the device to power manage 608 + it at run time by changing the value of its /sys/devices/.../power/control 609 + attribute to "on", which causes pm_runtime_forbid() to be called. In principle, 610 + this mechanism may also be used by the driver to effectively turn off the 611 + runtime power management of the device until the user space turns it on. 612 + Namely, during the initialization the driver can make sure that the runtime PM 613 + status of the device is 'active' and call pm_runtime_forbid(). It should be 614 + noted, however, that if the user space has already intentionally changed the 615 + value of /sys/devices/.../power/control to "auto" to allow the driver to power 616 + manage the device at run time, the driver may confuse it by using 617 + pm_runtime_forbid() this way. 618 + 619 + 6. Runtime PM and System Sleep 620 + ============================== 621 + 622 + Runtime PM and system sleep (i.e., system suspend and hibernation, also known 623 + as suspend-to-RAM and suspend-to-disk) interact with each other in a couple of 624 + ways. If a device is active when a system sleep starts, everything is 625 + straightforward. But what should happen if the device is already suspended? 626 + 627 + The device may have different wake-up settings for runtime PM and system sleep. 628 + For example, remote wake-up may be enabled for runtime suspend but disallowed 629 + for system sleep (device_may_wakeup(dev) returns 'false'). When this happens, 630 + the subsystem-level system suspend callback is responsible for changing the 631 + device's wake-up setting (it may leave that to the device driver's system 632 + suspend routine). It may be necessary to resume the device and suspend it again 633 + in order to do so. The same is true if the driver uses different power levels 634 + or other settings for runtime suspend and system sleep. 635 + 636 + During system resume, the simplest approach is to bring all devices back to full 637 + power, even if they had been suspended before the system suspend began. There 638 + are several reasons for this, including: 639 + 640 + * The device might need to switch power levels, wake-up settings, etc. 641 + 642 + * Remote wake-up events might have been lost by the firmware. 643 + 644 + * The device's children may need the device to be at full power in order 645 + to resume themselves. 646 + 647 + * The driver's idea of the device state may not agree with the device's 648 + physical state. This can happen during resume from hibernation. 649 + 650 + * The device might need to be reset. 651 + 652 + * Even though the device was suspended, if its usage counter was > 0 then most 653 + likely it would need a runtime resume in the near future anyway. 654 + 655 + If the device had been suspended before the system suspend began and it's 656 + brought back to full power during resume, then its runtime PM status will have 657 + to be updated to reflect the actual post-system sleep status. The way to do 658 + this is: 659 + 660 + - pm_runtime_disable(dev); 661 + - pm_runtime_set_active(dev); 662 + - pm_runtime_enable(dev); 663 + 664 + The PM core always increments the runtime usage counter before calling the 665 + ->suspend() callback and decrements it after calling the ->resume() callback. 666 + Hence disabling runtime PM temporarily like this will not cause any runtime 667 + suspend attempts to be permanently lost. If the usage count goes to zero 668 + following the return of the ->resume() callback, the ->runtime_idle() callback 669 + will be invoked as usual. 670 + 671 + On some systems, however, system sleep is not entered through a global firmware 672 + or hardware operation. Instead, all hardware components are put into low-power 673 + states directly by the kernel in a coordinated way. Then, the system sleep 674 + state effectively follows from the states the hardware components end up in 675 + and the system is woken up from that state by a hardware interrupt or a similar 676 + mechanism entirely under the kernel's control. As a result, the kernel never 677 + gives control away and the states of all devices during resume are precisely 678 + known to it. If that is the case and none of the situations listed above takes 679 + place (in particular, if the system is not waking up from hibernation), it may 680 + be more efficient to leave the devices that had been suspended before the system 681 + suspend began in the suspended state. 682 + 683 + To this end, the PM core provides a mechanism allowing some coordination between 684 + different levels of device hierarchy. Namely, if a system suspend .prepare() 685 + callback returns a positive number for a device, that indicates to the PM core 686 + that the device appears to be runtime-suspended and its state is fine, so it 687 + may be left in runtime suspend provided that all of its descendants are also 688 + left in runtime suspend. If that happens, the PM core will not execute any 689 + system suspend and resume callbacks for all of those devices, except for the 690 + complete callback, which is then entirely responsible for handling the device 691 + as appropriate. This only applies to system suspend transitions that are not 692 + related to hibernation (see Documentation/driver-api/pm/devices.rst for more 693 + information). 694 + 695 + The PM core does its best to reduce the probability of race conditions between 696 + the runtime PM and system suspend/resume (and hibernation) callbacks by carrying 697 + out the following operations: 698 + 699 + * During system suspend pm_runtime_get_noresume() is called for every device 700 + right before executing the subsystem-level .prepare() callback for it and 701 + pm_runtime_barrier() is called for every device right before executing the 702 + subsystem-level .suspend() callback for it. In addition to that the PM core 703 + calls __pm_runtime_disable() with 'false' as the second argument for every 704 + device right before executing the subsystem-level .suspend_late() callback 705 + for it. 706 + 707 + * During system resume pm_runtime_enable() and pm_runtime_put() are called for 708 + every device right after executing the subsystem-level .resume_early() 709 + callback and right after executing the subsystem-level .complete() callback 710 + for it, respectively. 711 + 712 + 7. Generic subsystem callbacks 713 + 714 + Subsystems may wish to conserve code space by using the set of generic power 715 + management callbacks provided by the PM core, defined in 716 + driver/base/power/generic_ops.c: 717 + 718 + `int pm_generic_runtime_suspend(struct device *dev);` 719 + - invoke the ->runtime_suspend() callback provided by the driver of this 720 + device and return its result, or return 0 if not defined 721 + 722 + `int pm_generic_runtime_resume(struct device *dev);` 723 + - invoke the ->runtime_resume() callback provided by the driver of this 724 + device and return its result, or return 0 if not defined 725 + 726 + `int pm_generic_suspend(struct device *dev);` 727 + - if the device has not been suspended at run time, invoke the ->suspend() 728 + callback provided by its driver and return its result, or return 0 if not 729 + defined 730 + 731 + `int pm_generic_suspend_noirq(struct device *dev);` 732 + - if pm_runtime_suspended(dev) returns "false", invoke the ->suspend_noirq() 733 + callback provided by the device's driver and return its result, or return 734 + 0 if not defined 735 + 736 + `int pm_generic_resume(struct device *dev);` 737 + - invoke the ->resume() callback provided by the driver of this device and, 738 + if successful, change the device's runtime PM status to 'active' 739 + 740 + `int pm_generic_resume_noirq(struct device *dev);` 741 + - invoke the ->resume_noirq() callback provided by the driver of this device 742 + 743 + `int pm_generic_freeze(struct device *dev);` 744 + - if the device has not been suspended at run time, invoke the ->freeze() 745 + callback provided by its driver and return its result, or return 0 if not 746 + defined 747 + 748 + `int pm_generic_freeze_noirq(struct device *dev);` 749 + - if pm_runtime_suspended(dev) returns "false", invoke the ->freeze_noirq() 750 + callback provided by the device's driver and return its result, or return 751 + 0 if not defined 752 + 753 + `int pm_generic_thaw(struct device *dev);` 754 + - if the device has not been suspended at run time, invoke the ->thaw() 755 + callback provided by its driver and return its result, or return 0 if not 756 + defined 757 + 758 + `int pm_generic_thaw_noirq(struct device *dev);` 759 + - if pm_runtime_suspended(dev) returns "false", invoke the ->thaw_noirq() 760 + callback provided by the device's driver and return its result, or return 761 + 0 if not defined 762 + 763 + `int pm_generic_poweroff(struct device *dev);` 764 + - if the device has not been suspended at run time, invoke the ->poweroff() 765 + callback provided by its driver and return its result, or return 0 if not 766 + defined 767 + 768 + `int pm_generic_poweroff_noirq(struct device *dev);` 769 + - if pm_runtime_suspended(dev) returns "false", run the ->poweroff_noirq() 770 + callback provided by the device's driver and return its result, or return 771 + 0 if not defined 772 + 773 + `int pm_generic_restore(struct device *dev);` 774 + - invoke the ->restore() callback provided by the driver of this device and, 775 + if successful, change the device's runtime PM status to 'active' 776 + 777 + `int pm_generic_restore_noirq(struct device *dev);` 778 + - invoke the ->restore_noirq() callback provided by the device's driver 779 + 780 + These functions are the defaults used by the PM core, if a subsystem doesn't 781 + provide its own callbacks for ->runtime_idle(), ->runtime_suspend(), 782 + ->runtime_resume(), ->suspend(), ->suspend_noirq(), ->resume(), 783 + ->resume_noirq(), ->freeze(), ->freeze_noirq(), ->thaw(), ->thaw_noirq(), 784 + ->poweroff(), ->poweroff_noirq(), ->restore(), ->restore_noirq() in the 785 + subsystem-level dev_pm_ops structure. 786 + 787 + Device drivers that wish to use the same function as a system suspend, freeze, 788 + poweroff and runtime suspend callback, and similarly for system resume, thaw, 789 + restore, and runtime resume, can achieve this with the help of the 790 + UNIVERSAL_DEV_PM_OPS macro defined in include/linux/pm.h (possibly setting its 791 + last argument to NULL). 792 + 793 + 8. "No-Callback" Devices 794 + ======================== 795 + 796 + Some "devices" are only logical sub-devices of their parent and cannot be 797 + power-managed on their own. (The prototype example is a USB interface. Entire 798 + USB devices can go into low-power mode or send wake-up requests, but neither is 799 + possible for individual interfaces.) The drivers for these devices have no 800 + need of runtime PM callbacks; if the callbacks did exist, ->runtime_suspend() 801 + and ->runtime_resume() would always return 0 without doing anything else and 802 + ->runtime_idle() would always call pm_runtime_suspend(). 803 + 804 + Subsystems can tell the PM core about these devices by calling 805 + pm_runtime_no_callbacks(). This should be done after the device structure is 806 + initialized and before it is registered (although after device registration is 807 + also okay). The routine will set the device's power.no_callbacks flag and 808 + prevent the non-debugging runtime PM sysfs attributes from being created. 809 + 810 + When power.no_callbacks is set, the PM core will not invoke the 811 + ->runtime_idle(), ->runtime_suspend(), or ->runtime_resume() callbacks. 812 + Instead it will assume that suspends and resumes always succeed and that idle 813 + devices should be suspended. 814 + 815 + As a consequence, the PM core will never directly inform the device's subsystem 816 + or driver about runtime power changes. Instead, the driver for the device's 817 + parent must take responsibility for telling the device's driver when the 818 + parent's power state changes. 819 + 820 + 9. Autosuspend, or automatically-delayed suspends 821 + ================================================= 822 + 823 + Changing a device's power state isn't free; it requires both time and energy. 824 + A device should be put in a low-power state only when there's some reason to 825 + think it will remain in that state for a substantial time. A common heuristic 826 + says that a device which hasn't been used for a while is liable to remain 827 + unused; following this advice, drivers should not allow devices to be suspended 828 + at runtime until they have been inactive for some minimum period. Even when 829 + the heuristic ends up being non-optimal, it will still prevent devices from 830 + "bouncing" too rapidly between low-power and full-power states. 831 + 832 + The term "autosuspend" is an historical remnant. It doesn't mean that the 833 + device is automatically suspended (the subsystem or driver still has to call 834 + the appropriate PM routines); rather it means that runtime suspends will 835 + automatically be delayed until the desired period of inactivity has elapsed. 836 + 837 + Inactivity is determined based on the power.last_busy field. Drivers should 838 + call pm_runtime_mark_last_busy() to update this field after carrying out I/O, 839 + typically just before calling pm_runtime_put_autosuspend(). The desired length 840 + of the inactivity period is a matter of policy. Subsystems can set this length 841 + initially by calling pm_runtime_set_autosuspend_delay(), but after device 842 + registration the length should be controlled by user space, using the 843 + /sys/devices/.../power/autosuspend_delay_ms attribute. 844 + 845 + In order to use autosuspend, subsystems or drivers must call 846 + pm_runtime_use_autosuspend() (preferably before registering the device), and 847 + thereafter they should use the various `*_autosuspend()` helper functions 848 + instead of the non-autosuspend counterparts:: 849 + 850 + Instead of: pm_runtime_suspend use: pm_runtime_autosuspend; 851 + Instead of: pm_schedule_suspend use: pm_request_autosuspend; 852 + Instead of: pm_runtime_put use: pm_runtime_put_autosuspend; 853 + Instead of: pm_runtime_put_sync use: pm_runtime_put_sync_autosuspend. 854 + 855 + Drivers may also continue to use the non-autosuspend helper functions; they 856 + will behave normally, which means sometimes taking the autosuspend delay into 857 + account (see pm_runtime_idle). 858 + 859 + Under some circumstances a driver or subsystem may want to prevent a device 860 + from autosuspending immediately, even though the usage counter is zero and the 861 + autosuspend delay time has expired. If the ->runtime_suspend() callback 862 + returns -EAGAIN or -EBUSY, and if the next autosuspend delay expiration time is 863 + in the future (as it normally would be if the callback invoked 864 + pm_runtime_mark_last_busy()), the PM core will automatically reschedule the 865 + autosuspend. The ->runtime_suspend() callback can't do this rescheduling 866 + itself because no suspend requests of any kind are accepted while the device is 867 + suspending (i.e., while the callback is running). 868 + 869 + The implementation is well suited for asynchronous use in interrupt contexts. 870 + However such use inevitably involves races, because the PM core can't 871 + synchronize ->runtime_suspend() callbacks with the arrival of I/O requests. 872 + This synchronization must be handled by the driver, using its private lock. 873 + Here is a schematic pseudo-code example:: 874 + 875 + foo_read_or_write(struct foo_priv *foo, void *data) 876 + { 877 + lock(&foo->private_lock); 878 + add_request_to_io_queue(foo, data); 879 + if (foo->num_pending_requests++ == 0) 880 + pm_runtime_get(&foo->dev); 881 + if (!foo->is_suspended) 882 + foo_process_next_request(foo); 883 + unlock(&foo->private_lock); 884 + } 885 + 886 + foo_io_completion(struct foo_priv *foo, void *req) 887 + { 888 + lock(&foo->private_lock); 889 + if (--foo->num_pending_requests == 0) { 890 + pm_runtime_mark_last_busy(&foo->dev); 891 + pm_runtime_put_autosuspend(&foo->dev); 892 + } else { 893 + foo_process_next_request(foo); 894 + } 895 + unlock(&foo->private_lock); 896 + /* Send req result back to the user ... */ 897 + } 898 + 899 + int foo_runtime_suspend(struct device *dev) 900 + { 901 + struct foo_priv foo = container_of(dev, ...); 902 + int ret = 0; 903 + 904 + lock(&foo->private_lock); 905 + if (foo->num_pending_requests > 0) { 906 + ret = -EBUSY; 907 + } else { 908 + /* ... suspend the device ... */ 909 + foo->is_suspended = 1; 910 + } 911 + unlock(&foo->private_lock); 912 + return ret; 913 + } 914 + 915 + int foo_runtime_resume(struct device *dev) 916 + { 917 + struct foo_priv foo = container_of(dev, ...); 918 + 919 + lock(&foo->private_lock); 920 + /* ... resume the device ... */ 921 + foo->is_suspended = 0; 922 + pm_runtime_mark_last_busy(&foo->dev); 923 + if (foo->num_pending_requests > 0) 924 + foo_process_next_request(foo); 925 + unlock(&foo->private_lock); 926 + return 0; 927 + } 928 + 929 + The important point is that after foo_io_completion() asks for an autosuspend, 930 + the foo_runtime_suspend() callback may race with foo_read_or_write(). 931 + Therefore foo_runtime_suspend() has to check whether there are any pending I/O 932 + requests (while holding the private lock) before allowing the suspend to 933 + proceed. 934 + 935 + In addition, the power.autosuspend_delay field can be changed by user space at 936 + any time. If a driver cares about this, it can call 937 + pm_runtime_autosuspend_expiration() from within the ->runtime_suspend() 938 + callback while holding its private lock. If the function returns a nonzero 939 + value then the delay has not yet expired and the callback should return 940 + -EAGAIN.

-928

Documentation/power/runtime_pm.txt

··· 1 - Runtime Power Management Framework for I/O Devices 2 - 3 - (C) 2009-2011 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc. 4 - (C) 2010 Alan Stern <stern@rowland.harvard.edu> 5 - (C) 2014 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com> 6 - 7 - 1. Introduction 8 - 9 - Support for runtime power management (runtime PM) of I/O devices is provided 10 - at the power management core (PM core) level by means of: 11 - 12 - * The power management workqueue pm_wq in which bus types and device drivers can 13 - put their PM-related work items. It is strongly recommended that pm_wq be 14 - used for queuing all work items related to runtime PM, because this allows 15 - them to be synchronized with system-wide power transitions (suspend to RAM, 16 - hibernation and resume from system sleep states). pm_wq is declared in 17 - include/linux/pm_runtime.h and defined in kernel/power/main.c. 18 - 19 - * A number of runtime PM fields in the 'power' member of 'struct device' (which 20 - is of the type 'struct dev_pm_info', defined in include/linux/pm.h) that can 21 - be used for synchronizing runtime PM operations with one another. 22 - 23 - * Three device runtime PM callbacks in 'struct dev_pm_ops' (defined in 24 - include/linux/pm.h). 25 - 26 - * A set of helper functions defined in drivers/base/power/runtime.c that can be 27 - used for carrying out runtime PM operations in such a way that the 28 - synchronization between them is taken care of by the PM core. Bus types and 29 - device drivers are encouraged to use these functions. 30 - 31 - The runtime PM callbacks present in 'struct dev_pm_ops', the device runtime PM 32 - fields of 'struct dev_pm_info' and the core helper functions provided for 33 - runtime PM are described below. 34 - 35 - 2. Device Runtime PM Callbacks 36 - 37 - There are three device runtime PM callbacks defined in 'struct dev_pm_ops': 38 - 39 - struct dev_pm_ops { 40 - ... 41 - int (*runtime_suspend)(struct device *dev); 42 - int (*runtime_resume)(struct device *dev); 43 - int (*runtime_idle)(struct device *dev); 44 - ... 45 - }; 46 - 47 - The ->runtime_suspend(), ->runtime_resume() and ->runtime_idle() callbacks 48 - are executed by the PM core for the device's subsystem that may be either of 49 - the following: 50 - 51 - 1. PM domain of the device, if the device's PM domain object, dev->pm_domain, 52 - is present. 53 - 54 - 2. Device type of the device, if both dev->type and dev->type->pm are present. 55 - 56 - 3. Device class of the device, if both dev->class and dev->class->pm are 57 - present. 58 - 59 - 4. Bus type of the device, if both dev->bus and dev->bus->pm are present. 60 - 61 - If the subsystem chosen by applying the above rules doesn't provide the relevant 62 - callback, the PM core will invoke the corresponding driver callback stored in 63 - dev->driver->pm directly (if present). 64 - 65 - The PM core always checks which callback to use in the order given above, so the 66 - priority order of callbacks from high to low is: PM domain, device type, class 67 - and bus type. Moreover, the high-priority one will always take precedence over 68 - a low-priority one. The PM domain, bus type, device type and class callbacks 69 - are referred to as subsystem-level callbacks in what follows. 70 - 71 - By default, the callbacks are always invoked in process context with interrupts 72 - enabled. However, the pm_runtime_irq_safe() helper function can be used to tell 73 - the PM core that it is safe to run the ->runtime_suspend(), ->runtime_resume() 74 - and ->runtime_idle() callbacks for the given device in atomic context with 75 - interrupts disabled. This implies that the callback routines in question must 76 - not block or sleep, but it also means that the synchronous helper functions 77 - listed at the end of Section 4 may be used for that device within an interrupt 78 - handler or generally in an atomic context. 79 - 80 - The subsystem-level suspend callback, if present, is _entirely_ _responsible_ 81 - for handling the suspend of the device as appropriate, which may, but need not 82 - include executing the device driver's own ->runtime_suspend() callback (from the 83 - PM core's point of view it is not necessary to implement a ->runtime_suspend() 84 - callback in a device driver as long as the subsystem-level suspend callback 85 - knows what to do to handle the device). 86 - 87 - * Once the subsystem-level suspend callback (or the driver suspend callback, 88 - if invoked directly) has completed successfully for the given device, the PM 89 - core regards the device as suspended, which need not mean that it has been 90 - put into a low power state. It is supposed to mean, however, that the 91 - device will not process data and will not communicate with the CPU(s) and 92 - RAM until the appropriate resume callback is executed for it. The runtime 93 - PM status of a device after successful execution of the suspend callback is 94 - 'suspended'. 95 - 96 - * If the suspend callback returns -EBUSY or -EAGAIN, the device's runtime PM 97 - status remains 'active', which means that the device _must_ be fully 98 - operational afterwards. 99 - 100 - * If the suspend callback returns an error code different from -EBUSY and 101 - -EAGAIN, the PM core regards this as a fatal error and will refuse to run 102 - the helper functions described in Section 4 for the device until its status 103 - is directly set to either 'active', or 'suspended' (the PM core provides 104 - special helper functions for this purpose). 105 - 106 - In particular, if the driver requires remote wakeup capability (i.e. hardware 107 - mechanism allowing the device to request a change of its power state, such as 108 - PCI PME) for proper functioning and device_can_wakeup() returns 'false' for the 109 - device, then ->runtime_suspend() should return -EBUSY. On the other hand, if 110 - device_can_wakeup() returns 'true' for the device and the device is put into a 111 - low-power state during the execution of the suspend callback, it is expected 112 - that remote wakeup will be enabled for the device. Generally, remote wakeup 113 - should be enabled for all input devices put into low-power states at run time. 114 - 115 - The subsystem-level resume callback, if present, is _entirely_ _responsible_ for 116 - handling the resume of the device as appropriate, which may, but need not 117 - include executing the device driver's own ->runtime_resume() callback (from the 118 - PM core's point of view it is not necessary to implement a ->runtime_resume() 119 - callback in a device driver as long as the subsystem-level resume callback knows 120 - what to do to handle the device). 121 - 122 - * Once the subsystem-level resume callback (or the driver resume callback, if 123 - invoked directly) has completed successfully, the PM core regards the device 124 - as fully operational, which means that the device _must_ be able to complete 125 - I/O operations as needed. The runtime PM status of the device is then 126 - 'active'. 127 - 128 - * If the resume callback returns an error code, the PM core regards this as a 129 - fatal error and will refuse to run the helper functions described in Section 130 - 4 for the device, until its status is directly set to either 'active', or 131 - 'suspended' (by means of special helper functions provided by the PM core 132 - for this purpose). 133 - 134 - The idle callback (a subsystem-level one, if present, or the driver one) is 135 - executed by the PM core whenever the device appears to be idle, which is 136 - indicated to the PM core by two counters, the device's usage counter and the 137 - counter of 'active' children of the device. 138 - 139 - * If any of these counters is decreased using a helper function provided by 140 - the PM core and it turns out to be equal to zero, the other counter is 141 - checked. If that counter also is equal to zero, the PM core executes the 142 - idle callback with the device as its argument. 143 - 144 - The action performed by the idle callback is totally dependent on the subsystem 145 - (or driver) in question, but the expected and recommended action is to check 146 - if the device can be suspended (i.e. if all of the conditions necessary for 147 - suspending the device are satisfied) and to queue up a suspend request for the 148 - device in that case. If there is no idle callback, or if the callback returns 149 - 0, then the PM core will attempt to carry out a runtime suspend of the device, 150 - also respecting devices configured for autosuspend. In essence this means a 151 - call to pm_runtime_autosuspend() (do note that drivers needs to update the 152 - device last busy mark, pm_runtime_mark_last_busy(), to control the delay under 153 - this circumstance). To prevent this (for example, if the callback routine has 154 - started a delayed suspend), the routine must return a non-zero value. Negative 155 - error return codes are ignored by the PM core. 156 - 157 - The helper functions provided by the PM core, described in Section 4, guarantee 158 - that the following constraints are met with respect to runtime PM callbacks for 159 - one device: 160 - 161 - (1) The callbacks are mutually exclusive (e.g. it is forbidden to execute 162 - ->runtime_suspend() in parallel with ->runtime_resume() or with another 163 - instance of ->runtime_suspend() for the same device) with the exception that 164 - ->runtime_suspend() or ->runtime_resume() can be executed in parallel with 165 - ->runtime_idle() (although ->runtime_idle() will not be started while any 166 - of the other callbacks is being executed for the same device). 167 - 168 - (2) ->runtime_idle() and ->runtime_suspend() can only be executed for 'active' 169 - devices (i.e. the PM core will only execute ->runtime_idle() or 170 - ->runtime_suspend() for the devices the runtime PM status of which is 171 - 'active'). 172 - 173 - (3) ->runtime_idle() and ->runtime_suspend() can only be executed for a device 174 - the usage counter of which is equal to zero _and_ either the counter of 175 - 'active' children of which is equal to zero, or the 'power.ignore_children' 176 - flag of which is set. 177 - 178 - (4) ->runtime_resume() can only be executed for 'suspended' devices (i.e. the 179 - PM core will only execute ->runtime_resume() for the devices the runtime 180 - PM status of which is 'suspended'). 181 - 182 - Additionally, the helper functions provided by the PM core obey the following 183 - rules: 184 - 185 - * If ->runtime_suspend() is about to be executed or there's a pending request 186 - to execute it, ->runtime_idle() will not be executed for the same device. 187 - 188 - * A request to execute or to schedule the execution of ->runtime_suspend() 189 - will cancel any pending requests to execute ->runtime_idle() for the same 190 - device. 191 - 192 - * If ->runtime_resume() is about to be executed or there's a pending request 193 - to execute it, the other callbacks will not be executed for the same device. 194 - 195 - * A request to execute ->runtime_resume() will cancel any pending or 196 - scheduled requests to execute the other callbacks for the same device, 197 - except for scheduled autosuspends. 198 - 199 - 3. Runtime PM Device Fields 200 - 201 - The following device runtime PM fields are present in 'struct dev_pm_info', as 202 - defined in include/linux/pm.h: 203 - 204 - struct timer_list suspend_timer; 205 - - timer used for scheduling (delayed) suspend and autosuspend requests 206 - 207 - unsigned long timer_expires; 208 - - timer expiration time, in jiffies (if this is different from zero, the 209 - timer is running and will expire at that time, otherwise the timer is not 210 - running) 211 - 212 - struct work_struct work; 213 - - work structure used for queuing up requests (i.e. work items in pm_wq) 214 - 215 - wait_queue_head_t wait_queue; 216 - - wait queue used if any of the helper functions needs to wait for another 217 - one to complete 218 - 219 - spinlock_t lock; 220 - - lock used for synchronization 221 - 222 - atomic_t usage_count; 223 - - the usage counter of the device 224 - 225 - atomic_t child_count; 226 - - the count of 'active' children of the device 227 - 228 - unsigned int ignore_children; 229 - - if set, the value of child_count is ignored (but still updated) 230 - 231 - unsigned int disable_depth; 232 - - used for disabling the helper functions (they work normally if this is 233 - equal to zero); the initial value of it is 1 (i.e. runtime PM is 234 - initially disabled for all devices) 235 - 236 - int runtime_error; 237 - - if set, there was a fatal error (one of the callbacks returned error code 238 - as described in Section 2), so the helper functions will not work until 239 - this flag is cleared; this is the error code returned by the failing 240 - callback 241 - 242 - unsigned int idle_notification; 243 - - if set, ->runtime_idle() is being executed 244 - 245 - unsigned int request_pending; 246 - - if set, there's a pending request (i.e. a work item queued up into pm_wq) 247 - 248 - enum rpm_request request; 249 - - type of request that's pending (valid if request_pending is set) 250 - 251 - unsigned int deferred_resume; 252 - - set if ->runtime_resume() is about to be run while ->runtime_suspend() is 253 - being executed for that device and it is not practical to wait for the 254 - suspend to complete; means "start a resume as soon as you've suspended" 255 - 256 - enum rpm_status runtime_status; 257 - - the runtime PM status of the device; this field's initial value is 258 - RPM_SUSPENDED, which means that each device is initially regarded by the 259 - PM core as 'suspended', regardless of its real hardware status 260 - 261 - unsigned int runtime_auto; 262 - - if set, indicates that the user space has allowed the device driver to 263 - power manage the device at run time via the /sys/devices/.../power/control 264 - interface; it may only be modified with the help of the pm_runtime_allow() 265 - and pm_runtime_forbid() helper functions 266 - 267 - unsigned int no_callbacks; 268 - - indicates that the device does not use the runtime PM callbacks (see 269 - Section 8); it may be modified only by the pm_runtime_no_callbacks() 270 - helper function 271 - 272 - unsigned int irq_safe; 273 - - indicates that the ->runtime_suspend() and ->runtime_resume() callbacks 274 - will be invoked with the spinlock held and interrupts disabled 275 - 276 - unsigned int use_autosuspend; 277 - - indicates that the device's driver supports delayed autosuspend (see 278 - Section 9); it may be modified only by the 279 - pm_runtime{_dont}_use_autosuspend() helper functions 280 - 281 - unsigned int timer_autosuspends; 282 - - indicates that the PM core should attempt to carry out an autosuspend 283 - when the timer expires rather than a normal suspend 284 - 285 - int autosuspend_delay; 286 - - the delay time (in milliseconds) to be used for autosuspend 287 - 288 - unsigned long last_busy; 289 - - the time (in jiffies) when the pm_runtime_mark_last_busy() helper 290 - function was last called for this device; used in calculating inactivity 291 - periods for autosuspend 292 - 293 - All of the above fields are members of the 'power' member of 'struct device'. 294 - 295 - 4. Runtime PM Device Helper Functions 296 - 297 - The following runtime PM helper functions are defined in 298 - drivers/base/power/runtime.c and include/linux/pm_runtime.h: 299 - 300 - void pm_runtime_init(struct device *dev); 301 - - initialize the device runtime PM fields in 'struct dev_pm_info' 302 - 303 - void pm_runtime_remove(struct device *dev); 304 - - make sure that the runtime PM of the device will be disabled after 305 - removing the device from device hierarchy 306 - 307 - int pm_runtime_idle(struct device *dev); 308 - - execute the subsystem-level idle callback for the device; returns an 309 - error code on failure, where -EINPROGRESS means that ->runtime_idle() is 310 - already being executed; if there is no callback or the callback returns 0 311 - then run pm_runtime_autosuspend(dev) and return its result 312 - 313 - int pm_runtime_suspend(struct device *dev); 314 - - execute the subsystem-level suspend callback for the device; returns 0 on 315 - success, 1 if the device's runtime PM status was already 'suspended', or 316 - error code on failure, where -EAGAIN or -EBUSY means it is safe to attempt 317 - to suspend the device again in future and -EACCES means that 318 - 'power.disable_depth' is different from 0 319 - 320 - int pm_runtime_autosuspend(struct device *dev); 321 - - same as pm_runtime_suspend() except that the autosuspend delay is taken 322 - into account; if pm_runtime_autosuspend_expiration() says the delay has 323 - not yet expired then an autosuspend is scheduled for the appropriate time 324 - and 0 is returned 325 - 326 - int pm_runtime_resume(struct device *dev); 327 - - execute the subsystem-level resume callback for the device; returns 0 on 328 - success, 1 if the device's runtime PM status was already 'active' or 329 - error code on failure, where -EAGAIN means it may be safe to attempt to 330 - resume the device again in future, but 'power.runtime_error' should be 331 - checked additionally, and -EACCES means that 'power.disable_depth' is 332 - different from 0 333 - 334 - int pm_request_idle(struct device *dev); 335 - - submit a request to execute the subsystem-level idle callback for the 336 - device (the request is represented by a work item in pm_wq); returns 0 on 337 - success or error code if the request has not been queued up 338 - 339 - int pm_request_autosuspend(struct device *dev); 340 - - schedule the execution of the subsystem-level suspend callback for the 341 - device when the autosuspend delay has expired; if the delay has already 342 - expired then the work item is queued up immediately 343 - 344 - int pm_schedule_suspend(struct device *dev, unsigned int delay); 345 - - schedule the execution of the subsystem-level suspend callback for the 346 - device in future, where 'delay' is the time to wait before queuing up a 347 - suspend work item in pm_wq, in milliseconds (if 'delay' is zero, the work 348 - item is queued up immediately); returns 0 on success, 1 if the device's PM 349 - runtime status was already 'suspended', or error code if the request 350 - hasn't been scheduled (or queued up if 'delay' is 0); if the execution of 351 - ->runtime_suspend() is already scheduled and not yet expired, the new 352 - value of 'delay' will be used as the time to wait 353 - 354 - int pm_request_resume(struct device *dev); 355 - - submit a request to execute the subsystem-level resume callback for the 356 - device (the request is represented by a work item in pm_wq); returns 0 on 357 - success, 1 if the device's runtime PM status was already 'active', or 358 - error code if the request hasn't been queued up 359 - 360 - void pm_runtime_get_noresume(struct device *dev); 361 - - increment the device's usage counter 362 - 363 - int pm_runtime_get(struct device *dev); 364 - - increment the device's usage counter, run pm_request_resume(dev) and 365 - return its result 366 - 367 - int pm_runtime_get_sync(struct device *dev); 368 - - increment the device's usage counter, run pm_runtime_resume(dev) and 369 - return its result 370 - 371 - int pm_runtime_get_if_in_use(struct device *dev); 372 - - return -EINVAL if 'power.disable_depth' is nonzero; otherwise, if the 373 - runtime PM status is RPM_ACTIVE and the runtime PM usage counter is 374 - nonzero, increment the counter and return 1; otherwise return 0 without 375 - changing the counter 376 - 377 - void pm_runtime_put_noidle(struct device *dev); 378 - - decrement the device's usage counter 379 - 380 - int pm_runtime_put(struct device *dev); 381 - - decrement the device's usage counter; if the result is 0 then run 382 - pm_request_idle(dev) and return its result 383 - 384 - int pm_runtime_put_autosuspend(struct device *dev); 385 - - decrement the device's usage counter; if the result is 0 then run 386 - pm_request_autosuspend(dev) and return its result 387 - 388 - int pm_runtime_put_sync(struct device *dev); 389 - - decrement the device's usage counter; if the result is 0 then run 390 - pm_runtime_idle(dev) and return its result 391 - 392 - int pm_runtime_put_sync_suspend(struct device *dev); 393 - - decrement the device's usage counter; if the result is 0 then run 394 - pm_runtime_suspend(dev) and return its result 395 - 396 - int pm_runtime_put_sync_autosuspend(struct device *dev); 397 - - decrement the device's usage counter; if the result is 0 then run 398 - pm_runtime_autosuspend(dev) and return its result 399 - 400 - void pm_runtime_enable(struct device *dev); 401 - - decrement the device's 'power.disable_depth' field; if that field is equal 402 - to zero, the runtime PM helper functions can execute subsystem-level 403 - callbacks described in Section 2 for the device 404 - 405 - int pm_runtime_disable(struct device *dev); 406 - - increment the device's 'power.disable_depth' field (if the value of that 407 - field was previously zero, this prevents subsystem-level runtime PM 408 - callbacks from being run for the device), make sure that all of the 409 - pending runtime PM operations on the device are either completed or 410 - canceled; returns 1 if there was a resume request pending and it was 411 - necessary to execute the subsystem-level resume callback for the device 412 - to satisfy that request, otherwise 0 is returned 413 - 414 - int pm_runtime_barrier(struct device *dev); 415 - - check if there's a resume request pending for the device and resume it 416 - (synchronously) in that case, cancel any other pending runtime PM requests 417 - regarding it and wait for all runtime PM operations on it in progress to 418 - complete; returns 1 if there was a resume request pending and it was 419 - necessary to execute the subsystem-level resume callback for the device to 420 - satisfy that request, otherwise 0 is returned 421 - 422 - void pm_suspend_ignore_children(struct device *dev, bool enable); 423 - - set/unset the power.ignore_children flag of the device 424 - 425 - int pm_runtime_set_active(struct device *dev); 426 - - clear the device's 'power.runtime_error' flag, set the device's runtime 427 - PM status to 'active' and update its parent's counter of 'active' 428 - children as appropriate (it is only valid to use this function if 429 - 'power.runtime_error' is set or 'power.disable_depth' is greater than 430 - zero); it will fail and return error code if the device has a parent 431 - which is not active and the 'power.ignore_children' flag of which is unset 432 - 433 - void pm_runtime_set_suspended(struct device *dev); 434 - - clear the device's 'power.runtime_error' flag, set the device's runtime 435 - PM status to 'suspended' and update its parent's counter of 'active' 436 - children as appropriate (it is only valid to use this function if 437 - 'power.runtime_error' is set or 'power.disable_depth' is greater than 438 - zero) 439 - 440 - bool pm_runtime_active(struct device *dev); 441 - - return true if the device's runtime PM status is 'active' or its 442 - 'power.disable_depth' field is not equal to zero, or false otherwise 443 - 444 - bool pm_runtime_suspended(struct device *dev); 445 - - return true if the device's runtime PM status is 'suspended' and its 446 - 'power.disable_depth' field is equal to zero, or false otherwise 447 - 448 - bool pm_runtime_status_suspended(struct device *dev); 449 - - return true if the device's runtime PM status is 'suspended' 450 - 451 - void pm_runtime_allow(struct device *dev); 452 - - set the power.runtime_auto flag for the device and decrease its usage 453 - counter (used by the /sys/devices/.../power/control interface to 454 - effectively allow the device to be power managed at run time) 455 - 456 - void pm_runtime_forbid(struct device *dev); 457 - - unset the power.runtime_auto flag for the device and increase its usage 458 - counter (used by the /sys/devices/.../power/control interface to 459 - effectively prevent the device from being power managed at run time) 460 - 461 - void pm_runtime_no_callbacks(struct device *dev); 462 - - set the power.no_callbacks flag for the device and remove the runtime 463 - PM attributes from /sys/devices/.../power (or prevent them from being 464 - added when the device is registered) 465 - 466 - void pm_runtime_irq_safe(struct device *dev); 467 - - set the power.irq_safe flag for the device, causing the runtime-PM 468 - callbacks to be invoked with interrupts off 469 - 470 - bool pm_runtime_is_irq_safe(struct device *dev); 471 - - return true if power.irq_safe flag was set for the device, causing 472 - the runtime-PM callbacks to be invoked with interrupts off 473 - 474 - void pm_runtime_mark_last_busy(struct device *dev); 475 - - set the power.last_busy field to the current time 476 - 477 - void pm_runtime_use_autosuspend(struct device *dev); 478 - - set the power.use_autosuspend flag, enabling autosuspend delays; call 479 - pm_runtime_get_sync if the flag was previously cleared and 480 - power.autosuspend_delay is negative 481 - 482 - void pm_runtime_dont_use_autosuspend(struct device *dev); 483 - - clear the power.use_autosuspend flag, disabling autosuspend delays; 484 - decrement the device's usage counter if the flag was previously set and 485 - power.autosuspend_delay is negative; call pm_runtime_idle 486 - 487 - void pm_runtime_set_autosuspend_delay(struct device *dev, int delay); 488 - - set the power.autosuspend_delay value to 'delay' (expressed in 489 - milliseconds); if 'delay' is negative then runtime suspends are 490 - prevented; if power.use_autosuspend is set, pm_runtime_get_sync may be 491 - called or the device's usage counter may be decremented and 492 - pm_runtime_idle called depending on if power.autosuspend_delay is 493 - changed to or from a negative value; if power.use_autosuspend is clear, 494 - pm_runtime_idle is called 495 - 496 - unsigned long pm_runtime_autosuspend_expiration(struct device *dev); 497 - - calculate the time when the current autosuspend delay period will expire, 498 - based on power.last_busy and power.autosuspend_delay; if the delay time 499 - is 1000 ms or larger then the expiration time is rounded up to the 500 - nearest second; returns 0 if the delay period has already expired or 501 - power.use_autosuspend isn't set, otherwise returns the expiration time 502 - in jiffies 503 - 504 - It is safe to execute the following helper functions from interrupt context: 505 - 506 - pm_request_idle() 507 - pm_request_autosuspend() 508 - pm_schedule_suspend() 509 - pm_request_resume() 510 - pm_runtime_get_noresume() 511 - pm_runtime_get() 512 - pm_runtime_put_noidle() 513 - pm_runtime_put() 514 - pm_runtime_put_autosuspend() 515 - pm_runtime_enable() 516 - pm_suspend_ignore_children() 517 - pm_runtime_set_active() 518 - pm_runtime_set_suspended() 519 - pm_runtime_suspended() 520 - pm_runtime_mark_last_busy() 521 - pm_runtime_autosuspend_expiration() 522 - 523 - If pm_runtime_irq_safe() has been called for a device then the following helper 524 - functions may also be used in interrupt context: 525 - 526 - pm_runtime_idle() 527 - pm_runtime_suspend() 528 - pm_runtime_autosuspend() 529 - pm_runtime_resume() 530 - pm_runtime_get_sync() 531 - pm_runtime_put_sync() 532 - pm_runtime_put_sync_suspend() 533 - pm_runtime_put_sync_autosuspend() 534 - 535 - 5. Runtime PM Initialization, Device Probing and Removal 536 - 537 - Initially, the runtime PM is disabled for all devices, which means that the 538 - majority of the runtime PM helper functions described in Section 4 will return 539 - -EAGAIN until pm_runtime_enable() is called for the device. 540 - 541 - In addition to that, the initial runtime PM status of all devices is 542 - 'suspended', but it need not reflect the actual physical state of the device. 543 - Thus, if the device is initially active (i.e. it is able to process I/O), its 544 - runtime PM status must be changed to 'active', with the help of 545 - pm_runtime_set_active(), before pm_runtime_enable() is called for the device. 546 - 547 - However, if the device has a parent and the parent's runtime PM is enabled, 548 - calling pm_runtime_set_active() for the device will affect the parent, unless 549 - the parent's 'power.ignore_children' flag is set. Namely, in that case the 550 - parent won't be able to suspend at run time, using the PM core's helper 551 - functions, as long as the child's status is 'active', even if the child's 552 - runtime PM is still disabled (i.e. pm_runtime_enable() hasn't been called for 553 - the child yet or pm_runtime_disable() has been called for it). For this reason, 554 - once pm_runtime_set_active() has been called for the device, pm_runtime_enable() 555 - should be called for it too as soon as reasonably possible or its runtime PM 556 - status should be changed back to 'suspended' with the help of 557 - pm_runtime_set_suspended(). 558 - 559 - If the default initial runtime PM status of the device (i.e. 'suspended') 560 - reflects the actual state of the device, its bus type's or its driver's 561 - ->probe() callback will likely need to wake it up using one of the PM core's 562 - helper functions described in Section 4. In that case, pm_runtime_resume() 563 - should be used. Of course, for this purpose the device's runtime PM has to be 564 - enabled earlier by calling pm_runtime_enable(). 565 - 566 - Note, if the device may execute pm_runtime calls during the probe (such as 567 - if it is registers with a subsystem that may call back in) then the 568 - pm_runtime_get_sync() call paired with a pm_runtime_put() call will be 569 - appropriate to ensure that the device is not put back to sleep during the 570 - probe. This can happen with systems such as the network device layer. 571 - 572 - It may be desirable to suspend the device once ->probe() has finished. 573 - Therefore the driver core uses the asynchronous pm_request_idle() to submit a 574 - request to execute the subsystem-level idle callback for the device at that 575 - time. A driver that makes use of the runtime autosuspend feature, may want to 576 - update the last busy mark before returning from ->probe(). 577 - 578 - Moreover, the driver core prevents runtime PM callbacks from racing with the bus 579 - notifier callback in __device_release_driver(), which is necessary, because the 580 - notifier is used by some subsystems to carry out operations affecting the 581 - runtime PM functionality. It does so by calling pm_runtime_get_sync() before 582 - driver_sysfs_remove() and the BUS_NOTIFY_UNBIND_DRIVER notifications. This 583 - resumes the device if it's in the suspended state and prevents it from 584 - being suspended again while those routines are being executed. 585 - 586 - To allow bus types and drivers to put devices into the suspended state by 587 - calling pm_runtime_suspend() from their ->remove() routines, the driver core 588 - executes pm_runtime_put_sync() after running the BUS_NOTIFY_UNBIND_DRIVER 589 - notifications in __device_release_driver(). This requires bus types and 590 - drivers to make their ->remove() callbacks avoid races with runtime PM directly, 591 - but also it allows of more flexibility in the handling of devices during the 592 - removal of their drivers. 593 - 594 - Drivers in ->remove() callback should undo the runtime PM changes done 595 - in ->probe(). Usually this means calling pm_runtime_disable(), 596 - pm_runtime_dont_use_autosuspend() etc. 597 - 598 - The user space can effectively disallow the driver of the device to power manage 599 - it at run time by changing the value of its /sys/devices/.../power/control 600 - attribute to "on", which causes pm_runtime_forbid() to be called. In principle, 601 - this mechanism may also be used by the driver to effectively turn off the 602 - runtime power management of the device until the user space turns it on. 603 - Namely, during the initialization the driver can make sure that the runtime PM 604 - status of the device is 'active' and call pm_runtime_forbid(). It should be 605 - noted, however, that if the user space has already intentionally changed the 606 - value of /sys/devices/.../power/control to "auto" to allow the driver to power 607 - manage the device at run time, the driver may confuse it by using 608 - pm_runtime_forbid() this way. 609 - 610 - 6. Runtime PM and System Sleep 611 - 612 - Runtime PM and system sleep (i.e., system suspend and hibernation, also known 613 - as suspend-to-RAM and suspend-to-disk) interact with each other in a couple of 614 - ways. If a device is active when a system sleep starts, everything is 615 - straightforward. But what should happen if the device is already suspended? 616 - 617 - The device may have different wake-up settings for runtime PM and system sleep. 618 - For example, remote wake-up may be enabled for runtime suspend but disallowed 619 - for system sleep (device_may_wakeup(dev) returns 'false'). When this happens, 620 - the subsystem-level system suspend callback is responsible for changing the 621 - device's wake-up setting (it may leave that to the device driver's system 622 - suspend routine). It may be necessary to resume the device and suspend it again 623 - in order to do so. The same is true if the driver uses different power levels 624 - or other settings for runtime suspend and system sleep. 625 - 626 - During system resume, the simplest approach is to bring all devices back to full 627 - power, even if they had been suspended before the system suspend began. There 628 - are several reasons for this, including: 629 - 630 - * The device might need to switch power levels, wake-up settings, etc. 631 - 632 - * Remote wake-up events might have been lost by the firmware. 633 - 634 - * The device's children may need the device to be at full power in order 635 - to resume themselves. 636 - 637 - * The driver's idea of the device state may not agree with the device's 638 - physical state. This can happen during resume from hibernation. 639 - 640 - * The device might need to be reset. 641 - 642 - * Even though the device was suspended, if its usage counter was > 0 then most 643 - likely it would need a runtime resume in the near future anyway. 644 - 645 - If the device had been suspended before the system suspend began and it's 646 - brought back to full power during resume, then its runtime PM status will have 647 - to be updated to reflect the actual post-system sleep status. The way to do 648 - this is: 649 - 650 - pm_runtime_disable(dev); 651 - pm_runtime_set_active(dev); 652 - pm_runtime_enable(dev); 653 - 654 - The PM core always increments the runtime usage counter before calling the 655 - ->suspend() callback and decrements it after calling the ->resume() callback. 656 - Hence disabling runtime PM temporarily like this will not cause any runtime 657 - suspend attempts to be permanently lost. If the usage count goes to zero 658 - following the return of the ->resume() callback, the ->runtime_idle() callback 659 - will be invoked as usual. 660 - 661 - On some systems, however, system sleep is not entered through a global firmware 662 - or hardware operation. Instead, all hardware components are put into low-power 663 - states directly by the kernel in a coordinated way. Then, the system sleep 664 - state effectively follows from the states the hardware components end up in 665 - and the system is woken up from that state by a hardware interrupt or a similar 666 - mechanism entirely under the kernel's control. As a result, the kernel never 667 - gives control away and the states of all devices during resume are precisely 668 - known to it. If that is the case and none of the situations listed above takes 669 - place (in particular, if the system is not waking up from hibernation), it may 670 - be more efficient to leave the devices that had been suspended before the system 671 - suspend began in the suspended state. 672 - 673 - To this end, the PM core provides a mechanism allowing some coordination between 674 - different levels of device hierarchy. Namely, if a system suspend .prepare() 675 - callback returns a positive number for a device, that indicates to the PM core 676 - that the device appears to be runtime-suspended and its state is fine, so it 677 - may be left in runtime suspend provided that all of its descendants are also 678 - left in runtime suspend. If that happens, the PM core will not execute any 679 - system suspend and resume callbacks for all of those devices, except for the 680 - complete callback, which is then entirely responsible for handling the device 681 - as appropriate. This only applies to system suspend transitions that are not 682 - related to hibernation (see Documentation/driver-api/pm/devices.rst for more 683 - information). 684 - 685 - The PM core does its best to reduce the probability of race conditions between 686 - the runtime PM and system suspend/resume (and hibernation) callbacks by carrying 687 - out the following operations: 688 - 689 - * During system suspend pm_runtime_get_noresume() is called for every device 690 - right before executing the subsystem-level .prepare() callback for it and 691 - pm_runtime_barrier() is called for every device right before executing the 692 - subsystem-level .suspend() callback for it. In addition to that the PM core 693 - calls __pm_runtime_disable() with 'false' as the second argument for every 694 - device right before executing the subsystem-level .suspend_late() callback 695 - for it. 696 - 697 - * During system resume pm_runtime_enable() and pm_runtime_put() are called for 698 - every device right after executing the subsystem-level .resume_early() 699 - callback and right after executing the subsystem-level .complete() callback 700 - for it, respectively. 701 - 702 - 7. Generic subsystem callbacks 703 - 704 - Subsystems may wish to conserve code space by using the set of generic power 705 - management callbacks provided by the PM core, defined in 706 - driver/base/power/generic_ops.c: 707 - 708 - int pm_generic_runtime_suspend(struct device *dev); 709 - - invoke the ->runtime_suspend() callback provided by the driver of this 710 - device and return its result, or return 0 if not defined 711 - 712 - int pm_generic_runtime_resume(struct device *dev); 713 - - invoke the ->runtime_resume() callback provided by the driver of this 714 - device and return its result, or return 0 if not defined 715 - 716 - int pm_generic_suspend(struct device *dev); 717 - - if the device has not been suspended at run time, invoke the ->suspend() 718 - callback provided by its driver and return its result, or return 0 if not 719 - defined 720 - 721 - int pm_generic_suspend_noirq(struct device *dev); 722 - - if pm_runtime_suspended(dev) returns "false", invoke the ->suspend_noirq() 723 - callback provided by the device's driver and return its result, or return 724 - 0 if not defined 725 - 726 - int pm_generic_resume(struct device *dev); 727 - - invoke the ->resume() callback provided by the driver of this device and, 728 - if successful, change the device's runtime PM status to 'active' 729 - 730 - int pm_generic_resume_noirq(struct device *dev); 731 - - invoke the ->resume_noirq() callback provided by the driver of this device 732 - 733 - int pm_generic_freeze(struct device *dev); 734 - - if the device has not been suspended at run time, invoke the ->freeze() 735 - callback provided by its driver and return its result, or return 0 if not 736 - defined 737 - 738 - int pm_generic_freeze_noirq(struct device *dev); 739 - - if pm_runtime_suspended(dev) returns "false", invoke the ->freeze_noirq() 740 - callback provided by the device's driver and return its result, or return 741 - 0 if not defined 742 - 743 - int pm_generic_thaw(struct device *dev); 744 - - if the device has not been suspended at run time, invoke the ->thaw() 745 - callback provided by its driver and return its result, or return 0 if not 746 - defined 747 - 748 - int pm_generic_thaw_noirq(struct device *dev); 749 - - if pm_runtime_suspended(dev) returns "false", invoke the ->thaw_noirq() 750 - callback provided by the device's driver and return its result, or return 751 - 0 if not defined 752 - 753 - int pm_generic_poweroff(struct device *dev); 754 - - if the device has not been suspended at run time, invoke the ->poweroff() 755 - callback provided by its driver and return its result, or return 0 if not 756 - defined 757 - 758 - int pm_generic_poweroff_noirq(struct device *dev); 759 - - if pm_runtime_suspended(dev) returns "false", run the ->poweroff_noirq() 760 - callback provided by the device's driver and return its result, or return 761 - 0 if not defined 762 - 763 - int pm_generic_restore(struct device *dev); 764 - - invoke the ->restore() callback provided by the driver of this device and, 765 - if successful, change the device's runtime PM status to 'active' 766 - 767 - int pm_generic_restore_noirq(struct device *dev); 768 - - invoke the ->restore_noirq() callback provided by the device's driver 769 - 770 - These functions are the defaults used by the PM core, if a subsystem doesn't 771 - provide its own callbacks for ->runtime_idle(), ->runtime_suspend(), 772 - ->runtime_resume(), ->suspend(), ->suspend_noirq(), ->resume(), 773 - ->resume_noirq(), ->freeze(), ->freeze_noirq(), ->thaw(), ->thaw_noirq(), 774 - ->poweroff(), ->poweroff_noirq(), ->restore(), ->restore_noirq() in the 775 - subsystem-level dev_pm_ops structure. 776 - 777 - Device drivers that wish to use the same function as a system suspend, freeze, 778 - poweroff and runtime suspend callback, and similarly for system resume, thaw, 779 - restore, and runtime resume, can achieve this with the help of the 780 - UNIVERSAL_DEV_PM_OPS macro defined in include/linux/pm.h (possibly setting its 781 - last argument to NULL). 782 - 783 - 8. "No-Callback" Devices 784 - 785 - Some "devices" are only logical sub-devices of their parent and cannot be 786 - power-managed on their own. (The prototype example is a USB interface. Entire 787 - USB devices can go into low-power mode or send wake-up requests, but neither is 788 - possible for individual interfaces.) The drivers for these devices have no 789 - need of runtime PM callbacks; if the callbacks did exist, ->runtime_suspend() 790 - and ->runtime_resume() would always return 0 without doing anything else and 791 - ->runtime_idle() would always call pm_runtime_suspend(). 792 - 793 - Subsystems can tell the PM core about these devices by calling 794 - pm_runtime_no_callbacks(). This should be done after the device structure is 795 - initialized and before it is registered (although after device registration is 796 - also okay). The routine will set the device's power.no_callbacks flag and 797 - prevent the non-debugging runtime PM sysfs attributes from being created. 798 - 799 - When power.no_callbacks is set, the PM core will not invoke the 800 - ->runtime_idle(), ->runtime_suspend(), or ->runtime_resume() callbacks. 801 - Instead it will assume that suspends and resumes always succeed and that idle 802 - devices should be suspended. 803 - 804 - As a consequence, the PM core will never directly inform the device's subsystem 805 - or driver about runtime power changes. Instead, the driver for the device's 806 - parent must take responsibility for telling the device's driver when the 807 - parent's power state changes. 808 - 809 - 9. Autosuspend, or automatically-delayed suspends 810 - 811 - Changing a device's power state isn't free; it requires both time and energy. 812 - A device should be put in a low-power state only when there's some reason to 813 - think it will remain in that state for a substantial time. A common heuristic 814 - says that a device which hasn't been used for a while is liable to remain 815 - unused; following this advice, drivers should not allow devices to be suspended 816 - at runtime until they have been inactive for some minimum period. Even when 817 - the heuristic ends up being non-optimal, it will still prevent devices from 818 - "bouncing" too rapidly between low-power and full-power states. 819 - 820 - The term "autosuspend" is an historical remnant. It doesn't mean that the 821 - device is automatically suspended (the subsystem or driver still has to call 822 - the appropriate PM routines); rather it means that runtime suspends will 823 - automatically be delayed until the desired period of inactivity has elapsed. 824 - 825 - Inactivity is determined based on the power.last_busy field. Drivers should 826 - call pm_runtime_mark_last_busy() to update this field after carrying out I/O, 827 - typically just before calling pm_runtime_put_autosuspend(). The desired length 828 - of the inactivity period is a matter of policy. Subsystems can set this length 829 - initially by calling pm_runtime_set_autosuspend_delay(), but after device 830 - registration the length should be controlled by user space, using the 831 - /sys/devices/.../power/autosuspend_delay_ms attribute. 832 - 833 - In order to use autosuspend, subsystems or drivers must call 834 - pm_runtime_use_autosuspend() (preferably before registering the device), and 835 - thereafter they should use the various *_autosuspend() helper functions instead 836 - of the non-autosuspend counterparts: 837 - 838 - Instead of: pm_runtime_suspend use: pm_runtime_autosuspend; 839 - Instead of: pm_schedule_suspend use: pm_request_autosuspend; 840 - Instead of: pm_runtime_put use: pm_runtime_put_autosuspend; 841 - Instead of: pm_runtime_put_sync use: pm_runtime_put_sync_autosuspend. 842 - 843 - Drivers may also continue to use the non-autosuspend helper functions; they 844 - will behave normally, which means sometimes taking the autosuspend delay into 845 - account (see pm_runtime_idle). 846 - 847 - Under some circumstances a driver or subsystem may want to prevent a device 848 - from autosuspending immediately, even though the usage counter is zero and the 849 - autosuspend delay time has expired. If the ->runtime_suspend() callback 850 - returns -EAGAIN or -EBUSY, and if the next autosuspend delay expiration time is 851 - in the future (as it normally would be if the callback invoked 852 - pm_runtime_mark_last_busy()), the PM core will automatically reschedule the 853 - autosuspend. The ->runtime_suspend() callback can't do this rescheduling 854 - itself because no suspend requests of any kind are accepted while the device is 855 - suspending (i.e., while the callback is running). 856 - 857 - The implementation is well suited for asynchronous use in interrupt contexts. 858 - However such use inevitably involves races, because the PM core can't 859 - synchronize ->runtime_suspend() callbacks with the arrival of I/O requests. 860 - This synchronization must be handled by the driver, using its private lock. 861 - Here is a schematic pseudo-code example: 862 - 863 - foo_read_or_write(struct foo_priv *foo, void *data) 864 - { 865 - lock(&foo->private_lock); 866 - add_request_to_io_queue(foo, data); 867 - if (foo->num_pending_requests++ == 0) 868 - pm_runtime_get(&foo->dev); 869 - if (!foo->is_suspended) 870 - foo_process_next_request(foo); 871 - unlock(&foo->private_lock); 872 - } 873 - 874 - foo_io_completion(struct foo_priv *foo, void *req) 875 - { 876 - lock(&foo->private_lock); 877 - if (--foo->num_pending_requests == 0) { 878 - pm_runtime_mark_last_busy(&foo->dev); 879 - pm_runtime_put_autosuspend(&foo->dev); 880 - } else { 881 - foo_process_next_request(foo); 882 - } 883 - unlock(&foo->private_lock); 884 - /* Send req result back to the user ... */ 885 - } 886 - 887 - int foo_runtime_suspend(struct device *dev) 888 - { 889 - struct foo_priv foo = container_of(dev, ...); 890 - int ret = 0; 891 - 892 - lock(&foo->private_lock); 893 - if (foo->num_pending_requests > 0) { 894 - ret = -EBUSY; 895 - } else { 896 - /* ... suspend the device ... */ 897 - foo->is_suspended = 1; 898 - } 899 - unlock(&foo->private_lock); 900 - return ret; 901 - } 902 - 903 - int foo_runtime_resume(struct device *dev) 904 - { 905 - struct foo_priv foo = container_of(dev, ...); 906 - 907 - lock(&foo->private_lock); 908 - /* ... resume the device ... */ 909 - foo->is_suspended = 0; 910 - pm_runtime_mark_last_busy(&foo->dev); 911 - if (foo->num_pending_requests > 0) 912 - foo_process_next_request(foo); 913 - unlock(&foo->private_lock); 914 - return 0; 915 - } 916 - 917 - The important point is that after foo_io_completion() asks for an autosuspend, 918 - the foo_runtime_suspend() callback may race with foo_read_or_write(). 919 - Therefore foo_runtime_suspend() has to check whether there are any pending I/O 920 - requests (while holding the private lock) before allowing the suspend to 921 - proceed. 922 - 923 - In addition, the power.autosuspend_delay field can be changed by user space at 924 - any time. If a driver cares about this, it can call 925 - pm_runtime_autosuspend_expiration() from within the ->runtime_suspend() 926 - callback while holding its private lock. If the function returns a nonzero 927 - value then the delay has not yet expired and the callback should return 928 - -EAGAIN.

+87

Documentation/power/s2ram.rst

··· 1 + ======================== 2 + How to get s2ram working 3 + ======================== 4 + 5 + 2006 Linus Torvalds 6 + 2006 Pavel Machek 7 + 8 + 1) Check suspend.sf.net, program s2ram there has long whitelist of 9 + "known ok" machines, along with tricks to use on each one. 10 + 11 + 2) If that does not help, try reading tricks.txt and 12 + video.txt. Perhaps problem is as simple as broken module, and 13 + simple module unload can fix it. 14 + 15 + 3) You can use Linus' TRACE_RESUME infrastructure, described below. 16 + 17 + Using TRACE_RESUME 18 + ~~~~~~~~~~~~~~~~~~ 19 + 20 + I've been working at making the machines I have able to STR, and almost 21 + always it's a driver that is buggy. Thank God for the suspend/resume 22 + debugging - the thing that Chuck tried to disable. That's often the _only_ 23 + way to debug these things, and it's actually pretty powerful (but 24 + time-consuming - having to insert TRACE_RESUME() markers into the device 25 + driver that doesn't resume and recompile and reboot). 26 + 27 + Anyway, the way to debug this for people who are interested (have a 28 + machine that doesn't boot) is: 29 + 30 + - enable PM_DEBUG, and PM_TRACE 31 + 32 + - use a script like this:: 33 + 34 + #!/bin/sh 35 + sync 36 + echo 1 > /sys/power/pm_trace 37 + echo mem > /sys/power/state 38 + 39 + to suspend 40 + 41 + - if it doesn't come back up (which is usually the problem), reboot by 42 + holding the power button down, and look at the dmesg output for things 43 + like:: 44 + 45 + Magic number: 4:156:725 46 + hash matches drivers/base/power/resume.c:28 47 + hash matches device 0000:01:00.0 48 + 49 + which means that the last trace event was just before trying to resume 50 + device 0000:01:00.0. Then figure out what driver is controlling that 51 + device (lspci and /sys/devices/pci* is your friend), and see if you can 52 + fix it, disable it, or trace into its resume function. 53 + 54 + If no device matches the hash (or any matches appear to be false positives), 55 + the culprit may be a device from a loadable kernel module that is not loaded 56 + until after the hash is checked. You can check the hash against the current 57 + devices again after more modules are loaded using sysfs:: 58 + 59 + cat /sys/power/pm_trace_dev_match 60 + 61 + For example, the above happens to be the VGA device on my EVO, which I 62 + used to run with "radeonfb" (it's an ATI Radeon mobility). It turns out 63 + that "radeonfb" simply cannot resume that device - it tries to set the 64 + PLL's, and it just _hangs_. Using the regular VGA console and letting X 65 + resume it instead works fine. 66 + 67 + NOTE 68 + ==== 69 + pm_trace uses the system's Real Time Clock (RTC) to save the magic number. 70 + Reason for this is that the RTC is the only reliably available piece of 71 + hardware during resume operations where a value can be set that will 72 + survive a reboot. 73 + 74 + pm_trace is not compatible with asynchronous suspend, so it turns 75 + asynchronous suspend off (which may work around timing or 76 + ordering-sensitive bugs). 77 + 78 + Consequence is that after a resume (even if it is successful) your system 79 + clock will have a value corresponding to the magic number instead of the 80 + correct date/time! It is therefore advisable to use a program like ntp-date 81 + or rdate to reset the correct date/time from an external time source when 82 + using this trace option. 83 + 84 + As the clock keeps ticking it is also essential that the reboot is done 85 + quickly after the resume failure. The trace option does not use the seconds 86 + or the low order bits of the minutes of the RTC, but a too long delay will 87 + corrupt the magic value.

-85

Documentation/power/s2ram.txt

··· 1 - How to get s2ram working 2 - ~~~~~~~~~~~~~~~~~~~~~~~~ 3 - 2006 Linus Torvalds 4 - 2006 Pavel Machek 5 - 6 - 1) Check suspend.sf.net, program s2ram there has long whitelist of 7 - "known ok" machines, along with tricks to use on each one. 8 - 9 - 2) If that does not help, try reading tricks.txt and 10 - video.txt. Perhaps problem is as simple as broken module, and 11 - simple module unload can fix it. 12 - 13 - 3) You can use Linus' TRACE_RESUME infrastructure, described below. 14 - 15 - Using TRACE_RESUME 16 - ~~~~~~~~~~~~~~~~~~ 17 - 18 - I've been working at making the machines I have able to STR, and almost 19 - always it's a driver that is buggy. Thank God for the suspend/resume 20 - debugging - the thing that Chuck tried to disable. That's often the _only_ 21 - way to debug these things, and it's actually pretty powerful (but 22 - time-consuming - having to insert TRACE_RESUME() markers into the device 23 - driver that doesn't resume and recompile and reboot). 24 - 25 - Anyway, the way to debug this for people who are interested (have a 26 - machine that doesn't boot) is: 27 - 28 - - enable PM_DEBUG, and PM_TRACE 29 - 30 - - use a script like this: 31 - 32 - #!/bin/sh 33 - sync 34 - echo 1 > /sys/power/pm_trace 35 - echo mem > /sys/power/state 36 - 37 - to suspend 38 - 39 - - if it doesn't come back up (which is usually the problem), reboot by 40 - holding the power button down, and look at the dmesg output for things 41 - like 42 - 43 - Magic number: 4:156:725 44 - hash matches drivers/base/power/resume.c:28 45 - hash matches device 0000:01:00.0 46 - 47 - which means that the last trace event was just before trying to resume 48 - device 0000:01:00.0. Then figure out what driver is controlling that 49 - device (lspci and /sys/devices/pci* is your friend), and see if you can 50 - fix it, disable it, or trace into its resume function. 51 - 52 - If no device matches the hash (or any matches appear to be false positives), 53 - the culprit may be a device from a loadable kernel module that is not loaded 54 - until after the hash is checked. You can check the hash against the current 55 - devices again after more modules are loaded using sysfs: 56 - 57 - cat /sys/power/pm_trace_dev_match 58 - 59 - For example, the above happens to be the VGA device on my EVO, which I 60 - used to run with "radeonfb" (it's an ATI Radeon mobility). It turns out 61 - that "radeonfb" simply cannot resume that device - it tries to set the 62 - PLL's, and it just _hangs_. Using the regular VGA console and letting X 63 - resume it instead works fine. 64 - 65 - NOTE 66 - ==== 67 - pm_trace uses the system's Real Time Clock (RTC) to save the magic number. 68 - Reason for this is that the RTC is the only reliably available piece of 69 - hardware during resume operations where a value can be set that will 70 - survive a reboot. 71 - 72 - pm_trace is not compatible with asynchronous suspend, so it turns 73 - asynchronous suspend off (which may work around timing or 74 - ordering-sensitive bugs). 75 - 76 - Consequence is that after a resume (even if it is successful) your system 77 - clock will have a value corresponding to the magic number instead of the 78 - correct date/time! It is therefore advisable to use a program like ntp-date 79 - or rdate to reset the correct date/time from an external time source when 80 - using this trace option. 81 - 82 - As the clock keeps ticking it is also essential that the reboot is done 83 - quickly after the resume failure. The trace option does not use the seconds 84 - or the low order bits of the minutes of the RTC, but a too long delay will 85 - corrupt the magic value.

+286

Documentation/power/suspend-and-cpuhotplug.rst

··· 1 + ==================================================================== 2 + Interaction of Suspend code (S3) with the CPU hotplug infrastructure 3 + ==================================================================== 4 + 5 + (C) 2011 - 2014 Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> 6 + 7 + 8 + I. Differences between CPU hotplug and Suspend-to-RAM 9 + ====================================================== 10 + 11 + How does the regular CPU hotplug code differ from how the Suspend-to-RAM 12 + infrastructure uses it internally? And where do they share common code? 13 + 14 + Well, a picture is worth a thousand words... So ASCII art follows :-) 15 + 16 + [This depicts the current design in the kernel, and focusses only on the 17 + interactions involving the freezer and CPU hotplug and also tries to explain 18 + the locking involved. It outlines the notifications involved as well. 19 + But please note that here, only the call paths are illustrated, with the aim 20 + of describing where they take different paths and where they share code. 21 + What happens when regular CPU hotplug and Suspend-to-RAM race with each other 22 + is not depicted here.] 23 + 24 + On a high level, the suspend-resume cycle goes like this:: 25 + 26 + |Freeze| -> |Disable nonboot| -> |Do suspend| -> |Enable nonboot| -> |Thaw | 27 + |tasks | | cpus | | | | cpus | |tasks| 28 + 29 + 30 + More details follow:: 31 + 32 + Suspend call path 33 + ----------------- 34 + 35 + Write 'mem' to 36 + /sys/power/state 37 + sysfs file 38 + | 39 + v 40 + Acquire system_transition_mutex lock 41 + | 42 + v 43 + Send PM_SUSPEND_PREPARE 44 + notifications 45 + | 46 + v 47 + Freeze tasks 48 + | 49 + | 50 + v 51 + disable_nonboot_cpus() 52 + /* start */ 53 + | 54 + v 55 + Acquire cpu_add_remove_lock 56 + | 57 + v 58 + Iterate over CURRENTLY 59 + online CPUs 60 + | 61 + | 62 + | ---------- 63 + v | L 64 + ======> _cpu_down() | 65 + | [This takes cpuhotplug.lock | 66 + Common | before taking down the CPU | 67 + code | and releases it when done] | O 68 + | While it is at it, notifications | 69 + | are sent when notable events occur, | 70 + ======> by running all registered callbacks. | 71 + | | O 72 + | | 73 + | | 74 + v | 75 + Note down these cpus in | P 76 + frozen_cpus mask ---------- 77 + | 78 + v 79 + Disable regular cpu hotplug 80 + by increasing cpu_hotplug_disabled 81 + | 82 + v 83 + Release cpu_add_remove_lock 84 + | 85 + v 86 + /* disable_nonboot_cpus() complete */ 87 + | 88 + v 89 + Do suspend 90 + 91 + 92 + 93 + Resuming back is likewise, with the counterparts being (in the order of 94 + execution during resume): 95 + 96 + * enable_nonboot_cpus() which involves:: 97 + 98 + | Acquire cpu_add_remove_lock 99 + | Decrease cpu_hotplug_disabled, thereby enabling regular cpu hotplug 100 + | Call _cpu_up() [for all those cpus in the frozen_cpus mask, in a loop] 101 + | Release cpu_add_remove_lock 102 + v 103 + 104 + * thaw tasks 105 + * send PM_POST_SUSPEND notifications 106 + * Release system_transition_mutex lock. 107 + 108 + 109 + It is to be noted here that the system_transition_mutex lock is acquired at the very 110 + beginning, when we are just starting out to suspend, and then released only 111 + after the entire cycle is complete (i.e., suspend + resume). 112 + 113 + :: 114 + 115 + 116 + 117 + Regular CPU hotplug call path 118 + ----------------------------- 119 + 120 + Write 0 (or 1) to 121 + /sys/devices/system/cpu/cpu*/online 122 + sysfs file 123 + | 124 + | 125 + v 126 + cpu_down() 127 + | 128 + v 129 + Acquire cpu_add_remove_lock 130 + | 131 + v 132 + If cpu_hotplug_disabled > 0 133 + return gracefully 134 + | 135 + | 136 + v 137 + ======> _cpu_down() 138 + | [This takes cpuhotplug.lock 139 + Common | before taking down the CPU 140 + code | and releases it when done] 141 + | While it is at it, notifications 142 + | are sent when notable events occur, 143 + ======> by running all registered callbacks. 144 + | 145 + | 146 + v 147 + Release cpu_add_remove_lock 148 + [That's it!, for 149 + regular CPU hotplug] 150 + 151 + 152 + 153 + So, as can be seen from the two diagrams (the parts marked as "Common code"), 154 + regular CPU hotplug and the suspend code path converge at the _cpu_down() and 155 + _cpu_up() functions. They differ in the arguments passed to these functions, 156 + in that during regular CPU hotplug, 0 is passed for the 'tasks_frozen' 157 + argument. But during suspend, since the tasks are already frozen by the time 158 + the non-boot CPUs are offlined or onlined, the _cpu_*() functions are called 159 + with the 'tasks_frozen' argument set to 1. 160 + [See below for some known issues regarding this.] 161 + 162 + 163 + Important files and functions/entry points: 164 + ------------------------------------------- 165 + 166 + - kernel/power/process.c : freeze_processes(), thaw_processes() 167 + - kernel/power/suspend.c : suspend_prepare(), suspend_enter(), suspend_finish() 168 + - kernel/cpu.c: cpu_[up|down](), _cpu_[up|down](), [disable|enable]_nonboot_cpus() 169 + 170 + 171 + 172 + II. What are the issues involved in CPU hotplug? 173 + ------------------------------------------------ 174 + 175 + There are some interesting situations involving CPU hotplug and microcode 176 + update on the CPUs, as discussed below: 177 + 178 + [Please bear in mind that the kernel requests the microcode images from 179 + userspace, using the request_firmware() function defined in 180 + drivers/base/firmware_loader/main.c] 181 + 182 + 183 + a. When all the CPUs are identical: 184 + 185 + This is the most common situation and it is quite straightforward: we want 186 + to apply the same microcode revision to each of the CPUs. 187 + To give an example of x86, the collect_cpu_info() function defined in 188 + arch/x86/kernel/microcode_core.c helps in discovering the type of the CPU 189 + and thereby in applying the correct microcode revision to it. 190 + But note that the kernel does not maintain a common microcode image for the 191 + all CPUs, in order to handle case 'b' described below. 192 + 193 + 194 + b. When some of the CPUs are different than the rest: 195 + 196 + In this case since we probably need to apply different microcode revisions 197 + to different CPUs, the kernel maintains a copy of the correct microcode 198 + image for each CPU (after appropriate CPU type/model discovery using 199 + functions such as collect_cpu_info()). 200 + 201 + 202 + c. When a CPU is physically hot-unplugged and a new (and possibly different 203 + type of) CPU is hot-plugged into the system: 204 + 205 + In the current design of the kernel, whenever a CPU is taken offline during 206 + a regular CPU hotplug operation, upon receiving the CPU_DEAD notification 207 + (which is sent by the CPU hotplug code), the microcode update driver's 208 + callback for that event reacts by freeing the kernel's copy of the 209 + microcode image for that CPU. 210 + 211 + Hence, when a new CPU is brought online, since the kernel finds that it 212 + doesn't have the microcode image, it does the CPU type/model discovery 213 + afresh and then requests the userspace for the appropriate microcode image 214 + for that CPU, which is subsequently applied. 215 + 216 + For example, in x86, the mc_cpu_callback() function (which is the microcode 217 + update driver's callback registered for CPU hotplug events) calls 218 + microcode_update_cpu() which would call microcode_init_cpu() in this case, 219 + instead of microcode_resume_cpu() when it finds that the kernel doesn't 220 + have a valid microcode image. This ensures that the CPU type/model 221 + discovery is performed and the right microcode is applied to the CPU after 222 + getting it from userspace. 223 + 224 + 225 + d. Handling microcode update during suspend/hibernate: 226 + 227 + Strictly speaking, during a CPU hotplug operation which does not involve 228 + physically removing or inserting CPUs, the CPUs are not actually powered 229 + off during a CPU offline. They are just put to the lowest C-states possible. 230 + Hence, in such a case, it is not really necessary to re-apply microcode 231 + when the CPUs are brought back online, since they wouldn't have lost the 232 + image during the CPU offline operation. 233 + 234 + This is the usual scenario encountered during a resume after a suspend. 235 + However, in the case of hibernation, since all the CPUs are completely 236 + powered off, during restore it becomes necessary to apply the microcode 237 + images to all the CPUs. 238 + 239 + [Note that we don't expect someone to physically pull out nodes and insert 240 + nodes with a different type of CPUs in-between a suspend-resume or a 241 + hibernate/restore cycle.] 242 + 243 + In the current design of the kernel however, during a CPU offline operation 244 + as part of the suspend/hibernate cycle (cpuhp_tasks_frozen is set), 245 + the existing copy of microcode image in the kernel is not freed up. 246 + And during the CPU online operations (during resume/restore), since the 247 + kernel finds that it already has copies of the microcode images for all the 248 + CPUs, it just applies them to the CPUs, avoiding any re-discovery of CPU 249 + type/model and the need for validating whether the microcode revisions are 250 + right for the CPUs or not (due to the above assumption that physical CPU 251 + hotplug will not be done in-between suspend/resume or hibernate/restore 252 + cycles). 253 + 254 + 255 + III. Known problems 256 + =================== 257 + 258 + Are there any known problems when regular CPU hotplug and suspend race 259 + with each other? 260 + 261 + Yes, they are listed below: 262 + 263 + 1. When invoking regular CPU hotplug, the 'tasks_frozen' argument passed to 264 + the _cpu_down() and _cpu_up() functions is *always* 0. 265 + This might not reflect the true current state of the system, since the 266 + tasks could have been frozen by an out-of-band event such as a suspend 267 + operation in progress. Hence, the cpuhp_tasks_frozen variable will not 268 + reflect the frozen state and the CPU hotplug callbacks which evaluate 269 + that variable might execute the wrong code path. 270 + 271 + 2. If a regular CPU hotplug stress test happens to race with the freezer due 272 + to a suspend operation in progress at the same time, then we could hit the 273 + situation described below: 274 + 275 + * A regular cpu online operation continues its journey from userspace 276 + into the kernel, since the freezing has not yet begun. 277 + * Then freezer gets to work and freezes userspace. 278 + * If cpu online has not yet completed the microcode update stuff by now, 279 + it will now start waiting on the frozen userspace in the 280 + TASK_UNINTERRUPTIBLE state, in order to get the microcode image. 281 + * Now the freezer continues and tries to freeze the remaining tasks. But 282 + due to this wait mentioned above, the freezer won't be able to freeze 283 + the cpu online hotplug task and hence freezing of tasks fails. 284 + 285 + As a result of this task freezing failure, the suspend operation gets 286 + aborted.

-274

Documentation/power/suspend-and-cpuhotplug.txt

··· 1 - Interaction of Suspend code (S3) with the CPU hotplug infrastructure 2 - 3 - (C) 2011 - 2014 Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> 4 - 5 - 6 - I. How does the regular CPU hotplug code differ from how the Suspend-to-RAM 7 - infrastructure uses it internally? And where do they share common code? 8 - 9 - Well, a picture is worth a thousand words... So ASCII art follows :-) 10 - 11 - [This depicts the current design in the kernel, and focusses only on the 12 - interactions involving the freezer and CPU hotplug and also tries to explain 13 - the locking involved. It outlines the notifications involved as well. 14 - But please note that here, only the call paths are illustrated, with the aim 15 - of describing where they take different paths and where they share code. 16 - What happens when regular CPU hotplug and Suspend-to-RAM race with each other 17 - is not depicted here.] 18 - 19 - On a high level, the suspend-resume cycle goes like this: 20 - 21 - |Freeze| -> |Disable nonboot| -> |Do suspend| -> |Enable nonboot| -> |Thaw | 22 - |tasks | | cpus | | | | cpus | |tasks| 23 - 24 - 25 - More details follow: 26 - 27 - Suspend call path 28 - ----------------- 29 - 30 - Write 'mem' to 31 - /sys/power/state 32 - sysfs file 33 - | 34 - v 35 - Acquire system_transition_mutex lock 36 - | 37 - v 38 - Send PM_SUSPEND_PREPARE 39 - notifications 40 - | 41 - v 42 - Freeze tasks 43 - | 44 - | 45 - v 46 - disable_nonboot_cpus() 47 - /* start */ 48 - | 49 - v 50 - Acquire cpu_add_remove_lock 51 - | 52 - v 53 - Iterate over CURRENTLY 54 - online CPUs 55 - | 56 - | 57 - | ---------- 58 - v | L 59 - ======> _cpu_down() | 60 - | [This takes cpuhotplug.lock | 61 - Common | before taking down the CPU | 62 - code | and releases it when done] | O 63 - | While it is at it, notifications | 64 - | are sent when notable events occur, | 65 - ======> by running all registered callbacks. | 66 - | | O 67 - | | 68 - | | 69 - v | 70 - Note down these cpus in | P 71 - frozen_cpus mask ---------- 72 - | 73 - v 74 - Disable regular cpu hotplug 75 - by increasing cpu_hotplug_disabled 76 - | 77 - v 78 - Release cpu_add_remove_lock 79 - | 80 - v 81 - /* disable_nonboot_cpus() complete */ 82 - | 83 - v 84 - Do suspend 85 - 86 - 87 - 88 - Resuming back is likewise, with the counterparts being (in the order of 89 - execution during resume): 90 - * enable_nonboot_cpus() which involves: 91 - | Acquire cpu_add_remove_lock 92 - | Decrease cpu_hotplug_disabled, thereby enabling regular cpu hotplug 93 - | Call _cpu_up() [for all those cpus in the frozen_cpus mask, in a loop] 94 - | Release cpu_add_remove_lock 95 - v 96 - 97 - * thaw tasks 98 - * send PM_POST_SUSPEND notifications 99 - * Release system_transition_mutex lock. 100 - 101 - 102 - It is to be noted here that the system_transition_mutex lock is acquired at the very 103 - beginning, when we are just starting out to suspend, and then released only 104 - after the entire cycle is complete (i.e., suspend + resume). 105 - 106 - 107 - 108 - Regular CPU hotplug call path 109 - ----------------------------- 110 - 111 - Write 0 (or 1) to 112 - /sys/devices/system/cpu/cpu*/online 113 - sysfs file 114 - | 115 - | 116 - v 117 - cpu_down() 118 - | 119 - v 120 - Acquire cpu_add_remove_lock 121 - | 122 - v 123 - If cpu_hotplug_disabled > 0 124 - return gracefully 125 - | 126 - | 127 - v 128 - ======> _cpu_down() 129 - | [This takes cpuhotplug.lock 130 - Common | before taking down the CPU 131 - code | and releases it when done] 132 - | While it is at it, notifications 133 - | are sent when notable events occur, 134 - ======> by running all registered callbacks. 135 - | 136 - | 137 - v 138 - Release cpu_add_remove_lock 139 - [That's it!, for 140 - regular CPU hotplug] 141 - 142 - 143 - 144 - So, as can be seen from the two diagrams (the parts marked as "Common code"), 145 - regular CPU hotplug and the suspend code path converge at the _cpu_down() and 146 - _cpu_up() functions. They differ in the arguments passed to these functions, 147 - in that during regular CPU hotplug, 0 is passed for the 'tasks_frozen' 148 - argument. But during suspend, since the tasks are already frozen by the time 149 - the non-boot CPUs are offlined or onlined, the _cpu_*() functions are called 150 - with the 'tasks_frozen' argument set to 1. 151 - [See below for some known issues regarding this.] 152 - 153 - 154 - Important files and functions/entry points: 155 - ------------------------------------------ 156 - 157 - kernel/power/process.c : freeze_processes(), thaw_processes() 158 - kernel/power/suspend.c : suspend_prepare(), suspend_enter(), suspend_finish() 159 - kernel/cpu.c: cpu_[up|down](), _cpu_[up|down](), [disable|enable]_nonboot_cpus() 160 - 161 - 162 - 163 - II. What are the issues involved in CPU hotplug? 164 - ------------------------------------------- 165 - 166 - There are some interesting situations involving CPU hotplug and microcode 167 - update on the CPUs, as discussed below: 168 - 169 - [Please bear in mind that the kernel requests the microcode images from 170 - userspace, using the request_firmware() function defined in 171 - drivers/base/firmware_loader/main.c] 172 - 173 - 174 - a. When all the CPUs are identical: 175 - 176 - This is the most common situation and it is quite straightforward: we want 177 - to apply the same microcode revision to each of the CPUs. 178 - To give an example of x86, the collect_cpu_info() function defined in 179 - arch/x86/kernel/microcode_core.c helps in discovering the type of the CPU 180 - and thereby in applying the correct microcode revision to it. 181 - But note that the kernel does not maintain a common microcode image for the 182 - all CPUs, in order to handle case 'b' described below. 183 - 184 - 185 - b. When some of the CPUs are different than the rest: 186 - 187 - In this case since we probably need to apply different microcode revisions 188 - to different CPUs, the kernel maintains a copy of the correct microcode 189 - image for each CPU (after appropriate CPU type/model discovery using 190 - functions such as collect_cpu_info()). 191 - 192 - 193 - c. When a CPU is physically hot-unplugged and a new (and possibly different 194 - type of) CPU is hot-plugged into the system: 195 - 196 - In the current design of the kernel, whenever a CPU is taken offline during 197 - a regular CPU hotplug operation, upon receiving the CPU_DEAD notification 198 - (which is sent by the CPU hotplug code), the microcode update driver's 199 - callback for that event reacts by freeing the kernel's copy of the 200 - microcode image for that CPU. 201 - 202 - Hence, when a new CPU is brought online, since the kernel finds that it 203 - doesn't have the microcode image, it does the CPU type/model discovery 204 - afresh and then requests the userspace for the appropriate microcode image 205 - for that CPU, which is subsequently applied. 206 - 207 - For example, in x86, the mc_cpu_callback() function (which is the microcode 208 - update driver's callback registered for CPU hotplug events) calls 209 - microcode_update_cpu() which would call microcode_init_cpu() in this case, 210 - instead of microcode_resume_cpu() when it finds that the kernel doesn't 211 - have a valid microcode image. This ensures that the CPU type/model 212 - discovery is performed and the right microcode is applied to the CPU after 213 - getting it from userspace. 214 - 215 - 216 - d. Handling microcode update during suspend/hibernate: 217 - 218 - Strictly speaking, during a CPU hotplug operation which does not involve 219 - physically removing or inserting CPUs, the CPUs are not actually powered 220 - off during a CPU offline. They are just put to the lowest C-states possible. 221 - Hence, in such a case, it is not really necessary to re-apply microcode 222 - when the CPUs are brought back online, since they wouldn't have lost the 223 - image during the CPU offline operation. 224 - 225 - This is the usual scenario encountered during a resume after a suspend. 226 - However, in the case of hibernation, since all the CPUs are completely 227 - powered off, during restore it becomes necessary to apply the microcode 228 - images to all the CPUs. 229 - 230 - [Note that we don't expect someone to physically pull out nodes and insert 231 - nodes with a different type of CPUs in-between a suspend-resume or a 232 - hibernate/restore cycle.] 233 - 234 - In the current design of the kernel however, during a CPU offline operation 235 - as part of the suspend/hibernate cycle (cpuhp_tasks_frozen is set), 236 - the existing copy of microcode image in the kernel is not freed up. 237 - And during the CPU online operations (during resume/restore), since the 238 - kernel finds that it already has copies of the microcode images for all the 239 - CPUs, it just applies them to the CPUs, avoiding any re-discovery of CPU 240 - type/model and the need for validating whether the microcode revisions are 241 - right for the CPUs or not (due to the above assumption that physical CPU 242 - hotplug will not be done in-between suspend/resume or hibernate/restore 243 - cycles). 244 - 245 - 246 - III. Are there any known problems when regular CPU hotplug and suspend race 247 - with each other? 248 - 249 - Yes, they are listed below: 250 - 251 - 1. When invoking regular CPU hotplug, the 'tasks_frozen' argument passed to 252 - the _cpu_down() and _cpu_up() functions is *always* 0. 253 - This might not reflect the true current state of the system, since the 254 - tasks could have been frozen by an out-of-band event such as a suspend 255 - operation in progress. Hence, the cpuhp_tasks_frozen variable will not 256 - reflect the frozen state and the CPU hotplug callbacks which evaluate 257 - that variable might execute the wrong code path. 258 - 259 - 2. If a regular CPU hotplug stress test happens to race with the freezer due 260 - to a suspend operation in progress at the same time, then we could hit the 261 - situation described below: 262 - 263 - * A regular cpu online operation continues its journey from userspace 264 - into the kernel, since the freezing has not yet begun. 265 - * Then freezer gets to work and freezes userspace. 266 - * If cpu online has not yet completed the microcode update stuff by now, 267 - it will now start waiting on the frozen userspace in the 268 - TASK_UNINTERRUPTIBLE state, in order to get the microcode image. 269 - * Now the freezer continues and tries to freeze the remaining tasks. But 270 - due to this wait mentioned above, the freezer won't be able to freeze 271 - the cpu online hotplug task and hence freezing of tasks fails. 272 - 273 - As a result of this task freezing failure, the suspend operation gets 274 - aborted.

+137

Documentation/power/suspend-and-interrupts.rst

··· 1 + ==================================== 2 + System Suspend and Device Interrupts 3 + ==================================== 4 + 5 + Copyright (C) 2014 Intel Corp. 6 + Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> 7 + 8 + 9 + Suspending and Resuming Device IRQs 10 + ----------------------------------- 11 + 12 + Device interrupt request lines (IRQs) are generally disabled during system 13 + suspend after the "late" phase of suspending devices (that is, after all of the 14 + ->prepare, ->suspend and ->suspend_late callbacks have been executed for all 15 + devices). That is done by suspend_device_irqs(). 16 + 17 + The rationale for doing so is that after the "late" phase of device suspend 18 + there is no legitimate reason why any interrupts from suspended devices should 19 + trigger and if any devices have not been suspended properly yet, it is better to 20 + block interrupts from them anyway. Also, in the past we had problems with 21 + interrupt handlers for shared IRQs that device drivers implementing them were 22 + not prepared for interrupts triggering after their devices had been suspended. 23 + In some cases they would attempt to access, for example, memory address spaces 24 + of suspended devices and cause unpredictable behavior to ensue as a result. 25 + Unfortunately, such problems are very difficult to debug and the introduction 26 + of suspend_device_irqs(), along with the "noirq" phase of device suspend and 27 + resume, was the only practical way to mitigate them. 28 + 29 + Device IRQs are re-enabled during system resume, right before the "early" phase 30 + of resuming devices (that is, before starting to execute ->resume_early 31 + callbacks for devices). The function doing that is resume_device_irqs(). 32 + 33 + 34 + The IRQF_NO_SUSPEND Flag 35 + ------------------------ 36 + 37 + There are interrupts that can legitimately trigger during the entire system 38 + suspend-resume cycle, including the "noirq" phases of suspending and resuming 39 + devices as well as during the time when nonboot CPUs are taken offline and 40 + brought back online. That applies to timer interrupts in the first place, 41 + but also to IPIs and to some other special-purpose interrupts. 42 + 43 + The IRQF_NO_SUSPEND flag is used to indicate that to the IRQ subsystem when 44 + requesting a special-purpose interrupt. It causes suspend_device_irqs() to 45 + leave the corresponding IRQ enabled so as to allow the interrupt to work as 46 + expected during the suspend-resume cycle, but does not guarantee that the 47 + interrupt will wake the system from a suspended state -- for such cases it is 48 + necessary to use enable_irq_wake(). 49 + 50 + Note that the IRQF_NO_SUSPEND flag affects the entire IRQ and not just one 51 + user of it. Thus, if the IRQ is shared, all of the interrupt handlers installed 52 + for it will be executed as usual after suspend_device_irqs(), even if the 53 + IRQF_NO_SUSPEND flag was not passed to request_irq() (or equivalent) by some of 54 + the IRQ's users. For this reason, using IRQF_NO_SUSPEND and IRQF_SHARED at the 55 + same time should be avoided. 56 + 57 + 58 + System Wakeup Interrupts, enable_irq_wake() and disable_irq_wake() 59 + ------------------------------------------------------------------ 60 + 61 + System wakeup interrupts generally need to be configured to wake up the system 62 + from sleep states, especially if they are used for different purposes (e.g. as 63 + I/O interrupts) in the working state. 64 + 65 + That may involve turning on a special signal handling logic within the platform 66 + (such as an SoC) so that signals from a given line are routed in a different way 67 + during system sleep so as to trigger a system wakeup when needed. For example, 68 + the platform may include a dedicated interrupt controller used specifically for 69 + handling system wakeup events. Then, if a given interrupt line is supposed to 70 + wake up the system from sleep sates, the corresponding input of that interrupt 71 + controller needs to be enabled to receive signals from the line in question. 72 + After wakeup, it generally is better to disable that input to prevent the 73 + dedicated controller from triggering interrupts unnecessarily. 74 + 75 + The IRQ subsystem provides two helper functions to be used by device drivers for 76 + those purposes. Namely, enable_irq_wake() turns on the platform's logic for 77 + handling the given IRQ as a system wakeup interrupt line and disable_irq_wake() 78 + turns that logic off. 79 + 80 + Calling enable_irq_wake() causes suspend_device_irqs() to treat the given IRQ 81 + in a special way. Namely, the IRQ remains enabled, by on the first interrupt 82 + it will be disabled, marked as pending and "suspended" so that it will be 83 + re-enabled by resume_device_irqs() during the subsequent system resume. Also 84 + the PM core is notified about the event which causes the system suspend in 85 + progress to be aborted (that doesn't have to happen immediately, but at one 86 + of the points where the suspend thread looks for pending wakeup events). 87 + 88 + This way every interrupt from a wakeup interrupt source will either cause the 89 + system suspend currently in progress to be aborted or wake up the system if 90 + already suspended. However, after suspend_device_irqs() interrupt handlers are 91 + not executed for system wakeup IRQs. They are only executed for IRQF_NO_SUSPEND 92 + IRQs at that time, but those IRQs should not be configured for system wakeup 93 + using enable_irq_wake(). 94 + 95 + 96 + Interrupts and Suspend-to-Idle 97 + ------------------------------ 98 + 99 + Suspend-to-idle (also known as the "freeze" sleep state) is a relatively new 100 + system sleep state that works by idling all of the processors and waiting for 101 + interrupts right after the "noirq" phase of suspending devices. 102 + 103 + Of course, this means that all of the interrupts with the IRQF_NO_SUSPEND flag 104 + set will bring CPUs out of idle while in that state, but they will not cause the 105 + IRQ subsystem to trigger a system wakeup. 106 + 107 + System wakeup interrupts, in turn, will trigger wakeup from suspend-to-idle in 108 + analogy with what they do in the full system suspend case. The only difference 109 + is that the wakeup from suspend-to-idle is signaled using the usual working 110 + state interrupt delivery mechanisms and doesn't require the platform to use 111 + any special interrupt handling logic for it to work. 112 + 113 + 114 + IRQF_NO_SUSPEND and enable_irq_wake() 115 + ------------------------------------- 116 + 117 + There are very few valid reasons to use both enable_irq_wake() and the 118 + IRQF_NO_SUSPEND flag on the same IRQ, and it is never valid to use both for the 119 + same device. 120 + 121 + First of all, if the IRQ is not shared, the rules for handling IRQF_NO_SUSPEND 122 + interrupts (interrupt handlers are invoked after suspend_device_irqs()) are 123 + directly at odds with the rules for handling system wakeup interrupts (interrupt 124 + handlers are not invoked after suspend_device_irqs()). 125 + 126 + Second, both enable_irq_wake() and IRQF_NO_SUSPEND apply to entire IRQs and not 127 + to individual interrupt handlers, so sharing an IRQ between a system wakeup 128 + interrupt source and an IRQF_NO_SUSPEND interrupt source does not generally 129 + make sense. 130 + 131 + In rare cases an IRQ can be shared between a wakeup device driver and an 132 + IRQF_NO_SUSPEND user. In order for this to be safe, the wakeup device driver 133 + must be able to discern spurious IRQs from genuine wakeup events (signalling 134 + the latter to the core with pm_system_wakeup()), must use enable_irq_wake() to 135 + ensure that the IRQ will function as a wakeup source, and must request the IRQ 136 + with IRQF_COND_SUSPEND to tell the core that it meets these requirements. If 137 + these requirements are not met, it is not valid to use IRQF_COND_SUSPEND.

-135

Documentation/power/suspend-and-interrupts.txt

··· 1 - System Suspend and Device Interrupts 2 - 3 - Copyright (C) 2014 Intel Corp. 4 - Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> 5 - 6 - 7 - Suspending and Resuming Device IRQs 8 - ----------------------------------- 9 - 10 - Device interrupt request lines (IRQs) are generally disabled during system 11 - suspend after the "late" phase of suspending devices (that is, after all of the 12 - ->prepare, ->suspend and ->suspend_late callbacks have been executed for all 13 - devices). That is done by suspend_device_irqs(). 14 - 15 - The rationale for doing so is that after the "late" phase of device suspend 16 - there is no legitimate reason why any interrupts from suspended devices should 17 - trigger and if any devices have not been suspended properly yet, it is better to 18 - block interrupts from them anyway. Also, in the past we had problems with 19 - interrupt handlers for shared IRQs that device drivers implementing them were 20 - not prepared for interrupts triggering after their devices had been suspended. 21 - In some cases they would attempt to access, for example, memory address spaces 22 - of suspended devices and cause unpredictable behavior to ensue as a result. 23 - Unfortunately, such problems are very difficult to debug and the introduction 24 - of suspend_device_irqs(), along with the "noirq" phase of device suspend and 25 - resume, was the only practical way to mitigate them. 26 - 27 - Device IRQs are re-enabled during system resume, right before the "early" phase 28 - of resuming devices (that is, before starting to execute ->resume_early 29 - callbacks for devices). The function doing that is resume_device_irqs(). 30 - 31 - 32 - The IRQF_NO_SUSPEND Flag 33 - ------------------------ 34 - 35 - There are interrupts that can legitimately trigger during the entire system 36 - suspend-resume cycle, including the "noirq" phases of suspending and resuming 37 - devices as well as during the time when nonboot CPUs are taken offline and 38 - brought back online. That applies to timer interrupts in the first place, 39 - but also to IPIs and to some other special-purpose interrupts. 40 - 41 - The IRQF_NO_SUSPEND flag is used to indicate that to the IRQ subsystem when 42 - requesting a special-purpose interrupt. It causes suspend_device_irqs() to 43 - leave the corresponding IRQ enabled so as to allow the interrupt to work as 44 - expected during the suspend-resume cycle, but does not guarantee that the 45 - interrupt will wake the system from a suspended state -- for such cases it is 46 - necessary to use enable_irq_wake(). 47 - 48 - Note that the IRQF_NO_SUSPEND flag affects the entire IRQ and not just one 49 - user of it. Thus, if the IRQ is shared, all of the interrupt handlers installed 50 - for it will be executed as usual after suspend_device_irqs(), even if the 51 - IRQF_NO_SUSPEND flag was not passed to request_irq() (or equivalent) by some of 52 - the IRQ's users. For this reason, using IRQF_NO_SUSPEND and IRQF_SHARED at the 53 - same time should be avoided. 54 - 55 - 56 - System Wakeup Interrupts, enable_irq_wake() and disable_irq_wake() 57 - ------------------------------------------------------------------ 58 - 59 - System wakeup interrupts generally need to be configured to wake up the system 60 - from sleep states, especially if they are used for different purposes (e.g. as 61 - I/O interrupts) in the working state. 62 - 63 - That may involve turning on a special signal handling logic within the platform 64 - (such as an SoC) so that signals from a given line are routed in a different way 65 - during system sleep so as to trigger a system wakeup when needed. For example, 66 - the platform may include a dedicated interrupt controller used specifically for 67 - handling system wakeup events. Then, if a given interrupt line is supposed to 68 - wake up the system from sleep sates, the corresponding input of that interrupt 69 - controller needs to be enabled to receive signals from the line in question. 70 - After wakeup, it generally is better to disable that input to prevent the 71 - dedicated controller from triggering interrupts unnecessarily. 72 - 73 - The IRQ subsystem provides two helper functions to be used by device drivers for 74 - those purposes. Namely, enable_irq_wake() turns on the platform's logic for 75 - handling the given IRQ as a system wakeup interrupt line and disable_irq_wake() 76 - turns that logic off. 77 - 78 - Calling enable_irq_wake() causes suspend_device_irqs() to treat the given IRQ 79 - in a special way. Namely, the IRQ remains enabled, by on the first interrupt 80 - it will be disabled, marked as pending and "suspended" so that it will be 81 - re-enabled by resume_device_irqs() during the subsequent system resume. Also 82 - the PM core is notified about the event which causes the system suspend in 83 - progress to be aborted (that doesn't have to happen immediately, but at one 84 - of the points where the suspend thread looks for pending wakeup events). 85 - 86 - This way every interrupt from a wakeup interrupt source will either cause the 87 - system suspend currently in progress to be aborted or wake up the system if 88 - already suspended. However, after suspend_device_irqs() interrupt handlers are 89 - not executed for system wakeup IRQs. They are only executed for IRQF_NO_SUSPEND 90 - IRQs at that time, but those IRQs should not be configured for system wakeup 91 - using enable_irq_wake(). 92 - 93 - 94 - Interrupts and Suspend-to-Idle 95 - ------------------------------ 96 - 97 - Suspend-to-idle (also known as the "freeze" sleep state) is a relatively new 98 - system sleep state that works by idling all of the processors and waiting for 99 - interrupts right after the "noirq" phase of suspending devices. 100 - 101 - Of course, this means that all of the interrupts with the IRQF_NO_SUSPEND flag 102 - set will bring CPUs out of idle while in that state, but they will not cause the 103 - IRQ subsystem to trigger a system wakeup. 104 - 105 - System wakeup interrupts, in turn, will trigger wakeup from suspend-to-idle in 106 - analogy with what they do in the full system suspend case. The only difference 107 - is that the wakeup from suspend-to-idle is signaled using the usual working 108 - state interrupt delivery mechanisms and doesn't require the platform to use 109 - any special interrupt handling logic for it to work. 110 - 111 - 112 - IRQF_NO_SUSPEND and enable_irq_wake() 113 - ------------------------------------- 114 - 115 - There are very few valid reasons to use both enable_irq_wake() and the 116 - IRQF_NO_SUSPEND flag on the same IRQ, and it is never valid to use both for the 117 - same device. 118 - 119 - First of all, if the IRQ is not shared, the rules for handling IRQF_NO_SUSPEND 120 - interrupts (interrupt handlers are invoked after suspend_device_irqs()) are 121 - directly at odds with the rules for handling system wakeup interrupts (interrupt 122 - handlers are not invoked after suspend_device_irqs()). 123 - 124 - Second, both enable_irq_wake() and IRQF_NO_SUSPEND apply to entire IRQs and not 125 - to individual interrupt handlers, so sharing an IRQ between a system wakeup 126 - interrupt source and an IRQF_NO_SUSPEND interrupt source does not generally 127 - make sense. 128 - 129 - In rare cases an IRQ can be shared between a wakeup device driver and an 130 - IRQF_NO_SUSPEND user. In order for this to be safe, the wakeup device driver 131 - must be able to discern spurious IRQs from genuine wakeup events (signalling 132 - the latter to the core with pm_system_wakeup()), must use enable_irq_wake() to 133 - ensure that the IRQ will function as a wakeup source, and must request the IRQ 134 - with IRQF_COND_SUSPEND to tell the core that it meets these requirements. If 135 - these requirements are not met, it is not valid to use IRQF_COND_SUSPEND.

+63

Documentation/power/swsusp-and-swap-files.rst

··· 1 + =============================================== 2 + Using swap files with software suspend (swsusp) 3 + =============================================== 4 + 5 + (C) 2006 Rafael J. Wysocki <rjw@sisk.pl> 6 + 7 + The Linux kernel handles swap files almost in the same way as it handles swap 8 + partitions and there are only two differences between these two types of swap 9 + areas: 10 + (1) swap files need not be contiguous, 11 + (2) the header of a swap file is not in the first block of the partition that 12 + holds it. From the swsusp's point of view (1) is not a problem, because it is 13 + already taken care of by the swap-handling code, but (2) has to be taken into 14 + consideration. 15 + 16 + In principle the location of a swap file's header may be determined with the 17 + help of appropriate filesystem driver. Unfortunately, however, it requires the 18 + filesystem holding the swap file to be mounted, and if this filesystem is 19 + journaled, it cannot be mounted during resume from disk. For this reason to 20 + identify a swap file swsusp uses the name of the partition that holds the file 21 + and the offset from the beginning of the partition at which the swap file's 22 + header is located. For convenience, this offset is expressed in <PAGE_SIZE> 23 + units. 24 + 25 + In order to use a swap file with swsusp, you need to: 26 + 27 + 1) Create the swap file and make it active, eg.:: 28 + 29 + # dd if=/dev/zero of=<swap_file_path> bs=1024 count=<swap_file_size_in_k> 30 + # mkswap <swap_file_path> 31 + # swapon <swap_file_path> 32 + 33 + 2) Use an application that will bmap the swap file with the help of the 34 + FIBMAP ioctl and determine the location of the file's swap header, as the 35 + offset, in <PAGE_SIZE> units, from the beginning of the partition which 36 + holds the swap file. 37 + 38 + 3) Add the following parameters to the kernel command line:: 39 + 40 + resume=<swap_file_partition> resume_offset=<swap_file_offset> 41 + 42 + where <swap_file_partition> is the partition on which the swap file is located 43 + and <swap_file_offset> is the offset of the swap header determined by the 44 + application in 2) (of course, this step may be carried out automatically 45 + by the same application that determines the swap file's header offset using the 46 + FIBMAP ioctl) 47 + 48 + OR 49 + 50 + Use a userland suspend application that will set the partition and offset 51 + with the help of the SNAPSHOT_SET_SWAP_AREA ioctl described in 52 + Documentation/power/userland-swsusp.rst (this is the only method to suspend 53 + to a swap file allowing the resume to be initiated from an initrd or initramfs 54 + image). 55 + 56 + Now, swsusp will use the swap file in the same way in which it would use a swap 57 + partition. In particular, the swap file has to be active (ie. be present in 58 + /proc/swaps) so that it can be used for suspending. 59 + 60 + Note that if the swap file used for suspending is deleted and recreated, 61 + the location of its header need not be the same as before. Thus every time 62 + this happens the value of the "resume_offset=" kernel command line parameter 63 + has to be updated.

-60

Documentation/power/swsusp-and-swap-files.txt

··· 1 - Using swap files with software suspend (swsusp) 2 - (C) 2006 Rafael J. Wysocki <rjw@sisk.pl> 3 - 4 - The Linux kernel handles swap files almost in the same way as it handles swap 5 - partitions and there are only two differences between these two types of swap 6 - areas: 7 - (1) swap files need not be contiguous, 8 - (2) the header of a swap file is not in the first block of the partition that 9 - holds it. From the swsusp's point of view (1) is not a problem, because it is 10 - already taken care of by the swap-handling code, but (2) has to be taken into 11 - consideration. 12 - 13 - In principle the location of a swap file's header may be determined with the 14 - help of appropriate filesystem driver. Unfortunately, however, it requires the 15 - filesystem holding the swap file to be mounted, and if this filesystem is 16 - journaled, it cannot be mounted during resume from disk. For this reason to 17 - identify a swap file swsusp uses the name of the partition that holds the file 18 - and the offset from the beginning of the partition at which the swap file's 19 - header is located. For convenience, this offset is expressed in <PAGE_SIZE> 20 - units. 21 - 22 - In order to use a swap file with swsusp, you need to: 23 - 24 - 1) Create the swap file and make it active, eg. 25 - 26 - # dd if=/dev/zero of=<swap_file_path> bs=1024 count=<swap_file_size_in_k> 27 - # mkswap <swap_file_path> 28 - # swapon <swap_file_path> 29 - 30 - 2) Use an application that will bmap the swap file with the help of the 31 - FIBMAP ioctl and determine the location of the file's swap header, as the 32 - offset, in <PAGE_SIZE> units, from the beginning of the partition which 33 - holds the swap file. 34 - 35 - 3) Add the following parameters to the kernel command line: 36 - 37 - resume=<swap_file_partition> resume_offset=<swap_file_offset> 38 - 39 - where <swap_file_partition> is the partition on which the swap file is located 40 - and <swap_file_offset> is the offset of the swap header determined by the 41 - application in 2) (of course, this step may be carried out automatically 42 - by the same application that determines the swap file's header offset using the 43 - FIBMAP ioctl) 44 - 45 - OR 46 - 47 - Use a userland suspend application that will set the partition and offset 48 - with the help of the SNAPSHOT_SET_SWAP_AREA ioctl described in 49 - Documentation/power/userland-swsusp.txt (this is the only method to suspend 50 - to a swap file allowing the resume to be initiated from an initrd or initramfs 51 - image). 52 - 53 - Now, swsusp will use the swap file in the same way in which it would use a swap 54 - partition. In particular, the swap file has to be active (ie. be present in 55 - /proc/swaps) so that it can be used for suspending. 56 - 57 - Note that if the swap file used for suspending is deleted and recreated, 58 - the location of its header need not be the same as before. Thus every time 59 - this happens the value of the "resume_offset=" kernel command line parameter 60 - has to be updated.

+140

Documentation/power/swsusp-dmcrypt.rst

··· 1 + ======================================= 2 + How to use dm-crypt and swsusp together 3 + ======================================= 4 + 5 + Author: Andreas Steinmetz <ast@domdv.de> 6 + 7 + 8 + 9 + Some prerequisites: 10 + You know how dm-crypt works. If not, visit the following web page: 11 + http://www.saout.de/misc/dm-crypt/ 12 + You have read Documentation/power/swsusp.rst and understand it. 13 + You did read Documentation/admin-guide/initrd.rst and know how an initrd works. 14 + You know how to create or how to modify an initrd. 15 + 16 + Now your system is properly set up, your disk is encrypted except for 17 + the swap device(s) and the boot partition which may contain a mini 18 + system for crypto setup and/or rescue purposes. You may even have 19 + an initrd that does your current crypto setup already. 20 + 21 + At this point you want to encrypt your swap, too. Still you want to 22 + be able to suspend using swsusp. This, however, means that you 23 + have to be able to either enter a passphrase or that you read 24 + the key(s) from an external device like a pcmcia flash disk 25 + or an usb stick prior to resume. So you need an initrd, that sets 26 + up dm-crypt and then asks swsusp to resume from the encrypted 27 + swap device. 28 + 29 + The most important thing is that you set up dm-crypt in such 30 + a way that the swap device you suspend to/resume from has 31 + always the same major/minor within the initrd as well as 32 + within your running system. The easiest way to achieve this is 33 + to always set up this swap device first with dmsetup, so that 34 + it will always look like the following:: 35 + 36 + brw------- 1 root root 254, 0 Jul 28 13:37 /dev/mapper/swap0 37 + 38 + Now set up your kernel to use /dev/mapper/swap0 as the default 39 + resume partition, so your kernel .config contains:: 40 + 41 + CONFIG_PM_STD_PARTITION="/dev/mapper/swap0" 42 + 43 + Prepare your boot loader to use the initrd you will create or 44 + modify. For lilo the simplest setup looks like the following 45 + lines:: 46 + 47 + image=/boot/vmlinuz 48 + initrd=/boot/initrd.gz 49 + label=linux 50 + append="root=/dev/ram0 init=/linuxrc rw" 51 + 52 + Finally you need to create or modify your initrd. Lets assume 53 + you create an initrd that reads the required dm-crypt setup 54 + from a pcmcia flash disk card. The card is formatted with an ext2 55 + fs which resides on /dev/hde1 when the card is inserted. The 56 + card contains at least the encrypted swap setup in a file 57 + named "swapkey". /etc/fstab of your initrd contains something 58 + like the following:: 59 + 60 + /dev/hda1 /mnt ext3 ro 0 0 61 + none /proc proc defaults,noatime,nodiratime 0 0 62 + none /sys sysfs defaults,noatime,nodiratime 0 0 63 + 64 + /dev/hda1 contains an unencrypted mini system that sets up all 65 + of your crypto devices, again by reading the setup from the 66 + pcmcia flash disk. What follows now is a /linuxrc for your 67 + initrd that allows you to resume from encrypted swap and that 68 + continues boot with your mini system on /dev/hda1 if resume 69 + does not happen:: 70 + 71 + #!/bin/sh 72 + PATH=/sbin:/bin:/usr/sbin:/usr/bin 73 + mount /proc 74 + mount /sys 75 + mapped=0 76 + noresume=`grep -c noresume /proc/cmdline` 77 + if [ "$*" != "" ] 78 + then 79 + noresume=1 80 + fi 81 + dmesg -n 1 82 + /sbin/cardmgr -q 83 + for i in 1 2 3 4 5 6 7 8 9 0 84 + do 85 + if [ -f /proc/ide/hde/media ] 86 + then 87 + usleep 500000 88 + mount -t ext2 -o ro /dev/hde1 /mnt 89 + if [ -f /mnt/swapkey ] 90 + then 91 + dmsetup create swap0 /mnt/swapkey > /dev/null 2>&1 && mapped=1 92 + fi 93 + umount /mnt 94 + break 95 + fi 96 + usleep 500000 97 + done 98 + killproc /sbin/cardmgr 99 + dmesg -n 6 100 + if [ $mapped = 1 ] 101 + then 102 + if [ $noresume != 0 ] 103 + then 104 + mkswap /dev/mapper/swap0 > /dev/null 2>&1 105 + fi 106 + echo 254:0 > /sys/power/resume 107 + dmsetup remove swap0 108 + fi 109 + umount /sys 110 + mount /mnt 111 + umount /proc 112 + cd /mnt 113 + pivot_root . mnt 114 + mount /proc 115 + umount -l /mnt 116 + umount /proc 117 + exec chroot . /sbin/init $* < dev/console > dev/console 2>&1 118 + 119 + Please don't mind the weird loop above, busybox's msh doesn't know 120 + the let statement. Now, what is happening in the script? 121 + First we have to decide if we want to try to resume, or not. 122 + We will not resume if booting with "noresume" or any parameters 123 + for init like "single" or "emergency" as boot parameters. 124 + 125 + Then we need to set up dmcrypt with the setup data from the 126 + pcmcia flash disk. If this succeeds we need to reset the swap 127 + device if we don't want to resume. The line "echo 254:0 > /sys/power/resume" 128 + then attempts to resume from the first device mapper device. 129 + Note that it is important to set the device in /sys/power/resume, 130 + regardless if resuming or not, otherwise later suspend will fail. 131 + If resume starts, script execution terminates here. 132 + 133 + Otherwise we just remove the encrypted swap device and leave it to the 134 + mini system on /dev/hda1 to set the whole crypto up (it is up to 135 + you to modify this to your taste). 136 + 137 + What then follows is the well known process to change the root 138 + file system and continue booting from there. I prefer to unmount 139 + the initrd prior to continue booting but it is up to you to modify 140 + this.

-138

Documentation/power/swsusp-dmcrypt.txt

··· 1 - Author: Andreas Steinmetz <ast@domdv.de> 2 - 3 - 4 - How to use dm-crypt and swsusp together: 5 - ======================================== 6 - 7 - Some prerequisites: 8 - You know how dm-crypt works. If not, visit the following web page: 9 - http://www.saout.de/misc/dm-crypt/ 10 - You have read Documentation/power/swsusp.txt and understand it. 11 - You did read Documentation/admin-guide/initrd.rst and know how an initrd works. 12 - You know how to create or how to modify an initrd. 13 - 14 - Now your system is properly set up, your disk is encrypted except for 15 - the swap device(s) and the boot partition which may contain a mini 16 - system for crypto setup and/or rescue purposes. You may even have 17 - an initrd that does your current crypto setup already. 18 - 19 - At this point you want to encrypt your swap, too. Still you want to 20 - be able to suspend using swsusp. This, however, means that you 21 - have to be able to either enter a passphrase or that you read 22 - the key(s) from an external device like a pcmcia flash disk 23 - or an usb stick prior to resume. So you need an initrd, that sets 24 - up dm-crypt and then asks swsusp to resume from the encrypted 25 - swap device. 26 - 27 - The most important thing is that you set up dm-crypt in such 28 - a way that the swap device you suspend to/resume from has 29 - always the same major/minor within the initrd as well as 30 - within your running system. The easiest way to achieve this is 31 - to always set up this swap device first with dmsetup, so that 32 - it will always look like the following: 33 - 34 - brw------- 1 root root 254, 0 Jul 28 13:37 /dev/mapper/swap0 35 - 36 - Now set up your kernel to use /dev/mapper/swap0 as the default 37 - resume partition, so your kernel .config contains: 38 - 39 - CONFIG_PM_STD_PARTITION="/dev/mapper/swap0" 40 - 41 - Prepare your boot loader to use the initrd you will create or 42 - modify. For lilo the simplest setup looks like the following 43 - lines: 44 - 45 - image=/boot/vmlinuz 46 - initrd=/boot/initrd.gz 47 - label=linux 48 - append="root=/dev/ram0 init=/linuxrc rw" 49 - 50 - Finally you need to create or modify your initrd. Lets assume 51 - you create an initrd that reads the required dm-crypt setup 52 - from a pcmcia flash disk card. The card is formatted with an ext2 53 - fs which resides on /dev/hde1 when the card is inserted. The 54 - card contains at least the encrypted swap setup in a file 55 - named "swapkey". /etc/fstab of your initrd contains something 56 - like the following: 57 - 58 - /dev/hda1 /mnt ext3 ro 0 0 59 - none /proc proc defaults,noatime,nodiratime 0 0 60 - none /sys sysfs defaults,noatime,nodiratime 0 0 61 - 62 - /dev/hda1 contains an unencrypted mini system that sets up all 63 - of your crypto devices, again by reading the setup from the 64 - pcmcia flash disk. What follows now is a /linuxrc for your 65 - initrd that allows you to resume from encrypted swap and that 66 - continues boot with your mini system on /dev/hda1 if resume 67 - does not happen: 68 - 69 - #!/bin/sh 70 - PATH=/sbin:/bin:/usr/sbin:/usr/bin 71 - mount /proc 72 - mount /sys 73 - mapped=0 74 - noresume=`grep -c noresume /proc/cmdline` 75 - if [ "$*" != "" ] 76 - then 77 - noresume=1 78 - fi 79 - dmesg -n 1 80 - /sbin/cardmgr -q 81 - for i in 1 2 3 4 5 6 7 8 9 0 82 - do 83 - if [ -f /proc/ide/hde/media ] 84 - then 85 - usleep 500000 86 - mount -t ext2 -o ro /dev/hde1 /mnt 87 - if [ -f /mnt/swapkey ] 88 - then 89 - dmsetup create swap0 /mnt/swapkey > /dev/null 2>&1 && mapped=1 90 - fi 91 - umount /mnt 92 - break 93 - fi 94 - usleep 500000 95 - done 96 - killproc /sbin/cardmgr 97 - dmesg -n 6 98 - if [ $mapped = 1 ] 99 - then 100 - if [ $noresume != 0 ] 101 - then 102 - mkswap /dev/mapper/swap0 > /dev/null 2>&1 103 - fi 104 - echo 254:0 > /sys/power/resume 105 - dmsetup remove swap0 106 - fi 107 - umount /sys 108 - mount /mnt 109 - umount /proc 110 - cd /mnt 111 - pivot_root . mnt 112 - mount /proc 113 - umount -l /mnt 114 - umount /proc 115 - exec chroot . /sbin/init $* < dev/console > dev/console 2>&1 116 - 117 - Please don't mind the weird loop above, busybox's msh doesn't know 118 - the let statement. Now, what is happening in the script? 119 - First we have to decide if we want to try to resume, or not. 120 - We will not resume if booting with "noresume" or any parameters 121 - for init like "single" or "emergency" as boot parameters. 122 - 123 - Then we need to set up dmcrypt with the setup data from the 124 - pcmcia flash disk. If this succeeds we need to reset the swap 125 - device if we don't want to resume. The line "echo 254:0 > /sys/power/resume" 126 - then attempts to resume from the first device mapper device. 127 - Note that it is important to set the device in /sys/power/resume, 128 - regardless if resuming or not, otherwise later suspend will fail. 129 - If resume starts, script execution terminates here. 130 - 131 - Otherwise we just remove the encrypted swap device and leave it to the 132 - mini system on /dev/hda1 to set the whole crypto up (it is up to 133 - you to modify this to your taste). 134 - 135 - What then follows is the well known process to change the root 136 - file system and continue booting from there. I prefer to unmount 137 - the initrd prior to continue booting but it is up to you to modify 138 - this.

+501

Documentation/power/swsusp.rst

··· 1 + ============ 2 + Swap suspend 3 + ============ 4 + 5 + Some warnings, first. 6 + 7 + .. warning:: 8 + 9 + **BIG FAT WARNING** 10 + 11 + If you touch anything on disk between suspend and resume... 12 + ...kiss your data goodbye. 13 + 14 + If you do resume from initrd after your filesystems are mounted... 15 + ...bye bye root partition. 16 + 17 + [this is actually same case as above] 18 + 19 + If you have unsupported ( ) devices using DMA, you may have some 20 + problems. If your disk driver does not support suspend... (IDE does), 21 + it may cause some problems, too. If you change kernel command line 22 + between suspend and resume, it may do something wrong. If you change 23 + your hardware while system is suspended... well, it was not good idea; 24 + but it will probably only crash. 25 + 26 + ( ) suspend/resume support is needed to make it safe. 27 + 28 + If you have any filesystems on USB devices mounted before software suspend, 29 + they won't be accessible after resume and you may lose data, as though 30 + you have unplugged the USB devices with mounted filesystems on them; 31 + see the FAQ below for details. (This is not true for more traditional 32 + power states like "standby", which normally don't turn USB off.) 33 + 34 + Swap partition: 35 + You need to append resume=/dev/your_swap_partition to kernel command 36 + line or specify it using /sys/power/resume. 37 + 38 + Swap file: 39 + If using a swapfile you can also specify a resume offset using 40 + resume_offset=<number> on the kernel command line or specify it 41 + in /sys/power/resume_offset. 42 + 43 + After preparing then you suspend by:: 44 + 45 + echo shutdown > /sys/power/disk; echo disk > /sys/power/state 46 + 47 + - If you feel ACPI works pretty well on your system, you might try:: 48 + 49 + echo platform > /sys/power/disk; echo disk > /sys/power/state 50 + 51 + - If you would like to write hibernation image to swap and then suspend 52 + to RAM (provided your platform supports it), you can try:: 53 + 54 + echo suspend > /sys/power/disk; echo disk > /sys/power/state 55 + 56 + - If you have SATA disks, you'll need recent kernels with SATA suspend 57 + support. For suspend and resume to work, make sure your disk drivers 58 + are built into kernel -- not modules. [There's way to make 59 + suspend/resume with modular disk drivers, see FAQ, but you probably 60 + should not do that.] 61 + 62 + If you want to limit the suspend image size to N bytes, do:: 63 + 64 + echo N > /sys/power/image_size 65 + 66 + before suspend (it is limited to around 2/5 of available RAM by default). 67 + 68 + - The resume process checks for the presence of the resume device, 69 + if found, it then checks the contents for the hibernation image signature. 70 + If both are found, it resumes the hibernation image. 71 + 72 + - The resume process may be triggered in two ways: 73 + 74 + 1) During lateinit: If resume=/dev/your_swap_partition is specified on 75 + the kernel command line, lateinit runs the resume process. If the 76 + resume device has not been probed yet, the resume process fails and 77 + bootup continues. 78 + 2) Manually from an initrd or initramfs: May be run from 79 + the init script by using the /sys/power/resume file. It is vital 80 + that this be done prior to remounting any filesystems (even as 81 + read-only) otherwise data may be corrupted. 82 + 83 + Article about goals and implementation of Software Suspend for Linux 84 + ==================================================================== 85 + 86 + Author: Gábor Kuti 87 + Last revised: 2003-10-20 by Pavel Machek 88 + 89 + Idea and goals to achieve 90 + ------------------------- 91 + 92 + Nowadays it is common in several laptops that they have a suspend button. It 93 + saves the state of the machine to a filesystem or to a partition and switches 94 + to standby mode. Later resuming the machine the saved state is loaded back to 95 + ram and the machine can continue its work. It has two real benefits. First we 96 + save ourselves the time machine goes down and later boots up, energy costs 97 + are real high when running from batteries. The other gain is that we don't have 98 + to interrupt our programs so processes that are calculating something for a long 99 + time shouldn't need to be written interruptible. 100 + 101 + swsusp saves the state of the machine into active swaps and then reboots or 102 + powerdowns. You must explicitly specify the swap partition to resume from with 103 + `resume=` kernel option. If signature is found it loads and restores saved 104 + state. If the option `noresume` is specified as a boot parameter, it skips 105 + the resuming. If the option `hibernate=nocompress` is specified as a boot 106 + parameter, it saves hibernation image without compression. 107 + 108 + In the meantime while the system is suspended you should not add/remove any 109 + of the hardware, write to the filesystems, etc. 110 + 111 + Sleep states summary 112 + ==================== 113 + 114 + There are three different interfaces you can use, /proc/acpi should 115 + work like this: 116 + 117 + In a really perfect world:: 118 + 119 + echo 1 > /proc/acpi/sleep # for standby 120 + echo 2 > /proc/acpi/sleep # for suspend to ram 121 + echo 3 > /proc/acpi/sleep # for suspend to ram, but with more power conservative 122 + echo 4 > /proc/acpi/sleep # for suspend to disk 123 + echo 5 > /proc/acpi/sleep # for shutdown unfriendly the system 124 + 125 + and perhaps:: 126 + 127 + echo 4b > /proc/acpi/sleep # for suspend to disk via s4bios 128 + 129 + Frequently Asked Questions 130 + ========================== 131 + 132 + Q: 133 + well, suspending a server is IMHO a really stupid thing, 134 + but... (Diego Zuccato): 135 + 136 + A: 137 + You bought new UPS for your server. How do you install it without 138 + bringing machine down? Suspend to disk, rearrange power cables, 139 + resume. 140 + 141 + You have your server on UPS. Power died, and UPS is indicating 30 142 + seconds to failure. What do you do? Suspend to disk. 143 + 144 + 145 + Q: 146 + Maybe I'm missing something, but why don't the regular I/O paths work? 147 + 148 + A: 149 + We do use the regular I/O paths. However we cannot restore the data 150 + to its original location as we load it. That would create an 151 + inconsistent kernel state which would certainly result in an oops. 152 + Instead, we load the image into unused memory and then atomically copy 153 + it back to it original location. This implies, of course, a maximum 154 + image size of half the amount of memory. 155 + 156 + There are two solutions to this: 157 + 158 + * require half of memory to be free during suspend. That way you can 159 + read "new" data onto free spots, then cli and copy 160 + 161 + * assume we had special "polling" ide driver that only uses memory 162 + between 0-640KB. That way, I'd have to make sure that 0-640KB is free 163 + during suspending, but otherwise it would work... 164 + 165 + suspend2 shares this fundamental limitation, but does not include user 166 + data and disk caches into "used memory" by saving them in 167 + advance. That means that the limitation goes away in practice. 168 + 169 + Q: 170 + Does linux support ACPI S4? 171 + 172 + A: 173 + Yes. That's what echo platform > /sys/power/disk does. 174 + 175 + Q: 176 + What is 'suspend2'? 177 + 178 + A: 179 + suspend2 is 'Software Suspend 2', a forked implementation of 180 + suspend-to-disk which is available as separate patches for 2.4 and 2.6 181 + kernels from swsusp.sourceforge.net. It includes support for SMP, 4GB 182 + highmem and preemption. It also has a extensible architecture that 183 + allows for arbitrary transformations on the image (compression, 184 + encryption) and arbitrary backends for writing the image (eg to swap 185 + or an NFS share[Work In Progress]). Questions regarding suspend2 186 + should be sent to the mailing list available through the suspend2 187 + website, and not to the Linux Kernel Mailing List. We are working 188 + toward merging suspend2 into the mainline kernel. 189 + 190 + Q: 191 + What is the freezing of tasks and why are we using it? 192 + 193 + A: 194 + The freezing of tasks is a mechanism by which user space processes and some 195 + kernel threads are controlled during hibernation or system-wide suspend (on some 196 + architectures). See freezing-of-tasks.txt for details. 197 + 198 + Q: 199 + What is the difference between "platform" and "shutdown"? 200 + 201 + A: 202 + shutdown: 203 + save state in linux, then tell bios to powerdown 204 + 205 + platform: 206 + save state in linux, then tell bios to powerdown and blink 207 + "suspended led" 208 + 209 + "platform" is actually right thing to do where supported, but 210 + "shutdown" is most reliable (except on ACPI systems). 211 + 212 + Q: 213 + I do not understand why you have such strong objections to idea of 214 + selective suspend. 215 + 216 + A: 217 + Do selective suspend during runtime power management, that's okay. But 218 + it's useless for suspend-to-disk. (And I do not see how you could use 219 + it for suspend-to-ram, I hope you do not want that). 220 + 221 + Lets see, so you suggest to 222 + 223 + * SUSPEND all but swap device and parents 224 + * Snapshot 225 + * Write image to disk 226 + * SUSPEND swap device and parents 227 + * Powerdown 228 + 229 + Oh no, that does not work, if swap device or its parents uses DMA, 230 + you've corrupted data. You'd have to do 231 + 232 + * SUSPEND all but swap device and parents 233 + * FREEZE swap device and parents 234 + * Snapshot 235 + * UNFREEZE swap device and parents 236 + * Write 237 + * SUSPEND swap device and parents 238 + 239 + Which means that you still need that FREEZE state, and you get more 240 + complicated code. (And I have not yet introduce details like system 241 + devices). 242 + 243 + Q: 244 + There don't seem to be any generally useful behavioral 245 + distinctions between SUSPEND and FREEZE. 246 + 247 + A: 248 + Doing SUSPEND when you are asked to do FREEZE is always correct, 249 + but it may be unnecessarily slow. If you want your driver to stay simple, 250 + slowness may not matter to you. It can always be fixed later. 251 + 252 + For devices like disk it does matter, you do not want to spindown for 253 + FREEZE. 254 + 255 + Q: 256 + After resuming, system is paging heavily, leading to very bad interactivity. 257 + 258 + A: 259 + Try running:: 260 + 261 + cat /proc/[0-9]*/maps | grep / | sed 's:.* /:/:' | sort -u | while read file 262 + do 263 + test -f "$file" && cat "$file" > /dev/null 264 + done 265 + 266 + after resume. swapoff -a; swapon -a may also be useful. 267 + 268 + Q: 269 + What happens to devices during swsusp? They seem to be resumed 270 + during system suspend? 271 + 272 + A: 273 + That's correct. We need to resume them if we want to write image to 274 + disk. Whole sequence goes like 275 + 276 + **Suspend part** 277 + 278 + running system, user asks for suspend-to-disk 279 + 280 + user processes are stopped 281 + 282 + suspend(PMSG_FREEZE): devices are frozen so that they don't interfere 283 + with state snapshot 284 + 285 + state snapshot: copy of whole used memory is taken with interrupts disabled 286 + 287 + resume(): devices are woken up so that we can write image to swap 288 + 289 + write image to swap 290 + 291 + suspend(PMSG_SUSPEND): suspend devices so that we can power off 292 + 293 + turn the power off 294 + 295 + **Resume part** 296 + 297 + (is actually pretty similar) 298 + 299 + running system, user asks for suspend-to-disk 300 + 301 + user processes are stopped (in common case there are none, 302 + but with resume-from-initrd, no one knows) 303 + 304 + read image from disk 305 + 306 + suspend(PMSG_FREEZE): devices are frozen so that they don't interfere 307 + with image restoration 308 + 309 + image restoration: rewrite memory with image 310 + 311 + resume(): devices are woken up so that system can continue 312 + 313 + thaw all user processes 314 + 315 + Q: 316 + What is this 'Encrypt suspend image' for? 317 + 318 + A: 319 + First of all: it is not a replacement for dm-crypt encrypted swap. 320 + It cannot protect your computer while it is suspended. Instead it does 321 + protect from leaking sensitive data after resume from suspend. 322 + 323 + Think of the following: you suspend while an application is running 324 + that keeps sensitive data in memory. The application itself prevents 325 + the data from being swapped out. Suspend, however, must write these 326 + data to swap to be able to resume later on. Without suspend encryption 327 + your sensitive data are then stored in plaintext on disk. This means 328 + that after resume your sensitive data are accessible to all 329 + applications having direct access to the swap device which was used 330 + for suspend. If you don't need swap after resume these data can remain 331 + on disk virtually forever. Thus it can happen that your system gets 332 + broken in weeks later and sensitive data which you thought were 333 + encrypted and protected are retrieved and stolen from the swap device. 334 + To prevent this situation you should use 'Encrypt suspend image'. 335 + 336 + During suspend a temporary key is created and this key is used to 337 + encrypt the data written to disk. When, during resume, the data was 338 + read back into memory the temporary key is destroyed which simply 339 + means that all data written to disk during suspend are then 340 + inaccessible so they can't be stolen later on. The only thing that 341 + you must then take care of is that you call 'mkswap' for the swap 342 + partition used for suspend as early as possible during regular 343 + boot. This asserts that any temporary key from an oopsed suspend or 344 + from a failed or aborted resume is erased from the swap device. 345 + 346 + As a rule of thumb use encrypted swap to protect your data while your 347 + system is shut down or suspended. Additionally use the encrypted 348 + suspend image to prevent sensitive data from being stolen after 349 + resume. 350 + 351 + Q: 352 + Can I suspend to a swap file? 353 + 354 + A: 355 + Generally, yes, you can. However, it requires you to use the "resume=" and 356 + "resume_offset=" kernel command line parameters, so the resume from a swap file 357 + cannot be initiated from an initrd or initramfs image. See 358 + swsusp-and-swap-files.txt for details. 359 + 360 + Q: 361 + Is there a maximum system RAM size that is supported by swsusp? 362 + 363 + A: 364 + It should work okay with highmem. 365 + 366 + Q: 367 + Does swsusp (to disk) use only one swap partition or can it use 368 + multiple swap partitions (aggregate them into one logical space)? 369 + 370 + A: 371 + Only one swap partition, sorry. 372 + 373 + Q: 374 + If my application(s) causes lots of memory & swap space to be used 375 + (over half of the total system RAM), is it correct that it is likely 376 + to be useless to try to suspend to disk while that app is running? 377 + 378 + A: 379 + No, it should work okay, as long as your app does not mlock() 380 + it. Just prepare big enough swap partition. 381 + 382 + Q: 383 + What information is useful for debugging suspend-to-disk problems? 384 + 385 + A: 386 + Well, last messages on the screen are always useful. If something 387 + is broken, it is usually some kernel driver, therefore trying with as 388 + little as possible modules loaded helps a lot. I also prefer people to 389 + suspend from console, preferably without X running. Booting with 390 + init=/bin/bash, then swapon and starting suspend sequence manually 391 + usually does the trick. Then it is good idea to try with latest 392 + vanilla kernel. 393 + 394 + Q: 395 + How can distributions ship a swsusp-supporting kernel with modular 396 + disk drivers (especially SATA)? 397 + 398 + A: 399 + Well, it can be done, load the drivers, then do echo into 400 + /sys/power/resume file from initrd. Be sure not to mount 401 + anything, not even read-only mount, or you are going to lose your 402 + data. 403 + 404 + Q: 405 + How do I make suspend more verbose? 406 + 407 + A: 408 + If you want to see any non-error kernel messages on the virtual 409 + terminal the kernel switches to during suspend, you have to set the 410 + kernel console loglevel to at least 4 (KERN_WARNING), for example by 411 + doing:: 412 + 413 + # save the old loglevel 414 + read LOGLEVEL DUMMY < /proc/sys/kernel/printk 415 + # set the loglevel so we see the progress bar. 416 + # if the level is higher than needed, we leave it alone. 417 + if [ $LOGLEVEL -lt 5 ]; then 418 + echo 5 > /proc/sys/kernel/printk 419 + fi 420 + 421 + IMG_SZ=0 422 + read IMG_SZ < /sys/power/image_size 423 + echo -n disk > /sys/power/state 424 + RET=$? 425 + # 426 + # the logic here is: 427 + # if image_size > 0 (without kernel support, IMG_SZ will be zero), 428 + # then try again with image_size set to zero. 429 + if [ $RET -ne 0 -a $IMG_SZ -ne 0 ]; then # try again with minimal image size 430 + echo 0 > /sys/power/image_size 431 + echo -n disk > /sys/power/state 432 + RET=$? 433 + fi 434 + 435 + # restore previous loglevel 436 + echo $LOGLEVEL > /proc/sys/kernel/printk 437 + exit $RET 438 + 439 + Q: 440 + Is this true that if I have a mounted filesystem on a USB device and 441 + I suspend to disk, I can lose data unless the filesystem has been mounted 442 + with "sync"? 443 + 444 + A: 445 + That's right ... if you disconnect that device, you may lose data. 446 + In fact, even with "-o sync" you can lose data if your programs have 447 + information in buffers they haven't written out to a disk you disconnect, 448 + or if you disconnect before the device finished saving data you wrote. 449 + 450 + Software suspend normally powers down USB controllers, which is equivalent 451 + to disconnecting all USB devices attached to your system. 452 + 453 + Your system might well support low-power modes for its USB controllers 454 + while the system is asleep, maintaining the connection, using true sleep 455 + modes like "suspend-to-RAM" or "standby". (Don't write "disk" to the 456 + /sys/power/state file; write "standby" or "mem".) We've not seen any 457 + hardware that can use these modes through software suspend, although in 458 + theory some systems might support "platform" modes that won't break the 459 + USB connections. 460 + 461 + Remember that it's always a bad idea to unplug a disk drive containing a 462 + mounted filesystem. That's true even when your system is asleep! The 463 + safest thing is to unmount all filesystems on removable media (such USB, 464 + Firewire, CompactFlash, MMC, external SATA, or even IDE hotplug bays) 465 + before suspending; then remount them after resuming. 466 + 467 + There is a work-around for this problem. For more information, see 468 + Documentation/driver-api/usb/persist.rst. 469 + 470 + Q: 471 + Can I suspend-to-disk using a swap partition under LVM? 472 + 473 + A: 474 + Yes and No. You can suspend successfully, but the kernel will not be able 475 + to resume on its own. You need an initramfs that can recognize the resume 476 + situation, activate the logical volume containing the swap volume (but not 477 + touch any filesystems!), and eventually call:: 478 + 479 + echo -n "$major:$minor" > /sys/power/resume 480 + 481 + where $major and $minor are the respective major and minor device numbers of 482 + the swap volume. 483 + 484 + uswsusp works with LVM, too. See http://suspend.sourceforge.net/ 485 + 486 + Q: 487 + I upgraded the kernel from 2.6.15 to 2.6.16. Both kernels were 488 + compiled with the similar configuration files. Anyway I found that 489 + suspend to disk (and resume) is much slower on 2.6.16 compared to 490 + 2.6.15. Any idea for why that might happen or how can I speed it up? 491 + 492 + A: 493 + This is because the size of the suspend image is now greater than 494 + for 2.6.15 (by saving more data we can get more responsive system 495 + after resume). 496 + 497 + There's the /sys/power/image_size knob that controls the size of the 498 + image. If you set it to 0 (eg. by echo 0 > /sys/power/image_size as 499 + root), the 2.6.15 behavior should be restored. If it is still too 500 + slow, take a look at suspend.sf.net -- userland suspend is faster and 501 + supports LZF compression to speed it up further.

-446

Documentation/power/swsusp.txt

··· 1 - Some warnings, first. 2 - 3 - * BIG FAT WARNING ********************************************************* 4 - * 5 - * If you touch anything on disk between suspend and resume... 6 - * ...kiss your data goodbye. 7 - * 8 - * If you do resume from initrd after your filesystems are mounted... 9 - * ...bye bye root partition. 10 - * [this is actually same case as above] 11 - * 12 - * If you have unsupported (*) devices using DMA, you may have some 13 - * problems. If your disk driver does not support suspend... (IDE does), 14 - * it may cause some problems, too. If you change kernel command line 15 - * between suspend and resume, it may do something wrong. If you change 16 - * your hardware while system is suspended... well, it was not good idea; 17 - * but it will probably only crash. 18 - * 19 - * (*) suspend/resume support is needed to make it safe. 20 - * 21 - * If you have any filesystems on USB devices mounted before software suspend, 22 - * they won't be accessible after resume and you may lose data, as though 23 - * you have unplugged the USB devices with mounted filesystems on them; 24 - * see the FAQ below for details. (This is not true for more traditional 25 - * power states like "standby", which normally don't turn USB off.) 26 - 27 - Swap partition: 28 - You need to append resume=/dev/your_swap_partition to kernel command 29 - line or specify it using /sys/power/resume. 30 - 31 - Swap file: 32 - If using a swapfile you can also specify a resume offset using 33 - resume_offset=<number> on the kernel command line or specify it 34 - in /sys/power/resume_offset. 35 - 36 - After preparing then you suspend by 37 - 38 - echo shutdown > /sys/power/disk; echo disk > /sys/power/state 39 - 40 - . If you feel ACPI works pretty well on your system, you might try 41 - 42 - echo platform > /sys/power/disk; echo disk > /sys/power/state 43 - 44 - . If you would like to write hibernation image to swap and then suspend 45 - to RAM (provided your platform supports it), you can try 46 - 47 - echo suspend > /sys/power/disk; echo disk > /sys/power/state 48 - 49 - . If you have SATA disks, you'll need recent kernels with SATA suspend 50 - support. For suspend and resume to work, make sure your disk drivers 51 - are built into kernel -- not modules. [There's way to make 52 - suspend/resume with modular disk drivers, see FAQ, but you probably 53 - should not do that.] 54 - 55 - If you want to limit the suspend image size to N bytes, do 56 - 57 - echo N > /sys/power/image_size 58 - 59 - before suspend (it is limited to around 2/5 of available RAM by default). 60 - 61 - . The resume process checks for the presence of the resume device, 62 - if found, it then checks the contents for the hibernation image signature. 63 - If both are found, it resumes the hibernation image. 64 - 65 - . The resume process may be triggered in two ways: 66 - 1) During lateinit: If resume=/dev/your_swap_partition is specified on 67 - the kernel command line, lateinit runs the resume process. If the 68 - resume device has not been probed yet, the resume process fails and 69 - bootup continues. 70 - 2) Manually from an initrd or initramfs: May be run from 71 - the init script by using the /sys/power/resume file. It is vital 72 - that this be done prior to remounting any filesystems (even as 73 - read-only) otherwise data may be corrupted. 74 - 75 - Article about goals and implementation of Software Suspend for Linux 76 - ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 77 - Author: Gábor Kuti 78 - Last revised: 2003-10-20 by Pavel Machek 79 - 80 - Idea and goals to achieve 81 - 82 - Nowadays it is common in several laptops that they have a suspend button. It 83 - saves the state of the machine to a filesystem or to a partition and switches 84 - to standby mode. Later resuming the machine the saved state is loaded back to 85 - ram and the machine can continue its work. It has two real benefits. First we 86 - save ourselves the time machine goes down and later boots up, energy costs 87 - are real high when running from batteries. The other gain is that we don't have to 88 - interrupt our programs so processes that are calculating something for a long 89 - time shouldn't need to be written interruptible. 90 - 91 - swsusp saves the state of the machine into active swaps and then reboots or 92 - powerdowns. You must explicitly specify the swap partition to resume from with 93 - ``resume='' kernel option. If signature is found it loads and restores saved 94 - state. If the option ``noresume'' is specified as a boot parameter, it skips 95 - the resuming. If the option ``hibernate=nocompress'' is specified as a boot 96 - parameter, it saves hibernation image without compression. 97 - 98 - In the meantime while the system is suspended you should not add/remove any 99 - of the hardware, write to the filesystems, etc. 100 - 101 - Sleep states summary 102 - ==================== 103 - 104 - There are three different interfaces you can use, /proc/acpi should 105 - work like this: 106 - 107 - In a really perfect world: 108 - echo 1 > /proc/acpi/sleep # for standby 109 - echo 2 > /proc/acpi/sleep # for suspend to ram 110 - echo 3 > /proc/acpi/sleep # for suspend to ram, but with more power conservative 111 - echo 4 > /proc/acpi/sleep # for suspend to disk 112 - echo 5 > /proc/acpi/sleep # for shutdown unfriendly the system 113 - 114 - and perhaps 115 - echo 4b > /proc/acpi/sleep # for suspend to disk via s4bios 116 - 117 - Frequently Asked Questions 118 - ========================== 119 - 120 - Q: well, suspending a server is IMHO a really stupid thing, 121 - but... (Diego Zuccato): 122 - 123 - A: You bought new UPS for your server. How do you install it without 124 - bringing machine down? Suspend to disk, rearrange power cables, 125 - resume. 126 - 127 - You have your server on UPS. Power died, and UPS is indicating 30 128 - seconds to failure. What do you do? Suspend to disk. 129 - 130 - 131 - Q: Maybe I'm missing something, but why don't the regular I/O paths work? 132 - 133 - A: We do use the regular I/O paths. However we cannot restore the data 134 - to its original location as we load it. That would create an 135 - inconsistent kernel state which would certainly result in an oops. 136 - Instead, we load the image into unused memory and then atomically copy 137 - it back to it original location. This implies, of course, a maximum 138 - image size of half the amount of memory. 139 - 140 - There are two solutions to this: 141 - 142 - * require half of memory to be free during suspend. That way you can 143 - read "new" data onto free spots, then cli and copy 144 - 145 - * assume we had special "polling" ide driver that only uses memory 146 - between 0-640KB. That way, I'd have to make sure that 0-640KB is free 147 - during suspending, but otherwise it would work... 148 - 149 - suspend2 shares this fundamental limitation, but does not include user 150 - data and disk caches into "used memory" by saving them in 151 - advance. That means that the limitation goes away in practice. 152 - 153 - Q: Does linux support ACPI S4? 154 - 155 - A: Yes. That's what echo platform > /sys/power/disk does. 156 - 157 - Q: What is 'suspend2'? 158 - 159 - A: suspend2 is 'Software Suspend 2', a forked implementation of 160 - suspend-to-disk which is available as separate patches for 2.4 and 2.6 161 - kernels from swsusp.sourceforge.net. It includes support for SMP, 4GB 162 - highmem and preemption. It also has a extensible architecture that 163 - allows for arbitrary transformations on the image (compression, 164 - encryption) and arbitrary backends for writing the image (eg to swap 165 - or an NFS share[Work In Progress]). Questions regarding suspend2 166 - should be sent to the mailing list available through the suspend2 167 - website, and not to the Linux Kernel Mailing List. We are working 168 - toward merging suspend2 into the mainline kernel. 169 - 170 - Q: What is the freezing of tasks and why are we using it? 171 - 172 - A: The freezing of tasks is a mechanism by which user space processes and some 173 - kernel threads are controlled during hibernation or system-wide suspend (on some 174 - architectures). See freezing-of-tasks.txt for details. 175 - 176 - Q: What is the difference between "platform" and "shutdown"? 177 - 178 - A: 179 - 180 - shutdown: save state in linux, then tell bios to powerdown 181 - 182 - platform: save state in linux, then tell bios to powerdown and blink 183 - "suspended led" 184 - 185 - "platform" is actually right thing to do where supported, but 186 - "shutdown" is most reliable (except on ACPI systems). 187 - 188 - Q: I do not understand why you have such strong objections to idea of 189 - selective suspend. 190 - 191 - A: Do selective suspend during runtime power management, that's okay. But 192 - it's useless for suspend-to-disk. (And I do not see how you could use 193 - it for suspend-to-ram, I hope you do not want that). 194 - 195 - Lets see, so you suggest to 196 - 197 - * SUSPEND all but swap device and parents 198 - * Snapshot 199 - * Write image to disk 200 - * SUSPEND swap device and parents 201 - * Powerdown 202 - 203 - Oh no, that does not work, if swap device or its parents uses DMA, 204 - you've corrupted data. You'd have to do 205 - 206 - * SUSPEND all but swap device and parents 207 - * FREEZE swap device and parents 208 - * Snapshot 209 - * UNFREEZE swap device and parents 210 - * Write 211 - * SUSPEND swap device and parents 212 - 213 - Which means that you still need that FREEZE state, and you get more 214 - complicated code. (And I have not yet introduce details like system 215 - devices). 216 - 217 - Q: There don't seem to be any generally useful behavioral 218 - distinctions between SUSPEND and FREEZE. 219 - 220 - A: Doing SUSPEND when you are asked to do FREEZE is always correct, 221 - but it may be unnecessarily slow. If you want your driver to stay simple, 222 - slowness may not matter to you. It can always be fixed later. 223 - 224 - For devices like disk it does matter, you do not want to spindown for 225 - FREEZE. 226 - 227 - Q: After resuming, system is paging heavily, leading to very bad interactivity. 228 - 229 - A: Try running 230 - 231 - cat /proc/[0-9]*/maps | grep / | sed 's:.* /:/:' | sort -u | while read file 232 - do 233 - test -f "$file" && cat "$file" > /dev/null 234 - done 235 - 236 - after resume. swapoff -a; swapon -a may also be useful. 237 - 238 - Q: What happens to devices during swsusp? They seem to be resumed 239 - during system suspend? 240 - 241 - A: That's correct. We need to resume them if we want to write image to 242 - disk. Whole sequence goes like 243 - 244 - Suspend part 245 - ~~~~~~~~~~~~ 246 - running system, user asks for suspend-to-disk 247 - 248 - user processes are stopped 249 - 250 - suspend(PMSG_FREEZE): devices are frozen so that they don't interfere 251 - with state snapshot 252 - 253 - state snapshot: copy of whole used memory is taken with interrupts disabled 254 - 255 - resume(): devices are woken up so that we can write image to swap 256 - 257 - write image to swap 258 - 259 - suspend(PMSG_SUSPEND): suspend devices so that we can power off 260 - 261 - turn the power off 262 - 263 - Resume part 264 - ~~~~~~~~~~~ 265 - (is actually pretty similar) 266 - 267 - running system, user asks for suspend-to-disk 268 - 269 - user processes are stopped (in common case there are none, but with resume-from-initrd, no one knows) 270 - 271 - read image from disk 272 - 273 - suspend(PMSG_FREEZE): devices are frozen so that they don't interfere 274 - with image restoration 275 - 276 - image restoration: rewrite memory with image 277 - 278 - resume(): devices are woken up so that system can continue 279 - 280 - thaw all user processes 281 - 282 - Q: What is this 'Encrypt suspend image' for? 283 - 284 - A: First of all: it is not a replacement for dm-crypt encrypted swap. 285 - It cannot protect your computer while it is suspended. Instead it does 286 - protect from leaking sensitive data after resume from suspend. 287 - 288 - Think of the following: you suspend while an application is running 289 - that keeps sensitive data in memory. The application itself prevents 290 - the data from being swapped out. Suspend, however, must write these 291 - data to swap to be able to resume later on. Without suspend encryption 292 - your sensitive data are then stored in plaintext on disk. This means 293 - that after resume your sensitive data are accessible to all 294 - applications having direct access to the swap device which was used 295 - for suspend. If you don't need swap after resume these data can remain 296 - on disk virtually forever. Thus it can happen that your system gets 297 - broken in weeks later and sensitive data which you thought were 298 - encrypted and protected are retrieved and stolen from the swap device. 299 - To prevent this situation you should use 'Encrypt suspend image'. 300 - 301 - During suspend a temporary key is created and this key is used to 302 - encrypt the data written to disk. When, during resume, the data was 303 - read back into memory the temporary key is destroyed which simply 304 - means that all data written to disk during suspend are then 305 - inaccessible so they can't be stolen later on. The only thing that 306 - you must then take care of is that you call 'mkswap' for the swap 307 - partition used for suspend as early as possible during regular 308 - boot. This asserts that any temporary key from an oopsed suspend or 309 - from a failed or aborted resume is erased from the swap device. 310 - 311 - As a rule of thumb use encrypted swap to protect your data while your 312 - system is shut down or suspended. Additionally use the encrypted 313 - suspend image to prevent sensitive data from being stolen after 314 - resume. 315 - 316 - Q: Can I suspend to a swap file? 317 - 318 - A: Generally, yes, you can. However, it requires you to use the "resume=" and 319 - "resume_offset=" kernel command line parameters, so the resume from a swap file 320 - cannot be initiated from an initrd or initramfs image. See 321 - swsusp-and-swap-files.txt for details. 322 - 323 - Q: Is there a maximum system RAM size that is supported by swsusp? 324 - 325 - A: It should work okay with highmem. 326 - 327 - Q: Does swsusp (to disk) use only one swap partition or can it use 328 - multiple swap partitions (aggregate them into one logical space)? 329 - 330 - A: Only one swap partition, sorry. 331 - 332 - Q: If my application(s) causes lots of memory & swap space to be used 333 - (over half of the total system RAM), is it correct that it is likely 334 - to be useless to try to suspend to disk while that app is running? 335 - 336 - A: No, it should work okay, as long as your app does not mlock() 337 - it. Just prepare big enough swap partition. 338 - 339 - Q: What information is useful for debugging suspend-to-disk problems? 340 - 341 - A: Well, last messages on the screen are always useful. If something 342 - is broken, it is usually some kernel driver, therefore trying with as 343 - little as possible modules loaded helps a lot. I also prefer people to 344 - suspend from console, preferably without X running. Booting with 345 - init=/bin/bash, then swapon and starting suspend sequence manually 346 - usually does the trick. Then it is good idea to try with latest 347 - vanilla kernel. 348 - 349 - Q: How can distributions ship a swsusp-supporting kernel with modular 350 - disk drivers (especially SATA)? 351 - 352 - A: Well, it can be done, load the drivers, then do echo into 353 - /sys/power/resume file from initrd. Be sure not to mount 354 - anything, not even read-only mount, or you are going to lose your 355 - data. 356 - 357 - Q: How do I make suspend more verbose? 358 - 359 - A: If you want to see any non-error kernel messages on the virtual 360 - terminal the kernel switches to during suspend, you have to set the 361 - kernel console loglevel to at least 4 (KERN_WARNING), for example by 362 - doing 363 - 364 - # save the old loglevel 365 - read LOGLEVEL DUMMY < /proc/sys/kernel/printk 366 - # set the loglevel so we see the progress bar. 367 - # if the level is higher than needed, we leave it alone. 368 - if [ $LOGLEVEL -lt 5 ]; then 369 - echo 5 > /proc/sys/kernel/printk 370 - fi 371 - 372 - IMG_SZ=0 373 - read IMG_SZ < /sys/power/image_size 374 - echo -n disk > /sys/power/state 375 - RET=$? 376 - # 377 - # the logic here is: 378 - # if image_size > 0 (without kernel support, IMG_SZ will be zero), 379 - # then try again with image_size set to zero. 380 - if [ $RET -ne 0 -a $IMG_SZ -ne 0 ]; then # try again with minimal image size 381 - echo 0 > /sys/power/image_size 382 - echo -n disk > /sys/power/state 383 - RET=$? 384 - fi 385 - 386 - # restore previous loglevel 387 - echo $LOGLEVEL > /proc/sys/kernel/printk 388 - exit $RET 389 - 390 - Q: Is this true that if I have a mounted filesystem on a USB device and 391 - I suspend to disk, I can lose data unless the filesystem has been mounted 392 - with "sync"? 393 - 394 - A: That's right ... if you disconnect that device, you may lose data. 395 - In fact, even with "-o sync" you can lose data if your programs have 396 - information in buffers they haven't written out to a disk you disconnect, 397 - or if you disconnect before the device finished saving data you wrote. 398 - 399 - Software suspend normally powers down USB controllers, which is equivalent 400 - to disconnecting all USB devices attached to your system. 401 - 402 - Your system might well support low-power modes for its USB controllers 403 - while the system is asleep, maintaining the connection, using true sleep 404 - modes like "suspend-to-RAM" or "standby". (Don't write "disk" to the 405 - /sys/power/state file; write "standby" or "mem".) We've not seen any 406 - hardware that can use these modes through software suspend, although in 407 - theory some systems might support "platform" modes that won't break the 408 - USB connections. 409 - 410 - Remember that it's always a bad idea to unplug a disk drive containing a 411 - mounted filesystem. That's true even when your system is asleep! The 412 - safest thing is to unmount all filesystems on removable media (such USB, 413 - Firewire, CompactFlash, MMC, external SATA, or even IDE hotplug bays) 414 - before suspending; then remount them after resuming. 415 - 416 - There is a work-around for this problem. For more information, see 417 - Documentation/driver-api/usb/persist.rst. 418 - 419 - Q: Can I suspend-to-disk using a swap partition under LVM? 420 - 421 - A: Yes and No. You can suspend successfully, but the kernel will not be able 422 - to resume on its own. You need an initramfs that can recognize the resume 423 - situation, activate the logical volume containing the swap volume (but not 424 - touch any filesystems!), and eventually call 425 - 426 - echo -n "$major:$minor" > /sys/power/resume 427 - 428 - where $major and $minor are the respective major and minor device numbers of 429 - the swap volume. 430 - 431 - uswsusp works with LVM, too. See http://suspend.sourceforge.net/ 432 - 433 - Q: I upgraded the kernel from 2.6.15 to 2.6.16. Both kernels were 434 - compiled with the similar configuration files. Anyway I found that 435 - suspend to disk (and resume) is much slower on 2.6.16 compared to 436 - 2.6.15. Any idea for why that might happen or how can I speed it up? 437 - 438 - A: This is because the size of the suspend image is now greater than 439 - for 2.6.15 (by saving more data we can get more responsive system 440 - after resume). 441 - 442 - There's the /sys/power/image_size knob that controls the size of the 443 - image. If you set it to 0 (eg. by echo 0 > /sys/power/image_size as 444 - root), the 2.6.15 behavior should be restored. If it is still too 445 - slow, take a look at suspend.sf.net -- userland suspend is faster and 446 - supports LZF compression to speed it up further.

+29

Documentation/power/tricks.rst

··· 1 + ================ 2 + swsusp/S3 tricks 3 + ================ 4 + 5 + Pavel Machek <pavel@ucw.cz> 6 + 7 + If you want to trick swsusp/S3 into working, you might want to try: 8 + 9 + * go with minimal config, turn off drivers like USB, AGP you don't 10 + really need 11 + 12 + * turn off APIC and preempt 13 + 14 + * use ext2. At least it has working fsck. [If something seems to go 15 + wrong, force fsck when you have a chance] 16 + 17 + * turn off modules 18 + 19 + * use vga text console, shut down X. [If you really want X, you might 20 + want to try vesafb later] 21 + 22 + * try running as few processes as possible, preferably go to single 23 + user mode. 24 + 25 + * due to video issues, swsusp should be easier to get working than 26 + S3. Try that first. 27 + 28 + When you make it work, try to find out what exactly was it that broke 29 + suspend, and preferably fix that.

-27

Documentation/power/tricks.txt

··· 1 - swsusp/S3 tricks 2 - ~~~~~~~~~~~~~~~~ 3 - Pavel Machek <pavel@ucw.cz> 4 - 5 - If you want to trick swsusp/S3 into working, you might want to try: 6 - 7 - * go with minimal config, turn off drivers like USB, AGP you don't 8 - really need 9 - 10 - * turn off APIC and preempt 11 - 12 - * use ext2. At least it has working fsck. [If something seems to go 13 - wrong, force fsck when you have a chance] 14 - 15 - * turn off modules 16 - 17 - * use vga text console, shut down X. [If you really want X, you might 18 - want to try vesafb later] 19 - 20 - * try running as few processes as possible, preferably go to single 21 - user mode. 22 - 23 - * due to video issues, swsusp should be easier to get working than 24 - S3. Try that first. 25 - 26 - When you make it work, try to find out what exactly was it that broke 27 - suspend, and preferably fix that.

+191

Documentation/power/userland-swsusp.rst

··· 1 + ===================================================== 2 + Documentation for userland software suspend interface 3 + ===================================================== 4 + 5 + (C) 2006 Rafael J. Wysocki <rjw@sisk.pl> 6 + 7 + First, the warnings at the beginning of swsusp.txt still apply. 8 + 9 + Second, you should read the FAQ in swsusp.txt _now_ if you have not 10 + done it already. 11 + 12 + Now, to use the userland interface for software suspend you need special 13 + utilities that will read/write the system memory snapshot from/to the 14 + kernel. Such utilities are available, for example, from 15 + <http://suspend.sourceforge.net>. You may want to have a look at them if you 16 + are going to develop your own suspend/resume utilities. 17 + 18 + The interface consists of a character device providing the open(), 19 + release(), read(), and write() operations as well as several ioctl() 20 + commands defined in include/linux/suspend_ioctls.h . The major and minor 21 + numbers of the device are, respectively, 10 and 231, and they can 22 + be read from /sys/class/misc/snapshot/dev. 23 + 24 + The device can be open either for reading or for writing. If open for 25 + reading, it is considered to be in the suspend mode. Otherwise it is 26 + assumed to be in the resume mode. The device cannot be open for simultaneous 27 + reading and writing. It is also impossible to have the device open more than 28 + once at a time. 29 + 30 + Even opening the device has side effects. Data structures are 31 + allocated, and PM_HIBERNATION_PREPARE / PM_RESTORE_PREPARE chains are 32 + called. 33 + 34 + The ioctl() commands recognized by the device are: 35 + 36 + SNAPSHOT_FREEZE 37 + freeze user space processes (the current process is 38 + not frozen); this is required for SNAPSHOT_CREATE_IMAGE 39 + and SNAPSHOT_ATOMIC_RESTORE to succeed 40 + 41 + SNAPSHOT_UNFREEZE 42 + thaw user space processes frozen by SNAPSHOT_FREEZE 43 + 44 + SNAPSHOT_CREATE_IMAGE 45 + create a snapshot of the system memory; the 46 + last argument of ioctl() should be a pointer to an int variable, 47 + the value of which will indicate whether the call returned after 48 + creating the snapshot (1) or after restoring the system memory state 49 + from it (0) (after resume the system finds itself finishing the 50 + SNAPSHOT_CREATE_IMAGE ioctl() again); after the snapshot 51 + has been created the read() operation can be used to transfer 52 + it out of the kernel 53 + 54 + SNAPSHOT_ATOMIC_RESTORE 55 + restore the system memory state from the 56 + uploaded snapshot image; before calling it you should transfer 57 + the system memory snapshot back to the kernel using the write() 58 + operation; this call will not succeed if the snapshot 59 + image is not available to the kernel 60 + 61 + SNAPSHOT_FREE 62 + free memory allocated for the snapshot image 63 + 64 + SNAPSHOT_PREF_IMAGE_SIZE 65 + set the preferred maximum size of the image 66 + (the kernel will do its best to ensure the image size will not exceed 67 + this number, but if it turns out to be impossible, the kernel will 68 + create the smallest image possible) 69 + 70 + SNAPSHOT_GET_IMAGE_SIZE 71 + return the actual size of the hibernation image 72 + 73 + SNAPSHOT_AVAIL_SWAP_SIZE 74 + return the amount of available swap in bytes (the 75 + last argument should be a pointer to an unsigned int variable that will 76 + contain the result if the call is successful). 77 + 78 + SNAPSHOT_ALLOC_SWAP_PAGE 79 + allocate a swap page from the resume partition 80 + (the last argument should be a pointer to a loff_t variable that 81 + will contain the swap page offset if the call is successful) 82 + 83 + SNAPSHOT_FREE_SWAP_PAGES 84 + free all swap pages allocated by 85 + SNAPSHOT_ALLOC_SWAP_PAGE 86 + 87 + SNAPSHOT_SET_SWAP_AREA 88 + set the resume partition and the offset (in <PAGE_SIZE> 89 + units) from the beginning of the partition at which the swap header is 90 + located (the last ioctl() argument should point to a struct 91 + resume_swap_area, as defined in kernel/power/suspend_ioctls.h, 92 + containing the resume device specification and the offset); for swap 93 + partitions the offset is always 0, but it is different from zero for 94 + swap files (see Documentation/power/swsusp-and-swap-files.rst for 95 + details). 96 + 97 + SNAPSHOT_PLATFORM_SUPPORT 98 + enable/disable the hibernation platform support, 99 + depending on the argument value (enable, if the argument is nonzero) 100 + 101 + SNAPSHOT_POWER_OFF 102 + make the kernel transition the system to the hibernation 103 + state (eg. ACPI S4) using the platform (eg. ACPI) driver 104 + 105 + SNAPSHOT_S2RAM 106 + suspend to RAM; using this call causes the kernel to 107 + immediately enter the suspend-to-RAM state, so this call must always 108 + be preceded by the SNAPSHOT_FREEZE call and it is also necessary 109 + to use the SNAPSHOT_UNFREEZE call after the system wakes up. This call 110 + is needed to implement the suspend-to-both mechanism in which the 111 + suspend image is first created, as though the system had been suspended 112 + to disk, and then the system is suspended to RAM (this makes it possible 113 + to resume the system from RAM if there's enough battery power or restore 114 + its state on the basis of the saved suspend image otherwise) 115 + 116 + The device's read() operation can be used to transfer the snapshot image from 117 + the kernel. It has the following limitations: 118 + 119 + - you cannot read() more than one virtual memory page at a time 120 + - read()s across page boundaries are impossible (ie. if you read() 1/2 of 121 + a page in the previous call, you will only be able to read() 122 + **at most** 1/2 of the page in the next call) 123 + 124 + The device's write() operation is used for uploading the system memory snapshot 125 + into the kernel. It has the same limitations as the read() operation. 126 + 127 + The release() operation frees all memory allocated for the snapshot image 128 + and all swap pages allocated with SNAPSHOT_ALLOC_SWAP_PAGE (if any). 129 + Thus it is not necessary to use either SNAPSHOT_FREE or 130 + SNAPSHOT_FREE_SWAP_PAGES before closing the device (in fact it will also 131 + unfreeze user space processes frozen by SNAPSHOT_UNFREEZE if they are 132 + still frozen when the device is being closed). 133 + 134 + Currently it is assumed that the userland utilities reading/writing the 135 + snapshot image from/to the kernel will use a swap partition, called the resume 136 + partition, or a swap file as storage space (if a swap file is used, the resume 137 + partition is the partition that holds this file). However, this is not really 138 + required, as they can use, for example, a special (blank) suspend partition or 139 + a file on a partition that is unmounted before SNAPSHOT_CREATE_IMAGE and 140 + mounted afterwards. 141 + 142 + These utilities MUST NOT make any assumptions regarding the ordering of 143 + data within the snapshot image. The contents of the image are entirely owned 144 + by the kernel and its structure may be changed in future kernel releases. 145 + 146 + The snapshot image MUST be written to the kernel unaltered (ie. all of the image 147 + data, metadata and header MUST be written in _exactly_ the same amount, form 148 + and order in which they have been read). Otherwise, the behavior of the 149 + resumed system may be totally unpredictable. 150 + 151 + While executing SNAPSHOT_ATOMIC_RESTORE the kernel checks if the 152 + structure of the snapshot image is consistent with the information stored 153 + in the image header. If any inconsistencies are detected, 154 + SNAPSHOT_ATOMIC_RESTORE will not succeed. Still, this is not a fool-proof 155 + mechanism and the userland utilities using the interface SHOULD use additional 156 + means, such as checksums, to ensure the integrity of the snapshot image. 157 + 158 + The suspending and resuming utilities MUST lock themselves in memory, 159 + preferably using mlockall(), before calling SNAPSHOT_FREEZE. 160 + 161 + The suspending utility MUST check the value stored by SNAPSHOT_CREATE_IMAGE 162 + in the memory location pointed to by the last argument of ioctl() and proceed 163 + in accordance with it: 164 + 165 + 1. If the value is 1 (ie. the system memory snapshot has just been 166 + created and the system is ready for saving it): 167 + 168 + (a) The suspending utility MUST NOT close the snapshot device 169 + _unless_ the whole suspend procedure is to be cancelled, in 170 + which case, if the snapshot image has already been saved, the 171 + suspending utility SHOULD destroy it, preferably by zapping 172 + its header. If the suspend is not to be cancelled, the 173 + system MUST be powered off or rebooted after the snapshot 174 + image has been saved. 175 + (b) The suspending utility SHOULD NOT attempt to perform any 176 + file system operations (including reads) on the file systems 177 + that were mounted before SNAPSHOT_CREATE_IMAGE has been 178 + called. However, it MAY mount a file system that was not 179 + mounted at that time and perform some operations on it (eg. 180 + use it for saving the image). 181 + 182 + 2. If the value is 0 (ie. the system state has just been restored from 183 + the snapshot image), the suspending utility MUST close the snapshot 184 + device. Afterwards it will be treated as a regular userland process, 185 + so it need not exit. 186 + 187 + The resuming utility SHOULD NOT attempt to mount any file systems that could 188 + be mounted before suspend and SHOULD NOT attempt to perform any operations 189 + involving such file systems. 190 + 191 + For details, please refer to the source code.

-170

Documentation/power/userland-swsusp.txt

··· 1 - Documentation for userland software suspend interface 2 - (C) 2006 Rafael J. Wysocki <rjw@sisk.pl> 3 - 4 - First, the warnings at the beginning of swsusp.txt still apply. 5 - 6 - Second, you should read the FAQ in swsusp.txt _now_ if you have not 7 - done it already. 8 - 9 - Now, to use the userland interface for software suspend you need special 10 - utilities that will read/write the system memory snapshot from/to the 11 - kernel. Such utilities are available, for example, from 12 - <http://suspend.sourceforge.net>. You may want to have a look at them if you 13 - are going to develop your own suspend/resume utilities. 14 - 15 - The interface consists of a character device providing the open(), 16 - release(), read(), and write() operations as well as several ioctl() 17 - commands defined in include/linux/suspend_ioctls.h . The major and minor 18 - numbers of the device are, respectively, 10 and 231, and they can 19 - be read from /sys/class/misc/snapshot/dev. 20 - 21 - The device can be open either for reading or for writing. If open for 22 - reading, it is considered to be in the suspend mode. Otherwise it is 23 - assumed to be in the resume mode. The device cannot be open for simultaneous 24 - reading and writing. It is also impossible to have the device open more than 25 - once at a time. 26 - 27 - Even opening the device has side effects. Data structures are 28 - allocated, and PM_HIBERNATION_PREPARE / PM_RESTORE_PREPARE chains are 29 - called. 30 - 31 - The ioctl() commands recognized by the device are: 32 - 33 - SNAPSHOT_FREEZE - freeze user space processes (the current process is 34 - not frozen); this is required for SNAPSHOT_CREATE_IMAGE 35 - and SNAPSHOT_ATOMIC_RESTORE to succeed 36 - 37 - SNAPSHOT_UNFREEZE - thaw user space processes frozen by SNAPSHOT_FREEZE 38 - 39 - SNAPSHOT_CREATE_IMAGE - create a snapshot of the system memory; the 40 - last argument of ioctl() should be a pointer to an int variable, 41 - the value of which will indicate whether the call returned after 42 - creating the snapshot (1) or after restoring the system memory state 43 - from it (0) (after resume the system finds itself finishing the 44 - SNAPSHOT_CREATE_IMAGE ioctl() again); after the snapshot 45 - has been created the read() operation can be used to transfer 46 - it out of the kernel 47 - 48 - SNAPSHOT_ATOMIC_RESTORE - restore the system memory state from the 49 - uploaded snapshot image; before calling it you should transfer 50 - the system memory snapshot back to the kernel using the write() 51 - operation; this call will not succeed if the snapshot 52 - image is not available to the kernel 53 - 54 - SNAPSHOT_FREE - free memory allocated for the snapshot image 55 - 56 - SNAPSHOT_PREF_IMAGE_SIZE - set the preferred maximum size of the image 57 - (the kernel will do its best to ensure the image size will not exceed 58 - this number, but if it turns out to be impossible, the kernel will 59 - create the smallest image possible) 60 - 61 - SNAPSHOT_GET_IMAGE_SIZE - return the actual size of the hibernation image 62 - 63 - SNAPSHOT_AVAIL_SWAP_SIZE - return the amount of available swap in bytes (the 64 - last argument should be a pointer to an unsigned int variable that will 65 - contain the result if the call is successful). 66 - 67 - SNAPSHOT_ALLOC_SWAP_PAGE - allocate a swap page from the resume partition 68 - (the last argument should be a pointer to a loff_t variable that 69 - will contain the swap page offset if the call is successful) 70 - 71 - SNAPSHOT_FREE_SWAP_PAGES - free all swap pages allocated by 72 - SNAPSHOT_ALLOC_SWAP_PAGE 73 - 74 - SNAPSHOT_SET_SWAP_AREA - set the resume partition and the offset (in <PAGE_SIZE> 75 - units) from the beginning of the partition at which the swap header is 76 - located (the last ioctl() argument should point to a struct 77 - resume_swap_area, as defined in kernel/power/suspend_ioctls.h, 78 - containing the resume device specification and the offset); for swap 79 - partitions the offset is always 0, but it is different from zero for 80 - swap files (see Documentation/power/swsusp-and-swap-files.txt for 81 - details). 82 - 83 - SNAPSHOT_PLATFORM_SUPPORT - enable/disable the hibernation platform support, 84 - depending on the argument value (enable, if the argument is nonzero) 85 - 86 - SNAPSHOT_POWER_OFF - make the kernel transition the system to the hibernation 87 - state (eg. ACPI S4) using the platform (eg. ACPI) driver 88 - 89 - SNAPSHOT_S2RAM - suspend to RAM; using this call causes the kernel to 90 - immediately enter the suspend-to-RAM state, so this call must always 91 - be preceded by the SNAPSHOT_FREEZE call and it is also necessary 92 - to use the SNAPSHOT_UNFREEZE call after the system wakes up. This call 93 - is needed to implement the suspend-to-both mechanism in which the 94 - suspend image is first created, as though the system had been suspended 95 - to disk, and then the system is suspended to RAM (this makes it possible 96 - to resume the system from RAM if there's enough battery power or restore 97 - its state on the basis of the saved suspend image otherwise) 98 - 99 - The device's read() operation can be used to transfer the snapshot image from 100 - the kernel. It has the following limitations: 101 - - you cannot read() more than one virtual memory page at a time 102 - - read()s across page boundaries are impossible (ie. if you read() 1/2 of 103 - a page in the previous call, you will only be able to read() 104 - _at_ _most_ 1/2 of the page in the next call) 105 - 106 - The device's write() operation is used for uploading the system memory snapshot 107 - into the kernel. It has the same limitations as the read() operation. 108 - 109 - The release() operation frees all memory allocated for the snapshot image 110 - and all swap pages allocated with SNAPSHOT_ALLOC_SWAP_PAGE (if any). 111 - Thus it is not necessary to use either SNAPSHOT_FREE or 112 - SNAPSHOT_FREE_SWAP_PAGES before closing the device (in fact it will also 113 - unfreeze user space processes frozen by SNAPSHOT_UNFREEZE if they are 114 - still frozen when the device is being closed). 115 - 116 - Currently it is assumed that the userland utilities reading/writing the 117 - snapshot image from/to the kernel will use a swap partition, called the resume 118 - partition, or a swap file as storage space (if a swap file is used, the resume 119 - partition is the partition that holds this file). However, this is not really 120 - required, as they can use, for example, a special (blank) suspend partition or 121 - a file on a partition that is unmounted before SNAPSHOT_CREATE_IMAGE and 122 - mounted afterwards. 123 - 124 - These utilities MUST NOT make any assumptions regarding the ordering of 125 - data within the snapshot image. The contents of the image are entirely owned 126 - by the kernel and its structure may be changed in future kernel releases. 127 - 128 - The snapshot image MUST be written to the kernel unaltered (ie. all of the image 129 - data, metadata and header MUST be written in _exactly_ the same amount, form 130 - and order in which they have been read). Otherwise, the behavior of the 131 - resumed system may be totally unpredictable. 132 - 133 - While executing SNAPSHOT_ATOMIC_RESTORE the kernel checks if the 134 - structure of the snapshot image is consistent with the information stored 135 - in the image header. If any inconsistencies are detected, 136 - SNAPSHOT_ATOMIC_RESTORE will not succeed. Still, this is not a fool-proof 137 - mechanism and the userland utilities using the interface SHOULD use additional 138 - means, such as checksums, to ensure the integrity of the snapshot image. 139 - 140 - The suspending and resuming utilities MUST lock themselves in memory, 141 - preferably using mlockall(), before calling SNAPSHOT_FREEZE. 142 - 143 - The suspending utility MUST check the value stored by SNAPSHOT_CREATE_IMAGE 144 - in the memory location pointed to by the last argument of ioctl() and proceed 145 - in accordance with it: 146 - 1. If the value is 1 (ie. the system memory snapshot has just been 147 - created and the system is ready for saving it): 148 - (a) The suspending utility MUST NOT close the snapshot device 149 - _unless_ the whole suspend procedure is to be cancelled, in 150 - which case, if the snapshot image has already been saved, the 151 - suspending utility SHOULD destroy it, preferably by zapping 152 - its header. If the suspend is not to be cancelled, the 153 - system MUST be powered off or rebooted after the snapshot 154 - image has been saved. 155 - (b) The suspending utility SHOULD NOT attempt to perform any 156 - file system operations (including reads) on the file systems 157 - that were mounted before SNAPSHOT_CREATE_IMAGE has been 158 - called. However, it MAY mount a file system that was not 159 - mounted at that time and perform some operations on it (eg. 160 - use it for saving the image). 161 - 2. If the value is 0 (ie. the system state has just been restored from 162 - the snapshot image), the suspending utility MUST close the snapshot 163 - device. Afterwards it will be treated as a regular userland process, 164 - so it need not exit. 165 - 166 - The resuming utility SHOULD NOT attempt to mount any file systems that could 167 - be mounted before suspend and SHOULD NOT attempt to perform any operations 168 - involving such file systems. 169 - 170 - For details, please refer to the source code.

+213

Documentation/power/video.rst

··· 1 + =========================== 2 + Video issues with S3 resume 3 + =========================== 4 + 5 + 2003-2006, Pavel Machek 6 + 7 + During S3 resume, hardware needs to be reinitialized. For most 8 + devices, this is easy, and kernel driver knows how to do 9 + it. Unfortunately there's one exception: video card. Those are usually 10 + initialized by BIOS, and kernel does not have enough information to 11 + boot video card. (Kernel usually does not even contain video card 12 + driver -- vesafb and vgacon are widely used). 13 + 14 + This is not problem for swsusp, because during swsusp resume, BIOS is 15 + run normally so video card is normally initialized. It should not be 16 + problem for S1 standby, because hardware should retain its state over 17 + that. 18 + 19 + We either have to run video BIOS during early resume, or interpret it 20 + using vbetool later, or maybe nothing is necessary on particular 21 + system because video state is preserved. Unfortunately different 22 + methods work on different systems, and no known method suits all of 23 + them. 24 + 25 + Userland application called s2ram has been developed; it contains long 26 + whitelist of systems, and automatically selects working method for a 27 + given system. It can be downloaded from CVS at 28 + www.sf.net/projects/suspend . If you get a system that is not in the 29 + whitelist, please try to find a working solution, and submit whitelist 30 + entry so that work does not need to be repeated. 31 + 32 + Currently, VBE_SAVE method (6 below) works on most 33 + systems. Unfortunately, vbetool only runs after userland is resumed, 34 + so it makes debugging of early resume problems 35 + hard/impossible. Methods that do not rely on userland are preferable. 36 + 37 + Details 38 + ~~~~~~~ 39 + 40 + There are a few types of systems where video works after S3 resume: 41 + 42 + (1) systems where video state is preserved over S3. 43 + 44 + (2) systems where it is possible to call the video BIOS during S3 45 + resume. Unfortunately, it is not correct to call the video BIOS at 46 + that point, but it happens to work on some machines. Use 47 + acpi_sleep=s3_bios. 48 + 49 + (3) systems that initialize video card into vga text mode and where 50 + the BIOS works well enough to be able to set video mode. Use 51 + acpi_sleep=s3_mode on these. 52 + 53 + (4) on some systems s3_bios kicks video into text mode, and 54 + acpi_sleep=s3_bios,s3_mode is needed. 55 + 56 + (5) radeon systems, where X can soft-boot your video card. You'll need 57 + a new enough X, and a plain text console (no vesafb or radeonfb). See 58 + http://www.doesi.gmxhome.de/linux/tm800s3/s3.html for more information. 59 + Alternatively, you should use vbetool (6) instead. 60 + 61 + (6) other radeon systems, where vbetool is enough to bring system back 62 + to life. It needs text console to be working. Do vbetool vbestate 63 + save > /tmp/delme; echo 3 > /proc/acpi/sleep; vbetool post; vbetool 64 + vbestate restore < /tmp/delme; setfont <whatever>, and your video 65 + should work. 66 + 67 + (7) on some systems, it is possible to boot most of kernel, and then 68 + POSTing bios works. Ole Rohne has patch to do just that at 69 + http://dev.gentoo.org/~marineam/patch-radeonfb-2.6.11-rc2-mm2. 70 + 71 + (8) on some systems, you can use the video_post utility and or 72 + do echo 3 > /sys/power/state && /usr/sbin/video_post - which will 73 + initialize the display in console mode. If you are in X, you can switch 74 + to a virtual terminal and back to X using CTRL+ALT+F1 - CTRL+ALT+F7 to get 75 + the display working in graphical mode again. 76 + 77 + Now, if you pass acpi_sleep=something, and it does not work with your 78 + bios, you'll get a hard crash during resume. Be careful. Also it is 79 + safest to do your experiments with plain old VGA console. The vesafb 80 + and radeonfb (etc) drivers have a tendency to crash the machine during 81 + resume. 82 + 83 + You may have a system where none of above works. At that point you 84 + either invent another ugly hack that works, or write proper driver for 85 + your video card (good luck getting docs :-(). Maybe suspending from X 86 + (proper X, knowing your hardware, not XF68_FBcon) might have better 87 + chance of working. 88 + 89 + Table of known working notebooks: 90 + 91 + 92 + =============================== =============================================== 93 + Model hack (or "how to do it") 94 + =============================== =============================================== 95 + Acer Aspire 1406LC ole's late BIOS init (7), turn off DRI 96 + Acer TM 230 s3_bios (2) 97 + Acer TM 242FX vbetool (6) 98 + Acer TM C110 video_post (8) 99 + Acer TM C300 vga=normal (only suspend on console, not in X), 100 + vbetool (6) or video_post (8) 101 + Acer TM 4052LCi s3_bios (2) 102 + Acer TM 636Lci s3_bios,s3_mode (4) 103 + Acer TM 650 (Radeon M7) vga=normal plus boot-radeon (5) gets text 104 + console back 105 + Acer TM 660 ??? [#f1]_ 106 + Acer TM 800 vga=normal, X patches, see webpage (5) 107 + or vbetool (6) 108 + Acer TM 803 vga=normal, X patches, see webpage (5) 109 + or vbetool (6) 110 + Acer TM 803LCi vga=normal, vbetool (6) 111 + Arima W730a vbetool needed (6) 112 + Asus L2400D s3_mode (3) [#f2]_ (S1 also works OK) 113 + Asus L3350M (SiS 740) (6) 114 + Asus L3800C (Radeon M7) s3_bios (2) (S1 also works OK) 115 + Asus M6887Ne vga=normal, s3_bios (2), use radeon driver 116 + instead of fglrx in x.org 117 + Athlon64 desktop prototype s3_bios (2) 118 + Compal CL-50 ??? [#f1]_ 119 + Compaq Armada E500 - P3-700 none (1) (S1 also works OK) 120 + Compaq Evo N620c vga=normal, s3_bios (2) 121 + Dell 600m, ATI R250 Lf none (1), but needs xorg-x11-6.8.1.902-1 122 + Dell D600, ATI RV250 vga=normal and X, or try vbestate (6) 123 + Dell D610 vga=normal and X (possibly vbestate (6) too, 124 + but not tested) 125 + Dell Inspiron 4000 ??? [#f1]_ 126 + Dell Inspiron 500m ??? [#f1]_ 127 + Dell Inspiron 510m ??? 128 + Dell Inspiron 5150 vbetool needed (6) 129 + Dell Inspiron 600m ??? [#f1]_ 130 + Dell Inspiron 8200 ??? [#f1]_ 131 + Dell Inspiron 8500 ??? [#f1]_ 132 + Dell Inspiron 8600 ??? [#f1]_ 133 + eMachines athlon64 machines vbetool needed (6) (someone please get 134 + me model #s) 135 + HP NC6000 s3_bios, may not use radeonfb (2); 136 + or vbetool (6) 137 + HP NX7000 ??? [#f1]_ 138 + HP Pavilion ZD7000 vbetool post needed, need open-source nv 139 + driver for X 140 + HP Omnibook XE3 athlon version none (1) 141 + HP Omnibook XE3GC none (1), video is S3 Savage/IX-MV 142 + HP Omnibook XE3L-GF vbetool (6) 143 + HP Omnibook 5150 none (1), (S1 also works OK) 144 + IBM TP T20, model 2647-44G none (1), video is S3 Inc. 86C270-294 145 + Savage/IX-MV, vesafb gets "interesting" 146 + but X work. 147 + IBM TP A31 / Type 2652-M5G s3_mode (3) [works ok with 148 + BIOS 1.04 2002-08-23, but not at all with 149 + BIOS 1.11 2004-11-05 :-(] 150 + IBM TP R32 / Type 2658-MMG none (1) 151 + IBM TP R40 2722B3G ??? [#f1]_ 152 + IBM TP R50p / Type 1832-22U s3_bios (2) 153 + IBM TP R51 none (1) 154 + IBM TP T30 236681A ??? [#f1]_ 155 + IBM TP T40 / Type 2373-MU4 none (1) 156 + IBM TP T40p none (1) 157 + IBM TP R40p s3_bios (2) 158 + IBM TP T41p s3_bios (2), switch to X after resume 159 + IBM TP T42 s3_bios (2) 160 + IBM ThinkPad T42p (2373-GTG) s3_bios (2) 161 + IBM TP X20 ??? [#f1]_ 162 + IBM TP X30 s3_bios, s3_mode (4) 163 + IBM TP X31 / Type 2672-XXH none (1), use radeontool 164 + (http://fdd.com/software/radeon/) to 165 + turn off backlight. 166 + IBM TP X32 none (1), but backlight is on and video is 167 + trashed after long suspend. s3_bios, 168 + s3_mode (4) works too. Perhaps that gets 169 + better results? 170 + IBM Thinkpad X40 Type 2371-7JG s3_bios,s3_mode (4) 171 + IBM TP 600e none(1), but a switch to console and 172 + back to X is needed 173 + Medion MD4220 ??? [#f1]_ 174 + Samsung P35 vbetool needed (6) 175 + Sharp PC-AR10 (ATI rage) none (1), backlight does not switch off 176 + Sony Vaio PCG-C1VRX/K s3_bios (2) 177 + Sony Vaio PCG-F403 ??? [#f1]_ 178 + Sony Vaio PCG-GRT995MP none (1), works with 'nv' X driver 179 + Sony Vaio PCG-GR7/K none (1), but needs radeonfb, use 180 + radeontool (http://fdd.com/software/radeon/) 181 + to turn off backlight. 182 + Sony Vaio PCG-N505SN ??? [#f1]_ 183 + Sony Vaio vgn-s260 X or boot-radeon can init it (5) 184 + Sony Vaio vgn-S580BH vga=normal, but suspend from X. Console will 185 + be blank unless you return to X. 186 + Sony Vaio vgn-FS115B s3_bios (2),s3_mode (4) 187 + Toshiba Libretto L5 none (1) 188 + Toshiba Libretto 100CT/110CT vbetool (6) 189 + Toshiba Portege 3020CT s3_mode (3) 190 + Toshiba Satellite 4030CDT s3_mode (3) (S1 also works OK) 191 + Toshiba Satellite 4080XCDT s3_mode (3) (S1 also works OK) 192 + Toshiba Satellite 4090XCDT ??? [#f1]_ 193 + Toshiba Satellite P10-554 s3_bios,s3_mode (4)[#f3]_ 194 + Toshiba M30 (2) xor X with nvidia driver using internal AGP 195 + Uniwill 244IIO ??? [#f1]_ 196 + =============================== =============================================== 197 + 198 + Known working desktop systems 199 + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 200 + 201 + =================== ============================= ======================== 202 + Mainboard Graphics card hack (or "how to do it") 203 + =================== ============================= ======================== 204 + Asus A7V8X nVidia RIVA TNT2 model 64 s3_bios,s3_mode (4) 205 + =================== ============================= ======================== 206 + 207 + 208 + .. [#f1] from https://wiki.ubuntu.com/HoaryPMResults, not sure 209 + which options to use. If you know, please tell me. 210 + 211 + .. [#f2] To be tested with a newer kernel. 212 + 213 + .. [#f3] Not with SMP kernel, UP only.

-185

Documentation/power/video.txt

··· 1 - 2 - Video issues with S3 resume 3 - ~~~~~~~~~~~~~~~~~~~~~~~~~~~ 4 - 2003-2006, Pavel Machek 5 - 6 - During S3 resume, hardware needs to be reinitialized. For most 7 - devices, this is easy, and kernel driver knows how to do 8 - it. Unfortunately there's one exception: video card. Those are usually 9 - initialized by BIOS, and kernel does not have enough information to 10 - boot video card. (Kernel usually does not even contain video card 11 - driver -- vesafb and vgacon are widely used). 12 - 13 - This is not problem for swsusp, because during swsusp resume, BIOS is 14 - run normally so video card is normally initialized. It should not be 15 - problem for S1 standby, because hardware should retain its state over 16 - that. 17 - 18 - We either have to run video BIOS during early resume, or interpret it 19 - using vbetool later, or maybe nothing is necessary on particular 20 - system because video state is preserved. Unfortunately different 21 - methods work on different systems, and no known method suits all of 22 - them. 23 - 24 - Userland application called s2ram has been developed; it contains long 25 - whitelist of systems, and automatically selects working method for a 26 - given system. It can be downloaded from CVS at 27 - www.sf.net/projects/suspend . If you get a system that is not in the 28 - whitelist, please try to find a working solution, and submit whitelist 29 - entry so that work does not need to be repeated. 30 - 31 - Currently, VBE_SAVE method (6 below) works on most 32 - systems. Unfortunately, vbetool only runs after userland is resumed, 33 - so it makes debugging of early resume problems 34 - hard/impossible. Methods that do not rely on userland are preferable. 35 - 36 - Details 37 - ~~~~~~~ 38 - 39 - There are a few types of systems where video works after S3 resume: 40 - 41 - (1) systems where video state is preserved over S3. 42 - 43 - (2) systems where it is possible to call the video BIOS during S3 44 - resume. Unfortunately, it is not correct to call the video BIOS at 45 - that point, but it happens to work on some machines. Use 46 - acpi_sleep=s3_bios. 47 - 48 - (3) systems that initialize video card into vga text mode and where 49 - the BIOS works well enough to be able to set video mode. Use 50 - acpi_sleep=s3_mode on these. 51 - 52 - (4) on some systems s3_bios kicks video into text mode, and 53 - acpi_sleep=s3_bios,s3_mode is needed. 54 - 55 - (5) radeon systems, where X can soft-boot your video card. You'll need 56 - a new enough X, and a plain text console (no vesafb or radeonfb). See 57 - http://www.doesi.gmxhome.de/linux/tm800s3/s3.html for more information. 58 - Alternatively, you should use vbetool (6) instead. 59 - 60 - (6) other radeon systems, where vbetool is enough to bring system back 61 - to life. It needs text console to be working. Do vbetool vbestate 62 - save > /tmp/delme; echo 3 > /proc/acpi/sleep; vbetool post; vbetool 63 - vbestate restore < /tmp/delme; setfont <whatever>, and your video 64 - should work. 65 - 66 - (7) on some systems, it is possible to boot most of kernel, and then 67 - POSTing bios works. Ole Rohne has patch to do just that at 68 - http://dev.gentoo.org/~marineam/patch-radeonfb-2.6.11-rc2-mm2. 69 - 70 - (8) on some systems, you can use the video_post utility and or 71 - do echo 3 > /sys/power/state && /usr/sbin/video_post - which will 72 - initialize the display in console mode. If you are in X, you can switch 73 - to a virtual terminal and back to X using CTRL+ALT+F1 - CTRL+ALT+F7 to get 74 - the display working in graphical mode again. 75 - 76 - Now, if you pass acpi_sleep=something, and it does not work with your 77 - bios, you'll get a hard crash during resume. Be careful. Also it is 78 - safest to do your experiments with plain old VGA console. The vesafb 79 - and radeonfb (etc) drivers have a tendency to crash the machine during 80 - resume. 81 - 82 - You may have a system where none of above works. At that point you 83 - either invent another ugly hack that works, or write proper driver for 84 - your video card (good luck getting docs :-(). Maybe suspending from X 85 - (proper X, knowing your hardware, not XF68_FBcon) might have better 86 - chance of working. 87 - 88 - Table of known working notebooks: 89 - 90 - Model hack (or "how to do it") 91 - ------------------------------------------------------------------------------ 92 - Acer Aspire 1406LC ole's late BIOS init (7), turn off DRI 93 - Acer TM 230 s3_bios (2) 94 - Acer TM 242FX vbetool (6) 95 - Acer TM C110 video_post (8) 96 - Acer TM C300 vga=normal (only suspend on console, not in X), vbetool (6) or video_post (8) 97 - Acer TM 4052LCi s3_bios (2) 98 - Acer TM 636Lci s3_bios,s3_mode (4) 99 - Acer TM 650 (Radeon M7) vga=normal plus boot-radeon (5) gets text console back 100 - Acer TM 660 ??? (*) 101 - Acer TM 800 vga=normal, X patches, see webpage (5) or vbetool (6) 102 - Acer TM 803 vga=normal, X patches, see webpage (5) or vbetool (6) 103 - Acer TM 803LCi vga=normal, vbetool (6) 104 - Arima W730a vbetool needed (6) 105 - Asus L2400D s3_mode (3)(***) (S1 also works OK) 106 - Asus L3350M (SiS 740) (6) 107 - Asus L3800C (Radeon M7) s3_bios (2) (S1 also works OK) 108 - Asus M6887Ne vga=normal, s3_bios (2), use radeon driver instead of fglrx in x.org 109 - Athlon64 desktop prototype s3_bios (2) 110 - Compal CL-50 ??? (*) 111 - Compaq Armada E500 - P3-700 none (1) (S1 also works OK) 112 - Compaq Evo N620c vga=normal, s3_bios (2) 113 - Dell 600m, ATI R250 Lf none (1), but needs xorg-x11-6.8.1.902-1 114 - Dell D600, ATI RV250 vga=normal and X, or try vbestate (6) 115 - Dell D610 vga=normal and X (possibly vbestate (6) too, but not tested) 116 - Dell Inspiron 4000 ??? (*) 117 - Dell Inspiron 500m ??? (*) 118 - Dell Inspiron 510m ??? 119 - Dell Inspiron 5150 vbetool needed (6) 120 - Dell Inspiron 600m ??? (*) 121 - Dell Inspiron 8200 ??? (*) 122 - Dell Inspiron 8500 ??? (*) 123 - Dell Inspiron 8600 ??? (*) 124 - eMachines athlon64 machines vbetool needed (6) (someone please get me model #s) 125 - HP NC6000 s3_bios, may not use radeonfb (2); or vbetool (6) 126 - HP NX7000 ??? (*) 127 - HP Pavilion ZD7000 vbetool post needed, need open-source nv driver for X 128 - HP Omnibook XE3 athlon version none (1) 129 - HP Omnibook XE3GC none (1), video is S3 Savage/IX-MV 130 - HP Omnibook XE3L-GF vbetool (6) 131 - HP Omnibook 5150 none (1), (S1 also works OK) 132 - IBM TP T20, model 2647-44G none (1), video is S3 Inc. 86C270-294 Savage/IX-MV, vesafb gets "interesting" but X work. 133 - IBM TP A31 / Type 2652-M5G s3_mode (3) [works ok with BIOS 1.04 2002-08-23, but not at all with BIOS 1.11 2004-11-05 :-(] 134 - IBM TP R32 / Type 2658-MMG none (1) 135 - IBM TP R40 2722B3G ??? (*) 136 - IBM TP R50p / Type 1832-22U s3_bios (2) 137 - IBM TP R51 none (1) 138 - IBM TP T30 236681A ??? (*) 139 - IBM TP T40 / Type 2373-MU4 none (1) 140 - IBM TP T40p none (1) 141 - IBM TP R40p s3_bios (2) 142 - IBM TP T41p s3_bios (2), switch to X after resume 143 - IBM TP T42 s3_bios (2) 144 - IBM ThinkPad T42p (2373-GTG) s3_bios (2) 145 - IBM TP X20 ??? (*) 146 - IBM TP X30 s3_bios, s3_mode (4) 147 - IBM TP X31 / Type 2672-XXH none (1), use radeontool (http://fdd.com/software/radeon/) to turn off backlight. 148 - IBM TP X32 none (1), but backlight is on and video is trashed after long suspend. s3_bios,s3_mode (4) works too. Perhaps that gets better results? 149 - IBM Thinkpad X40 Type 2371-7JG s3_bios,s3_mode (4) 150 - IBM TP 600e none(1), but a switch to console and back to X is needed 151 - Medion MD4220 ??? (*) 152 - Samsung P35 vbetool needed (6) 153 - Sharp PC-AR10 (ATI rage) none (1), backlight does not switch off 154 - Sony Vaio PCG-C1VRX/K s3_bios (2) 155 - Sony Vaio PCG-F403 ??? (*) 156 - Sony Vaio PCG-GRT995MP none (1), works with 'nv' X driver 157 - Sony Vaio PCG-GR7/K none (1), but needs radeonfb, use radeontool (http://fdd.com/software/radeon/) to turn off backlight. 158 - Sony Vaio PCG-N505SN ??? (*) 159 - Sony Vaio vgn-s260 X or boot-radeon can init it (5) 160 - Sony Vaio vgn-S580BH vga=normal, but suspend from X. Console will be blank unless you return to X. 161 - Sony Vaio vgn-FS115B s3_bios (2),s3_mode (4) 162 - Toshiba Libretto L5 none (1) 163 - Toshiba Libretto 100CT/110CT vbetool (6) 164 - Toshiba Portege 3020CT s3_mode (3) 165 - Toshiba Satellite 4030CDT s3_mode (3) (S1 also works OK) 166 - Toshiba Satellite 4080XCDT s3_mode (3) (S1 also works OK) 167 - Toshiba Satellite 4090XCDT ??? (*) 168 - Toshiba Satellite P10-554 s3_bios,s3_mode (4)(****) 169 - Toshiba M30 (2) xor X with nvidia driver using internal AGP 170 - Uniwill 244IIO ??? (*) 171 - 172 - Known working desktop systems 173 - ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 174 - 175 - Mainboard Graphics card hack (or "how to do it") 176 - ------------------------------------------------------------------------------ 177 - Asus A7V8X nVidia RIVA TNT2 model 64 s3_bios,s3_mode (4) 178 - 179 - 180 - (*) from https://wiki.ubuntu.com/HoaryPMResults, not sure 181 - which options to use. If you know, please tell me. 182 - 183 - (***) To be tested with a newer kernel. 184 - 185 - (****) Not with SMP kernel, UP only.

+1 -1

Documentation/process/submitting-drivers.rst

··· 117 117 implemented") error. You should also try to make sure that your 118 118 driver uses as little power as possible when it's not doing 119 119 anything. For the driver testing instructions see 120 - Documentation/power/drivers-testing.txt and for a relatively 120 + Documentation/power/drivers-testing.rst and for a relatively 121 121 complete overview of the power management issues related to 122 122 drivers see :ref:`Documentation/driver-api/pm/devices.rst <driverapi_pm_devices>`. 123 123

+3 -3

Documentation/scheduler/sched-energy.rst

··· 22 22 23 23 The actual EM used by EAS is _not_ maintained by the scheduler, but by a 24 24 dedicated framework. For details about this framework and what it provides, 25 - please refer to its documentation (see Documentation/power/energy-model.txt). 25 + please refer to its documentation (see Documentation/power/energy-model.rst). 26 26 27 27 28 28 2. Background and Terminology ··· 81 81 82 82 The rest of platform knowledge used by EAS is directly read from the Energy 83 83 Model (EM) framework. The EM of a platform is composed of a power cost table 84 - per 'performance domain' in the system (see Documentation/power/energy-model.txt 84 + per 'performance domain' in the system (see Documentation/power/energy-model.rst 85 85 for futher details about performance domains). 86 86 87 87 The scheduler manages references to the EM objects in the topology code when the ··· 353 353 EAS uses the EM of a platform to estimate the impact of scheduling decisions on 354 354 energy. So, your platform must provide power cost tables to the EM framework in 355 355 order to make EAS start. To do so, please refer to documentation of the 356 - independent EM framework in Documentation/power/energy-model.txt. 356 + independent EM framework in Documentation/power/energy-model.rst. 357 357 358 358 Please also note that the scheduling domains need to be re-built after the 359 359 EM has been registered in order to start EAS.

+1 -1

Documentation/trace/coresight-cpu-debug.txt

··· 151 151 152 152 It is possible to disable CPU idle states by way of the PM QoS 153 153 subsystem, more specifically by using the "/dev/cpu_dma_latency" 154 - interface (see Documentation/power/pm_qos_interface.txt for more 154 + interface (see Documentation/power/pm_qos_interface.rst for more 155 155 details). As specified in the PM QoS documentation the requested 156 156 parameter will stay in effect until the file descriptor is released. 157 157 For example:

+1 -1

Documentation/translations/zh_CN/process/submitting-drivers.rst

··· 97 97 函数定义成返回 -ENOSYS（功能未实现）错误。你还应该尝试确 98 98 保你的驱动在什么都不干的情况下将耗电降到最低。要获得驱动 99 99 程序测试的指导，请参阅 100 - Documentation/power/drivers-testing.txt。有关驱动程序电 100 + Documentation/power/drivers-testing.rst。有关驱动程序电 101 101 源管理问题相对全面的概述，请参阅 102 102 Documentation/driver-api/pm/devices.rst。 103 103

+4 -4

MAINTAINERS

··· 6548 6548 M: Pavel Machek <pavel@ucw.cz> 6549 6549 L: linux-pm@vger.kernel.org 6550 6550 S: Supported 6551 - F: Documentation/power/freezing-of-tasks.txt 6551 + F: Documentation/power/freezing-of-tasks.rst 6552 6552 F: include/linux/freezer.h 6553 6553 F: kernel/freezer.c 6554 6554 ··· 11942 11942 T: git git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm.git 11943 11943 F: drivers/opp/ 11944 11944 F: include/linux/pm_opp.h 11945 - F: Documentation/power/opp.txt 11945 + F: Documentation/power/opp.rst 11946 11946 F: Documentation/devicetree/bindings/opp/ 11947 11947 11948 11948 OPL4 DRIVER ··· 12329 12329 M: Oliver O'Halloran <oohall@gmail.com> 12330 12330 L: linuxppc-dev@lists.ozlabs.org 12331 12331 S: Supported 12332 - F: Documentation/PCI/pci-error-recovery.txt 12332 + F: Documentation/PCI/pci-error-recovery.rst 12333 12333 F: drivers/pci/pcie/aer.c 12334 12334 F: drivers/pci/pcie/dpc.c 12335 12335 F: drivers/pci/pcie/err.c ··· 12342 12342 M: Linas Vepstas <linasvepstas@gmail.com> 12343 12343 L: linux-pci@vger.kernel.org 12344 12344 S: Supported 12345 - F: Documentation/PCI/pci-error-recovery.txt 12345 + F: Documentation/PCI/pci-error-recovery.rst 12346 12346 12347 12347 PCI MSI DRIVER FOR ALTERA MSI IP 12348 12348 M: Ley Foon Tan <lftan@altera.com>

+11 -2

arch/arm64/kernel/pci.c

··· 164 164 struct acpi_pci_generic_root_info *ri; 165 165 struct pci_bus *bus, *child; 166 166 struct acpi_pci_root_ops *root_ops; 167 + struct pci_host_bridge *host; 167 168 168 169 ri = kzalloc(sizeof(*ri), GFP_KERNEL); 169 170 if (!ri) ··· 190 189 if (!bus) 191 190 return NULL; 192 191 193 - pci_bus_size_bridges(bus); 194 - pci_bus_assign_resources(bus); 192 + /* If we must preserve the resource configuration, claim now */ 193 + host = pci_find_host_bridge(bus); 194 + if (host->preserve_config) 195 + pci_bus_claim_resources(bus); 196 + 197 + /* 198 + * Assign whatever was left unassigned. If we didn't claim above, 199 + * this will reassign everything. 200 + */ 201 + pci_assign_unassigned_root_bus_resources(bus); 195 202 196 203 list_for_each_entry(child, &bus->children, node) 197 204 pcie_bus_configure_settings(child);

+1 -1

arch/x86/Kconfig

··· 2482 2482 machines with more than one CPU. 2483 2483 2484 2484 In order to use APM, you will need supporting software. For location 2485 - and more information, read <file:Documentation/power/apm-acpi.txt> 2485 + and more information, read <file:Documentation/power/apm-acpi.rst> 2486 2486 and the Battery Powered Linux mini-HOWTO, available from 2487 2487 <http://www.tldp.org/docs.html#howto>. 2488 2488

+12

drivers/acpi/pci_root.c

··· 881 881 int node = acpi_get_node(device->handle); 882 882 struct pci_bus *bus; 883 883 struct pci_host_bridge *host_bridge; 884 + union acpi_object *obj; 884 885 885 886 info->root = root; 886 887 info->bridge = device; ··· 917 916 host_bridge->native_pme = 0; 918 917 if (!(root->osc_control_set & OSC_PCI_EXPRESS_LTR_CONTROL)) 919 918 host_bridge->native_ltr = 0; 919 + 920 + /* 921 + * Evaluate the "PCI Boot Configuration" _DSM Function. If it 922 + * exists and returns 0, we must preserve any PCI resource 923 + * assignments made by firmware for this host bridge. 924 + */ 925 + obj = acpi_evaluate_dsm(ACPI_HANDLE(bus->bridge), &pci_acpi_dsm_guid, 1, 926 + IGNORE_PCI_BOOT_CONFIG_DSM, NULL); 927 + if (obj && obj->type == ACPI_TYPE_INTEGER && obj->integer.value == 0) 928 + host_bridge->preserve_config = 1; 929 + ACPI_FREE(obj); 920 930 921 931 pci_scan_child_bus(bus); 922 932 pci_set_host_bridge_release(host_bridge, acpi_pci_root_release_info,

+1 -1

drivers/gpu/drm/i915/intel_runtime_pm.h

··· 45 45 * to be disabled. This shouldn't happen and we'll print some error messages in 46 46 * case it happens. 47 47 * 48 - * For more, read the Documentation/power/runtime_pm.txt. 48 + * For more, read the Documentation/power/runtime_pm.rst. 49 49 */ 50 50 struct intel_runtime_pm { 51 51 atomic_t wakeref_count;

+1 -1

drivers/opp/Kconfig

··· 11 11 OPP layer organizes the data internally using device pointers 12 12 representing individual voltage domains and provides SOC 13 13 implementations a ready to use framework to manage OPPs. 14 - For more information, read <file:Documentation/power/opp.txt> 14 + For more information, read <file:Documentation/power/opp.rst>

+1 -1

drivers/pci/ats.c

··· 432 432 * @pdev: PCI device structure 433 433 * 434 434 * Returns negative value when PASID capability is not present. 435 - * Otherwise it returns the numer of supported PASIDs. 435 + * Otherwise it returns the number of supported PASIDs. 436 436 */ 437 437 int pci_max_pasids(struct pci_dev *pdev) 438 438 {

+2 -2

drivers/pci/controller/Kconfig

··· 174 174 PCIe controller 175 175 176 176 config PCIE_ALTERA 177 - bool "Altera PCIe controller" 177 + tristate "Altera PCIe controller" 178 178 depends on ARM || NIOS2 || ARM64 || COMPILE_TEST 179 179 help 180 180 Say Y here if you want to enable PCIe controller support on Altera 181 181 FPGA. 182 182 183 183 config PCIE_ALTERA_MSI 184 - bool "Altera PCIe MSI feature" 184 + tristate "Altera PCIe MSI feature" 185 185 depends on PCIE_ALTERA 186 186 depends on PCI_MSI_IRQ_DOMAIN 187 187 help

+1 -1

drivers/pci/controller/dwc/Kconfig

··· 90 90 91 91 config PCI_IMX6 92 92 bool "Freescale i.MX6/7/8 PCIe controller" 93 - depends on SOC_IMX6Q || SOC_IMX7D || (ARM64 && ARCH_MXC) || COMPILE_TEST 93 + depends on ARCH_MXC || COMPILE_TEST 94 94 depends on PCI_MSI_IRQ_DOMAIN 95 95 select PCIE_DW_HOST 96 96

+1

drivers/pci/controller/dwc/pci-dra7xx.c

··· 26 26 #include <linux/types.h> 27 27 #include <linux/mfd/syscon.h> 28 28 #include <linux/regmap.h> 29 + #include <linux/gpio/consumer.h> 29 30 30 31 #include "../../pci.h" 31 32 #include "pcie-designware.h"

+82 -2

drivers/pci/controller/dwc/pcie-armada8k.c

··· 25 25 26 26 #include "pcie-designware.h" 27 27 28 + #define ARMADA8K_PCIE_MAX_LANES PCIE_LNK_X4 29 + 28 30 struct armada8k_pcie { 29 31 struct dw_pcie *pci; 30 32 struct clk *clk; 31 33 struct clk *clk_reg; 34 + struct phy *phy[ARMADA8K_PCIE_MAX_LANES]; 35 + unsigned int phy_count; 32 36 }; 33 37 34 38 #define PCIE_VENDOR_REGS_OFFSET 0x8000 ··· 59 55 #define PCIE_ARUSER_REG (PCIE_VENDOR_REGS_OFFSET + 0x5C) 60 56 #define PCIE_AWUSER_REG (PCIE_VENDOR_REGS_OFFSET + 0x60) 61 57 /* 62 - * AR/AW Cache defauls: Normal memory, Write-Back, Read / Write 58 + * AR/AW Cache defaults: Normal memory, Write-Back, Read / Write 63 59 * allocate 64 60 */ 65 61 #define ARCACHE_DEFAULT_VALUE 0x3511 ··· 70 66 #define AX_USER_DOMAIN_SHIFT 4 71 67 72 68 #define to_armada8k_pcie(x) dev_get_drvdata((x)->dev) 69 + 70 + static void armada8k_pcie_disable_phys(struct armada8k_pcie *pcie) 71 + { 72 + int i; 73 + 74 + for (i = 0; i < ARMADA8K_PCIE_MAX_LANES; i++) { 75 + phy_power_off(pcie->phy[i]); 76 + phy_exit(pcie->phy[i]); 77 + } 78 + } 79 + 80 + static int armada8k_pcie_enable_phys(struct armada8k_pcie *pcie) 81 + { 82 + int ret; 83 + int i; 84 + 85 + for (i = 0; i < ARMADA8K_PCIE_MAX_LANES; i++) { 86 + ret = phy_init(pcie->phy[i]); 87 + if (ret) 88 + return ret; 89 + 90 + ret = phy_set_mode_ext(pcie->phy[i], PHY_MODE_PCIE, 91 + pcie->phy_count); 92 + if (ret) { 93 + phy_exit(pcie->phy[i]); 94 + return ret; 95 + } 96 + 97 + ret = phy_power_on(pcie->phy[i]); 98 + if (ret) { 99 + phy_exit(pcie->phy[i]); 100 + return ret; 101 + } 102 + } 103 + 104 + return 0; 105 + } 106 + 107 + static int armada8k_pcie_setup_phys(struct armada8k_pcie *pcie) 108 + { 109 + struct dw_pcie *pci = pcie->pci; 110 + struct device *dev = pci->dev; 111 + struct device_node *node = dev->of_node; 112 + int ret = 0; 113 + int i; 114 + 115 + for (i = 0; i < ARMADA8K_PCIE_MAX_LANES; i++) { 116 + pcie->phy[i] = devm_of_phy_get_by_index(dev, node, i); 117 + if (IS_ERR(pcie->phy[i]) && 118 + (PTR_ERR(pcie->phy[i]) == -EPROBE_DEFER)) 119 + return PTR_ERR(pcie->phy[i]); 120 + 121 + if (IS_ERR(pcie->phy[i])) { 122 + pcie->phy[i] = NULL; 123 + continue; 124 + } 125 + 126 + pcie->phy_count++; 127 + } 128 + 129 + /* Old bindings miss the PHY handle, so just warn if there is no PHY */ 130 + if (!pcie->phy_count) 131 + dev_warn(dev, "No available PHY\n"); 132 + 133 + ret = armada8k_pcie_enable_phys(pcie); 134 + if (ret) 135 + dev_err(dev, "Failed to initialize PHY(s) (%d)\n", ret); 136 + 137 + return ret; 138 + } 73 139 74 140 static int armada8k_pcie_link_up(struct dw_pcie *pci) 75 141 { ··· 323 249 goto fail_clkreg; 324 250 } 325 251 252 + ret = armada8k_pcie_setup_phys(pcie); 253 + if (ret) 254 + goto fail_clkreg; 255 + 326 256 platform_set_drvdata(pdev, pcie); 327 257 328 258 ret = armada8k_add_pcie_port(pcie, pdev); 329 259 if (ret) 330 - goto fail_clkreg; 260 + goto disable_phy; 331 261 332 262 return 0; 333 263 264 + disable_phy: 265 + armada8k_pcie_disable_phys(pcie); 334 266 fail_clkreg: 335 267 clk_disable_unprepare(pcie->clk_reg); 336 268 fail:

+12

drivers/pci/controller/dwc/pcie-designware-host.c

··· 311 311 dw_pcie_wr_own_conf(pp, PCIE_MSI_ADDR_HI, 4, 312 312 upper_32_bits(msi_target)); 313 313 } 314 + EXPORT_SYMBOL_GPL(dw_pcie_msi_init); 314 315 315 316 int dw_pcie_host_init(struct pcie_port *pp) 316 317 { ··· 496 495 dw_pcie_free_msi(pp); 497 496 return ret; 498 497 } 498 + EXPORT_SYMBOL_GPL(dw_pcie_host_init); 499 + 500 + void dw_pcie_host_deinit(struct pcie_port *pp) 501 + { 502 + pci_stop_root_bus(pp->root_bus); 503 + pci_remove_root_bus(pp->root_bus); 504 + if (pci_msi_enabled() && !pp->ops->msi_host_init) 505 + dw_pcie_free_msi(pp); 506 + } 507 + EXPORT_SYMBOL_GPL(dw_pcie_host_deinit); 499 508 500 509 static int dw_pcie_access_other_conf(struct pcie_port *pp, struct pci_bus *bus, 501 510 u32 devfn, int where, int size, u32 *val, ··· 698 687 val |= PORT_LOGIC_SPEED_CHANGE; 699 688 dw_pcie_wr_own_conf(pp, PCIE_LINK_WIDTH_SPEED_CONTROL, 4, val); 700 689 } 690 + EXPORT_SYMBOL_GPL(dw_pcie_setup_rc);

+45 -16

drivers/pci/controller/dwc/pcie-designware.c

··· 34 34 35 35 return PCIBIOS_SUCCESSFUL; 36 36 } 37 + EXPORT_SYMBOL_GPL(dw_pcie_read); 37 38 38 39 int dw_pcie_write(void __iomem *addr, int size, u32 val) 39 40 { ··· 52 51 53 52 return PCIBIOS_SUCCESSFUL; 54 53 } 54 + EXPORT_SYMBOL_GPL(dw_pcie_write); 55 55 56 - u32 __dw_pcie_read_dbi(struct dw_pcie *pci, void __iomem *base, u32 reg, 57 - size_t size) 56 + u32 dw_pcie_read_dbi(struct dw_pcie *pci, u32 reg, size_t size) 58 57 { 59 58 int ret; 60 59 u32 val; 61 60 62 61 if (pci->ops->read_dbi) 63 - return pci->ops->read_dbi(pci, base, reg, size); 62 + return pci->ops->read_dbi(pci, pci->dbi_base, reg, size); 64 63 65 - ret = dw_pcie_read(base + reg, size, &val); 64 + ret = dw_pcie_read(pci->dbi_base + reg, size, &val); 66 65 if (ret) 67 66 dev_err(pci->dev, "Read DBI address failed\n"); 68 67 69 68 return val; 70 69 } 70 + EXPORT_SYMBOL_GPL(dw_pcie_read_dbi); 71 71 72 - void __dw_pcie_write_dbi(struct dw_pcie *pci, void __iomem *base, u32 reg, 73 - size_t size, u32 val) 72 + void dw_pcie_write_dbi(struct dw_pcie *pci, u32 reg, size_t size, u32 val) 74 73 { 75 74 int ret; 76 75 77 76 if (pci->ops->write_dbi) { 78 - pci->ops->write_dbi(pci, base, reg, size, val); 77 + pci->ops->write_dbi(pci, pci->dbi_base, reg, size, val); 79 78 return; 80 79 } 81 80 82 - ret = dw_pcie_write(base + reg, size, val); 81 + ret = dw_pcie_write(pci->dbi_base + reg, size, val); 83 82 if (ret) 84 83 dev_err(pci->dev, "Write DBI address failed\n"); 85 84 } 85 + EXPORT_SYMBOL_GPL(dw_pcie_write_dbi); 86 86 87 - u32 __dw_pcie_read_dbi2(struct dw_pcie *pci, void __iomem *base, u32 reg, 88 - size_t size) 87 + u32 dw_pcie_read_dbi2(struct dw_pcie *pci, u32 reg, size_t size) 89 88 { 90 89 int ret; 91 90 u32 val; 92 91 93 92 if (pci->ops->read_dbi2) 94 - return pci->ops->read_dbi2(pci, base, reg, size); 93 + return pci->ops->read_dbi2(pci, pci->dbi_base2, reg, size); 95 94 96 - ret = dw_pcie_read(base + reg, size, &val); 95 + ret = dw_pcie_read(pci->dbi_base2 + reg, size, &val); 97 96 if (ret) 98 97 dev_err(pci->dev, "read DBI address failed\n"); 99 98 100 99 return val; 101 100 } 102 101 103 - void __dw_pcie_write_dbi2(struct dw_pcie *pci, void __iomem *base, u32 reg, 104 - size_t size, u32 val) 102 + void dw_pcie_write_dbi2(struct dw_pcie *pci, u32 reg, size_t size, u32 val) 105 103 { 106 104 int ret; 107 105 108 106 if (pci->ops->write_dbi2) { 109 - pci->ops->write_dbi2(pci, base, reg, size, val); 107 + pci->ops->write_dbi2(pci, pci->dbi_base2, reg, size, val); 110 108 return; 111 109 } 112 110 113 - ret = dw_pcie_write(base + reg, size, val); 111 + ret = dw_pcie_write(pci->dbi_base2 + reg, size, val); 114 112 if (ret) 115 113 dev_err(pci->dev, "write DBI address failed\n"); 114 + } 115 + 116 + u32 dw_pcie_read_atu(struct dw_pcie *pci, u32 reg, size_t size) 117 + { 118 + int ret; 119 + u32 val; 120 + 121 + if (pci->ops->read_dbi) 122 + return pci->ops->read_dbi(pci, pci->atu_base, reg, size); 123 + 124 + ret = dw_pcie_read(pci->atu_base + reg, size, &val); 125 + if (ret) 126 + dev_err(pci->dev, "Read ATU address failed\n"); 127 + 128 + return val; 129 + } 130 + 131 + void dw_pcie_write_atu(struct dw_pcie *pci, u32 reg, size_t size, u32 val) 132 + { 133 + int ret; 134 + 135 + if (pci->ops->write_dbi) { 136 + pci->ops->write_dbi(pci, pci->atu_base, reg, size, val); 137 + return; 138 + } 139 + 140 + ret = dw_pcie_write(pci->atu_base + reg, size, val); 141 + if (ret) 142 + dev_err(pci->dev, "Write ATU address failed\n"); 116 143 } 117 144 118 145 static u32 dw_pcie_readl_ob_unroll(struct dw_pcie *pci, u32 index, u32 reg)

+21 -18

drivers/pci/controller/dwc/pcie-designware.h

··· 254 254 int dw_pcie_read(void __iomem *addr, int size, u32 *val); 255 255 int dw_pcie_write(void __iomem *addr, int size, u32 val); 256 256 257 - u32 __dw_pcie_read_dbi(struct dw_pcie *pci, void __iomem *base, u32 reg, 258 - size_t size); 259 - void __dw_pcie_write_dbi(struct dw_pcie *pci, void __iomem *base, u32 reg, 260 - size_t size, u32 val); 261 - u32 __dw_pcie_read_dbi2(struct dw_pcie *pci, void __iomem *base, u32 reg, 262 - size_t size); 263 - void __dw_pcie_write_dbi2(struct dw_pcie *pci, void __iomem *base, u32 reg, 264 - size_t size, u32 val); 257 + u32 dw_pcie_read_dbi(struct dw_pcie *pci, u32 reg, size_t size); 258 + void dw_pcie_write_dbi(struct dw_pcie *pci, u32 reg, size_t size, u32 val); 259 + u32 dw_pcie_read_dbi2(struct dw_pcie *pci, u32 reg, size_t size); 260 + void dw_pcie_write_dbi2(struct dw_pcie *pci, u32 reg, size_t size, u32 val); 261 + u32 dw_pcie_read_atu(struct dw_pcie *pci, u32 reg, size_t size); 262 + void dw_pcie_write_atu(struct dw_pcie *pci, u32 reg, size_t size, u32 val); 265 263 int dw_pcie_link_up(struct dw_pcie *pci); 266 264 int dw_pcie_wait_for_link(struct dw_pcie *pci); 267 265 void dw_pcie_prog_outbound_atu(struct dw_pcie *pci, int index, ··· 273 275 274 276 static inline void dw_pcie_writel_dbi(struct dw_pcie *pci, u32 reg, u32 val) 275 277 { 276 - __dw_pcie_write_dbi(pci, pci->dbi_base, reg, 0x4, val); 278 + dw_pcie_write_dbi(pci, reg, 0x4, val); 277 279 } 278 280 279 281 static inline u32 dw_pcie_readl_dbi(struct dw_pcie *pci, u32 reg) 280 282 { 281 - return __dw_pcie_read_dbi(pci, pci->dbi_base, reg, 0x4); 283 + return dw_pcie_read_dbi(pci, reg, 0x4); 282 284 } 283 285 284 286 static inline void dw_pcie_writew_dbi(struct dw_pcie *pci, u32 reg, u16 val) 285 287 { 286 - __dw_pcie_write_dbi(pci, pci->dbi_base, reg, 0x2, val); 288 + dw_pcie_write_dbi(pci, reg, 0x2, val); 287 289 } 288 290 289 291 static inline u16 dw_pcie_readw_dbi(struct dw_pcie *pci, u32 reg) 290 292 { 291 - return __dw_pcie_read_dbi(pci, pci->dbi_base, reg, 0x2); 293 + return dw_pcie_read_dbi(pci, reg, 0x2); 292 294 } 293 295 294 296 static inline void dw_pcie_writeb_dbi(struct dw_pcie *pci, u32 reg, u8 val) 295 297 { 296 - __dw_pcie_write_dbi(pci, pci->dbi_base, reg, 0x1, val); 298 + dw_pcie_write_dbi(pci, reg, 0x1, val); 297 299 } 298 300 299 301 static inline u8 dw_pcie_readb_dbi(struct dw_pcie *pci, u32 reg) 300 302 { 301 - return __dw_pcie_read_dbi(pci, pci->dbi_base, reg, 0x1); 303 + return dw_pcie_read_dbi(pci, reg, 0x1); 302 304 } 303 305 304 306 static inline void dw_pcie_writel_dbi2(struct dw_pcie *pci, u32 reg, u32 val) 305 307 { 306 - __dw_pcie_write_dbi2(pci, pci->dbi_base2, reg, 0x4, val); 308 + dw_pcie_write_dbi2(pci, reg, 0x4, val); 307 309 } 308 310 309 311 static inline u32 dw_pcie_readl_dbi2(struct dw_pcie *pci, u32 reg) 310 312 { 311 - return __dw_pcie_read_dbi2(pci, pci->dbi_base2, reg, 0x4); 313 + return dw_pcie_read_dbi2(pci, reg, 0x4); 312 314 } 313 315 314 316 static inline void dw_pcie_writel_atu(struct dw_pcie *pci, u32 reg, u32 val) 315 317 { 316 - __dw_pcie_write_dbi(pci, pci->atu_base, reg, 0x4, val); 318 + dw_pcie_write_atu(pci, reg, 0x4, val); 317 319 } 318 320 319 321 static inline u32 dw_pcie_readl_atu(struct dw_pcie *pci, u32 reg) 320 322 { 321 - return __dw_pcie_read_dbi(pci, pci->atu_base, reg, 0x4); 323 + return dw_pcie_read_atu(pci, reg, 0x4); 322 324 } 323 325 324 326 static inline void dw_pcie_dbi_ro_wr_en(struct dw_pcie *pci) ··· 349 351 void dw_pcie_free_msi(struct pcie_port *pp); 350 352 void dw_pcie_setup_rc(struct pcie_port *pp); 351 353 int dw_pcie_host_init(struct pcie_port *pp); 354 + void dw_pcie_host_deinit(struct pcie_port *pp); 352 355 int dw_pcie_allocate_domains(struct pcie_port *pp); 353 356 #else 354 357 static inline irqreturn_t dw_handle_msi_irq(struct pcie_port *pp) ··· 372 373 static inline int dw_pcie_host_init(struct pcie_port *pp) 373 374 { 374 375 return 0; 376 + } 377 + 378 + static inline void dw_pcie_host_deinit(struct pcie_port *pp) 379 + { 375 380 } 376 381 377 382 static inline int dw_pcie_allocate_domains(struct pcie_port *pp)

+1 -1

drivers/pci/controller/dwc/pcie-kirin.c

··· 2 2 /* 3 3 * PCIe host controller driver for Kirin Phone SoCs 4 4 * 5 - * Copyright (C) 2017 Hilisicon Electronics Co., Ltd. 5 + * Copyright (C) 2017 HiSilicon Electronics Co., Ltd. 6 6 * http://www.huawei.com 7 7 * 8 8 * Author: Xiaowei Song <songxiaowei@huawei.com>

+50 -57

drivers/pci/controller/dwc/pcie-qcom.c

··· 112 112 struct regulator_bulk_data supplies[QCOM_PCIE_2_3_2_MAX_SUPPLY]; 113 113 }; 114 114 115 + #define QCOM_PCIE_2_4_0_MAX_CLOCKS 4 115 116 struct qcom_pcie_resources_2_4_0 { 116 - struct clk *aux_clk; 117 - struct clk *master_clk; 118 - struct clk *slave_clk; 117 + struct clk_bulk_data clks[QCOM_PCIE_2_4_0_MAX_CLOCKS]; 118 + int num_clks; 119 119 struct reset_control *axi_m_reset; 120 120 struct reset_control *axi_s_reset; 121 121 struct reset_control *pipe_reset; ··· 178 178 179 179 static void qcom_ep_reset_deassert(struct qcom_pcie *pcie) 180 180 { 181 + /* Ensure that PERST has been asserted for at least 100 ms */ 182 + msleep(100); 181 183 gpiod_set_value_cansleep(pcie->reset, 0); 182 184 usleep_range(PERST_DELAY_US, PERST_DELAY_US + 500); 183 185 } ··· 640 638 struct qcom_pcie_resources_2_4_0 *res = &pcie->res.v2_4_0; 641 639 struct dw_pcie *pci = pcie->pci; 642 640 struct device *dev = pci->dev; 641 + bool is_ipq = of_device_is_compatible(dev->of_node, "qcom,pcie-ipq4019"); 642 + int ret; 643 643 644 - res->aux_clk = devm_clk_get(dev, "aux"); 645 - if (IS_ERR(res->aux_clk)) 646 - return PTR_ERR(res->aux_clk); 644 + res->clks[0].id = "aux"; 645 + res->clks[1].id = "master_bus"; 646 + res->clks[2].id = "slave_bus"; 647 + res->clks[3].id = "iface"; 647 648 648 - res->master_clk = devm_clk_get(dev, "master_bus"); 649 - if (IS_ERR(res->master_clk)) 650 - return PTR_ERR(res->master_clk); 649 + /* qcom,pcie-ipq4019 is defined without "iface" */ 650 + res->num_clks = is_ipq ? 3 : 4; 651 651 652 - res->slave_clk = devm_clk_get(dev, "slave_bus"); 653 - if (IS_ERR(res->slave_clk)) 654 - return PTR_ERR(res->slave_clk); 652 + ret = devm_clk_bulk_get(dev, res->num_clks, res->clks); 653 + if (ret < 0) 654 + return ret; 655 655 656 656 res->axi_m_reset = devm_reset_control_get_exclusive(dev, "axi_m"); 657 657 if (IS_ERR(res->axi_m_reset)) ··· 663 659 if (IS_ERR(res->axi_s_reset)) 664 660 return PTR_ERR(res->axi_s_reset); 665 661 666 - res->pipe_reset = devm_reset_control_get_exclusive(dev, "pipe"); 667 - if (IS_ERR(res->pipe_reset)) 668 - return PTR_ERR(res->pipe_reset); 662 + if (is_ipq) { 663 + /* 664 + * These resources relates to the PHY or are secure clocks, but 665 + * are controlled here for IPQ4019 666 + */ 667 + res->pipe_reset = devm_reset_control_get_exclusive(dev, "pipe"); 668 + if (IS_ERR(res->pipe_reset)) 669 + return PTR_ERR(res->pipe_reset); 669 670 670 - res->axi_m_vmid_reset = devm_reset_control_get_exclusive(dev, 671 - "axi_m_vmid"); 672 - if (IS_ERR(res->axi_m_vmid_reset)) 673 - return PTR_ERR(res->axi_m_vmid_reset); 671 + res->axi_m_vmid_reset = devm_reset_control_get_exclusive(dev, 672 + "axi_m_vmid"); 673 + if (IS_ERR(res->axi_m_vmid_reset)) 674 + return PTR_ERR(res->axi_m_vmid_reset); 674 675 675 - res->axi_s_xpu_reset = devm_reset_control_get_exclusive(dev, 676 - "axi_s_xpu"); 677 - if (IS_ERR(res->axi_s_xpu_reset)) 678 - return PTR_ERR(res->axi_s_xpu_reset); 676 + res->axi_s_xpu_reset = devm_reset_control_get_exclusive(dev, 677 + "axi_s_xpu"); 678 + if (IS_ERR(res->axi_s_xpu_reset)) 679 + return PTR_ERR(res->axi_s_xpu_reset); 679 680 680 - res->parf_reset = devm_reset_control_get_exclusive(dev, "parf"); 681 - if (IS_ERR(res->parf_reset)) 682 - return PTR_ERR(res->parf_reset); 681 + res->parf_reset = devm_reset_control_get_exclusive(dev, "parf"); 682 + if (IS_ERR(res->parf_reset)) 683 + return PTR_ERR(res->parf_reset); 683 684 684 - res->phy_reset = devm_reset_control_get_exclusive(dev, "phy"); 685 - if (IS_ERR(res->phy_reset)) 686 - return PTR_ERR(res->phy_reset); 685 + res->phy_reset = devm_reset_control_get_exclusive(dev, "phy"); 686 + if (IS_ERR(res->phy_reset)) 687 + return PTR_ERR(res->phy_reset); 688 + } 687 689 688 690 res->axi_m_sticky_reset = devm_reset_control_get_exclusive(dev, 689 691 "axi_m_sticky"); ··· 709 699 if (IS_ERR(res->ahb_reset)) 710 700 return PTR_ERR(res->ahb_reset); 711 701 712 - res->phy_ahb_reset = devm_reset_control_get_exclusive(dev, "phy_ahb"); 713 - if (IS_ERR(res->phy_ahb_reset)) 714 - return PTR_ERR(res->phy_ahb_reset); 702 + if (is_ipq) { 703 + res->phy_ahb_reset = devm_reset_control_get_exclusive(dev, "phy_ahb"); 704 + if (IS_ERR(res->phy_ahb_reset)) 705 + return PTR_ERR(res->phy_ahb_reset); 706 + } 715 707 716 708 return 0; 717 709 } ··· 731 719 reset_control_assert(res->axi_m_sticky_reset); 732 720 reset_control_assert(res->pwr_reset); 733 721 reset_control_assert(res->ahb_reset); 734 - clk_disable_unprepare(res->aux_clk); 735 - clk_disable_unprepare(res->master_clk); 736 - clk_disable_unprepare(res->slave_clk); 722 + clk_bulk_disable_unprepare(res->num_clks, res->clks); 737 723 } 738 724 739 725 static int qcom_pcie_init_2_4_0(struct qcom_pcie *pcie) ··· 860 850 861 851 usleep_range(10000, 12000); 862 852 863 - ret = clk_prepare_enable(res->aux_clk); 864 - if (ret) { 865 - dev_err(dev, "cannot prepare/enable iface clock\n"); 866 - goto err_clk_aux; 867 - } 868 - 869 - ret = clk_prepare_enable(res->master_clk); 870 - if (ret) { 871 - dev_err(dev, "cannot prepare/enable core clock\n"); 872 - goto err_clk_axi_m; 873 - } 874 - 875 - ret = clk_prepare_enable(res->slave_clk); 876 - if (ret) { 877 - dev_err(dev, "cannot prepare/enable phy clock\n"); 878 - goto err_clk_axi_s; 879 - } 853 + ret = clk_bulk_prepare_enable(res->num_clks, res->clks); 854 + if (ret) 855 + goto err_clks; 880 856 881 857 /* enable PCIe clocks and resets */ 882 858 val = readl(pcie->parf + PCIE20_PARF_PHY_CTRL); ··· 887 891 888 892 return 0; 889 893 890 - err_clk_axi_s: 891 - clk_disable_unprepare(res->master_clk); 892 - err_clk_axi_m: 893 - clk_disable_unprepare(res->aux_clk); 894 - err_clk_aux: 894 + err_clks: 895 895 reset_control_assert(res->ahb_reset); 896 896 err_rst_ahb: 897 897 reset_control_assert(res->pwr_reset); ··· 1281 1289 { .compatible = "qcom,pcie-msm8996", .data = &ops_2_3_2 }, 1282 1290 { .compatible = "qcom,pcie-ipq8074", .data = &ops_2_3_3 }, 1283 1291 { .compatible = "qcom,pcie-ipq4019", .data = &ops_2_4_0 }, 1292 + { .compatible = "qcom,pcie-qcs404", .data = &ops_2_4_0 }, 1284 1293 { } 1285 1294 }; 1286 1295

+1 -1

drivers/pci/controller/pci-aardvark.c

··· 308 308 309 309 advk_writel(pcie, PCIE_ISR1_ALL_MASK, PCIE_ISR1_MASK_REG); 310 310 311 - /* Unmask all MSI's */ 311 + /* Unmask all MSIs */ 312 312 advk_writel(pcie, 0, PCIE_MSI_MASK_REG); 313 313 314 314 /* Enable summary interrupt for GIC SPI source */

+9 -6

drivers/pci/controller/pci-hyperv.c

··· 1875 1875 static void hv_eject_device_work(struct work_struct *work) 1876 1876 { 1877 1877 struct pci_eject_response *ejct_pkt; 1878 + struct hv_pcibus_device *hbus; 1878 1879 struct hv_pci_dev *hpdev; 1879 1880 struct pci_dev *pdev; 1880 1881 unsigned long flags; ··· 1886 1885 } ctxt; 1887 1886 1888 1887 hpdev = container_of(work, struct hv_pci_dev, wrk); 1888 + hbus = hpdev->hbus; 1889 1889 1890 1890 WARN_ON(hpdev->state != hv_pcichild_ejecting); 1891 1891 ··· 1897 1895 * because hbus->pci_bus may not exist yet. 1898 1896 */ 1899 1897 wslot = wslot_to_devfn(hpdev->desc.win_slot.slot); 1900 - pdev = pci_get_domain_bus_and_slot(hpdev->hbus->sysdata.domain, 0, 1901 - wslot); 1898 + pdev = pci_get_domain_bus_and_slot(hbus->sysdata.domain, 0, wslot); 1902 1899 if (pdev) { 1903 1900 pci_lock_rescan_remove(); 1904 1901 pci_stop_and_remove_bus_device(pdev); ··· 1905 1904 pci_unlock_rescan_remove(); 1906 1905 } 1907 1906 1908 - spin_lock_irqsave(&hpdev->hbus->device_list_lock, flags); 1907 + spin_lock_irqsave(&hbus->device_list_lock, flags); 1909 1908 list_del(&hpdev->list_entry); 1910 - spin_unlock_irqrestore(&hpdev->hbus->device_list_lock, flags); 1909 + spin_unlock_irqrestore(&hbus->device_list_lock, flags); 1911 1910 1912 1911 if (hpdev->pci_slot) 1913 1912 pci_destroy_slot(hpdev->pci_slot); ··· 1916 1915 ejct_pkt = (struct pci_eject_response *)&ctxt.pkt.message; 1917 1916 ejct_pkt->message_type.type = PCI_EJECTION_COMPLETE; 1918 1917 ejct_pkt->wslot.slot = hpdev->desc.win_slot.slot; 1919 - vmbus_sendpacket(hpdev->hbus->hdev->channel, ejct_pkt, 1918 + vmbus_sendpacket(hbus->hdev->channel, ejct_pkt, 1920 1919 sizeof(*ejct_pkt), (unsigned long)&ctxt.pkt, 1921 1920 VM_PKT_DATA_INBAND, 0); 1922 1921 ··· 1925 1924 /* For the two refs got in new_pcichild_device() */ 1926 1925 put_pcichild(hpdev); 1927 1926 put_pcichild(hpdev); 1928 - put_hvpcibus(hpdev->hbus); 1927 + /* hpdev has been freed. Do not use it any more. */ 1928 + 1929 + put_hvpcibus(hbus); 1929 1930 } 1930 1931 1931 1932 /**

+507 -82

drivers/pci/controller/pci-tegra.c

··· 17 17 #include <linux/debugfs.h> 18 18 #include <linux/delay.h> 19 19 #include <linux/export.h> 20 + #include <linux/gpio/consumer.h> 20 21 #include <linux/interrupt.h> 21 22 #include <linux/iopoll.h> 22 23 #include <linux/irq.h> ··· 31 30 #include <linux/of_platform.h> 32 31 #include <linux/pci.h> 33 32 #include <linux/phy/phy.h> 33 + #include <linux/pinctrl/consumer.h> 34 34 #include <linux/platform_device.h> 35 35 #include <linux/reset.h> 36 36 #include <linux/sizes.h> ··· 97 95 #define AFI_MSI_EN_VEC7 0xa8 98 96 99 97 #define AFI_CONFIGURATION 0xac 100 - #define AFI_CONFIGURATION_EN_FPCI (1 << 0) 98 + #define AFI_CONFIGURATION_EN_FPCI (1 << 0) 99 + #define AFI_CONFIGURATION_CLKEN_OVERRIDE (1 << 31) 101 100 102 101 #define AFI_FPCI_ERROR_MASKS 0xb0 103 102 ··· 162 159 #define AFI_PCIE_CONFIG_SM2TMS0_XBAR_CONFIG_211 (0x1 << 20) 163 160 #define AFI_PCIE_CONFIG_SM2TMS0_XBAR_CONFIG_411 (0x2 << 20) 164 161 #define AFI_PCIE_CONFIG_SM2TMS0_XBAR_CONFIG_111 (0x2 << 20) 162 + #define AFI_PCIE_CONFIG_PCIE_CLKREQ_GPIO(x) (1 << ((x) + 29)) 163 + #define AFI_PCIE_CONFIG_PCIE_CLKREQ_GPIO_ALL (0x7 << 29) 165 164 166 165 #define AFI_FUSE 0x104 167 166 #define AFI_FUSE_PCIE_T0_GEN2_DIS (1 << 2) 168 167 169 168 #define AFI_PEX0_CTRL 0x110 170 169 #define AFI_PEX1_CTRL 0x118 171 - #define AFI_PEX2_CTRL 0x128 172 170 #define AFI_PEX_CTRL_RST (1 << 0) 173 171 #define AFI_PEX_CTRL_CLKREQ_EN (1 << 1) 174 172 #define AFI_PEX_CTRL_REFCLK_EN (1 << 3) ··· 181 177 182 178 #define AFI_PEXBIAS_CTRL_0 0x168 183 179 180 + #define RP_PRIV_XP_DL 0x00000494 181 + #define RP_PRIV_XP_DL_GEN2_UPD_FC_TSHOLD (0x1ff << 1) 182 + 183 + #define RP_RX_HDR_LIMIT 0x00000e00 184 + #define RP_RX_HDR_LIMIT_PW_MASK (0xff << 8) 185 + #define RP_RX_HDR_LIMIT_PW (0x0e << 8) 186 + 187 + #define RP_ECTL_2_R1 0x00000e84 188 + #define RP_ECTL_2_R1_RX_CTLE_1C_MASK 0xffff 189 + 190 + #define RP_ECTL_4_R1 0x00000e8c 191 + #define RP_ECTL_4_R1_RX_CDR_CTRL_1C_MASK (0xffff << 16) 192 + #define RP_ECTL_4_R1_RX_CDR_CTRL_1C_SHIFT 16 193 + 194 + #define RP_ECTL_5_R1 0x00000e90 195 + #define RP_ECTL_5_R1_RX_EQ_CTRL_L_1C_MASK 0xffffffff 196 + 197 + #define RP_ECTL_6_R1 0x00000e94 198 + #define RP_ECTL_6_R1_RX_EQ_CTRL_H_1C_MASK 0xffffffff 199 + 200 + #define RP_ECTL_2_R2 0x00000ea4 201 + #define RP_ECTL_2_R2_RX_CTLE_1C_MASK 0xffff 202 + 203 + #define RP_ECTL_4_R2 0x00000eac 204 + #define RP_ECTL_4_R2_RX_CDR_CTRL_1C_MASK (0xffff << 16) 205 + #define RP_ECTL_4_R2_RX_CDR_CTRL_1C_SHIFT 16 206 + 207 + #define RP_ECTL_5_R2 0x00000eb0 208 + #define RP_ECTL_5_R2_RX_EQ_CTRL_L_1C_MASK 0xffffffff 209 + 210 + #define RP_ECTL_6_R2 0x00000eb4 211 + #define RP_ECTL_6_R2_RX_EQ_CTRL_H_1C_MASK 0xffffffff 212 + 184 213 #define RP_VEND_XP 0x00000f00 185 - #define RP_VEND_XP_DL_UP (1 << 30) 214 + #define RP_VEND_XP_DL_UP (1 << 30) 215 + #define RP_VEND_XP_OPPORTUNISTIC_ACK (1 << 27) 216 + #define RP_VEND_XP_OPPORTUNISTIC_UPDATEFC (1 << 28) 217 + #define RP_VEND_XP_UPDATE_FC_THRESHOLD_MASK (0xff << 18) 218 + 219 + #define RP_VEND_CTL0 0x00000f44 220 + #define RP_VEND_CTL0_DSK_RST_PULSE_WIDTH_MASK (0xf << 12) 221 + #define RP_VEND_CTL0_DSK_RST_PULSE_WIDTH (0x9 << 12) 222 + 223 + #define RP_VEND_CTL1 0x00000f48 224 + #define RP_VEND_CTL1_ERPT (1 << 13) 225 + 226 + #define RP_VEND_XP_BIST 0x00000f4c 227 + #define RP_VEND_XP_BIST_GOTO_L1_L2_AFTER_DLLP_DONE (1 << 28) 186 228 187 229 #define RP_VEND_CTL2 0x00000fa8 188 230 #define RP_VEND_CTL2_PCA_ENABLE (1 << 7) 189 231 190 232 #define RP_PRIV_MISC 0x00000fe0 191 - #define RP_PRIV_MISC_PRSNT_MAP_EP_PRSNT (0xe << 0) 192 - #define RP_PRIV_MISC_PRSNT_MAP_EP_ABSNT (0xf << 0) 233 + #define RP_PRIV_MISC_PRSNT_MAP_EP_PRSNT (0xe << 0) 234 + #define RP_PRIV_MISC_PRSNT_MAP_EP_ABSNT (0xf << 0) 235 + #define RP_PRIV_MISC_CTLR_CLK_CLAMP_THRESHOLD_MASK (0x7f << 16) 236 + #define RP_PRIV_MISC_CTLR_CLK_CLAMP_THRESHOLD (0xf << 16) 237 + #define RP_PRIV_MISC_CTLR_CLK_CLAMP_ENABLE (1 << 23) 238 + #define RP_PRIV_MISC_TMS_CLK_CLAMP_THRESHOLD_MASK (0x7f << 24) 239 + #define RP_PRIV_MISC_TMS_CLK_CLAMP_THRESHOLD (0xf << 24) 240 + #define RP_PRIV_MISC_TMS_CLK_CLAMP_ENABLE (1 << 31) 193 241 194 242 #define RP_LINK_CONTROL_STATUS 0x00000090 195 243 #define RP_LINK_CONTROL_STATUS_DL_LINK_ACTIVE 0x20000000 196 244 #define RP_LINK_CONTROL_STATUS_LINKSTAT_MASK 0x3fff0000 245 + 246 + #define RP_LINK_CONTROL_STATUS_2 0x000000b0 197 247 198 248 #define PADS_CTL_SEL 0x0000009c 199 249 ··· 284 226 #define PADS_REFCLK_CFG_DRVI_SHIFT 12 /* 15:12 */ 285 227 286 228 #define PME_ACK_TIMEOUT 10000 229 + #define LINK_RETRAIN_TIMEOUT 100000 /* in usec */ 287 230 288 231 struct tegra_msi { 289 232 struct msi_controller chip; ··· 308 249 unsigned int num_ports; 309 250 const struct tegra_pcie_port_soc *ports; 310 251 unsigned int msi_base_shift; 252 + unsigned long afi_pex2_ctrl; 311 253 u32 pads_pll_ctl; 312 254 u32 tx_ref_sel; 313 255 u32 pads_refclk_cfg0; 314 256 u32 pads_refclk_cfg1; 257 + u32 update_fc_threshold; 315 258 bool has_pex_clkreq_en; 316 259 bool has_pex_bias_ctrl; 317 260 bool has_intr_prsnt_sense; ··· 321 260 bool has_gen2; 322 261 bool force_pca_enable; 323 262 bool program_uphy; 263 + bool update_clamp_threshold; 264 + bool program_deskew_time; 265 + bool raw_violation_fixup; 266 + bool update_fc_timer; 267 + bool has_cache_bars; 268 + struct { 269 + struct { 270 + u32 rp_ectl_2_r1; 271 + u32 rp_ectl_4_r1; 272 + u32 rp_ectl_5_r1; 273 + u32 rp_ectl_6_r1; 274 + u32 rp_ectl_2_r2; 275 + u32 rp_ectl_4_r2; 276 + u32 rp_ectl_5_r2; 277 + u32 rp_ectl_6_r2; 278 + } regs; 279 + bool enable; 280 + } ectl; 324 281 }; 325 282 326 283 static inline struct tegra_msi *to_tegra_msi(struct msi_controller *chip) ··· 400 321 unsigned int lanes; 401 322 402 323 struct phy **phys; 324 + 325 + struct gpio_desc *reset_gpio; 403 326 }; 404 327 405 328 struct tegra_pcie_bus { ··· 521 440 522 441 static unsigned long tegra_pcie_port_get_pex_ctrl(struct tegra_pcie_port *port) 523 442 { 443 + const struct tegra_pcie_soc *soc = port->pcie->soc; 524 444 unsigned long ret = 0; 525 445 526 446 switch (port->index) { ··· 534 452 break; 535 453 536 454 case 2: 537 - ret = AFI_PEX2_CTRL; 455 + ret = soc->afi_pex2_ctrl; 538 456 break; 539 457 } 540 458 ··· 547 465 unsigned long value; 548 466 549 467 /* pulse reset signal */ 550 - value = afi_readl(port->pcie, ctrl); 551 - value &= ~AFI_PEX_CTRL_RST; 552 - afi_writel(port->pcie, value, ctrl); 468 + if (port->reset_gpio) { 469 + gpiod_set_value(port->reset_gpio, 1); 470 + } else { 471 + value = afi_readl(port->pcie, ctrl); 472 + value &= ~AFI_PEX_CTRL_RST; 473 + afi_writel(port->pcie, value, ctrl); 474 + } 553 475 554 476 usleep_range(1000, 2000); 555 477 556 - value = afi_readl(port->pcie, ctrl); 557 - value |= AFI_PEX_CTRL_RST; 558 - afi_writel(port->pcie, value, ctrl); 478 + if (port->reset_gpio) { 479 + gpiod_set_value(port->reset_gpio, 0); 480 + } else { 481 + value = afi_readl(port->pcie, ctrl); 482 + value |= AFI_PEX_CTRL_RST; 483 + afi_writel(port->pcie, value, ctrl); 484 + } 485 + } 486 + 487 + static void tegra_pcie_enable_rp_features(struct tegra_pcie_port *port) 488 + { 489 + const struct tegra_pcie_soc *soc = port->pcie->soc; 490 + u32 value; 491 + 492 + /* Enable AER capability */ 493 + value = readl(port->base + RP_VEND_CTL1); 494 + value |= RP_VEND_CTL1_ERPT; 495 + writel(value, port->base + RP_VEND_CTL1); 496 + 497 + /* Optimal settings to enhance bandwidth */ 498 + value = readl(port->base + RP_VEND_XP); 499 + value |= RP_VEND_XP_OPPORTUNISTIC_ACK; 500 + value |= RP_VEND_XP_OPPORTUNISTIC_UPDATEFC; 501 + writel(value, port->base + RP_VEND_XP); 502 + 503 + /* 504 + * LTSSM will wait for DLLP to finish before entering L1 or L2, 505 + * to avoid truncation of PM messages which results in receiver errors 506 + */ 507 + value = readl(port->base + RP_VEND_XP_BIST); 508 + value |= RP_VEND_XP_BIST_GOTO_L1_L2_AFTER_DLLP_DONE; 509 + writel(value, port->base + RP_VEND_XP_BIST); 510 + 511 + value = readl(port->base + RP_PRIV_MISC); 512 + value |= RP_PRIV_MISC_CTLR_CLK_CLAMP_ENABLE; 513 + value |= RP_PRIV_MISC_TMS_CLK_CLAMP_ENABLE; 514 + 515 + if (soc->update_clamp_threshold) { 516 + value &= ~(RP_PRIV_MISC_CTLR_CLK_CLAMP_THRESHOLD_MASK | 517 + RP_PRIV_MISC_TMS_CLK_CLAMP_THRESHOLD_MASK); 518 + value |= RP_PRIV_MISC_CTLR_CLK_CLAMP_THRESHOLD | 519 + RP_PRIV_MISC_TMS_CLK_CLAMP_THRESHOLD; 520 + } 521 + 522 + writel(value, port->base + RP_PRIV_MISC); 523 + } 524 + 525 + static void tegra_pcie_program_ectl_settings(struct tegra_pcie_port *port) 526 + { 527 + const struct tegra_pcie_soc *soc = port->pcie->soc; 528 + u32 value; 529 + 530 + value = readl(port->base + RP_ECTL_2_R1); 531 + value &= ~RP_ECTL_2_R1_RX_CTLE_1C_MASK; 532 + value |= soc->ectl.regs.rp_ectl_2_r1; 533 + writel(value, port->base + RP_ECTL_2_R1); 534 + 535 + value = readl(port->base + RP_ECTL_4_R1); 536 + value &= ~RP_ECTL_4_R1_RX_CDR_CTRL_1C_MASK; 537 + value |= soc->ectl.regs.rp_ectl_4_r1 << 538 + RP_ECTL_4_R1_RX_CDR_CTRL_1C_SHIFT; 539 + writel(value, port->base + RP_ECTL_4_R1); 540 + 541 + value = readl(port->base + RP_ECTL_5_R1); 542 + value &= ~RP_ECTL_5_R1_RX_EQ_CTRL_L_1C_MASK; 543 + value |= soc->ectl.regs.rp_ectl_5_r1; 544 + writel(value, port->base + RP_ECTL_5_R1); 545 + 546 + value = readl(port->base + RP_ECTL_6_R1); 547 + value &= ~RP_ECTL_6_R1_RX_EQ_CTRL_H_1C_MASK; 548 + value |= soc->ectl.regs.rp_ectl_6_r1; 549 + writel(value, port->base + RP_ECTL_6_R1); 550 + 551 + value = readl(port->base + RP_ECTL_2_R2); 552 + value &= ~RP_ECTL_2_R2_RX_CTLE_1C_MASK; 553 + value |= soc->ectl.regs.rp_ectl_2_r2; 554 + writel(value, port->base + RP_ECTL_2_R2); 555 + 556 + value = readl(port->base + RP_ECTL_4_R2); 557 + value &= ~RP_ECTL_4_R2_RX_CDR_CTRL_1C_MASK; 558 + value |= soc->ectl.regs.rp_ectl_4_r2 << 559 + RP_ECTL_4_R2_RX_CDR_CTRL_1C_SHIFT; 560 + writel(value, port->base + RP_ECTL_4_R2); 561 + 562 + value = readl(port->base + RP_ECTL_5_R2); 563 + value &= ~RP_ECTL_5_R2_RX_EQ_CTRL_L_1C_MASK; 564 + value |= soc->ectl.regs.rp_ectl_5_r2; 565 + writel(value, port->base + RP_ECTL_5_R2); 566 + 567 + value = readl(port->base + RP_ECTL_6_R2); 568 + value &= ~RP_ECTL_6_R2_RX_EQ_CTRL_H_1C_MASK; 569 + value |= soc->ectl.regs.rp_ectl_6_r2; 570 + writel(value, port->base + RP_ECTL_6_R2); 571 + } 572 + 573 + static void tegra_pcie_apply_sw_fixup(struct tegra_pcie_port *port) 574 + { 575 + const struct tegra_pcie_soc *soc = port->pcie->soc; 576 + u32 value; 577 + 578 + /* 579 + * Sometimes link speed change from Gen2 to Gen1 fails due to 580 + * instability in deskew logic on lane-0. Increase the deskew 581 + * retry time to resolve this issue. 582 + */ 583 + if (soc->program_deskew_time) { 584 + value = readl(port->base + RP_VEND_CTL0); 585 + value &= ~RP_VEND_CTL0_DSK_RST_PULSE_WIDTH_MASK; 586 + value |= RP_VEND_CTL0_DSK_RST_PULSE_WIDTH; 587 + writel(value, port->base + RP_VEND_CTL0); 588 + } 589 + 590 + /* Fixup for read after write violation. */ 591 + if (soc->raw_violation_fixup) { 592 + value = readl(port->base + RP_RX_HDR_LIMIT); 593 + value &= ~RP_RX_HDR_LIMIT_PW_MASK; 594 + value |= RP_RX_HDR_LIMIT_PW; 595 + writel(value, port->base + RP_RX_HDR_LIMIT); 596 + 597 + value = readl(port->base + RP_PRIV_XP_DL); 598 + value |= RP_PRIV_XP_DL_GEN2_UPD_FC_TSHOLD; 599 + writel(value, port->base + RP_PRIV_XP_DL); 600 + 601 + value = readl(port->base + RP_VEND_XP); 602 + value &= ~RP_VEND_XP_UPDATE_FC_THRESHOLD_MASK; 603 + value |= soc->update_fc_threshold; 604 + writel(value, port->base + RP_VEND_XP); 605 + } 606 + 607 + if (soc->update_fc_timer) { 608 + value = readl(port->base + RP_VEND_XP); 609 + value &= ~RP_VEND_XP_UPDATE_FC_THRESHOLD_MASK; 610 + value |= soc->update_fc_threshold; 611 + writel(value, port->base + RP_VEND_XP); 612 + } 613 + 614 + /* 615 + * PCIe link doesn't come up with few legacy PCIe endpoints if 616 + * root port advertises both Gen-1 and Gen-2 speeds in Tegra. 617 + * Hence, the strategy followed here is to initially advertise 618 + * only Gen-1 and after link is up, retrain link to Gen-2 speed 619 + */ 620 + value = readl(port->base + RP_LINK_CONTROL_STATUS_2); 621 + value &= ~PCI_EXP_LNKSTA_CLS; 622 + value |= PCI_EXP_LNKSTA_CLS_2_5GB; 623 + writel(value, port->base + RP_LINK_CONTROL_STATUS_2); 559 624 } 560 625 561 626 static void tegra_pcie_port_enable(struct tegra_pcie_port *port) ··· 729 500 value |= RP_VEND_CTL2_PCA_ENABLE; 730 501 writel(value, port->base + RP_VEND_CTL2); 731 502 } 503 + 504 + tegra_pcie_enable_rp_features(port); 505 + 506 + if (soc->ectl.enable) 507 + tegra_pcie_program_ectl_settings(port); 508 + 509 + tegra_pcie_apply_sw_fixup(port); 732 510 } 733 511 734 512 static void tegra_pcie_port_disable(struct tegra_pcie_port *port) ··· 757 521 758 522 value &= ~AFI_PEX_CTRL_REFCLK_EN; 759 523 afi_writel(port->pcie, value, ctrl); 524 + 525 + /* disable PCIe port and set CLKREQ# as GPIO to allow PLLE power down */ 526 + value = afi_readl(port->pcie, AFI_PCIE_CONFIG); 527 + value |= AFI_PCIE_CONFIG_PCIE_DISABLE(port->index); 528 + value |= AFI_PCIE_CONFIG_PCIE_CLKREQ_GPIO(port->index); 529 + afi_writel(port->pcie, value, AFI_PCIE_CONFIG); 760 530 } 761 531 762 532 static void tegra_pcie_port_free(struct tegra_pcie_port *port) ··· 787 545 DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_NVIDIA, 0x0e1c, tegra_pcie_fixup_class); 788 546 DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_NVIDIA, 0x0e1d, tegra_pcie_fixup_class); 789 547 790 - /* Tegra PCIE requires relaxed ordering */ 548 + /* Tegra20 and Tegra30 PCIE requires relaxed ordering */ 791 549 static void tegra_pcie_relax_enable(struct pci_dev *dev) 792 550 { 793 551 pcie_capability_set_word(dev, PCI_EXP_DEVCTL, PCI_EXP_DEVCTL_RELAX_EN); 794 552 } 795 - DECLARE_PCI_FIXUP_FINAL(PCI_ANY_ID, PCI_ANY_ID, tegra_pcie_relax_enable); 553 + DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_NVIDIA, 0x0bf0, tegra_pcie_relax_enable); 554 + DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_NVIDIA, 0x0bf1, tegra_pcie_relax_enable); 555 + DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_NVIDIA, 0x0e1c, tegra_pcie_relax_enable); 556 + DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_NVIDIA, 0x0e1d, tegra_pcie_relax_enable); 796 557 797 558 static int tegra_pcie_request_resources(struct tegra_pcie *pcie) 798 559 { ··· 880 635 * do not pollute kernel log with master abort reports since they 881 636 * happen a lot during enumeration 882 637 */ 883 - if (code == AFI_INTR_MASTER_ABORT) 638 + if (code == AFI_INTR_MASTER_ABORT || code == AFI_INTR_PE_PRSNT_SENSE) 884 639 dev_dbg(dev, "%s, signature: %08x\n", err_msg[code], signature); 885 640 else 886 641 dev_err(dev, "%s, signature: %08x\n", err_msg[code], signature); ··· 949 704 afi_writel(pcie, 0, AFI_AXI_BAR5_SZ); 950 705 afi_writel(pcie, 0, AFI_FPCI_BAR5); 951 706 952 - /* map all upstream transactions as uncached */ 953 - afi_writel(pcie, 0, AFI_CACHE_BAR0_ST); 954 - afi_writel(pcie, 0, AFI_CACHE_BAR0_SZ); 955 - afi_writel(pcie, 0, AFI_CACHE_BAR1_ST); 956 - afi_writel(pcie, 0, AFI_CACHE_BAR1_SZ); 707 + if (pcie->soc->has_cache_bars) { 708 + /* map all upstream transactions as uncached */ 709 + afi_writel(pcie, 0, AFI_CACHE_BAR0_ST); 710 + afi_writel(pcie, 0, AFI_CACHE_BAR0_SZ); 711 + afi_writel(pcie, 0, AFI_CACHE_BAR1_ST); 712 + afi_writel(pcie, 0, AFI_CACHE_BAR1_SZ); 713 + } 957 714 958 715 /* MSI translations are setup only when needed */ 959 716 afi_writel(pcie, 0, AFI_MSI_FPCI_BAR_ST); ··· 1099 852 static int tegra_pcie_phy_power_on(struct tegra_pcie *pcie) 1100 853 { 1101 854 struct device *dev = pcie->dev; 1102 - const struct tegra_pcie_soc *soc = pcie->soc; 1103 855 struct tegra_pcie_port *port; 1104 856 int err; 1105 857 ··· 1123 877 return err; 1124 878 } 1125 879 } 1126 - 1127 - /* Configure the reference clock driver */ 1128 - pads_writel(pcie, soc->pads_refclk_cfg0, PADS_REFCLK_CFG0); 1129 - 1130 - if (soc->num_ports > 2) 1131 - pads_writel(pcie, soc->pads_refclk_cfg1, PADS_REFCLK_CFG1); 1132 880 1133 881 return 0; 1134 882 } ··· 1158 918 return 0; 1159 919 } 1160 920 1161 - static int tegra_pcie_enable_controller(struct tegra_pcie *pcie) 921 + static void tegra_pcie_enable_controller(struct tegra_pcie *pcie) 1162 922 { 1163 - struct device *dev = pcie->dev; 1164 923 const struct tegra_pcie_soc *soc = pcie->soc; 1165 924 struct tegra_pcie_port *port; 1166 925 unsigned long value; 1167 - int err; 1168 926 1169 927 /* enable PLL power down */ 1170 928 if (pcie->phy) { ··· 1180 942 value = afi_readl(pcie, AFI_PCIE_CONFIG); 1181 943 value &= ~AFI_PCIE_CONFIG_SM2TMS0_XBAR_CONFIG_MASK; 1182 944 value |= AFI_PCIE_CONFIG_PCIE_DISABLE_ALL | pcie->xbar_config; 945 + value |= AFI_PCIE_CONFIG_PCIE_CLKREQ_GPIO_ALL; 1183 946 1184 - list_for_each_entry(port, &pcie->ports, list) 947 + list_for_each_entry(port, &pcie->ports, list) { 1185 948 value &= ~AFI_PCIE_CONFIG_PCIE_DISABLE(port->index); 949 + value &= ~AFI_PCIE_CONFIG_PCIE_CLKREQ_GPIO(port->index); 950 + } 1186 951 1187 952 afi_writel(pcie, value, AFI_PCIE_CONFIG); 1188 953 ··· 1199 958 afi_writel(pcie, value, AFI_FUSE); 1200 959 } 1201 960 1202 - if (soc->program_uphy) { 1203 - err = tegra_pcie_phy_power_on(pcie); 1204 - if (err < 0) { 1205 - dev_err(dev, "failed to power on PHY(s): %d\n", err); 1206 - return err; 1207 - } 1208 - } 1209 - 1210 - /* take the PCIe interface module out of reset */ 1211 - reset_control_deassert(pcie->pcie_xrst); 1212 - 1213 - /* finally enable PCIe */ 961 + /* Disable AFI dynamic clock gating and enable PCIe */ 1214 962 value = afi_readl(pcie, AFI_CONFIGURATION); 1215 963 value |= AFI_CONFIGURATION_EN_FPCI; 964 + value |= AFI_CONFIGURATION_CLKEN_OVERRIDE; 1216 965 afi_writel(pcie, value, AFI_CONFIGURATION); 1217 966 1218 967 value = AFI_INTR_EN_INI_SLVERR | AFI_INTR_EN_INI_DECERR | ··· 1220 989 1221 990 /* disable all exceptions */ 1222 991 afi_writel(pcie, 0, AFI_FPCI_ERROR_MASKS); 1223 - 1224 - return 0; 1225 - } 1226 - 1227 - static void tegra_pcie_disable_controller(struct tegra_pcie *pcie) 1228 - { 1229 - int err; 1230 - 1231 - reset_control_assert(pcie->pcie_xrst); 1232 - 1233 - if (pcie->soc->program_uphy) { 1234 - err = tegra_pcie_phy_power_off(pcie); 1235 - if (err < 0) 1236 - dev_err(pcie->dev, "failed to power off PHY(s): %d\n", 1237 - err); 1238 - } 1239 992 } 1240 993 1241 994 static void tegra_pcie_power_off(struct tegra_pcie *pcie) ··· 1229 1014 int err; 1230 1015 1231 1016 reset_control_assert(pcie->afi_rst); 1232 - reset_control_assert(pcie->pex_rst); 1233 1017 1234 1018 clk_disable_unprepare(pcie->pll_e); 1235 1019 if (soc->has_cml_clk) 1236 1020 clk_disable_unprepare(pcie->cml_clk); 1237 1021 clk_disable_unprepare(pcie->afi_clk); 1238 - clk_disable_unprepare(pcie->pex_clk); 1239 1022 1240 1023 if (!dev->pm_domain) 1241 1024 tegra_powergate_power_off(TEGRA_POWERGATE_PCIE); ··· 1261 1048 if (err < 0) 1262 1049 dev_err(dev, "failed to enable regulators: %d\n", err); 1263 1050 1264 - if (dev->pm_domain) { 1265 - err = clk_prepare_enable(pcie->pex_clk); 1051 + if (!dev->pm_domain) { 1052 + err = tegra_powergate_power_on(TEGRA_POWERGATE_PCIE); 1266 1053 if (err) { 1267 - dev_err(dev, "failed to enable PEX clock: %d\n", err); 1268 - return err; 1054 + dev_err(dev, "failed to power ungate: %d\n", err); 1055 + goto regulator_disable; 1269 1056 } 1270 - reset_control_deassert(pcie->pex_rst); 1271 - } else { 1272 - err = tegra_powergate_sequence_power_up(TEGRA_POWERGATE_PCIE, 1273 - pcie->pex_clk, 1274 - pcie->pex_rst); 1057 + err = tegra_powergate_remove_clamping(TEGRA_POWERGATE_PCIE); 1275 1058 if (err) { 1276 - dev_err(dev, "powerup sequence failed: %d\n", err); 1277 - return err; 1059 + dev_err(dev, "failed to remove clamp: %d\n", err); 1060 + goto powergate; 1278 1061 } 1279 1062 } 1280 - 1281 - reset_control_deassert(pcie->afi_rst); 1282 1063 1283 1064 err = clk_prepare_enable(pcie->afi_clk); 1284 1065 if (err < 0) { 1285 1066 dev_err(dev, "failed to enable AFI clock: %d\n", err); 1286 - return err; 1067 + goto powergate; 1287 1068 } 1288 1069 1289 1070 if (soc->has_cml_clk) { 1290 1071 err = clk_prepare_enable(pcie->cml_clk); 1291 1072 if (err < 0) { 1292 1073 dev_err(dev, "failed to enable CML clock: %d\n", err); 1293 - return err; 1074 + goto disable_afi_clk; 1294 1075 } 1295 1076 } 1296 1077 1297 1078 err = clk_prepare_enable(pcie->pll_e); 1298 1079 if (err < 0) { 1299 1080 dev_err(dev, "failed to enable PLLE clock: %d\n", err); 1300 - return err; 1081 + goto disable_cml_clk; 1301 1082 } 1302 1083 1084 + reset_control_deassert(pcie->afi_rst); 1085 + 1303 1086 return 0; 1087 + 1088 + disable_cml_clk: 1089 + if (soc->has_cml_clk) 1090 + clk_disable_unprepare(pcie->cml_clk); 1091 + disable_afi_clk: 1092 + clk_disable_unprepare(pcie->afi_clk); 1093 + powergate: 1094 + if (!dev->pm_domain) 1095 + tegra_powergate_power_off(TEGRA_POWERGATE_PCIE); 1096 + regulator_disable: 1097 + regulator_bulk_disable(pcie->num_supplies, pcie->supplies); 1098 + 1099 + return err; 1100 + } 1101 + 1102 + static void tegra_pcie_apply_pad_settings(struct tegra_pcie *pcie) 1103 + { 1104 + const struct tegra_pcie_soc *soc = pcie->soc; 1105 + 1106 + /* Configure the reference clock driver */ 1107 + pads_writel(pcie, soc->pads_refclk_cfg0, PADS_REFCLK_CFG0); 1108 + 1109 + if (soc->num_ports > 2) 1110 + pads_writel(pcie, soc->pads_refclk_cfg1, PADS_REFCLK_CFG1); 1304 1111 } 1305 1112 1306 1113 static int tegra_pcie_clocks_get(struct tegra_pcie *pcie) ··· 1880 1647 return 0; 1881 1648 } 1882 1649 1650 + static void tegra_pcie_disable_interrupts(struct tegra_pcie *pcie) 1651 + { 1652 + u32 value; 1653 + 1654 + value = afi_readl(pcie, AFI_INTR_MASK); 1655 + value &= ~AFI_INTR_MASK_INT_MASK; 1656 + afi_writel(pcie, value, AFI_INTR_MASK); 1657 + } 1658 + 1883 1659 static int tegra_pcie_get_xbar_config(struct tegra_pcie *pcie, u32 lanes, 1884 1660 u32 *xbar) 1885 1661 { ··· 2232 1990 struct tegra_pcie_port *rp; 2233 1991 unsigned int index; 2234 1992 u32 value; 1993 + char *label; 2235 1994 2236 1995 err = of_pci_get_devfn(port); 2237 1996 if (err < 0) { ··· 2291 2048 if (IS_ERR(rp->base)) 2292 2049 return PTR_ERR(rp->base); 2293 2050 2051 + label = devm_kasprintf(dev, GFP_KERNEL, "pex-reset-%u", index); 2052 + if (!label) { 2053 + dev_err(dev, "failed to create reset GPIO label\n"); 2054 + return -ENOMEM; 2055 + } 2056 + 2057 + /* 2058 + * Returns -ENOENT if reset-gpios property is not populated 2059 + * and in this case fall back to using AFI per port register 2060 + * to toggle PERST# SFIO line. 2061 + */ 2062 + rp->reset_gpio = devm_gpiod_get_from_of_node(dev, port, 2063 + "reset-gpios", 0, 2064 + GPIOD_OUT_LOW, 2065 + label); 2066 + if (IS_ERR(rp->reset_gpio)) { 2067 + if (PTR_ERR(rp->reset_gpio) == -ENOENT) { 2068 + rp->reset_gpio = NULL; 2069 + } else { 2070 + dev_err(dev, "failed to get reset GPIO: %d\n", 2071 + err); 2072 + return PTR_ERR(rp->reset_gpio); 2073 + } 2074 + } 2075 + 2294 2076 list_add_tail(&rp->list, &pcie->ports); 2295 2077 } 2296 2078 ··· 2363 2095 } while (--timeout); 2364 2096 2365 2097 if (!timeout) { 2366 - dev_err(dev, "link %u down, retrying\n", port->index); 2098 + dev_dbg(dev, "link %u down, retrying\n", port->index); 2367 2099 goto retry; 2368 2100 } 2369 2101 ··· 2385 2117 return false; 2386 2118 } 2387 2119 2120 + static void tegra_pcie_change_link_speed(struct tegra_pcie *pcie) 2121 + { 2122 + struct device *dev = pcie->dev; 2123 + struct tegra_pcie_port *port; 2124 + ktime_t deadline; 2125 + u32 value; 2126 + 2127 + list_for_each_entry(port, &pcie->ports, list) { 2128 + /* 2129 + * "Supported Link Speeds Vector" in "Link Capabilities 2" 2130 + * is not supported by Tegra. tegra_pcie_change_link_speed() 2131 + * is called only for Tegra chips which support Gen2. 2132 + * So there no harm if supported link speed is not verified. 2133 + */ 2134 + value = readl(port->base + RP_LINK_CONTROL_STATUS_2); 2135 + value &= ~PCI_EXP_LNKSTA_CLS; 2136 + value |= PCI_EXP_LNKSTA_CLS_5_0GB; 2137 + writel(value, port->base + RP_LINK_CONTROL_STATUS_2); 2138 + 2139 + /* 2140 + * Poll until link comes back from recovery to avoid race 2141 + * condition. 2142 + */ 2143 + deadline = ktime_add_us(ktime_get(), LINK_RETRAIN_TIMEOUT); 2144 + 2145 + while (ktime_before(ktime_get(), deadline)) { 2146 + value = readl(port->base + RP_LINK_CONTROL_STATUS); 2147 + if ((value & PCI_EXP_LNKSTA_LT) == 0) 2148 + break; 2149 + 2150 + usleep_range(2000, 3000); 2151 + } 2152 + 2153 + if (value & PCI_EXP_LNKSTA_LT) 2154 + dev_warn(dev, "PCIe port %u link is in recovery\n", 2155 + port->index); 2156 + 2157 + /* Retrain the link */ 2158 + value = readl(port->base + RP_LINK_CONTROL_STATUS); 2159 + value |= PCI_EXP_LNKCTL_RL; 2160 + writel(value, port->base + RP_LINK_CONTROL_STATUS); 2161 + 2162 + deadline = ktime_add_us(ktime_get(), LINK_RETRAIN_TIMEOUT); 2163 + 2164 + while (ktime_before(ktime_get(), deadline)) { 2165 + value = readl(port->base + RP_LINK_CONTROL_STATUS); 2166 + if ((value & PCI_EXP_LNKSTA_LT) == 0) 2167 + break; 2168 + 2169 + usleep_range(2000, 3000); 2170 + } 2171 + 2172 + if (value & PCI_EXP_LNKSTA_LT) 2173 + dev_err(dev, "failed to retrain link of port %u\n", 2174 + port->index); 2175 + } 2176 + } 2177 + 2388 2178 static void tegra_pcie_enable_ports(struct tegra_pcie *pcie) 2389 2179 { 2390 2180 struct device *dev = pcie->dev; ··· 2453 2127 port->index, port->lanes); 2454 2128 2455 2129 tegra_pcie_port_enable(port); 2130 + } 2456 2131 2132 + /* Start LTSSM from Tegra side */ 2133 + reset_control_deassert(pcie->pcie_xrst); 2134 + 2135 + list_for_each_entry_safe(port, tmp, &pcie->ports, list) { 2457 2136 if (tegra_pcie_port_check_link(port)) 2458 2137 continue; 2459 2138 ··· 2467 2136 tegra_pcie_port_disable(port); 2468 2137 tegra_pcie_port_free(port); 2469 2138 } 2139 + 2140 + if (pcie->soc->has_gen2) 2141 + tegra_pcie_change_link_speed(pcie); 2470 2142 } 2471 2143 2472 2144 static void tegra_pcie_disable_ports(struct tegra_pcie *pcie) 2473 2145 { 2474 2146 struct tegra_pcie_port *port, *tmp; 2147 + 2148 + reset_control_assert(pcie->pcie_xrst); 2475 2149 2476 2150 list_for_each_entry_safe(port, tmp, &pcie->ports, list) 2477 2151 tegra_pcie_port_disable(port); ··· 2491 2155 .num_ports = 2, 2492 2156 .ports = tegra20_pcie_ports, 2493 2157 .msi_base_shift = 0, 2158 + .afi_pex2_ctrl = 0x128, 2494 2159 .pads_pll_ctl = PADS_PLL_CTL_TEGRA20, 2495 2160 .tx_ref_sel = PADS_PLL_CTL_TXCLKREF_DIV10, 2496 2161 .pads_refclk_cfg0 = 0xfa5cfa5c, ··· 2502 2165 .has_gen2 = false, 2503 2166 .force_pca_enable = false, 2504 2167 .program_uphy = true, 2168 + .update_clamp_threshold = false, 2169 + .program_deskew_time = false, 2170 + .raw_violation_fixup = false, 2171 + .update_fc_timer = false, 2172 + .has_cache_bars = true, 2173 + .ectl.enable = false, 2505 2174 }; 2506 2175 2507 2176 static const struct tegra_pcie_port_soc tegra30_pcie_ports[] = { ··· 2531 2188 .has_gen2 = false, 2532 2189 .force_pca_enable = false, 2533 2190 .program_uphy = true, 2191 + .update_clamp_threshold = false, 2192 + .program_deskew_time = false, 2193 + .raw_violation_fixup = false, 2194 + .update_fc_timer = false, 2195 + .has_cache_bars = false, 2196 + .ectl.enable = false, 2534 2197 }; 2535 2198 2536 2199 static const struct tegra_pcie_soc tegra124_pcie = { ··· 2546 2197 .pads_pll_ctl = PADS_PLL_CTL_TEGRA30, 2547 2198 .tx_ref_sel = PADS_PLL_CTL_TXCLKREF_BUF_EN, 2548 2199 .pads_refclk_cfg0 = 0x44ac44ac, 2200 + /* FC threshold is bit[25:18] */ 2201 + .update_fc_threshold = 0x03fc0000, 2549 2202 .has_pex_clkreq_en = true, 2550 2203 .has_pex_bias_ctrl = true, 2551 2204 .has_intr_prsnt_sense = true, ··· 2555 2204 .has_gen2 = true, 2556 2205 .force_pca_enable = false, 2557 2206 .program_uphy = true, 2207 + .update_clamp_threshold = true, 2208 + .program_deskew_time = false, 2209 + .raw_violation_fixup = true, 2210 + .update_fc_timer = false, 2211 + .has_cache_bars = false, 2212 + .ectl.enable = false, 2558 2213 }; 2559 2214 2560 2215 static const struct tegra_pcie_soc tegra210_pcie = { ··· 2570 2213 .pads_pll_ctl = PADS_PLL_CTL_TEGRA30, 2571 2214 .tx_ref_sel = PADS_PLL_CTL_TXCLKREF_BUF_EN, 2572 2215 .pads_refclk_cfg0 = 0x90b890b8, 2216 + /* FC threshold is bit[25:18] */ 2217 + .update_fc_threshold = 0x01800000, 2573 2218 .has_pex_clkreq_en = true, 2574 2219 .has_pex_bias_ctrl = true, 2575 2220 .has_intr_prsnt_sense = true, ··· 2579 2220 .has_gen2 = true, 2580 2221 .force_pca_enable = true, 2581 2222 .program_uphy = true, 2223 + .update_clamp_threshold = true, 2224 + .program_deskew_time = true, 2225 + .raw_violation_fixup = false, 2226 + .update_fc_timer = true, 2227 + .has_cache_bars = false, 2228 + .ectl = { 2229 + .regs = { 2230 + .rp_ectl_2_r1 = 0x0000000f, 2231 + .rp_ectl_4_r1 = 0x00000067, 2232 + .rp_ectl_5_r1 = 0x55010000, 2233 + .rp_ectl_6_r1 = 0x00000001, 2234 + .rp_ectl_2_r2 = 0x0000008f, 2235 + .rp_ectl_4_r2 = 0x000000c7, 2236 + .rp_ectl_5_r2 = 0x55010000, 2237 + .rp_ectl_6_r2 = 0x00000001, 2238 + }, 2239 + .enable = true, 2240 + }, 2582 2241 }; 2583 2242 2584 2243 static const struct tegra_pcie_port_soc tegra186_pcie_ports[] = { ··· 2609 2232 .num_ports = 3, 2610 2233 .ports = tegra186_pcie_ports, 2611 2234 .msi_base_shift = 8, 2235 + .afi_pex2_ctrl = 0x19c, 2612 2236 .pads_pll_ctl = PADS_PLL_CTL_TEGRA30, 2613 2237 .tx_ref_sel = PADS_PLL_CTL_TXCLKREF_BUF_EN, 2614 2238 .pads_refclk_cfg0 = 0x80b880b8, ··· 2621 2243 .has_gen2 = true, 2622 2244 .force_pca_enable = false, 2623 2245 .program_uphy = false, 2246 + .update_clamp_threshold = false, 2247 + .program_deskew_time = false, 2248 + .raw_violation_fixup = false, 2249 + .update_fc_timer = false, 2250 + .has_cache_bars = false, 2251 + .ectl.enable = false, 2624 2252 }; 2625 2253 2626 2254 static const struct of_device_id tegra_pcie_of_match[] = { ··· 2869 2485 { 2870 2486 struct tegra_pcie *pcie = dev_get_drvdata(dev); 2871 2487 struct tegra_pcie_port *port; 2488 + int err; 2872 2489 2873 2490 list_for_each_entry(port, &pcie->ports, list) 2874 2491 tegra_pcie_pme_turnoff(port); 2875 2492 2876 2493 tegra_pcie_disable_ports(pcie); 2877 2494 2495 + /* 2496 + * AFI_INTR is unmasked in tegra_pcie_enable_controller(), mask it to 2497 + * avoid unwanted interrupts raised by AFI after pex_rst is asserted. 2498 + */ 2499 + tegra_pcie_disable_interrupts(pcie); 2500 + 2501 + if (pcie->soc->program_uphy) { 2502 + err = tegra_pcie_phy_power_off(pcie); 2503 + if (err < 0) 2504 + dev_err(dev, "failed to power off PHY(s): %d\n", err); 2505 + } 2506 + 2507 + reset_control_assert(pcie->pex_rst); 2508 + clk_disable_unprepare(pcie->pex_clk); 2509 + 2878 2510 if (IS_ENABLED(CONFIG_PCI_MSI)) 2879 2511 tegra_pcie_disable_msi(pcie); 2880 2512 2881 - tegra_pcie_disable_controller(pcie); 2513 + pinctrl_pm_select_idle_state(dev); 2882 2514 tegra_pcie_power_off(pcie); 2883 2515 2884 2516 return 0; ··· 2910 2510 dev_err(dev, "tegra pcie power on fail: %d\n", err); 2911 2511 return err; 2912 2512 } 2913 - err = tegra_pcie_enable_controller(pcie); 2914 - if (err) { 2915 - dev_err(dev, "tegra pcie controller enable fail: %d\n", err); 2513 + 2514 + err = pinctrl_pm_select_default_state(dev); 2515 + if (err < 0) { 2516 + dev_err(dev, "failed to disable PCIe IO DPD: %d\n", err); 2916 2517 goto poweroff; 2917 2518 } 2519 + 2520 + tegra_pcie_enable_controller(pcie); 2918 2521 tegra_pcie_setup_translations(pcie); 2919 2522 2920 2523 if (IS_ENABLED(CONFIG_PCI_MSI)) 2921 2524 tegra_pcie_enable_msi(pcie); 2922 2525 2526 + err = clk_prepare_enable(pcie->pex_clk); 2527 + if (err) { 2528 + dev_err(dev, "failed to enable PEX clock: %d\n", err); 2529 + goto pex_dpd_enable; 2530 + } 2531 + 2532 + reset_control_deassert(pcie->pex_rst); 2533 + 2534 + if (pcie->soc->program_uphy) { 2535 + err = tegra_pcie_phy_power_on(pcie); 2536 + if (err < 0) { 2537 + dev_err(dev, "failed to power on PHY(s): %d\n", err); 2538 + goto disable_pex_clk; 2539 + } 2540 + } 2541 + 2542 + tegra_pcie_apply_pad_settings(pcie); 2923 2543 tegra_pcie_enable_ports(pcie); 2924 2544 2925 2545 return 0; 2926 2546 2547 + disable_pex_clk: 2548 + reset_control_assert(pcie->pex_rst); 2549 + clk_disable_unprepare(pcie->pex_clk); 2550 + pex_dpd_enable: 2551 + pinctrl_pm_select_idle_state(dev); 2927 2552 poweroff: 2928 2553 tegra_pcie_power_off(pcie); 2929 2554

+10

drivers/pci/controller/pcie-altera-msi.c

··· 10 10 #include <linux/interrupt.h> 11 11 #include <linux/irqchip/chained_irq.h> 12 12 #include <linux/init.h> 13 + #include <linux/module.h> 13 14 #include <linux/msi.h> 14 15 #include <linux/of_address.h> 15 16 #include <linux/of_irq.h> ··· 289 288 { 290 289 return platform_driver_register(&altera_msi_driver); 291 290 } 291 + 292 + static void __exit altera_msi_exit(void) 293 + { 294 + platform_driver_unregister(&altera_msi_driver); 295 + } 296 + 292 297 subsys_initcall(altera_msi_init); 298 + MODULE_DEVICE_TABLE(of, altera_msi_of_match); 299 + module_exit(altera_msi_exit); 300 + MODULE_LICENSE("GPL v2");

+53 -16

drivers/pci/controller/pcie-altera.c

··· 10 10 #include <linux/interrupt.h> 11 11 #include <linux/irqchip/chained_irq.h> 12 12 #include <linux/init.h> 13 + #include <linux/module.h> 13 14 #include <linux/of_address.h> 14 15 #include <linux/of_device.h> 15 16 #include <linux/of_irq.h> ··· 44 43 #define S10_RP_RXCPL_STATUS 0x200C 45 44 #define S10_RP_CFG_ADDR(pcie, reg) \ 46 45 (((pcie)->hip_base) + (reg) + (1 << 20)) 46 + #define S10_RP_SECONDARY(pcie) \ 47 + readb(S10_RP_CFG_ADDR(pcie, PCI_SECONDARY_BUS)) 47 48 48 49 /* TLP configuration type 0 and 1 */ 49 50 #define TLP_FMTTYPE_CFGRD0 0x04 /* Configuration Read Type 0 */ ··· 57 54 #define TLP_WRITE_TAG 0x10 58 55 #define RP_DEVFN 0 59 56 #define TLP_REQ_ID(bus, devfn) (((bus) << 8) | (devfn)) 60 - #define TLP_CFGRD_DW0(pcie, bus) \ 61 - ((((bus == pcie->root_bus_nr) ? pcie->pcie_data->cfgrd0 \ 62 - : pcie->pcie_data->cfgrd1) << 24) | \ 63 - TLP_PAYLOAD_SIZE) 64 - #define TLP_CFGWR_DW0(pcie, bus) \ 65 - ((((bus == pcie->root_bus_nr) ? pcie->pcie_data->cfgwr0 \ 66 - : pcie->pcie_data->cfgwr1) << 24) | \ 67 - TLP_PAYLOAD_SIZE) 57 + #define TLP_CFG_DW0(pcie, cfg) \ 58 + (((cfg) << 24) | \ 59 + TLP_PAYLOAD_SIZE) 68 60 #define TLP_CFG_DW1(pcie, tag, be) \ 69 61 (((TLP_REQ_ID(pcie->root_bus_nr, RP_DEVFN)) << 16) | (tag << 8) | (be)) 70 62 #define TLP_CFG_DW2(bus, devfn, offset) \ ··· 319 321 s10_tlp_write_tx(pcie, data, RP_TX_EOP); 320 322 } 321 323 324 + static void get_tlp_header(struct altera_pcie *pcie, u8 bus, u32 devfn, 325 + int where, u8 byte_en, bool read, u32 *headers) 326 + { 327 + u8 cfg; 328 + u8 cfg0 = read ? pcie->pcie_data->cfgrd0 : pcie->pcie_data->cfgwr0; 329 + u8 cfg1 = read ? pcie->pcie_data->cfgrd1 : pcie->pcie_data->cfgwr1; 330 + u8 tag = read ? TLP_READ_TAG : TLP_WRITE_TAG; 331 + 332 + if (pcie->pcie_data->version == ALTERA_PCIE_V1) 333 + cfg = (bus == pcie->root_bus_nr) ? cfg0 : cfg1; 334 + else 335 + cfg = (bus > S10_RP_SECONDARY(pcie)) ? cfg0 : cfg1; 336 + 337 + headers[0] = TLP_CFG_DW0(pcie, cfg); 338 + headers[1] = TLP_CFG_DW1(pcie, tag, byte_en); 339 + headers[2] = TLP_CFG_DW2(bus, devfn, where); 340 + } 341 + 322 342 static int tlp_cfg_dword_read(struct altera_pcie *pcie, u8 bus, u32 devfn, 323 343 int where, u8 byte_en, u32 *value) 324 344 { 325 345 u32 headers[TLP_HDR_SIZE]; 326 346 327 - headers[0] = TLP_CFGRD_DW0(pcie, bus); 328 - headers[1] = TLP_CFG_DW1(pcie, TLP_READ_TAG, byte_en); 329 - headers[2] = TLP_CFG_DW2(bus, devfn, where); 347 + get_tlp_header(pcie, bus, devfn, where, byte_en, true, 348 + headers); 330 349 331 350 pcie->pcie_data->ops->tlp_write_pkt(pcie, headers, 0, false); 332 351 ··· 356 341 u32 headers[TLP_HDR_SIZE]; 357 342 int ret; 358 343 359 - headers[0] = TLP_CFGWR_DW0(pcie, bus); 360 - headers[1] = TLP_CFG_DW1(pcie, TLP_WRITE_TAG, byte_en); 361 - headers[2] = TLP_CFG_DW2(bus, devfn, where); 344 + get_tlp_header(pcie, bus, devfn, where, byte_en, false, 345 + headers); 362 346 363 347 /* check alignment to Qword */ 364 348 if ((where & 0x7) == 0) ··· 719 705 return 0; 720 706 } 721 707 708 + static void altera_pcie_irq_teardown(struct altera_pcie *pcie) 709 + { 710 + irq_set_chained_handler_and_data(pcie->irq, NULL, NULL); 711 + irq_domain_remove(pcie->irq_domain); 712 + irq_dispose_mapping(pcie->irq); 713 + } 714 + 722 715 static int altera_pcie_parse_dt(struct altera_pcie *pcie) 723 716 { 724 717 struct device *dev = &pcie->pdev->dev; ··· 819 798 820 799 pcie = pci_host_bridge_priv(bridge); 821 800 pcie->pdev = pdev; 801 + platform_set_drvdata(pdev, pcie); 822 802 823 803 match = of_match_device(altera_pcie_of_match, &pdev->dev); 824 804 if (!match) ··· 877 855 return ret; 878 856 } 879 857 858 + static int altera_pcie_remove(struct platform_device *pdev) 859 + { 860 + struct altera_pcie *pcie = platform_get_drvdata(pdev); 861 + struct pci_host_bridge *bridge = pci_host_bridge_from_priv(pcie); 862 + 863 + pci_stop_root_bus(bridge->bus); 864 + pci_remove_root_bus(bridge->bus); 865 + pci_free_resource_list(&pcie->resources); 866 + altera_pcie_irq_teardown(pcie); 867 + 868 + return 0; 869 + } 870 + 880 871 static struct platform_driver altera_pcie_driver = { 881 872 .probe = altera_pcie_probe, 873 + .remove = altera_pcie_remove, 882 874 .driver = { 883 875 .name = "altera-pcie", 884 876 .of_match_table = altera_pcie_of_match, 885 - .suppress_bind_attrs = true, 886 877 }, 887 878 }; 888 879 889 - builtin_platform_driver(altera_pcie_driver); 880 + MODULE_DEVICE_TABLE(of, altera_pcie_of_match); 881 + module_platform_driver(altera_pcie_driver); 882 + MODULE_LICENSE("GPL v2");

+1 -1

drivers/pci/controller/pcie-iproc-platform.c

··· 87 87 88 88 /* 89 89 * DT nodes are not used by all platforms that use the iProc PCIe 90 - * core driver. For platforms that require explict inbound mapping 90 + * core driver. For platforms that require explicit inbound mapping 91 91 * configuration, "dma-ranges" would have been present in DT 92 92 */ 93 93 pcie->need_ib_cfg = of_property_read_bool(np, "dma-ranges");

+1 -1

drivers/pci/controller/pcie-iproc.c

··· 163 163 * @size_unit: inbound mapping region size unit, could be SZ_1K, SZ_1M, or 164 164 * SZ_1G 165 165 * @region_sizes: list of supported inbound mapping region sizes in KB, MB, or 166 - * GB, depedning on the size unit 166 + * GB, depending on the size unit 167 167 * @nr_sizes: number of supported inbound mapping region sizes 168 168 * @nr_windows: number of supported inbound mapping windows for the region 169 169 * @imap_addr_offset: register offset between the upper and lower 32-bit

+311 -212

drivers/pci/controller/pcie-mobiveil.c

··· 31 31 * translation tables are grouped into windows, each window registers are 32 32 * grouped into blocks of 4 or 16 registers each 33 33 */ 34 - #define PAB_REG_BLOCK_SIZE 16 35 - #define PAB_EXT_REG_BLOCK_SIZE 4 34 + #define PAB_REG_BLOCK_SIZE 16 35 + #define PAB_EXT_REG_BLOCK_SIZE 4 36 36 37 - #define PAB_REG_ADDR(offset, win) (offset + (win * PAB_REG_BLOCK_SIZE)) 38 - #define PAB_EXT_REG_ADDR(offset, win) (offset + (win * PAB_EXT_REG_BLOCK_SIZE)) 37 + #define PAB_REG_ADDR(offset, win) \ 38 + (offset + (win * PAB_REG_BLOCK_SIZE)) 39 + #define PAB_EXT_REG_ADDR(offset, win) \ 40 + (offset + (win * PAB_EXT_REG_BLOCK_SIZE)) 39 41 40 - #define LTSSM_STATUS 0x0404 41 - #define LTSSM_STATUS_L0_MASK 0x3f 42 - #define LTSSM_STATUS_L0 0x2d 42 + #define LTSSM_STATUS 0x0404 43 + #define LTSSM_STATUS_L0_MASK 0x3f 44 + #define LTSSM_STATUS_L0 0x2d 43 45 44 - #define PAB_CTRL 0x0808 45 - #define AMBA_PIO_ENABLE_SHIFT 0 46 - #define PEX_PIO_ENABLE_SHIFT 1 47 - #define PAGE_SEL_SHIFT 13 48 - #define PAGE_SEL_MASK 0x3f 49 - #define PAGE_LO_MASK 0x3ff 50 - #define PAGE_SEL_EN 0xc00 51 - #define PAGE_SEL_OFFSET_SHIFT 10 46 + #define PAB_CTRL 0x0808 47 + #define AMBA_PIO_ENABLE_SHIFT 0 48 + #define PEX_PIO_ENABLE_SHIFT 1 49 + #define PAGE_SEL_SHIFT 13 50 + #define PAGE_SEL_MASK 0x3f 51 + #define PAGE_LO_MASK 0x3ff 52 + #define PAGE_SEL_OFFSET_SHIFT 10 52 53 53 - #define PAB_AXI_PIO_CTRL 0x0840 54 - #define APIO_EN_MASK 0xf 54 + #define PAB_AXI_PIO_CTRL 0x0840 55 + #define APIO_EN_MASK 0xf 55 56 56 - #define PAB_PEX_PIO_CTRL 0x08c0 57 - #define PIO_ENABLE_SHIFT 0 57 + #define PAB_PEX_PIO_CTRL 0x08c0 58 + #define PIO_ENABLE_SHIFT 0 58 59 59 60 #define PAB_INTP_AMBA_MISC_ENB 0x0b0c 60 - #define PAB_INTP_AMBA_MISC_STAT 0x0b1c 61 + #define PAB_INTP_AMBA_MISC_STAT 0x0b1c 61 62 #define PAB_INTP_INTX_MASK 0x01e0 62 63 #define PAB_INTP_MSI_MASK 0x8 63 64 64 - #define PAB_AXI_AMAP_CTRL(win) PAB_REG_ADDR(0x0ba0, win) 65 - #define WIN_ENABLE_SHIFT 0 66 - #define WIN_TYPE_SHIFT 1 65 + #define PAB_AXI_AMAP_CTRL(win) PAB_REG_ADDR(0x0ba0, win) 66 + #define WIN_ENABLE_SHIFT 0 67 + #define WIN_TYPE_SHIFT 1 68 + #define WIN_TYPE_MASK 0x3 69 + #define WIN_SIZE_MASK 0xfffffc00 67 70 68 71 #define PAB_EXT_AXI_AMAP_SIZE(win) PAB_EXT_REG_ADDR(0xbaf0, win) 69 72 73 + #define PAB_EXT_AXI_AMAP_AXI_WIN(win) PAB_EXT_REG_ADDR(0x80a0, win) 70 74 #define PAB_AXI_AMAP_AXI_WIN(win) PAB_REG_ADDR(0x0ba4, win) 71 75 #define AXI_WINDOW_ALIGN_MASK 3 72 76 73 77 #define PAB_AXI_AMAP_PEX_WIN_L(win) PAB_REG_ADDR(0x0ba8, win) 74 - #define PAB_BUS_SHIFT 24 75 - #define PAB_DEVICE_SHIFT 19 76 - #define PAB_FUNCTION_SHIFT 16 78 + #define PAB_BUS_SHIFT 24 79 + #define PAB_DEVICE_SHIFT 19 80 + #define PAB_FUNCTION_SHIFT 16 77 81 78 82 #define PAB_AXI_AMAP_PEX_WIN_H(win) PAB_REG_ADDR(0x0bac, win) 79 83 #define PAB_INTP_AXI_PIO_CLASS 0x474 80 84 81 - #define PAB_PEX_AMAP_CTRL(win) PAB_REG_ADDR(0x4ba0, win) 82 - #define AMAP_CTRL_EN_SHIFT 0 83 - #define AMAP_CTRL_TYPE_SHIFT 1 85 + #define PAB_PEX_AMAP_CTRL(win) PAB_REG_ADDR(0x4ba0, win) 86 + #define AMAP_CTRL_EN_SHIFT 0 87 + #define AMAP_CTRL_TYPE_SHIFT 1 88 + #define AMAP_CTRL_TYPE_MASK 3 84 89 85 90 #define PAB_EXT_PEX_AMAP_SIZEN(win) PAB_EXT_REG_ADDR(0xbef0, win) 86 91 #define PAB_PEX_AMAP_AXI_WIN(win) PAB_REG_ADDR(0x4ba4, win) ··· 93 88 #define PAB_PEX_AMAP_PEX_WIN_H(win) PAB_REG_ADDR(0x4bac, win) 94 89 95 90 /* starting offset of INTX bits in status register */ 96 - #define PAB_INTX_START 5 91 + #define PAB_INTX_START 5 97 92 98 93 /* supported number of MSI interrupts */ 99 - #define PCI_NUM_MSI 16 94 + #define PCI_NUM_MSI 16 100 95 101 96 /* MSI registers */ 102 - #define MSI_BASE_LO_OFFSET 0x04 103 - #define MSI_BASE_HI_OFFSET 0x08 104 - #define MSI_SIZE_OFFSET 0x0c 105 - #define MSI_ENABLE_OFFSET 0x14 106 - #define MSI_STATUS_OFFSET 0x18 107 - #define MSI_DATA_OFFSET 0x20 108 - #define MSI_ADDR_L_OFFSET 0x24 109 - #define MSI_ADDR_H_OFFSET 0x28 97 + #define MSI_BASE_LO_OFFSET 0x04 98 + #define MSI_BASE_HI_OFFSET 0x08 99 + #define MSI_SIZE_OFFSET 0x0c 100 + #define MSI_ENABLE_OFFSET 0x14 101 + #define MSI_STATUS_OFFSET 0x18 102 + #define MSI_DATA_OFFSET 0x20 103 + #define MSI_ADDR_L_OFFSET 0x24 104 + #define MSI_ADDR_H_OFFSET 0x28 110 105 111 106 /* outbound and inbound window definitions */ 112 - #define WIN_NUM_0 0 113 - #define WIN_NUM_1 1 114 - #define CFG_WINDOW_TYPE 0 115 - #define IO_WINDOW_TYPE 1 116 - #define MEM_WINDOW_TYPE 2 117 - #define IB_WIN_SIZE ((u64)256 * 1024 * 1024 * 1024) 118 - #define MAX_PIO_WINDOWS 8 107 + #define WIN_NUM_0 0 108 + #define WIN_NUM_1 1 109 + #define CFG_WINDOW_TYPE 0 110 + #define IO_WINDOW_TYPE 1 111 + #define MEM_WINDOW_TYPE 2 112 + #define IB_WIN_SIZE ((u64)256 * 1024 * 1024 * 1024) 113 + #define MAX_PIO_WINDOWS 8 119 114 120 115 /* Parameters for the waiting for link up routine */ 121 - #define LINK_WAIT_MAX_RETRIES 10 122 - #define LINK_WAIT_MIN 90000 123 - #define LINK_WAIT_MAX 100000 116 + #define LINK_WAIT_MAX_RETRIES 10 117 + #define LINK_WAIT_MIN 90000 118 + #define LINK_WAIT_MAX 100000 119 + 120 + #define PAGED_ADDR_BNDRY 0xc00 121 + #define OFFSET_TO_PAGE_ADDR(off) \ 122 + ((off & PAGE_LO_MASK) | PAGED_ADDR_BNDRY) 123 + #define OFFSET_TO_PAGE_IDX(off) \ 124 + ((off >> PAGE_SEL_OFFSET_SHIFT) & PAGE_SEL_MASK) 124 125 125 126 struct mobiveil_msi { /* MSI information */ 126 127 struct mutex lock; /* protect bitmap variable */ ··· 156 145 struct mobiveil_msi msi; 157 146 }; 158 147 159 - static inline void csr_writel(struct mobiveil_pcie *pcie, const u32 value, 160 - const u32 reg) 148 + /* 149 + * mobiveil_pcie_sel_page - routine to access paged register 150 + * 151 + * Registers whose address greater than PAGED_ADDR_BNDRY (0xc00) are paged, 152 + * for this scheme to work extracted higher 6 bits of the offset will be 153 + * written to pg_sel field of PAB_CTRL register and rest of the lower 10 154 + * bits enabled with PAGED_ADDR_BNDRY are used as offset of the register. 155 + */ 156 + static void mobiveil_pcie_sel_page(struct mobiveil_pcie *pcie, u8 pg_idx) 161 157 { 162 - writel_relaxed(value, pcie->csr_axi_slave_base + reg); 158 + u32 val; 159 + 160 + val = readl(pcie->csr_axi_slave_base + PAB_CTRL); 161 + val &= ~(PAGE_SEL_MASK << PAGE_SEL_SHIFT); 162 + val |= (pg_idx & PAGE_SEL_MASK) << PAGE_SEL_SHIFT; 163 + 164 + writel(val, pcie->csr_axi_slave_base + PAB_CTRL); 163 165 } 164 166 165 - static inline u32 csr_readl(struct mobiveil_pcie *pcie, const u32 reg) 167 + static void *mobiveil_pcie_comp_addr(struct mobiveil_pcie *pcie, u32 off) 166 168 { 167 - return readl_relaxed(pcie->csr_axi_slave_base + reg); 169 + if (off < PAGED_ADDR_BNDRY) { 170 + /* For directly accessed registers, clear the pg_sel field */ 171 + mobiveil_pcie_sel_page(pcie, 0); 172 + return pcie->csr_axi_slave_base + off; 173 + } 174 + 175 + mobiveil_pcie_sel_page(pcie, OFFSET_TO_PAGE_IDX(off)); 176 + return pcie->csr_axi_slave_base + OFFSET_TO_PAGE_ADDR(off); 177 + } 178 + 179 + static int mobiveil_pcie_read(void __iomem *addr, int size, u32 *val) 180 + { 181 + if ((uintptr_t)addr & (size - 1)) { 182 + *val = 0; 183 + return PCIBIOS_BAD_REGISTER_NUMBER; 184 + } 185 + 186 + switch (size) { 187 + case 4: 188 + *val = readl(addr); 189 + break; 190 + case 2: 191 + *val = readw(addr); 192 + break; 193 + case 1: 194 + *val = readb(addr); 195 + break; 196 + default: 197 + *val = 0; 198 + return PCIBIOS_BAD_REGISTER_NUMBER; 199 + } 200 + 201 + return PCIBIOS_SUCCESSFUL; 202 + } 203 + 204 + static int mobiveil_pcie_write(void __iomem *addr, int size, u32 val) 205 + { 206 + if ((uintptr_t)addr & (size - 1)) 207 + return PCIBIOS_BAD_REGISTER_NUMBER; 208 + 209 + switch (size) { 210 + case 4: 211 + writel(val, addr); 212 + break; 213 + case 2: 214 + writew(val, addr); 215 + break; 216 + case 1: 217 + writeb(val, addr); 218 + break; 219 + default: 220 + return PCIBIOS_BAD_REGISTER_NUMBER; 221 + } 222 + 223 + return PCIBIOS_SUCCESSFUL; 224 + } 225 + 226 + static u32 csr_read(struct mobiveil_pcie *pcie, u32 off, size_t size) 227 + { 228 + void *addr; 229 + u32 val; 230 + int ret; 231 + 232 + addr = mobiveil_pcie_comp_addr(pcie, off); 233 + 234 + ret = mobiveil_pcie_read(addr, size, &val); 235 + if (ret) 236 + dev_err(&pcie->pdev->dev, "read CSR address failed\n"); 237 + 238 + return val; 239 + } 240 + 241 + static void csr_write(struct mobiveil_pcie *pcie, u32 val, u32 off, size_t size) 242 + { 243 + void *addr; 244 + int ret; 245 + 246 + addr = mobiveil_pcie_comp_addr(pcie, off); 247 + 248 + ret = mobiveil_pcie_write(addr, size, val); 249 + if (ret) 250 + dev_err(&pcie->pdev->dev, "write CSR address failed\n"); 251 + } 252 + 253 + static u32 csr_readl(struct mobiveil_pcie *pcie, u32 off) 254 + { 255 + return csr_read(pcie, off, 0x4); 256 + } 257 + 258 + static void csr_writel(struct mobiveil_pcie *pcie, u32 val, u32 off) 259 + { 260 + csr_write(pcie, val, off, 0x4); 168 261 } 169 262 170 263 static bool mobiveil_pcie_link_up(struct mobiveil_pcie *pcie) ··· 289 174 * Do not read more than one device on the bus directly 290 175 * attached to RC 291 176 */ 292 - if ((bus->primary == pcie->root_bus_nr) && (devfn > 0)) 177 + if ((bus->primary == pcie->root_bus_nr) && (PCI_SLOT(devfn) > 0)) 293 178 return false; 294 179 295 180 return true; ··· 300 185 * root port or endpoint 301 186 */ 302 187 static void __iomem *mobiveil_pcie_map_bus(struct pci_bus *bus, 303 - unsigned int devfn, int where) 188 + unsigned int devfn, int where) 304 189 { 305 190 struct mobiveil_pcie *pcie = bus->sysdata; 191 + u32 value; 306 192 307 193 if (!mobiveil_pcie_valid_device(bus, devfn)) 308 194 return NULL; 309 195 310 - if (bus->number == pcie->root_bus_nr) { 311 - /* RC config access */ 196 + /* RC config access */ 197 + if (bus->number == pcie->root_bus_nr) 312 198 return pcie->csr_axi_slave_base + where; 313 - } 314 199 315 200 /* 316 201 * EP config access (in Config/APIO space) ··· 318 203 * (BDF) in PAB_AXI_AMAP_PEX_WIN_L0 Register. 319 204 * Relies on pci_lock serialization 320 205 */ 321 - csr_writel(pcie, bus->number << PAB_BUS_SHIFT | 322 - PCI_SLOT(devfn) << PAB_DEVICE_SHIFT | 323 - PCI_FUNC(devfn) << PAB_FUNCTION_SHIFT, 324 - PAB_AXI_AMAP_PEX_WIN_L(WIN_NUM_0)); 206 + value = bus->number << PAB_BUS_SHIFT | 207 + PCI_SLOT(devfn) << PAB_DEVICE_SHIFT | 208 + PCI_FUNC(devfn) << PAB_FUNCTION_SHIFT; 209 + 210 + csr_writel(pcie, value, PAB_AXI_AMAP_PEX_WIN_L(WIN_NUM_0)); 211 + 325 212 return pcie->config_axi_slave_base + where; 326 213 } 327 214 ··· 358 241 359 242 /* Handle INTx */ 360 243 if (intr_status & PAB_INTP_INTX_MASK) { 361 - shifted_status = csr_readl(pcie, PAB_INTP_AMBA_MISC_STAT) >> 362 - PAB_INTX_START; 244 + shifted_status = csr_readl(pcie, PAB_INTP_AMBA_MISC_STAT); 245 + shifted_status &= PAB_INTP_INTX_MASK; 246 + shifted_status >>= PAB_INTX_START; 363 247 do { 364 248 for_each_set_bit(bit, &shifted_status, PCI_NUM_INTX) { 365 249 virq = irq_find_mapping(pcie->intx_domain, 366 - bit + 1); 250 + bit + 1); 367 251 if (virq) 368 252 generic_handle_irq(virq); 369 253 else 370 - dev_err_ratelimited(dev, 371 - "unexpected IRQ, INT%d\n", bit); 254 + dev_err_ratelimited(dev, "unexpected IRQ, INT%d\n", 255 + bit); 372 256 373 - /* clear interrupt */ 374 - csr_writel(pcie, 375 - shifted_status << PAB_INTX_START, 376 - PAB_INTP_AMBA_MISC_STAT); 257 + /* clear interrupt handled */ 258 + csr_writel(pcie, 1 << (PAB_INTX_START + bit), 259 + PAB_INTP_AMBA_MISC_STAT); 377 260 } 378 - } while ((shifted_status >> PAB_INTX_START) != 0); 261 + 262 + shifted_status = csr_readl(pcie, 263 + PAB_INTP_AMBA_MISC_STAT); 264 + shifted_status &= PAB_INTP_INTX_MASK; 265 + shifted_status >>= PAB_INTX_START; 266 + } while (shifted_status != 0); 379 267 } 380 268 381 269 /* read extra MSI status register */ ··· 388 266 389 267 /* handle MSI interrupts */ 390 268 while (msi_status & 1) { 391 - msi_data = readl_relaxed(pcie->apb_csr_base 392 - + MSI_DATA_OFFSET); 269 + msi_data = readl_relaxed(pcie->apb_csr_base + MSI_DATA_OFFSET); 393 270 394 271 /* 395 272 * MSI_STATUS_OFFSET register gets updated to zero ··· 397 276 * two dummy reads. 398 277 */ 399 278 msi_addr_lo = readl_relaxed(pcie->apb_csr_base + 400 - MSI_ADDR_L_OFFSET); 279 + MSI_ADDR_L_OFFSET); 401 280 msi_addr_hi = readl_relaxed(pcie->apb_csr_base + 402 - MSI_ADDR_H_OFFSET); 281 + MSI_ADDR_H_OFFSET); 403 282 dev_dbg(dev, "MSI registers, data: %08x, addr: %08x:%08x\n", 404 - msi_data, msi_addr_hi, msi_addr_lo); 283 + msi_data, msi_addr_hi, msi_addr_lo); 405 284 406 285 virq = irq_find_mapping(msi->dev_domain, msi_data); 407 286 if (virq) 408 287 generic_handle_irq(virq); 409 288 410 289 msi_status = readl_relaxed(pcie->apb_csr_base + 411 - MSI_STATUS_OFFSET); 290 + MSI_STATUS_OFFSET); 412 291 } 413 292 414 293 /* Clear the interrupt status */ ··· 425 304 426 305 /* map config resource */ 427 306 res = platform_get_resource_byname(pdev, IORESOURCE_MEM, 428 - "config_axi_slave"); 307 + "config_axi_slave"); 429 308 pcie->config_axi_slave_base = devm_pci_remap_cfg_resource(dev, res); 430 309 if (IS_ERR(pcie->config_axi_slave_base)) 431 310 return PTR_ERR(pcie->config_axi_slave_base); ··· 433 312 434 313 /* map csr resource */ 435 314 res = platform_get_resource_byname(pdev, IORESOURCE_MEM, 436 - "csr_axi_slave"); 315 + "csr_axi_slave"); 437 316 pcie->csr_axi_slave_base = devm_pci_remap_cfg_resource(dev, res); 438 317 if (IS_ERR(pcie->csr_axi_slave_base)) 439 318 return PTR_ERR(pcie->csr_axi_slave_base); ··· 458 337 return -ENODEV; 459 338 } 460 339 461 - irq_set_chained_handler_and_data(pcie->irq, mobiveil_pcie_isr, pcie); 462 - 463 340 return 0; 464 341 } 465 342 466 - /* 467 - * select_paged_register - routine to access paged register of root complex 468 - * 469 - * registers of RC are paged, for this scheme to work 470 - * extracted higher 6 bits of the offset will be written to pg_sel 471 - * field of PAB_CTRL register and rest of the lower 10 bits enabled with 472 - * PAGE_SEL_EN are used as offset of the register. 473 - */ 474 - static void select_paged_register(struct mobiveil_pcie *pcie, u32 offset) 475 - { 476 - int pab_ctrl_dw, pg_sel; 477 - 478 - /* clear pg_sel field */ 479 - pab_ctrl_dw = csr_readl(pcie, PAB_CTRL); 480 - pab_ctrl_dw = (pab_ctrl_dw & ~(PAGE_SEL_MASK << PAGE_SEL_SHIFT)); 481 - 482 - /* set pg_sel field */ 483 - pg_sel = (offset >> PAGE_SEL_OFFSET_SHIFT) & PAGE_SEL_MASK; 484 - pab_ctrl_dw |= ((pg_sel << PAGE_SEL_SHIFT)); 485 - csr_writel(pcie, pab_ctrl_dw, PAB_CTRL); 486 - } 487 - 488 - static void write_paged_register(struct mobiveil_pcie *pcie, 489 - u32 val, u32 offset) 490 - { 491 - u32 off = (offset & PAGE_LO_MASK) | PAGE_SEL_EN; 492 - 493 - select_paged_register(pcie, offset); 494 - csr_writel(pcie, val, off); 495 - } 496 - 497 - static u32 read_paged_register(struct mobiveil_pcie *pcie, u32 offset) 498 - { 499 - u32 off = (offset & PAGE_LO_MASK) | PAGE_SEL_EN; 500 - 501 - select_paged_register(pcie, offset); 502 - return csr_readl(pcie, off); 503 - } 504 - 505 343 static void program_ib_windows(struct mobiveil_pcie *pcie, int win_num, 506 - int pci_addr, u32 type, u64 size) 344 + u64 pci_addr, u32 type, u64 size) 507 345 { 508 - int pio_ctrl_val; 509 - int amap_ctrl_dw; 346 + u32 value; 510 347 u64 size64 = ~(size - 1); 511 348 512 - if ((pcie->ib_wins_configured + 1) > pcie->ppio_wins) { 349 + if (win_num >= pcie->ppio_wins) { 513 350 dev_err(&pcie->pdev->dev, 514 351 "ERROR: max inbound windows reached !\n"); 515 352 return; 516 353 } 517 354 518 - pio_ctrl_val = csr_readl(pcie, PAB_PEX_PIO_CTRL); 519 - csr_writel(pcie, 520 - pio_ctrl_val | (1 << PIO_ENABLE_SHIFT), PAB_PEX_PIO_CTRL); 521 - amap_ctrl_dw = read_paged_register(pcie, PAB_PEX_AMAP_CTRL(win_num)); 522 - amap_ctrl_dw = (amap_ctrl_dw | (type << AMAP_CTRL_TYPE_SHIFT)); 523 - amap_ctrl_dw = (amap_ctrl_dw | (1 << AMAP_CTRL_EN_SHIFT)); 355 + value = csr_readl(pcie, PAB_PEX_AMAP_CTRL(win_num)); 356 + value &= ~(AMAP_CTRL_TYPE_MASK << AMAP_CTRL_TYPE_SHIFT | WIN_SIZE_MASK); 357 + value |= type << AMAP_CTRL_TYPE_SHIFT | 1 << AMAP_CTRL_EN_SHIFT | 358 + (lower_32_bits(size64) & WIN_SIZE_MASK); 359 + csr_writel(pcie, value, PAB_PEX_AMAP_CTRL(win_num)); 524 360 525 - write_paged_register(pcie, amap_ctrl_dw | lower_32_bits(size64), 526 - PAB_PEX_AMAP_CTRL(win_num)); 361 + csr_writel(pcie, upper_32_bits(size64), 362 + PAB_EXT_PEX_AMAP_SIZEN(win_num)); 527 363 528 - write_paged_register(pcie, upper_32_bits(size64), 529 - PAB_EXT_PEX_AMAP_SIZEN(win_num)); 364 + csr_writel(pcie, pci_addr, PAB_PEX_AMAP_AXI_WIN(win_num)); 530 365 531 - write_paged_register(pcie, pci_addr, PAB_PEX_AMAP_AXI_WIN(win_num)); 532 - write_paged_register(pcie, pci_addr, PAB_PEX_AMAP_PEX_WIN_L(win_num)); 533 - write_paged_register(pcie, 0, PAB_PEX_AMAP_PEX_WIN_H(win_num)); 366 + csr_writel(pcie, lower_32_bits(pci_addr), 367 + PAB_PEX_AMAP_PEX_WIN_L(win_num)); 368 + csr_writel(pcie, upper_32_bits(pci_addr), 369 + PAB_PEX_AMAP_PEX_WIN_H(win_num)); 370 + 371 + pcie->ib_wins_configured++; 534 372 } 535 373 536 374 /* 537 375 * routine to program the outbound windows 538 376 */ 539 377 static void program_ob_windows(struct mobiveil_pcie *pcie, int win_num, 540 - u64 cpu_addr, u64 pci_addr, u32 config_io_bit, u64 size) 378 + u64 cpu_addr, u64 pci_addr, u32 type, u64 size) 541 379 { 542 - 543 - u32 value, type; 380 + u32 value; 544 381 u64 size64 = ~(size - 1); 545 382 546 - if ((pcie->ob_wins_configured + 1) > pcie->apio_wins) { 383 + if (win_num >= pcie->apio_wins) { 547 384 dev_err(&pcie->pdev->dev, 548 385 "ERROR: max outbound windows reached !\n"); 549 386 return; ··· 511 432 * program Enable Bit to 1, Type Bit to (00) base 2, AXI Window Size Bit 512 433 * to 4 KB in PAB_AXI_AMAP_CTRL register 513 434 */ 514 - type = config_io_bit; 515 435 value = csr_readl(pcie, PAB_AXI_AMAP_CTRL(win_num)); 516 - csr_writel(pcie, 1 << WIN_ENABLE_SHIFT | type << WIN_TYPE_SHIFT | 517 - lower_32_bits(size64), PAB_AXI_AMAP_CTRL(win_num)); 436 + value &= ~(WIN_TYPE_MASK << WIN_TYPE_SHIFT | WIN_SIZE_MASK); 437 + value |= 1 << WIN_ENABLE_SHIFT | type << WIN_TYPE_SHIFT | 438 + (lower_32_bits(size64) & WIN_SIZE_MASK); 439 + csr_writel(pcie, value, PAB_AXI_AMAP_CTRL(win_num)); 518 440 519 - write_paged_register(pcie, upper_32_bits(size64), 520 - PAB_EXT_AXI_AMAP_SIZE(win_num)); 441 + csr_writel(pcie, upper_32_bits(size64), PAB_EXT_AXI_AMAP_SIZE(win_num)); 521 442 522 443 /* 523 444 * program AXI window base with appropriate value in 524 445 * PAB_AXI_AMAP_AXI_WIN0 register 525 446 */ 526 - value = csr_readl(pcie, PAB_AXI_AMAP_AXI_WIN(win_num)); 527 - csr_writel(pcie, cpu_addr & (~AXI_WINDOW_ALIGN_MASK), 528 - PAB_AXI_AMAP_AXI_WIN(win_num)); 529 - 530 - value = csr_readl(pcie, PAB_AXI_AMAP_PEX_WIN_H(win_num)); 447 + csr_writel(pcie, lower_32_bits(cpu_addr) & (~AXI_WINDOW_ALIGN_MASK), 448 + PAB_AXI_AMAP_AXI_WIN(win_num)); 449 + csr_writel(pcie, upper_32_bits(cpu_addr), 450 + PAB_EXT_AXI_AMAP_AXI_WIN(win_num)); 531 451 532 452 csr_writel(pcie, lower_32_bits(pci_addr), 533 - PAB_AXI_AMAP_PEX_WIN_L(win_num)); 453 + PAB_AXI_AMAP_PEX_WIN_L(win_num)); 534 454 csr_writel(pcie, upper_32_bits(pci_addr), 535 - PAB_AXI_AMAP_PEX_WIN_H(win_num)); 455 + PAB_AXI_AMAP_PEX_WIN_H(win_num)); 536 456 537 457 pcie->ob_wins_configured++; 538 458 } ··· 547 469 548 470 usleep_range(LINK_WAIT_MIN, LINK_WAIT_MAX); 549 471 } 472 + 550 473 dev_err(&pcie->pdev->dev, "link never came up\n"); 474 + 551 475 return -ETIMEDOUT; 552 476 } 553 477 ··· 562 482 msi->msi_pages_phys = (phys_addr_t)msg_addr; 563 483 564 484 writel_relaxed(lower_32_bits(msg_addr), 565 - pcie->apb_csr_base + MSI_BASE_LO_OFFSET); 485 + pcie->apb_csr_base + MSI_BASE_LO_OFFSET); 566 486 writel_relaxed(upper_32_bits(msg_addr), 567 - pcie->apb_csr_base + MSI_BASE_HI_OFFSET); 487 + pcie->apb_csr_base + MSI_BASE_HI_OFFSET); 568 488 writel_relaxed(4096, pcie->apb_csr_base + MSI_SIZE_OFFSET); 569 489 writel_relaxed(1, pcie->apb_csr_base + MSI_ENABLE_OFFSET); 570 490 } 571 491 572 492 static int mobiveil_host_init(struct mobiveil_pcie *pcie) 573 493 { 574 - u32 value, pab_ctrl, type = 0; 575 - int err; 576 - struct resource_entry *win, *tmp; 494 + u32 value, pab_ctrl, type; 495 + struct resource_entry *win; 577 496 578 - err = mobiveil_bringup_link(pcie); 579 - if (err) { 580 - dev_info(&pcie->pdev->dev, "link bring-up failed\n"); 581 - return err; 582 - } 497 + /* setup bus numbers */ 498 + value = csr_readl(pcie, PCI_PRIMARY_BUS); 499 + value &= 0xff000000; 500 + value |= 0x00ff0100; 501 + csr_writel(pcie, value, PCI_PRIMARY_BUS); 583 502 584 503 /* 585 504 * program Bus Master Enable Bit in Command Register in PAB Config 586 505 * Space 587 506 */ 588 507 value = csr_readl(pcie, PCI_COMMAND); 589 - csr_writel(pcie, value | PCI_COMMAND_IO | PCI_COMMAND_MEMORY | 590 - PCI_COMMAND_MASTER, PCI_COMMAND); 508 + value |= PCI_COMMAND_IO | PCI_COMMAND_MEMORY | PCI_COMMAND_MASTER; 509 + csr_writel(pcie, value, PCI_COMMAND); 591 510 592 511 /* 593 512 * program PIO Enable Bit to 1 (and PEX PIO Enable to 1) in PAB_CTRL 594 513 * register 595 514 */ 596 515 pab_ctrl = csr_readl(pcie, PAB_CTRL); 597 - csr_writel(pcie, pab_ctrl | (1 << AMBA_PIO_ENABLE_SHIFT) | 598 - (1 << PEX_PIO_ENABLE_SHIFT), PAB_CTRL); 516 + pab_ctrl |= (1 << AMBA_PIO_ENABLE_SHIFT) | (1 << PEX_PIO_ENABLE_SHIFT); 517 + csr_writel(pcie, pab_ctrl, PAB_CTRL); 599 518 600 519 csr_writel(pcie, (PAB_INTP_INTX_MASK | PAB_INTP_MSI_MASK), 601 - PAB_INTP_AMBA_MISC_ENB); 520 + PAB_INTP_AMBA_MISC_ENB); 602 521 603 522 /* 604 523 * program PIO Enable Bit to 1 and Config Window Enable Bit to 1 in 605 524 * PAB_AXI_PIO_CTRL Register 606 525 */ 607 526 value = csr_readl(pcie, PAB_AXI_PIO_CTRL); 608 - csr_writel(pcie, value | APIO_EN_MASK, PAB_AXI_PIO_CTRL); 527 + value |= APIO_EN_MASK; 528 + csr_writel(pcie, value, PAB_AXI_PIO_CTRL); 529 + 530 + /* Enable PCIe PIO master */ 531 + value = csr_readl(pcie, PAB_PEX_PIO_CTRL); 532 + value |= 1 << PIO_ENABLE_SHIFT; 533 + csr_writel(pcie, value, PAB_PEX_PIO_CTRL); 609 534 610 535 /* 611 536 * we'll program one outbound window for config reads and ··· 620 535 */ 621 536 622 537 /* config outbound translation window */ 623 - program_ob_windows(pcie, pcie->ob_wins_configured, 624 - pcie->ob_io_res->start, 0, CFG_WINDOW_TYPE, 625 - resource_size(pcie->ob_io_res)); 538 + program_ob_windows(pcie, WIN_NUM_0, pcie->ob_io_res->start, 0, 539 + CFG_WINDOW_TYPE, resource_size(pcie->ob_io_res)); 626 540 627 541 /* memory inbound translation window */ 628 - program_ib_windows(pcie, WIN_NUM_1, 0, MEM_WINDOW_TYPE, IB_WIN_SIZE); 542 + program_ib_windows(pcie, WIN_NUM_0, 0, MEM_WINDOW_TYPE, IB_WIN_SIZE); 629 543 630 544 /* Get the I/O and memory ranges from DT */ 631 - resource_list_for_each_entry_safe(win, tmp, &pcie->resources) { 632 - type = 0; 545 + resource_list_for_each_entry(win, &pcie->resources) { 633 546 if (resource_type(win->res) == IORESOURCE_MEM) 634 547 type = MEM_WINDOW_TYPE; 635 - if (resource_type(win->res) == IORESOURCE_IO) 548 + else if (resource_type(win->res) == IORESOURCE_IO) 636 549 type = IO_WINDOW_TYPE; 637 - if (type) { 638 - /* configure outbound translation window */ 639 - program_ob_windows(pcie, pcie->ob_wins_configured, 640 - win->res->start, 0, type, 641 - resource_size(win->res)); 642 - } 550 + else 551 + continue; 552 + 553 + /* configure outbound translation window */ 554 + program_ob_windows(pcie, pcie->ob_wins_configured, 555 + win->res->start, 556 + win->res->start - win->offset, 557 + type, resource_size(win->res)); 643 558 } 559 + 560 + /* fixup for PCIe class register */ 561 + value = csr_readl(pcie, PAB_INTP_AXI_PIO_CLASS); 562 + value &= 0xff; 563 + value |= (PCI_CLASS_BRIDGE_PCI << 16); 564 + csr_writel(pcie, value, PAB_INTP_AXI_PIO_CLASS); 644 565 645 566 /* setup MSI hardware registers */ 646 567 mobiveil_pcie_enable_msi(pcie); 647 568 648 - return err; 569 + return 0; 649 570 } 650 571 651 572 static void mobiveil_mask_intx_irq(struct irq_data *data) ··· 665 574 mask = 1 << ((data->hwirq + PAB_INTX_START) - 1); 666 575 raw_spin_lock_irqsave(&pcie->intx_mask_lock, flags); 667 576 shifted_val = csr_readl(pcie, PAB_INTP_AMBA_MISC_ENB); 668 - csr_writel(pcie, (shifted_val & (~mask)), PAB_INTP_AMBA_MISC_ENB); 577 + shifted_val &= ~mask; 578 + csr_writel(pcie, shifted_val, PAB_INTP_AMBA_MISC_ENB); 669 579 raw_spin_unlock_irqrestore(&pcie->intx_mask_lock, flags); 670 580 } 671 581 ··· 681 589 mask = 1 << ((data->hwirq + PAB_INTX_START) - 1); 682 590 raw_spin_lock_irqsave(&pcie->intx_mask_lock, flags); 683 591 shifted_val = csr_readl(pcie, PAB_INTP_AMBA_MISC_ENB); 684 - csr_writel(pcie, (shifted_val | mask), PAB_INTP_AMBA_MISC_ENB); 592 + shifted_val |= mask; 593 + csr_writel(pcie, shifted_val, PAB_INTP_AMBA_MISC_ENB); 685 594 raw_spin_unlock_irqrestore(&pcie->intx_mask_lock, flags); 686 595 } 687 596 ··· 696 603 697 604 /* routine to setup the INTx related data */ 698 605 static int mobiveil_pcie_intx_map(struct irq_domain *domain, unsigned int irq, 699 - irq_hw_number_t hwirq) 606 + irq_hw_number_t hwirq) 700 607 { 701 608 irq_set_chip_and_handler(irq, &intx_irq_chip, handle_level_irq); 702 609 irq_set_chip_data(irq, domain->host_data); 610 + 703 611 return 0; 704 612 } 705 613 ··· 717 623 718 624 static struct msi_domain_info mobiveil_msi_domain_info = { 719 625 .flags = (MSI_FLAG_USE_DEF_DOM_OPS | MSI_FLAG_USE_DEF_CHIP_OPS | 720 - MSI_FLAG_MULTI_PCI_MSI | MSI_FLAG_PCI_MSIX), 626 + MSI_FLAG_PCI_MSIX), 721 627 .chip = &mobiveil_msi_irq_chip, 722 628 }; 723 629 ··· 735 641 } 736 642 737 643 static int mobiveil_msi_set_affinity(struct irq_data *irq_data, 738 - const struct cpumask *mask, bool force) 644 + const struct cpumask *mask, bool force) 739 645 { 740 646 return -EINVAL; 741 647 } ··· 747 653 }; 748 654 749 655 static int mobiveil_irq_msi_domain_alloc(struct irq_domain *domain, 750 - unsigned int virq, unsigned int nr_irqs, void *args) 656 + unsigned int virq, 657 + unsigned int nr_irqs, void *args) 751 658 { 752 659 struct mobiveil_pcie *pcie = domain->host_data; 753 660 struct mobiveil_msi *msi = &pcie->msi; ··· 768 673 mutex_unlock(&msi->lock); 769 674 770 675 irq_domain_set_info(domain, virq, bit, &mobiveil_msi_bottom_irq_chip, 771 - domain->host_data, handle_level_irq, 772 - NULL, NULL); 676 + domain->host_data, handle_level_irq, NULL, NULL); 773 677 return 0; 774 678 } 775 679 776 680 static void mobiveil_irq_msi_domain_free(struct irq_domain *domain, 777 - unsigned int virq, unsigned int nr_irqs) 681 + unsigned int virq, 682 + unsigned int nr_irqs) 778 683 { 779 684 struct irq_data *d = irq_domain_get_irq_data(domain, virq); 780 685 struct mobiveil_pcie *pcie = irq_data_get_irq_chip_data(d); ··· 782 687 783 688 mutex_lock(&msi->lock); 784 689 785 - if (!test_bit(d->hwirq, msi->msi_irq_in_use)) { 690 + if (!test_bit(d->hwirq, msi->msi_irq_in_use)) 786 691 dev_err(&pcie->pdev->dev, "trying to free unused MSI#%lu\n", 787 692 d->hwirq); 788 - } else { 693 + else 789 694 __clear_bit(d->hwirq, msi->msi_irq_in_use); 790 - } 791 695 792 696 mutex_unlock(&msi->lock); 793 697 } ··· 810 716 } 811 717 812 718 msi->msi_domain = pci_msi_create_irq_domain(fwnode, 813 - &mobiveil_msi_domain_info, msi->dev_domain); 719 + &mobiveil_msi_domain_info, 720 + msi->dev_domain); 814 721 if (!msi->msi_domain) { 815 722 dev_err(dev, "failed to create MSI domain\n"); 816 723 irq_domain_remove(msi->dev_domain); 817 724 return -ENOMEM; 818 725 } 726 + 819 727 return 0; 820 728 } 821 729 ··· 828 732 int ret; 829 733 830 734 /* setup INTx */ 831 - pcie->intx_domain = irq_domain_add_linear(node, 832 - PCI_NUM_INTX, &intx_domain_ops, pcie); 735 + pcie->intx_domain = irq_domain_add_linear(node, PCI_NUM_INTX, 736 + &intx_domain_ops, pcie); 833 737 834 738 if (!pcie->intx_domain) { 835 739 dev_err(dev, "Failed to get a INTx IRQ domain\n"); 836 - return -ENODEV; 740 + return -ENOMEM; 837 741 } 838 742 839 743 raw_spin_lock_init(&pcie->intx_mask_lock); ··· 859 763 /* allocate the PCIe port */ 860 764 bridge = devm_pci_alloc_host_bridge(dev, sizeof(*pcie)); 861 765 if (!bridge) 862 - return -ENODEV; 766 + return -ENOMEM; 863 767 864 768 pcie = pci_host_bridge_priv(bridge); 865 - if (!pcie) 866 - return -ENOMEM; 867 769 868 770 pcie->pdev = pdev; 869 771 ··· 878 784 &pcie->resources, &iobase); 879 785 if (ret) { 880 786 dev_err(dev, "Getting bridge resources failed\n"); 881 - return -ENOMEM; 787 + return ret; 882 788 } 883 789 884 790 /* ··· 891 797 goto error; 892 798 } 893 799 894 - /* fixup for PCIe class register */ 895 - csr_writel(pcie, 0x060402ab, PAB_INTP_AXI_PIO_CLASS); 896 - 897 800 /* initialize the IRQ domains */ 898 801 ret = mobiveil_pcie_init_irq_domain(pcie); 899 802 if (ret) { 900 803 dev_err(dev, "Failed creating IRQ Domain\n"); 901 804 goto error; 902 805 } 806 + 807 + irq_set_chained_handler_and_data(pcie->irq, mobiveil_pcie_isr, pcie); 903 808 904 809 ret = devm_request_pci_bus_resources(dev, &pcie->resources); 905 810 if (ret) ··· 912 819 bridge->ops = &mobiveil_pcie_ops; 913 820 bridge->map_irq = of_irq_parse_and_map_pci; 914 821 bridge->swizzle_irq = pci_common_swizzle; 822 + 823 + ret = mobiveil_bringup_link(pcie); 824 + if (ret) { 825 + dev_info(dev, "link bring-up failed\n"); 826 + goto error; 827 + } 915 828 916 829 /* setup the kernel resources for the newly added PCIe root bus */ 917 830 ret = pci_scan_root_bus_bridge(bridge); ··· 947 848 static struct platform_driver mobiveil_pcie_driver = { 948 849 .probe = mobiveil_pcie_probe, 949 850 .driver = { 950 - .name = "mobiveil-pcie", 951 - .of_match_table = mobiveil_pcie_of_match, 952 - .suppress_bind_attrs = true, 953 - }, 851 + .name = "mobiveil-pcie", 852 + .of_match_table = mobiveil_pcie_of_match, 853 + .suppress_bind_attrs = true, 854 + }, 954 855 }; 955 856 956 857 builtin_platform_driver(mobiveil_pcie_driver);

+5 -6

drivers/pci/controller/pcie-xilinx-nwl.c

··· 482 482 int i; 483 483 484 484 mutex_lock(&msi->lock); 485 - bit = bitmap_find_next_zero_area(msi->bitmap, INT_PCI_MSI_NR, 0, 486 - nr_irqs, 0); 487 - if (bit >= INT_PCI_MSI_NR) { 485 + bit = bitmap_find_free_region(msi->bitmap, INT_PCI_MSI_NR, 486 + get_count_order(nr_irqs)); 487 + if (bit < 0) { 488 488 mutex_unlock(&msi->lock); 489 489 return -ENOSPC; 490 490 } 491 - 492 - bitmap_set(msi->bitmap, bit, nr_irqs); 493 491 494 492 for (i = 0; i < nr_irqs; i++) { 495 493 irq_domain_set_info(domain, virq + i, bit + i, &nwl_irq_chip, ··· 506 508 struct nwl_msi *msi = &pcie->msi; 507 509 508 510 mutex_lock(&msi->lock); 509 - bitmap_clear(msi->bitmap, data->hwirq, nr_irqs); 511 + bitmap_release_region(msi->bitmap, data->hwirq, 512 + get_count_order(nr_irqs)); 510 513 mutex_unlock(&msi->lock); 511 514 } 512 515

+1 -1

drivers/pci/controller/vmd.c

··· 627 627 * 32-bit resources. __pci_assign_resource() enforces that 628 628 * artificial restriction to make sure everything will fit. 629 629 * 630 - * The only way we could use a 64-bit non-prefechable MEMBAR is 630 + * The only way we could use a 64-bit non-prefetchable MEMBAR is 631 631 * if its address is <4GB so that we can convert it to a 32-bit 632 632 * resource. To be visible to the host OS, all VMD endpoints must 633 633 * be initially configured by platform BIOS, which includes setting

+20 -15

drivers/pci/endpoint/functions/pci-epf-test.c

··· 381 381 epf_bar = &epf->bar[bar]; 382 382 383 383 if (epf_test->reg[bar]) { 384 - pci_epf_free_space(epf, epf_test->reg[bar], bar); 385 384 pci_epc_clear_bar(epc, epf->func_no, epf_bar); 385 + pci_epf_free_space(epf, epf_test->reg[bar], bar); 386 386 } 387 387 } 388 388 } 389 389 390 390 static int pci_epf_test_set_bar(struct pci_epf *epf) 391 391 { 392 - int bar; 392 + int bar, add; 393 393 int ret; 394 394 struct pci_epf_bar *epf_bar; 395 395 struct pci_epc *epc = epf->epc; ··· 400 400 401 401 epc_features = epf_test->epc_features; 402 402 403 - for (bar = BAR_0; bar <= BAR_5; bar++) { 403 + for (bar = BAR_0; bar <= BAR_5; bar += add) { 404 404 epf_bar = &epf->bar[bar]; 405 + /* 406 + * pci_epc_set_bar() sets PCI_BASE_ADDRESS_MEM_TYPE_64 407 + * if the specific implementation required a 64-bit BAR, 408 + * even if we only requested a 32-bit BAR. 409 + */ 410 + add = (epf_bar->flags & PCI_BASE_ADDRESS_MEM_TYPE_64) ? 2 : 1; 405 411 406 412 if (!!(epc_features->reserved_bar & (1 << bar))) 407 413 continue; ··· 419 413 if (bar == test_reg_bar) 420 414 return ret; 421 415 } 422 - /* 423 - * pci_epc_set_bar() sets PCI_BASE_ADDRESS_MEM_TYPE_64 424 - * if the specific implementation required a 64-bit BAR, 425 - * even if we only requested a 32-bit BAR. 426 - */ 427 - if (epf_bar->flags & PCI_BASE_ADDRESS_MEM_TYPE_64) 428 - bar++; 429 416 } 430 417 431 418 return 0; ··· 430 431 struct device *dev = &epf->dev; 431 432 struct pci_epf_bar *epf_bar; 432 433 void *base; 433 - int bar; 434 + int bar, add; 434 435 enum pci_barno test_reg_bar = epf_test->test_reg_bar; 435 436 const struct pci_epc_features *epc_features; 437 + size_t test_reg_size; 436 438 437 439 epc_features = epf_test->epc_features; 438 440 439 - base = pci_epf_alloc_space(epf, sizeof(struct pci_epf_test_reg), 441 + if (epc_features->bar_fixed_size[test_reg_bar]) 442 + test_reg_size = bar_size[test_reg_bar]; 443 + else 444 + test_reg_size = sizeof(struct pci_epf_test_reg); 445 + 446 + base = pci_epf_alloc_space(epf, test_reg_size, 440 447 test_reg_bar, epc_features->align); 441 448 if (!base) { 442 449 dev_err(dev, "Failed to allocated register space\n"); ··· 450 445 } 451 446 epf_test->reg[test_reg_bar] = base; 452 447 453 - for (bar = BAR_0; bar <= BAR_5; bar++) { 448 + for (bar = BAR_0; bar <= BAR_5; bar += add) { 454 449 epf_bar = &epf->bar[bar]; 450 + add = (epf_bar->flags & PCI_BASE_ADDRESS_MEM_TYPE_64) ? 2 : 1; 451 + 455 452 if (bar == test_reg_bar) 456 453 continue; 457 454 ··· 466 459 dev_err(dev, "Failed to allocate space for BAR%d\n", 467 460 bar); 468 461 epf_test->reg[bar] = base; 469 - if (epf_bar->flags & PCI_BASE_ADDRESS_MEM_TYPE_64) 470 - bar++; 471 462 } 472 463 473 464 return 0;

+2 -1

drivers/pci/endpoint/pci-epc-core.c

··· 519 519 { 520 520 unsigned long flags; 521 521 522 - if (!epc || IS_ERR(epc)) 522 + if (!epc || IS_ERR(epc) || !epf) 523 523 return; 524 524 525 525 spin_lock_irqsave(&epc->lock, flags); 526 526 list_del(&epf->list); 527 + epf->epc = NULL; 527 528 spin_unlock_irqrestore(&epc->lock, flags); 528 529 } 529 530 EXPORT_SYMBOL_GPL(pci_epc_remove_epf);

-2

drivers/pci/iov.c

··· 132 132 &physfn->sriov->subsystem_vendor); 133 133 pci_read_config_word(virtfn, PCI_SUBSYSTEM_ID, 134 134 &physfn->sriov->subsystem_device); 135 - 136 - physfn->sriov->cfg_size = pci_cfg_space_size(virtfn); 137 135 } 138 136 139 137 int pci_iov_add_virtfn(struct pci_dev *dev, int id)

+1 -1

drivers/pci/mmap.c

··· 73 73 #elif defined(HAVE_PCI_MMAP) /* && !ARCH_GENERIC_PCI_MMAP_RESOURCE */ 74 74 75 75 /* 76 - * Legacy setup: Impement pci_mmap_resource_range() as a wrapper around 76 + * Legacy setup: Implement pci_mmap_resource_range() as a wrapper around 77 77 * the architecture's pci_mmap_page_range(), converting to "user visible" 78 78 * addresses as necessary. 79 79 */

+22 -21

drivers/pci/msi.c

··· 237 237 } 238 238 239 239 /** 240 - * pci_msi_mask_irq - Generic irq chip callback to mask PCI/MSI interrupts 240 + * pci_msi_mask_irq - Generic IRQ chip callback to mask PCI/MSI interrupts 241 241 * @data: pointer to irqdata associated to that interrupt 242 242 */ 243 243 void pci_msi_mask_irq(struct irq_data *data) ··· 247 247 EXPORT_SYMBOL_GPL(pci_msi_mask_irq); 248 248 249 249 /** 250 - * pci_msi_unmask_irq - Generic irq chip callback to unmask PCI/MSI interrupts 250 + * pci_msi_unmask_irq - Generic IRQ chip callback to unmask PCI/MSI interrupts 251 251 * @data: pointer to irqdata associated to that interrupt 252 252 */ 253 253 void pci_msi_unmask_irq(struct irq_data *data) ··· 588 588 * msi_capability_init - configure device's MSI capability structure 589 589 * @dev: pointer to the pci_dev data structure of MSI device function 590 590 * @nvec: number of interrupts to allocate 591 - * @affd: description of automatic irq affinity assignments (may be %NULL) 591 + * @affd: description of automatic IRQ affinity assignments (may be %NULL) 592 592 * 593 593 * Setup the MSI capability structure of the device with the requested 594 594 * number of interrupts. A return value of zero indicates the successful 595 - * setup of an entry with the new MSI irq. A negative return value indicates 595 + * setup of an entry with the new MSI IRQ. A negative return value indicates 596 596 * an error, and a positive return value indicates the number of interrupts 597 597 * which could have been allocated. 598 598 */ ··· 609 609 if (!entry) 610 610 return -ENOMEM; 611 611 612 - /* All MSIs are unmasked by default, Mask them all */ 612 + /* All MSIs are unmasked by default; mask them all */ 613 613 mask = msi_mask(entry->msi_attrib.multi_cap); 614 614 msi_mask_irq(entry, mask, mask); 615 615 ··· 637 637 return ret; 638 638 } 639 639 640 - /* Set MSI enabled bits */ 640 + /* Set MSI enabled bits */ 641 641 pci_intx_for_msi(dev, 0); 642 642 pci_msi_set_enable(dev, 1); 643 643 dev->msi_enabled = 1; ··· 729 729 * @dev: pointer to the pci_dev data structure of MSI-X device function 730 730 * @entries: pointer to an array of struct msix_entry entries 731 731 * @nvec: number of @entries 732 - * @affd: Optional pointer to enable automatic affinity assignement 732 + * @affd: Optional pointer to enable automatic affinity assignment 733 733 * 734 734 * Setup the MSI-X capability structure of device function with a 735 - * single MSI-X irq. A return of zero indicates the successful setup of 736 - * requested MSI-X entries with allocated irqs or non-zero for otherwise. 735 + * single MSI-X IRQ. A return of zero indicates the successful setup of 736 + * requested MSI-X entries with allocated IRQs or non-zero for otherwise. 737 737 **/ 738 738 static int msix_capability_init(struct pci_dev *dev, struct msix_entry *entries, 739 739 int nvec, struct irq_affinity *affd) ··· 789 789 out_avail: 790 790 if (ret < 0) { 791 791 /* 792 - * If we had some success, report the number of irqs 792 + * If we had some success, report the number of IRQs 793 793 * we succeeded in setting up. 794 794 */ 795 795 struct msi_desc *entry; ··· 812 812 /** 813 813 * pci_msi_supported - check whether MSI may be enabled on a device 814 814 * @dev: pointer to the pci_dev data structure of MSI device function 815 - * @nvec: how many MSIs have been requested ? 815 + * @nvec: how many MSIs have been requested? 816 816 * 817 817 * Look at global flags, the device itself, and its parent buses 818 818 * to determine if MSI/-X are supported for the device. If MSI/-X is ··· 896 896 /* Keep cached state to be restored */ 897 897 __pci_msi_desc_mask_irq(desc, mask, ~mask); 898 898 899 - /* Restore dev->irq to its default pin-assertion irq */ 899 + /* Restore dev->irq to its default pin-assertion IRQ */ 900 900 dev->irq = desc->msi_attrib.default_irq; 901 901 pcibios_alloc_irq(dev); 902 902 } ··· 958 958 } 959 959 } 960 960 961 - /* Check whether driver already requested for MSI irq */ 961 + /* Check whether driver already requested for MSI IRQ */ 962 962 if (dev->msi_enabled) { 963 963 pci_info(dev, "can't enable MSI-X (MSI IRQ already assigned)\n"); 964 964 return -EINVAL; ··· 1026 1026 if (!pci_msi_supported(dev, minvec)) 1027 1027 return -EINVAL; 1028 1028 1029 - /* Check whether driver already requested MSI-X irqs */ 1029 + /* Check whether driver already requested MSI-X IRQs */ 1030 1030 if (dev->msix_enabled) { 1031 1031 pci_info(dev, "can't enable MSI (MSI-X already enabled)\n"); 1032 1032 return -EINVAL; ··· 1113 1113 * pci_enable_msix_range - configure device's MSI-X capability structure 1114 1114 * @dev: pointer to the pci_dev data structure of MSI-X device function 1115 1115 * @entries: pointer to an array of MSI-X entries 1116 - * @minvec: minimum number of MSI-X irqs requested 1117 - * @maxvec: maximum number of MSI-X irqs requested 1116 + * @minvec: minimum number of MSI-X IRQs requested 1117 + * @maxvec: maximum number of MSI-X IRQs requested 1118 1118 * 1119 1119 * Setup the MSI-X capability structure of device function with a maximum 1120 1120 * possible number of interrupts in the range between @minvec and @maxvec ··· 1179 1179 return msi_vecs; 1180 1180 } 1181 1181 1182 - /* use legacy irq if allowed */ 1182 + /* use legacy IRQ if allowed */ 1183 1183 if (flags & PCI_IRQ_LEGACY) { 1184 1184 if (min_vecs == 1 && dev->irq) { 1185 1185 /* ··· 1248 1248 EXPORT_SYMBOL(pci_irq_vector); 1249 1249 1250 1250 /** 1251 - * pci_irq_get_affinity - return the affinity of a particular msi vector 1251 + * pci_irq_get_affinity - return the affinity of a particular MSI vector 1252 1252 * @dev: PCI device to operate on 1253 1253 * @nr: device-relative interrupt vector index (0-based). 1254 1254 */ ··· 1280 1280 EXPORT_SYMBOL(pci_irq_get_affinity); 1281 1281 1282 1282 /** 1283 - * pci_irq_get_node - return the numa node of a particular msi vector 1283 + * pci_irq_get_node - return the NUMA node of a particular MSI vector 1284 1284 * @pdev: PCI device to operate on 1285 1285 * @vec: device-relative interrupt vector index (0-based). 1286 1286 */ ··· 1330 1330 /** 1331 1331 * pci_msi_domain_calc_hwirq - Generate a unique ID for an MSI source 1332 1332 * @dev: Pointer to the PCI device 1333 - * @desc: Pointer to the msi descriptor 1333 + * @desc: Pointer to the MSI descriptor 1334 1334 * 1335 1335 * The ID number is only used within the irqdomain. 1336 1336 */ ··· 1348 1348 } 1349 1349 1350 1350 /** 1351 - * pci_msi_domain_check_cap - Verify that @domain supports the capabilities for @dev 1351 + * pci_msi_domain_check_cap - Verify that @domain supports the capabilities 1352 + * for @dev 1352 1353 * @domain: The interrupt domain to check 1353 1354 * @info: The domain info for verification 1354 1355 * @dev: The device to check

+12 -4

drivers/pci/p2pdma.c

··· 195 195 196 196 /* 197 197 * Note this function returns the parent PCI device with a 198 - * reference taken. It is the caller's responsibily to drop 198 + * reference taken. It is the caller's responsibility to drop 199 199 * the reference. 200 200 */ 201 201 static struct pci_dev *find_parent_pci_dev(struct device *dev) ··· 355 355 356 356 /* 357 357 * Allow the connection if both devices are on a whitelisted root 358 - * complex, but add an arbitary large value to the distance. 358 + * complex, but add an arbitrary large value to the distance. 359 359 */ 360 360 if (root_complex_whitelist(provider) && 361 361 root_complex_whitelist(client)) ··· 414 414 } 415 415 416 416 /** 417 - * pci_p2pdma_distance_many - Determive the cumulative distance between 417 + * pci_p2pdma_distance_many - Determine the cumulative distance between 418 418 * a p2pdma provider and the clients in use. 419 419 * @provider: p2pdma provider to check against the client list 420 420 * @clients: array of devices to check (NULL-terminated) ··· 443 443 return -1; 444 444 445 445 for (i = 0; i < num_clients; i++) { 446 + if (IS_ENABLED(CONFIG_DMA_VIRT_OPS) && 447 + clients[i]->dma_ops == &dma_virt_ops) { 448 + if (verbose) 449 + dev_warn(clients[i], 450 + "cannot be used for peer-to-peer DMA because the driver makes use of dma_virt_ops\n"); 451 + return -1; 452 + } 453 + 446 454 pci_client = find_parent_pci_dev(clients[i]); 447 455 if (!pci_client) { 448 456 if (verbose) ··· 729 721 * p2pdma mappings are not compatible with devices that use 730 722 * dma_virt_ops. If the upper layers do the right thing 731 723 * this should never happen because it will be prevented 732 - * by the check in pci_p2pdma_add_client() 724 + * by the check in pci_p2pdma_distance_many() 733 725 */ 734 726 if (WARN_ON_ONCE(IS_ENABLED(CONFIG_DMA_VIRT_OPS) && 735 727 dev->dma_ops == &dma_virt_ops))

+1 -1

drivers/pci/pci-bridge-emul.c

··· 305 305 } 306 306 307 307 /* 308 - * Cleanup a pci_bridge_emul structure that was previously initilized 308 + * Cleanup a pci_bridge_emul structure that was previously initialized 309 309 * using pci_bridge_emul_init(). 310 310 */ 311 311 void pci_bridge_emul_cleanup(struct pci_bridge_emul *bridge)

+9 -7

drivers/pci/pci-driver.c

··· 399 399 #ifdef CONFIG_PCI_IOV 400 400 static inline bool pci_device_can_probe(struct pci_dev *pdev) 401 401 { 402 - return (!pdev->is_virtfn || pdev->physfn->sriov->drivers_autoprobe); 402 + return (!pdev->is_virtfn || pdev->physfn->sriov->drivers_autoprobe || 403 + pdev->driver_override); 403 404 } 404 405 #else 405 406 static inline bool pci_device_can_probe(struct pci_dev *pdev) ··· 415 414 struct pci_dev *pci_dev = to_pci_dev(dev); 416 415 struct pci_driver *drv = to_pci_driver(dev->driver); 417 416 417 + if (!pci_device_can_probe(pci_dev)) 418 + return -ENODEV; 419 + 418 420 pci_assign_irq(pci_dev); 419 421 420 422 error = pcibios_alloc_irq(pci_dev); ··· 425 421 return error; 426 422 427 423 pci_dev_get(pci_dev); 428 - if (pci_device_can_probe(pci_dev)) { 429 - error = __pci_device_probe(drv, pci_dev); 430 - if (error) { 431 - pcibios_free_irq(pci_dev); 432 - pci_dev_put(pci_dev); 433 - } 424 + error = __pci_device_probe(drv, pci_dev); 425 + if (error) { 426 + pcibios_free_irq(pci_dev); 427 + pci_dev_put(pci_dev); 434 428 } 435 429 436 430 return error;

+1 -1

drivers/pci/pci-pf-stub.c

··· 1 1 // SPDX-License-Identifier: GPL-2.0 2 2 /* pci-pf-stub - simple stub driver for PCI SR-IOV PF device 3 3 * 4 - * This driver is meant to act as a "whitelist" for devices that provde 4 + * This driver is meant to act as a "whitelist" for devices that provide 5 5 * SR-IOV functionality while at the same time not actually needing a 6 6 * driver of their own. 7 7 */

+4 -1

drivers/pci/pci-sysfs.c

··· 182 182 return -EINVAL; 183 183 184 184 switch (linkstat & PCI_EXP_LNKSTA_CLS) { 185 + case PCI_EXP_LNKSTA_CLS_32_0GB: 186 + speed = "32 GT/s"; 187 + break; 185 188 case PCI_EXP_LNKSTA_CLS_16_0GB: 186 189 speed = "16 GT/s"; 187 190 break; ··· 480 477 pci_stop_and_remove_bus_device_locked(to_pci_dev(dev)); 481 478 return count; 482 479 } 483 - static struct device_attribute dev_remove_attr = __ATTR(remove, 480 + static struct device_attribute dev_remove_attr = __ATTR_IGNORE_LOCKDEP(remove, 484 481 (S_IWUSR|S_IWGRP), 485 482 NULL, remove_store); 486 483

+4 -2

drivers/pci/pci.c

··· 4535 4535 4536 4536 /* 4537 4537 * Wait for Transaction Pending bit to clear. A word-aligned test 4538 - * is used, so we use the conrol offset rather than status and shift 4538 + * is used, so we use the control offset rather than status and shift 4539 4539 * the test bit to match. 4540 4540 */ 4541 4541 if (!pci_wait_for_pending(dev, pos + PCI_AF_CTRL, ··· 5669 5669 */ 5670 5670 pcie_capability_read_dword(dev, PCI_EXP_LNKCAP2, &lnkcap2); 5671 5671 if (lnkcap2) { /* PCIe r3.0-compliant */ 5672 - if (lnkcap2 & PCI_EXP_LNKCAP2_SLS_16_0GB) 5672 + if (lnkcap2 & PCI_EXP_LNKCAP2_SLS_32_0GB) 5673 + return PCIE_SPEED_32_0GT; 5674 + else if (lnkcap2 & PCI_EXP_LNKCAP2_SLS_16_0GB) 5673 5675 return PCIE_SPEED_16_0GT; 5674 5676 else if (lnkcap2 & PCI_EXP_LNKCAP2_SLS_8_0GB) 5675 5677 return PCIE_SPEED_8_0GT;

-1

drivers/pci/pci.h

··· 298 298 u16 driver_max_VFs; /* Max num VFs driver supports */ 299 299 struct pci_dev *dev; /* Lowest numbered PF */ 300 300 struct pci_dev *self; /* This PF */ 301 - u32 cfg_size; /* VF config space size */ 302 301 u32 class; /* VF device */ 303 302 u8 hdr_type; /* VF header type */ 304 303 u16 subsystem_vendor; /* VF subsystem vendor */

+1 -1

drivers/pci/pcie/aer_inject.c

··· 2 2 /* 3 3 * PCIe AER software error injection support. 4 4 * 5 - * Debuging PCIe AER code is quite difficult because it is hard to 5 + * Debugging PCIe AER code is quite difficult because it is hard to 6 6 * trigger various real hardware errors. Software based error 7 7 * injection can fake almost all kinds of errors with the help of a 8 8 * user space helper tool aer-inject, which can be gotten from:

+13 -15

drivers/pci/probe.c

··· 668 668 PCIE_SPEED_5_0GT, /* 2 */ 669 669 PCIE_SPEED_8_0GT, /* 3 */ 670 670 PCIE_SPEED_16_0GT, /* 4 */ 671 - PCI_SPEED_UNKNOWN, /* 5 */ 671 + PCIE_SPEED_32_0GT, /* 5 */ 672 672 PCI_SPEED_UNKNOWN, /* 6 */ 673 673 PCI_SPEED_UNKNOWN, /* 7 */ 674 674 PCI_SPEED_UNKNOWN, /* 8 */ ··· 1555 1555 return PCI_CFG_SPACE_EXP_SIZE; 1556 1556 } 1557 1557 1558 - #ifdef CONFIG_PCI_IOV 1559 - static bool is_vf0(struct pci_dev *dev) 1560 - { 1561 - if (pci_iov_virtfn_devfn(dev->physfn, 0) == dev->devfn && 1562 - pci_iov_virtfn_bus(dev->physfn, 0) == dev->bus->number) 1563 - return true; 1564 - 1565 - return false; 1566 - } 1567 - #endif 1568 - 1569 1558 int pci_cfg_space_size(struct pci_dev *dev) 1570 1559 { 1571 1560 int pos; ··· 1562 1573 u16 class; 1563 1574 1564 1575 #ifdef CONFIG_PCI_IOV 1565 - /* Read cached value for all VFs except for VF0 */ 1566 - if (dev->is_virtfn && !is_vf0(dev)) 1567 - return dev->physfn->sriov->cfg_size; 1576 + /* 1577 + * Per the SR-IOV specification (rev 1.1, sec 3.5), VFs are required to 1578 + * implement a PCIe capability and therefore must implement extended 1579 + * config space. We can skip the NO_EXTCFG test below and the 1580 + * reachability/aliasing test in pci_cfg_space_size_ext() by virtue of 1581 + * the fact that the SR-IOV capability on the PF resides in extended 1582 + * config space and must be accessible and non-aliased to have enabled 1583 + * support for this VF. This is a micro performance optimization for 1584 + * systems supporting many VFs. 1585 + */ 1586 + if (dev->is_virtfn) 1587 + return PCI_CFG_SPACE_EXP_SIZE; 1568 1588 #endif 1569 1589 1570 1590 if (dev->bus->bus_flags & PCI_BUS_FLAGS_NO_EXTCFG)

+1 -1

drivers/pci/proc.c

··· 377 377 } 378 378 seq_putc(m, '\t'); 379 379 if (drv) 380 - seq_printf(m, "%s", drv->name); 380 + seq_puts(m, drv->name); 381 381 seq_putc(m, '\n'); 382 382 return 0; 383 383 }

+90 -20

drivers/pci/quirks.c

··· 4934 4934 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_FREESCALE, PCI_ANY_ID, quirk_fsl_no_msi); 4935 4935 4936 4936 /* 4937 - * GPUs with integrated HDA controller for streaming audio to attached displays 4938 - * need a device link from the HDA controller (consumer) to the GPU (supplier) 4939 - * so that the GPU is powered up whenever the HDA controller is accessed. 4940 - * The GPU and HDA controller are functions 0 and 1 of the same PCI device. 4941 - * The device link stays in place until shutdown (or removal of the PCI device 4942 - * if it's hotplugged). Runtime PM is allowed by default on the HDA controller 4943 - * to prevent it from permanently keeping the GPU awake. 4937 + * Although not allowed by the spec, some multi-function devices have 4938 + * dependencies of one function (consumer) on another (supplier). For the 4939 + * consumer to work in D0, the supplier must also be in D0. Create a 4940 + * device link from the consumer to the supplier to enforce this 4941 + * dependency. Runtime PM is allowed by default on the consumer to prevent 4942 + * it from permanently keeping the supplier awake. 4944 4943 */ 4945 - static void quirk_gpu_hda(struct pci_dev *hda) 4944 + static void pci_create_device_link(struct pci_dev *pdev, unsigned int consumer, 4945 + unsigned int supplier, unsigned int class, 4946 + unsigned int class_shift) 4946 4947 { 4947 - struct pci_dev *gpu; 4948 + struct pci_dev *supplier_pdev; 4948 4949 4949 - if (PCI_FUNC(hda->devfn) != 1) 4950 + if (PCI_FUNC(pdev->devfn) != consumer) 4950 4951 return; 4951 4952 4952 - gpu = pci_get_domain_bus_and_slot(pci_domain_nr(hda->bus), 4953 - hda->bus->number, 4954 - PCI_DEVFN(PCI_SLOT(hda->devfn), 0)); 4955 - if (!gpu || (gpu->class >> 16) != PCI_BASE_CLASS_DISPLAY) { 4956 - pci_dev_put(gpu); 4953 + supplier_pdev = pci_get_domain_bus_and_slot(pci_domain_nr(pdev->bus), 4954 + pdev->bus->number, 4955 + PCI_DEVFN(PCI_SLOT(pdev->devfn), supplier)); 4956 + if (!supplier_pdev || (supplier_pdev->class >> class_shift) != class) { 4957 + pci_dev_put(supplier_pdev); 4957 4958 return; 4958 4959 } 4959 4960 4960 - if (!device_link_add(&hda->dev, &gpu->dev, 4961 - DL_FLAG_STATELESS | DL_FLAG_PM_RUNTIME)) 4962 - pci_err(hda, "cannot link HDA to GPU %s\n", pci_name(gpu)); 4961 + if (device_link_add(&pdev->dev, &supplier_pdev->dev, 4962 + DL_FLAG_STATELESS | DL_FLAG_PM_RUNTIME)) 4963 + pci_info(pdev, "D0 power state depends on %s\n", 4964 + pci_name(supplier_pdev)); 4965 + else 4966 + pci_err(pdev, "Cannot enforce power dependency on %s\n", 4967 + pci_name(supplier_pdev)); 4963 4968 4964 - pm_runtime_allow(&hda->dev); 4965 - pci_dev_put(gpu); 4969 + pm_runtime_allow(&pdev->dev); 4970 + pci_dev_put(supplier_pdev); 4971 + } 4972 + 4973 + /* 4974 + * Create device link for GPUs with integrated HDA controller for streaming 4975 + * audio to attached displays. 4976 + */ 4977 + static void quirk_gpu_hda(struct pci_dev *hda) 4978 + { 4979 + pci_create_device_link(hda, 1, 0, PCI_BASE_CLASS_DISPLAY, 16); 4966 4980 } 4967 4981 DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_ATI, PCI_ANY_ID, 4968 4982 PCI_CLASS_MULTIMEDIA_HD_AUDIO, 8, quirk_gpu_hda); ··· 4984 4970 PCI_CLASS_MULTIMEDIA_HD_AUDIO, 8, quirk_gpu_hda); 4985 4971 DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID, 4986 4972 PCI_CLASS_MULTIMEDIA_HD_AUDIO, 8, quirk_gpu_hda); 4973 + 4974 + /* 4975 + * Create device link for NVIDIA GPU with integrated USB xHCI Host 4976 + * controller to VGA. 4977 + */ 4978 + static void quirk_gpu_usb(struct pci_dev *usb) 4979 + { 4980 + pci_create_device_link(usb, 2, 0, PCI_BASE_CLASS_DISPLAY, 16); 4981 + } 4982 + DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID, 4983 + PCI_CLASS_SERIAL_USB, 8, quirk_gpu_usb); 4984 + 4985 + /* 4986 + * Create device link for NVIDIA GPU with integrated Type-C UCSI controller 4987 + * to VGA. Currently there is no class code defined for UCSI device over PCI 4988 + * so using UNKNOWN class for now and it will be updated when UCSI 4989 + * over PCI gets a class code. 4990 + */ 4991 + #define PCI_CLASS_SERIAL_UNKNOWN 0x0c80 4992 + static void quirk_gpu_usb_typec_ucsi(struct pci_dev *ucsi) 4993 + { 4994 + pci_create_device_link(ucsi, 3, 0, PCI_BASE_CLASS_DISPLAY, 16); 4995 + } 4996 + DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID, 4997 + PCI_CLASS_SERIAL_UNKNOWN, 8, 4998 + quirk_gpu_usb_typec_ucsi); 4999 + 5000 + /* 5001 + * Enable the NVIDIA GPU integrated HDA controller if the BIOS left it 5002 + * disabled. https://devtalk.nvidia.com/default/topic/1024022 5003 + */ 5004 + static void quirk_nvidia_hda(struct pci_dev *gpu) 5005 + { 5006 + u8 hdr_type; 5007 + u32 val; 5008 + 5009 + /* There was no integrated HDA controller before MCP89 */ 5010 + if (gpu->device < PCI_DEVICE_ID_NVIDIA_GEFORCE_320M) 5011 + return; 5012 + 5013 + /* Bit 25 at offset 0x488 enables the HDA controller */ 5014 + pci_read_config_dword(gpu, 0x488, &val); 5015 + if (val & BIT(25)) 5016 + return; 5017 + 5018 + pci_info(gpu, "Enabling HDA controller\n"); 5019 + pci_write_config_dword(gpu, 0x488, val | BIT(25)); 5020 + 5021 + /* The GPU becomes a multi-function device when the HDA is enabled */ 5022 + pci_read_config_byte(gpu, PCI_HEADER_TYPE, &hdr_type); 5023 + gpu->multifunction = !!(hdr_type & 0x80); 5024 + } 5025 + DECLARE_PCI_FIXUP_CLASS_HEADER(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID, 5026 + PCI_BASE_CLASS_DISPLAY, 16, quirk_nvidia_hda); 5027 + DECLARE_PCI_FIXUP_CLASS_RESUME_EARLY(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID, 5028 + PCI_BASE_CLASS_DISPLAY, 16, quirk_nvidia_hda); 4987 5029 4988 5030 /* 4989 5031 * Some IDT switches incorrectly flag an ACS Source Validation error on

+33 -27

drivers/pci/setup-bus.c

··· 1684 1684 enum enable_type enable_local) 1685 1685 { 1686 1686 bool unassigned = false; 1687 + struct pci_host_bridge *host; 1687 1688 1688 1689 if (enable_local != undefined) 1689 1690 return enable_local; 1691 + 1692 + host = pci_find_host_bridge(bus); 1693 + if (host->preserve_config) 1694 + return auto_disabled; 1690 1695 1691 1696 pci_walk_bus(bus, iov_resources_unassigned, &unassigned); 1692 1697 if (unassigned) ··· 1866 1861 available_mmio_pref); 1867 1862 1868 1863 /* 1869 - * Calculate the total amount of extra resource space we can 1870 - * pass to bridges below this one. This is basically the 1871 - * extra space reduced by the minimal required space for the 1872 - * non-hotplug bridges. 1873 - */ 1874 - remaining_io = available_io; 1875 - remaining_mmio = available_mmio; 1876 - remaining_mmio_pref = available_mmio_pref; 1877 - 1878 - /* 1879 1864 * Calculate how many hotplug bridges and normal bridges there 1880 1865 * are on this bus. We will distribute the additional available 1881 1866 * resources between hotplug bridges. ··· 1876 1881 else 1877 1882 normal_bridges++; 1878 1883 } 1884 + 1885 + /* 1886 + * There is only one bridge on the bus so it gets all available 1887 + * resources which it can then distribute to the possible hotplug 1888 + * bridges below. 1889 + */ 1890 + if (hotplug_bridges + normal_bridges == 1) { 1891 + dev = list_first_entry(&bus->devices, struct pci_dev, bus_list); 1892 + if (dev->subordinate) { 1893 + pci_bus_distribute_available_resources(dev->subordinate, 1894 + add_list, available_io, available_mmio, 1895 + available_mmio_pref); 1896 + } 1897 + return; 1898 + } 1899 + 1900 + if (hotplug_bridges == 0) 1901 + return; 1902 + 1903 + /* 1904 + * Calculate the total amount of extra resource space we can 1905 + * pass to bridges below this one. This is basically the 1906 + * extra space reduced by the minimal required space for the 1907 + * non-hotplug bridges. 1908 + */ 1909 + remaining_io = available_io; 1910 + remaining_mmio = available_mmio; 1911 + remaining_mmio_pref = available_mmio_pref; 1879 1912 1880 1913 for_each_pci_bridge(dev, bus) { 1881 1914 const struct resource *res; ··· 1929 1906 } 1930 1907 1931 1908 /* 1932 - * There is only one bridge on the bus so it gets all available 1933 - * resources which it can then distribute to the possible hotplug 1934 - * bridges below. 1935 - */ 1936 - if (hotplug_bridges + normal_bridges == 1) { 1937 - dev = list_first_entry(&bus->devices, struct pci_dev, bus_list); 1938 - if (dev->subordinate) { 1939 - pci_bus_distribute_available_resources(dev->subordinate, 1940 - add_list, available_io, available_mmio, 1941 - available_mmio_pref); 1942 - } 1943 - return; 1944 - } 1945 - 1946 - /* 1947 1909 * Go over devices on this bus and distribute the remaining 1948 1910 * resource space between hotplug bridges. 1949 1911 */ ··· 1944 1936 * Distribute available extra resources equally between 1945 1937 * hotplug-capable downstream ports taking alignment into 1946 1938 * account. 1947 - * 1948 - * Here hotplug_bridges is always != 0. 1949 1939 */ 1950 1940 align = pci_resource_alignment(bridge, io_res); 1951 1941 io = div64_ul(available_io, hotplug_bridges);

+1

drivers/pci/slot.c

··· 75 75 "5.0 GT/s PCIe", /* 0x15 */ 76 76 "8.0 GT/s PCIe", /* 0x16 */ 77 77 "16.0 GT/s PCIe", /* 0x17 */ 78 + "32.0 GT/s PCIe", /* 0x18 */ 78 79 }; 79 80 80 81 static ssize_t bus_speed_read(enum pci_bus_speed speed, char *buf)

+1 -1

drivers/power/supply/power_supply_core.c

··· 606 606 607 607 /* The property and field names below must correspond to elements 608 608 * in enum power_supply_property. For reasoning, see 609 - * Documentation/power/power_supply_class.txt. 609 + * Documentation/power/power_supply_class.rst. 610 610 */ 611 611 612 612 of_property_read_u32(battery_np, "energy-full-design-microwatt-hours",

+1

drivers/soc/tegra/pmc.c

··· 700 700 701 701 return tegra_powergate_set(pmc, id, true); 702 702 } 703 + EXPORT_SYMBOL(tegra_powergate_power_on); 703 704 704 705 /** 705 706 * tegra_powergate_power_off() - power off partition

+1 -1

include/linux/interrupt.h

··· 52 52 * irq line disabled until the threaded handler has been run. 53 53 * IRQF_NO_SUSPEND - Do not disable this IRQ during suspend. Does not guarantee 54 54 * that this interrupt will wake the system from a suspended 55 - * state. See Documentation/power/suspend-and-interrupts.txt 55 + * state. See Documentation/power/suspend-and-interrupts.rst 56 56 * IRQF_FORCE_RESUME - Force enable it on resume even if IRQF_NO_SUSPEND is set 57 57 * IRQF_NO_THREAD - Interrupt cannot be threaded 58 58 * IRQF_EARLY_RESUME - Resume IRQ early during syscore instead of at device

+24 -5

include/linux/mod_devicetable.h

··· 16 16 17 17 #define PCI_ANY_ID (~0) 18 18 19 + /** 20 + * struct pci_device_id - PCI device ID structure 21 + * @vendor: Vendor ID to match (or PCI_ANY_ID) 22 + * @device: Device ID to match (or PCI_ANY_ID) 23 + * @subvendor: Subsystem vendor ID to match (or PCI_ANY_ID) 24 + * @subdevice: Subsystem device ID to match (or PCI_ANY_ID) 25 + * @class: Device class, subclass, and "interface" to match. 26 + * See Appendix D of the PCI Local Bus Spec or 27 + * include/linux/pci_ids.h for a full list of classes. 28 + * Most drivers do not need to specify class/class_mask 29 + * as vendor/device is normally sufficient. 30 + * @class_mask: Limit which sub-fields of the class field are compared. 31 + * See drivers/scsi/sym53c8xx_2/ for example of usage. 32 + * @driver_data: Data private to the driver. 33 + * Most drivers don't need to use driver_data field. 34 + * Best practice is to use driver_data as an index 35 + * into a static list of equivalent device types, 36 + * instead of using it as a pointer. 37 + */ 19 38 struct pci_device_id { 20 39 __u32 vendor, device; /* Vendor and device ID or PCI_ANY_ID*/ 21 40 __u32 subvendor, subdevice; /* Subsystem ID's or PCI_ANY_ID */ ··· 276 257 __u16 match_flags; 277 258 278 259 __u16 manf_id; 279 - __u16 card_id; 260 + __u16 card_id; 280 261 281 - __u8 func_id; 262 + __u8 func_id; 282 263 283 264 /* for real multi-function devices */ 284 - __u8 function; 265 + __u8 function; 285 266 286 267 /* for pseudo multi-function devices */ 287 - __u8 device_no; 268 + __u8 device_no; 288 269 289 - __u32 prod_id_hash[4]; 270 + __u32 prod_id_hash[4]; 290 271 291 272 /* not matched against in kernelspace */ 292 273 const char * prod_id[4];

+4 -3

include/linux/pci-acpi.h

··· 107 107 #endif 108 108 109 109 extern const guid_t pci_acpi_dsm_guid; 110 - #define DEVICE_LABEL_DSM 0x07 111 - #define RESET_DELAY_DSM 0x08 112 - #define FUNCTION_DELAY_DSM 0x09 110 + #define IGNORE_PCI_BOOT_CONFIG_DSM 0x05 111 + #define DEVICE_LABEL_DSM 0x07 112 + #define RESET_DELAY_DSM 0x08 113 + #define FUNCTION_DELAY_DSM 0x09 113 114 114 115 #else /* CONFIG_ACPI */ 115 116 static inline void acpi_pci_add_bus(struct pci_bus *bus) { }

+51 -2

include/linux/pci.h

··· 151 151 #define PCI_PM_BUS_WAIT 50 152 152 153 153 /** 154 + * typedef pci_channel_state_t 155 + * 154 156 * The pci_channel state describes connectivity between the CPU and 155 157 * the PCI device. If some PCI bus between here and the PCI device 156 158 * has crashed or locked up, this info is reflected here. ··· 260 258 PCIE_SPEED_5_0GT = 0x15, 261 259 PCIE_SPEED_8_0GT = 0x16, 262 260 PCIE_SPEED_16_0GT = 0x17, 261 + PCIE_SPEED_32_0GT = 0x18, 263 262 PCI_SPEED_UNKNOWN = 0xff, 264 263 }; 265 264 ··· 386 383 387 384 unsigned int is_busmaster:1; /* Is busmaster */ 388 385 unsigned int no_msi:1; /* May not use MSI */ 389 - unsigned int no_64bit_msi:1; /* May only use 32-bit MSIs */ 386 + unsigned int no_64bit_msi:1; /* May only use 32-bit MSIs */ 390 387 unsigned int block_cfg_access:1; /* Config space access blocked */ 391 388 unsigned int broken_parity_status:1; /* Generates false positive parity */ 392 389 unsigned int irq_reroute_variant:2; /* Needs IRQ rerouting variant */ ··· 509 506 unsigned int native_shpc_hotplug:1; /* OS may use SHPC hotplug */ 510 507 unsigned int native_pme:1; /* OS may use PCIe PME */ 511 508 unsigned int native_ltr:1; /* OS may use PCIe LTR */ 509 + unsigned int preserve_config:1; /* Preserve FW resource setup */ 510 + 512 511 /* Resource alignment requirements */ 513 512 resource_size_t (*align_resource)(struct pci_dev *dev, 514 513 const struct resource *res, ··· 781 776 782 777 783 778 struct module; 779 + 780 + /** 781 + * struct pci_driver - PCI driver structure 782 + * @node: List of driver structures. 783 + * @name: Driver name. 784 + * @id_table: Pointer to table of device IDs the driver is 785 + * interested in. Most drivers should export this 786 + * table using MODULE_DEVICE_TABLE(pci,...). 787 + * @probe: This probing function gets called (during execution 788 + * of pci_register_driver() for already existing 789 + * devices or later if a new device gets inserted) for 790 + * all PCI devices which match the ID table and are not 791 + * "owned" by the other drivers yet. This function gets 792 + * passed a "struct pci_dev \*" for each device whose 793 + * entry in the ID table matches the device. The probe 794 + * function returns zero when the driver chooses to 795 + * take "ownership" of the device or an error code 796 + * (negative number) otherwise. 797 + * The probe function always gets called from process 798 + * context, so it can sleep. 799 + * @remove: The remove() function gets called whenever a device 800 + * being handled by this driver is removed (either during 801 + * deregistration of the driver or when it's manually 802 + * pulled out of a hot-pluggable slot). 803 + * The remove function always gets called from process 804 + * context, so it can sleep. 805 + * @suspend: Put device into low power state. 806 + * @suspend_late: Put device into low power state. 807 + * @resume_early: Wake device from low power state. 808 + * @resume: Wake device from low power state. 809 + * (Please see Documentation/power/pci.rst for descriptions 810 + * of PCI Power Management and the related functions.) 811 + * @shutdown: Hook into reboot_notifier_list (kernel/sys.c). 812 + * Intended to stop any idling DMA operations. 813 + * Useful for enabling wake-on-lan (NIC) or changing 814 + * the power state of a device before reboot. 815 + * e.g. drivers/net/e100.c. 816 + * @sriov_configure: Optional driver callback to allow configuration of 817 + * number of VFs to enable via sysfs "sriov_numvfs" file. 818 + * @err_handler: See Documentation/PCI/pci-error-recovery.rst 819 + * @groups: Sysfs attribute groups. 820 + * @driver: Driver model structure. 821 + * @dynids: List of dynamically added device IDs. 822 + */ 784 823 struct pci_driver { 785 824 struct list_head node; 786 825 const char *name; ··· 2256 2207 2257 2208 /** 2258 2209 * pci_vpd_info_field_size - Extracts the information field length 2259 - * @lrdt: Pointer to the beginning of an information field header 2210 + * @info_field: Pointer to the beginning of an information field header 2260 2211 * 2261 2212 * Returns the extracted information field length. 2262 2213 */

+4 -3

include/linux/pci_ids.h

··· 1112 1112 1113 1113 #define PCI_VENDOR_ID_AL 0x10b9 1114 1114 #define PCI_DEVICE_ID_AL_M1533 0x1533 1115 - #define PCI_DEVICE_ID_AL_M1535 0x1535 1115 + #define PCI_DEVICE_ID_AL_M1535 0x1535 1116 1116 #define PCI_DEVICE_ID_AL_M1541 0x1541 1117 1117 #define PCI_DEVICE_ID_AL_M1563 0x1563 1118 1118 #define PCI_DEVICE_ID_AL_M1621 0x1621 ··· 1336 1336 #define PCI_DEVICE_ID_NVIDIA_NFORCE_MCP78S_SMBUS 0x0752 1337 1337 #define PCI_DEVICE_ID_NVIDIA_NFORCE_MCP77_IDE 0x0759 1338 1338 #define PCI_DEVICE_ID_NVIDIA_NFORCE_MCP73_SMBUS 0x07D8 1339 + #define PCI_DEVICE_ID_NVIDIA_GEFORCE_320M 0x08A0 1339 1340 #define PCI_DEVICE_ID_NVIDIA_NFORCE_MCP79_SMBUS 0x0AA2 1340 1341 #define PCI_DEVICE_ID_NVIDIA_NFORCE_MCP89_SATA 0x0D85 1341 1342 ··· 1753 1752 #define PCI_VENDOR_ID_STALLION 0x124d 1754 1753 1755 1754 /* Allied Telesyn */ 1756 - #define PCI_VENDOR_ID_AT 0x1259 1755 + #define PCI_VENDOR_ID_AT 0x1259 1757 1756 #define PCI_SUBDEVICE_ID_AT_2700FX 0x2701 1758 1757 #define PCI_SUBDEVICE_ID_AT_2701FX 0x2703 1759 1758 ··· 2551 2550 #define PCI_DEVICE_ID_KORENIX_JETCARDF2 0x1700 2552 2551 #define PCI_DEVICE_ID_KORENIX_JETCARDF3 0x17ff 2553 2552 2554 - #define PCI_VENDOR_ID_HUAWEI 0x19e5 2553 + #define PCI_VENDOR_ID_HUAWEI 0x19e5 2555 2554 2556 2555 #define PCI_VENDOR_ID_NETRONOME 0x19ee 2557 2556 #define PCI_DEVICE_ID_NETRONOME_NFP4000 0x4000

+1 -1

include/linux/pm.h

··· 271 271 * actions to be performed by a device driver's callbacks generally depend on 272 272 * the platform and subsystem the device belongs to. 273 273 * 274 - * Refer to Documentation/power/runtime_pm.txt for more information about the 274 + * Refer to Documentation/power/runtime_pm.rst for more information about the 275 275 * role of the @runtime_suspend(), @runtime_resume() and @runtime_idle() 276 276 * callbacks in device runtime power management. 277 277 */

+4

include/uapi/linux/pci_regs.h

··· 528 528 #define PCI_EXP_LNKCAP_SLS_5_0GB 0x00000002 /* LNKCAP2 SLS Vector bit 1 */ 529 529 #define PCI_EXP_LNKCAP_SLS_8_0GB 0x00000003 /* LNKCAP2 SLS Vector bit 2 */ 530 530 #define PCI_EXP_LNKCAP_SLS_16_0GB 0x00000004 /* LNKCAP2 SLS Vector bit 3 */ 531 + #define PCI_EXP_LNKCAP_SLS_32_0GB 0x00000005 /* LNKCAP2 SLS Vector bit 4 */ 531 532 #define PCI_EXP_LNKCAP_MLW 0x000003f0 /* Maximum Link Width */ 532 533 #define PCI_EXP_LNKCAP_ASPMS 0x00000c00 /* ASPM Support */ 533 534 #define PCI_EXP_LNKCAP_L0SEL 0x00007000 /* L0s Exit Latency */ ··· 557 556 #define PCI_EXP_LNKSTA_CLS_5_0GB 0x0002 /* Current Link Speed 5.0GT/s */ 558 557 #define PCI_EXP_LNKSTA_CLS_8_0GB 0x0003 /* Current Link Speed 8.0GT/s */ 559 558 #define PCI_EXP_LNKSTA_CLS_16_0GB 0x0004 /* Current Link Speed 16.0GT/s */ 559 + #define PCI_EXP_LNKSTA_CLS_32_0GB 0x0005 /* Current Link Speed 32.0GT/s */ 560 560 #define PCI_EXP_LNKSTA_NLW 0x03f0 /* Negotiated Link Width */ 561 561 #define PCI_EXP_LNKSTA_NLW_X1 0x0010 /* Current Link Width x1 */ 562 562 #define PCI_EXP_LNKSTA_NLW_X2 0x0020 /* Current Link Width x2 */ ··· 663 661 #define PCI_EXP_LNKCAP2_SLS_5_0GB 0x00000004 /* Supported Speed 5GT/s */ 664 662 #define PCI_EXP_LNKCAP2_SLS_8_0GB 0x00000008 /* Supported Speed 8GT/s */ 665 663 #define PCI_EXP_LNKCAP2_SLS_16_0GB 0x00000010 /* Supported Speed 16GT/s */ 664 + #define PCI_EXP_LNKCAP2_SLS_32_0GB 0x00000020 /* Supported Speed 32GT/s */ 666 665 #define PCI_EXP_LNKCAP2_CROSSLINK 0x00000100 /* Crosslink supported */ 667 666 #define PCI_EXP_LNKCTL2 48 /* Link Control 2 */ 668 667 #define PCI_EXP_LNKCTL2_TLS 0x000f ··· 671 668 #define PCI_EXP_LNKCTL2_TLS_5_0GT 0x0002 /* Supported Speed 5GT/s */ 672 669 #define PCI_EXP_LNKCTL2_TLS_8_0GT 0x0003 /* Supported Speed 8GT/s */ 673 670 #define PCI_EXP_LNKCTL2_TLS_16_0GT 0x0004 /* Supported Speed 16GT/s */ 671 + #define PCI_EXP_LNKCTL2_TLS_32_0GT 0x0005 /* Supported Speed 32GT/s */ 674 672 #define PCI_EXP_LNKSTA2 50 /* Link Status 2 */ 675 673 #define PCI_CAP_EXP_ENDPOINT_SIZEOF_V2 52 /* v2 endpoints with link end here */ 676 674 #define PCI_EXP_SLTCAP2 52 /* Slot Capabilities 2 */

+3 -3

kernel/power/Kconfig

··· 66 66 need to run mkswap against the swap partition used for the suspend. 67 67 68 68 It also works with swap files to a limited extent (for details see 69 - <file:Documentation/power/swsusp-and-swap-files.txt>). 69 + <file:Documentation/power/swsusp-and-swap-files.rst>). 70 70 71 71 Right now you may boot without resuming and resume later but in the 72 72 meantime you cannot use the swap partition(s)/file(s) involved in ··· 75 75 MOUNT any journaled filesystems mounted before the suspend or they 76 76 will get corrupted in a nasty way. 77 77 78 - For more information take a look at <file:Documentation/power/swsusp.txt>. 78 + For more information take a look at <file:Documentation/power/swsusp.rst>. 79 79 80 80 config ARCH_SAVE_PAGE_KEYS 81 81 bool ··· 256 256 notification of APM "events" (e.g. battery status change). 257 257 258 258 In order to use APM, you will need supporting software. For location 259 - and more information, read <file:Documentation/power/apm-acpi.txt> 259 + and more information, read <file:Documentation/power/apm-acpi.rst> 260 260 and the Battery Powered Linux mini-HOWTO, available from 261 261 <http://www.tldp.org/docs.html#howto>. 262 262

+1 -1

net/wireless/Kconfig

··· 166 166 167 167 If this causes your applications to misbehave you should fix your 168 168 applications instead -- they need to register their network 169 - latency requirement, see Documentation/power/pm_qos_interface.txt. 169 + latency requirement, see Documentation/power/pm_qos_interface.rst. 170 170 171 171 config CFG80211_DEBUGFS 172 172 bool "cfg80211 DebugFS entries"

+2 -3

tools/pci/Makefile

··· 18 18 ALL_PROGRAMS := $(patsubst %,$(OUTPUT)%,$(ALL_TARGETS)) 19 19 20 20 SCRIPTS := pcitest.sh 21 - ALL_SCRIPTS := $(patsubst %,$(OUTPUT)%,$(SCRIPTS)) 22 21 23 22 all: $(ALL_PROGRAMS) 24 23 ··· 46 47 47 48 install: $(ALL_PROGRAMS) 48 49 install -d -m 755 $(DESTDIR)$(bindir); \ 49 - for program in $(ALL_PROGRAMS) pcitest.sh; do \ 50 + for program in $(ALL_PROGRAMS); do \ 50 51 install $$program $(DESTDIR)$(bindir); \ 51 52 done; \ 52 - for script in $(ALL_SCRIPTS); do \ 53 + for script in $(SCRIPTS); do \ 53 54 install $$script $(DESTDIR)$(bindir); \ 54 55 done 55 56

+4 -4

tools/pci/pcitest.c

··· 36 36 unsigned long size; 37 37 }; 38 38 39 - static void run_test(struct pci_test *test) 39 + static int run_test(struct pci_test *test) 40 40 { 41 - long ret; 41 + int ret = -EINVAL; 42 42 int fd; 43 43 44 44 fd = open(test->device, O_RDWR); 45 45 if (fd < 0) { 46 46 perror("can't open PCI Endpoint Test device"); 47 - return; 47 + return -ENODEV; 48 48 } 49 49 50 50 if (test->barnum >= 0 && test->barnum <= 5) { ··· 212 212 "\t-r Read buffer test\n" 213 213 "\t-w Write buffer test\n" 214 214 "\t-c Copy buffer test\n" 215 - "\t-s <size> Size of buffer {default: 100KB}\n", 215 + "\t-s <size> Size of buffer {default: 100KB}\n" 216 216 "\t-h Print this help message\n", 217 217 argv[0]); 218 218 return -EINVAL;