Merge tag 'pm-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

kernel os linux

Pull power management updates from Rafael Wysocki:
"These add hybrid processors support to the intel_pstate driver and
make it work with more processor models when HWP is disabled, make the
intel_idle driver use special C6 idle state paremeters when package
C-states are disabled, add cooling support to the tegra30 devfreq
driver, rework the TEO (timer events oriented) cpuidle governor,
extend the OPP (operating performance points) framework to use the
required-opps DT property in more cases, fix some issues and clean up
a number of assorted pieces of code.

Specifics:

- Make intel_pstate support hybrid processors using abstract
performance units in the HWP interface (Rafael Wysocki).

- Add Icelake servers and Cometlake support in no-HWP mode to
intel_pstate (Giovanni Gherdovich).

- Make cpufreq_online() error path be consistent with the CPU device
removal path in cpufreq (Rafael Wysocki).

- Clean up 3 cpufreq drivers and the statistics code (Hailong Liu,
Randy Dunlap, Shaokun Zhang).

- Make intel_idle use special idle state parameters for C6 when
package C-states are disabled (Chen Yu).

- Rework the TEO (timer events oriented) cpuidle governor to address
some theoretical shortcomings in it (Rafael Wysocki).

- Drop unneeded semicolon from the TEO governor (Wan Jiabing).

- Modify the runtime PM framework to accept unassigned suspend and
resume callback pointers (Ulf Hansson).

- Improve pm_runtime_get_sync() documentation (Krzysztof Kozlowski).

- Improve device performance states support in the generic power
domains (genpd) framework (Ulf Hansson).

- Fix some documentation issues in genpd (Yang Yingliang).

- Make the operating performance points (OPP) framework use the
required-opps DT property in use cases that are not related to
genpd (Hsin-Yi Wang).

- Make lazy_link_required_opp_table() use list_del_init instead of
list_del/INIT_LIST_HEAD (Yang Yingliang).

- Simplify wake IRQs handling in the core system-wide sleep support
code and clean up some coding style inconsistencies in it (Tian
Tao, Zhen Lei).

- Add cooling support to the tegra30 devfreq driver and improve its
DT bindings (Dmitry Osipenko).

- Fix some assorted issues in the devfreq core and drivers (Chanwoo
Choi, Dong Aisheng, YueHaibing)"

* tag 'pm-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (39 commits)
PM / devfreq: passive: Fix get_target_freq when not using required-opp
cpufreq: Make cpufreq_online() call driver->offline() on errors
opp: Allow required-opps to be used for non genpd use cases
cpuidle: teo: remove unneeded semicolon in teo_select()
dt-bindings: devfreq: tegra30-actmon: Add cooling-cells
dt-bindings: devfreq: tegra30-actmon: Convert to schema
PM / devfreq: userspace: Use DEVICE_ATTR_RW macro
PM: runtime: Clarify documentation when callbacks are unassigned
PM: runtime: Allow unassigned ->runtime_suspend|resume callbacks
PM: runtime: Improve path in rpm_idle() when no callback
PM: hibernate: remove leading spaces before tabs
PM: sleep: remove trailing spaces and tabs
PM: domains: Drop/restore performance state votes for devices at runtime PM
PM: domains: Return early if perf state is already set for the device
PM: domains: Split code in dev_pm_genpd_set_performance_state()
cpuidle: teo: Use kerneldoc documentation in admin-guide
cpuidle: teo: Rework most recent idle duration values treatment
cpuidle: teo: Change the main idle state selection logic
cpuidle: teo: Cosmetic modification of teo_select()
cpuidle: teo: Cosmetic modifications of teo_update()
...

Linus Torvalds 4 years ago 3563f55c 1dfb0f47

+779 -485

31 changed files

expand all

Documentation

admin-guide

cpuidle.rst

intel_pstate.rst

devicetree

bindings

arm

tegra

nvidia,tegra30-actmon.txt

devfreq

nvidia,tegra30-actmon.yaml

power

runtime_pm.rst

drivers

base

power

domain.c

domain_governor.c

runtime.c

wakeirq.c

cpufreq

cpufreq.c

cpufreq_stats.c

intel_pstate.c

loongson2_cpufreq.c

sc520_freq.c

sh-cpufreq.c

cpuidle

governors

teo.c

devfreq

Kconfig

devfreq.c

governor_passive.c

governor_userspace.c

imx-bus.c

tegra30-devfreq.c

idle

intel_idle.c

opp

core.c

of.c

include

linux

pm_domain.h

pm_runtime.h

kernel

power

Kconfig

process.c

snapshot.c

swap.c

+2 -75

Documentation/admin-guide/pm/cpuidle.rst

··· 347 347 <menu-gov_>`_: it always tries to find the deepest idle state suitable for the 348 348 given conditions. However, it applies a different approach to that problem. 349 349 350 - First, it does not use sleep length correction factors, but instead it attempts 351 - to correlate the observed idle duration values with the available idle states 352 - and use that information to pick up the idle state that is most likely to 353 - "match" the upcoming CPU idle interval. Second, it does not take the tasks 354 - that were running on the given CPU in the past and are waiting on some I/O 355 - operations to complete now at all (there is no guarantee that they will run on 356 - the same CPU when they become runnable again) and the pattern detection code in 357 - it avoids taking timer wakeups into account. It also only uses idle duration 358 - values less than the current time till the closest timer (with the scheduler 359 - tick excluded) for that purpose. 360 - 361 - Like in the ``menu`` governor `case <menu-gov_>`_, the first step is to obtain 362 - the *sleep length*, which is the time until the closest timer event with the 363 - assumption that the scheduler tick will be stopped (that also is the upper bound 364 - on the time until the next CPU wakeup). That value is then used to preselect an 365 - idle state on the basis of three metrics maintained for each idle state provided 366 - by the ``CPUIdle`` driver: ``hits``, ``misses`` and ``early_hits``. 367 - 368 - The ``hits`` and ``misses`` metrics measure the likelihood that a given idle 369 - state will "match" the observed (post-wakeup) idle duration if it "matches" the 370 - sleep length. They both are subject to decay (after a CPU wakeup) every time 371 - the target residency of the idle state corresponding to them is less than or 372 - equal to the sleep length and the target residency of the next idle state is 373 - greater than the sleep length (that is, when the idle state corresponding to 374 - them "matches" the sleep length). The ``hits`` metric is increased if the 375 - former condition is satisfied and the target residency of the given idle state 376 - is less than or equal to the observed idle duration and the target residency of 377 - the next idle state is greater than the observed idle duration at the same time 378 - (that is, it is increased when the given idle state "matches" both the sleep 379 - length and the observed idle duration). In turn, the ``misses`` metric is 380 - increased when the given idle state "matches" the sleep length only and the 381 - observed idle duration is too short for its target residency. 382 - 383 - The ``early_hits`` metric measures the likelihood that a given idle state will 384 - "match" the observed (post-wakeup) idle duration if it does not "match" the 385 - sleep length. It is subject to decay on every CPU wakeup and it is increased 386 - when the idle state corresponding to it "matches" the observed (post-wakeup) 387 - idle duration and the target residency of the next idle state is less than or 388 - equal to the sleep length (i.e. the idle state "matching" the sleep length is 389 - deeper than the given one). 390 - 391 - The governor walks the list of idle states provided by the ``CPUIdle`` driver 392 - and finds the last (deepest) one with the target residency less than or equal 393 - to the sleep length. Then, the ``hits`` and ``misses`` metrics of that idle 394 - state are compared with each other and it is preselected if the ``hits`` one is 395 - greater (which means that that idle state is likely to "match" the observed idle 396 - duration after CPU wakeup). If the ``misses`` one is greater, the governor 397 - preselects the shallower idle state with the maximum ``early_hits`` metric 398 - (or if there are multiple shallower idle states with equal ``early_hits`` 399 - metric which also is the maximum, the shallowest of them will be preselected). 400 - [If there is a wakeup latency constraint coming from the `PM QoS framework 401 - <cpu-pm-qos_>`_ which is hit before reaching the deepest idle state with the 402 - target residency within the sleep length, the deepest idle state with the exit 403 - latency within the constraint is preselected without consulting the ``hits``, 404 - ``misses`` and ``early_hits`` metrics.] 405 - 406 - Next, the governor takes several idle duration values observed most recently 407 - into consideration and if at least a half of them are greater than or equal to 408 - the target residency of the preselected idle state, that idle state becomes the 409 - final candidate to ask for. Otherwise, the average of the most recent idle 410 - duration values below the target residency of the preselected idle state is 411 - computed and the governor walks the idle states shallower than the preselected 412 - one and finds the deepest of them with the target residency within that average. 413 - That idle state is then taken as the final candidate to ask for. 414 - 415 - Still, at this point the governor may need to refine the idle state selection if 416 - it has not decided to `stop the scheduler tick <idle-cpus-and-tick_>`_. That 417 - generally happens if the target residency of the idle state selected so far is 418 - less than the tick period and the tick has not been stopped already (in a 419 - previous iteration of the idle loop). Then, like in the ``menu`` governor 420 - `case <menu-gov_>`_, the sleep length used in the previous computations may not 421 - reflect the real time until the closest timer event and if it really is greater 422 - than that time, a shallower state with a suitable target residency may need to 423 - be selected. 424 - 350 + .. kernel-doc:: drivers/cpuidle/governors/teo.c 351 + :doc: teo-description 425 352 426 353 .. _idle-states-representation: 427 354

Documentation/admin-guide/pm/intel_pstate.rst

··· 365 365 inclusive) including both turbo and non-turbo P-states (see 366 366 `Turbo P-states Support`_). 367 367 368 + This attribute is present only if the value exposed by it is the same 369 + for all of the CPUs in the system. 370 + 368 371 The value of this attribute is not affected by the ``no_turbo`` 369 372 setting described `below <no_turbo_attr_>`_. 370 373 ··· 376 373 ``turbo_pct`` 377 374 Ratio of the `turbo range <turbo_>`_ size to the size of the entire 378 375 range of supported P-states, in percent. 376 + 377 + This attribute is present only if the value exposed by it is the same 378 + for all of the CPUs in the system. 379 379 380 380 This attribute is read-only. 381 381

-57

Documentation/devicetree/bindings/arm/tegra/nvidia,tegra30-actmon.txt

··· 1 - NVIDIA Tegra Activity Monitor 2 - 3 - The activity monitor block collects statistics about the behaviour of other 4 - components in the system. This information can be used to derive the rate at 5 - which the external memory needs to be clocked in order to serve all requests 6 - from the monitored clients. 7 - 8 - Required properties: 9 - - compatible: should be "nvidia,tegra<chip>-actmon" 10 - - reg: offset and length of the register set for the device 11 - - interrupts: standard interrupt property 12 - - clocks: Must contain a phandle and clock specifier pair for each entry in 13 - clock-names. See ../../clock/clock-bindings.txt for details. 14 - - clock-names: Must include the following entries: 15 - - actmon 16 - - emc 17 - - resets: Must contain an entry for each entry in reset-names. See 18 - ../../reset/reset.txt for details. 19 - - reset-names: Must include the following entries: 20 - - actmon 21 - - operating-points-v2: See ../bindings/opp/opp.txt for details. 22 - - interconnects: Should contain entries for memory clients sitting on 23 - MC->EMC memory interconnect path. 24 - - interconnect-names: Should include name of the interconnect path for each 25 - interconnect entry. Consult TRM documentation for 26 - information about available memory clients, see MEMORY 27 - CONTROLLER section. 28 - 29 - For each opp entry in 'operating-points-v2' table: 30 - - opp-supported-hw: bitfield indicating SoC speedo ID mask 31 - - opp-peak-kBps: peak bandwidth of the memory channel 32 - 33 - Example: 34 - dfs_opp_table: opp-table { 35 - compatible = "operating-points-v2"; 36 - 37 - opp@12750000 { 38 - opp-hz = /bits/ 64 <12750000>; 39 - opp-supported-hw = <0x000F>; 40 - opp-peak-kBps = <51000>; 41 - }; 42 - ... 43 - }; 44 - 45 - actmon@6000c800 { 46 - compatible = "nvidia,tegra124-actmon"; 47 - reg = <0x0 0x6000c800 0x0 0x400>; 48 - interrupts = <GIC_SPI 45 IRQ_TYPE_LEVEL_HIGH>; 49 - clocks = <&tegra_car TEGRA124_CLK_ACTMON>, 50 - <&tegra_car TEGRA124_CLK_EMC>; 51 - clock-names = "actmon", "emc"; 52 - resets = <&tegra_car 119>; 53 - reset-names = "actmon"; 54 - operating-points-v2 = <&dfs_opp_table>; 55 - interconnects = <&mc TEGRA124_MC_MPCORER &emc>; 56 - interconnect-names = "cpu"; 57 - };

+126

Documentation/devicetree/bindings/devfreq/nvidia,tegra30-actmon.yaml

··· 1 + # SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) 2 + %YAML 1.2 3 + --- 4 + $id: http://devicetree.org/schemas/devfreq/nvidia,tegra30-actmon.yaml# 5 + $schema: http://devicetree.org/meta-schemas/core.yaml# 6 + 7 + title: NVIDIA Tegra30 Activity Monitor 8 + 9 + maintainers: 10 + - Dmitry Osipenko <digetx@gmail.com> 11 + - Jon Hunter <jonathanh@nvidia.com> 12 + - Thierry Reding <thierry.reding@gmail.com> 13 + 14 + description: | 15 + The activity monitor block collects statistics about the behaviour of other 16 + components in the system. This information can be used to derive the rate at 17 + which the external memory needs to be clocked in order to serve all requests 18 + from the monitored clients. 19 + 20 + properties: 21 + compatible: 22 + enum: 23 + - nvidia,tegra30-actmon 24 + - nvidia,tegra114-actmon 25 + - nvidia,tegra124-actmon 26 + - nvidia,tegra210-actmon 27 + 28 + reg: 29 + maxItems: 1 30 + 31 + clocks: 32 + maxItems: 2 33 + 34 + clock-names: 35 + items: 36 + - const: actmon 37 + - const: emc 38 + 39 + resets: 40 + maxItems: 1 41 + 42 + reset-names: 43 + items: 44 + - const: actmon 45 + 46 + interrupts: 47 + maxItems: 1 48 + 49 + interconnects: 50 + minItems: 1 51 + maxItems: 12 52 + 53 + interconnect-names: 54 + minItems: 1 55 + maxItems: 12 56 + description: 57 + Should include name of the interconnect path for each interconnect 58 + entry. Consult TRM documentation for information about available 59 + memory clients, see MEMORY CONTROLLER and ACTIVITY MONITOR sections. 60 + 61 + operating-points-v2: 62 + description: 63 + Should contain freqs and voltages and opp-supported-hw property, which 64 + is a bitfield indicating SoC speedo ID mask. 65 + 66 + "#cooling-cells": 67 + const: 2 68 + 69 + required: 70 + - compatible 71 + - reg 72 + - clocks 73 + - clock-names 74 + - resets 75 + - reset-names 76 + - interrupts 77 + - interconnects 78 + - interconnect-names 79 + - operating-points-v2 80 + - "#cooling-cells" 81 + 82 + additionalProperties: false 83 + 84 + examples: 85 + - | 86 + #include <dt-bindings/memory/tegra30-mc.h> 87 + 88 + mc: memory-controller@7000f000 { 89 + compatible = "nvidia,tegra30-mc"; 90 + reg = <0x7000f000 0x400>; 91 + clocks = <&clk 32>; 92 + clock-names = "mc"; 93 + 94 + interrupts = <0 77 4>; 95 + 96 + #iommu-cells = <1>; 97 + #reset-cells = <1>; 98 + #interconnect-cells = <1>; 99 + }; 100 + 101 + emc: external-memory-controller@7000f400 { 102 + compatible = "nvidia,tegra30-emc"; 103 + reg = <0x7000f400 0x400>; 104 + interrupts = <0 78 4>; 105 + clocks = <&clk 57>; 106 + 107 + nvidia,memory-controller = <&mc>; 108 + operating-points-v2 = <&dvfs_opp_table>; 109 + power-domains = <&domain>; 110 + 111 + #interconnect-cells = <0>; 112 + }; 113 + 114 + actmon@6000c800 { 115 + compatible = "nvidia,tegra30-actmon"; 116 + reg = <0x6000c800 0x400>; 117 + interrupts = <0 45 4>; 118 + clocks = <&clk 119>, <&clk 57>; 119 + clock-names = "actmon", "emc"; 120 + resets = <&rst 119>; 121 + reset-names = "actmon"; 122 + operating-points-v2 = <&dvfs_opp_table>; 123 + interconnects = <&mc TEGRA30_MC_MPCORER &emc>; 124 + interconnect-names = "cpu-read"; 125 + #cooling-cells = <2>; 126 + };

+14 -1

Documentation/power/runtime_pm.rst

··· 378 378 379 379 `int pm_runtime_get_sync(struct device *dev);` 380 380 - increment the device's usage counter, run pm_runtime_resume(dev) and 381 - return its result 381 + return its result; 382 + note that it does not drop the device's usage counter on errors, so 383 + consider using pm_runtime_resume_and_get() instead of it, especially 384 + if its return value is checked by the caller, as this is likely to 385 + result in cleaner code. 382 386 383 387 `int pm_runtime_get_if_in_use(struct device *dev);` 384 388 - return -EINVAL if 'power.disable_depth' is nonzero; otherwise, if the ··· 830 826 or driver about runtime power changes. Instead, the driver for the device's 831 827 parent must take responsibility for telling the device's driver when the 832 828 parent's power state changes. 829 + 830 + Note that, in some cases it may not be desirable for subsystems/drivers to call 831 + pm_runtime_no_callbacks() for their devices. This could be because a subset of 832 + the runtime PM callbacks needs to be implemented, a platform dependent PM 833 + domain could get attached to the device or that the device is power managed 834 + through a supplier device link. For these reasons and to avoid boilerplate code 835 + in subsystems/drivers, the PM core allows runtime PM callbacks to be 836 + unassigned. More precisely, if a callback pointer is NULL, the PM core will act 837 + as though there was a callback and it returned 0. 833 838 834 839 9. Autosuspend, or automatically-delayed suspends 835 840 =================================================

+49 -15

drivers/base/power/domain.c

··· 379 379 return ret; 380 380 } 381 381 382 + static int genpd_set_performance_state(struct device *dev, unsigned int state) 383 + { 384 + struct generic_pm_domain *genpd = dev_to_genpd(dev); 385 + struct generic_pm_domain_data *gpd_data = dev_gpd_data(dev); 386 + unsigned int prev_state; 387 + int ret; 388 + 389 + prev_state = gpd_data->performance_state; 390 + if (prev_state == state) 391 + return 0; 392 + 393 + gpd_data->performance_state = state; 394 + state = _genpd_reeval_performance_state(genpd, state); 395 + 396 + ret = _genpd_set_performance_state(genpd, state, 0); 397 + if (ret) 398 + gpd_data->performance_state = prev_state; 399 + 400 + return ret; 401 + } 402 + 403 + static int genpd_drop_performance_state(struct device *dev) 404 + { 405 + unsigned int prev_state = dev_gpd_data(dev)->performance_state; 406 + 407 + if (!genpd_set_performance_state(dev, 0)) 408 + return prev_state; 409 + 410 + return 0; 411 + } 412 + 413 + static void genpd_restore_performance_state(struct device *dev, 414 + unsigned int state) 415 + { 416 + if (state) 417 + genpd_set_performance_state(dev, state); 418 + } 419 + 382 420 /** 383 421 * dev_pm_genpd_set_performance_state- Set performance state of device's power 384 422 * domain. ··· 435 397 int dev_pm_genpd_set_performance_state(struct device *dev, unsigned int state) 436 398 { 437 399 struct generic_pm_domain *genpd; 438 - struct generic_pm_domain_data *gpd_data; 439 - unsigned int prev; 440 400 int ret; 441 401 442 402 genpd = dev_to_genpd_safe(dev); ··· 446 410 return -EINVAL; 447 411 448 412 genpd_lock(genpd); 449 - 450 - gpd_data = to_gpd_data(dev->power.subsys_data->domain_data); 451 - prev = gpd_data->performance_state; 452 - gpd_data->performance_state = state; 453 - 454 - state = _genpd_reeval_performance_state(genpd, state); 455 - ret = _genpd_set_performance_state(genpd, state, 0); 456 - if (ret) 457 - gpd_data->performance_state = prev; 458 - 413 + ret = genpd_set_performance_state(dev, state); 459 414 genpd_unlock(genpd); 460 415 461 416 return ret; ··· 599 572 * RPM status of the releated device is in an intermediate state, not yet turned 600 573 * into RPM_SUSPENDED. This means genpd_power_off() must allow one device to not 601 574 * be RPM_SUSPENDED, while it tries to power off the PM domain. 575 + * @depth: nesting count for lockdep. 602 576 * 603 577 * If all of the @genpd's devices have been suspended and all of its subdomains 604 578 * have been powered down, remove power from @genpd. ··· 860 832 { 861 833 struct generic_pm_domain *genpd; 862 834 bool (*suspend_ok)(struct device *__dev); 863 - struct gpd_timing_data *td = &dev_gpd_data(dev)->td; 835 + struct generic_pm_domain_data *gpd_data = dev_gpd_data(dev); 836 + struct gpd_timing_data *td = &gpd_data->td; 864 837 bool runtime_pm = pm_runtime_enabled(dev); 865 838 ktime_t time_start; 866 839 s64 elapsed_ns; ··· 918 889 return 0; 919 890 920 891 genpd_lock(genpd); 892 + gpd_data->rpm_pstate = genpd_drop_performance_state(dev); 921 893 genpd_power_off(genpd, true, 0); 922 894 genpd_unlock(genpd); 923 895 ··· 936 906 static int genpd_runtime_resume(struct device *dev) 937 907 { 938 908 struct generic_pm_domain *genpd; 939 - struct gpd_timing_data *td = &dev_gpd_data(dev)->td; 909 + struct generic_pm_domain_data *gpd_data = dev_gpd_data(dev); 910 + struct gpd_timing_data *td = &gpd_data->td; 940 911 bool runtime_pm = pm_runtime_enabled(dev); 941 912 ktime_t time_start; 942 913 s64 elapsed_ns; ··· 961 930 962 931 genpd_lock(genpd); 963 932 ret = genpd_power_on(genpd, 0); 933 + if (!ret) 934 + genpd_restore_performance_state(dev, gpd_data->rpm_pstate); 964 935 genpd_unlock(genpd); 965 936 966 937 if (ret) ··· 1001 968 err_poweroff: 1002 969 if (!pm_runtime_is_irq_safe(dev) || genpd_is_irq_safe(genpd)) { 1003 970 genpd_lock(genpd); 971 + gpd_data->rpm_pstate = genpd_drop_performance_state(dev); 1004 972 genpd_power_off(genpd, true, 0); 1005 973 genpd_unlock(genpd); 1006 974 } ··· 2539 2505 2540 2506 /** 2541 2507 * of_genpd_remove_last - Remove the last PM domain registered for a provider 2542 - * @provider: Pointer to device structure associated with provider 2508 + * @np: Pointer to device node associated with provider 2543 2509 * 2544 2510 * Find the last PM domain that was added by a particular provider and 2545 2511 * remove this PM domain from the list of PM domains. The provider is

drivers/base/power/domain_governor.c

··· 252 252 /** 253 253 * _default_power_down_ok - Default generic PM domain power off governor routine. 254 254 * @pd: PM domain to check. 255 + * @now: current ktime. 255 256 * 256 257 * This routine must be executed under the PM domain's lock. 257 258 */

+8 -10

drivers/base/power/runtime.c

··· 345 345 static int __rpm_callback(int (*cb)(struct device *), struct device *dev) 346 346 __releases(&dev->power.lock) __acquires(&dev->power.lock) 347 347 { 348 - int retval, idx; 348 + int retval = 0, idx; 349 349 bool use_links = dev->power.links_count > 0; 350 350 351 351 if (dev->power.irq_safe) { ··· 373 373 } 374 374 } 375 375 376 - retval = cb(dev); 376 + if (cb) 377 + retval = cb(dev); 377 378 378 379 if (dev->power.irq_safe) { 379 380 spin_lock(&dev->power.lock); ··· 447 446 /* Pending requests need to be canceled. */ 448 447 dev->power.request = RPM_REQ_NONE; 449 448 450 - if (dev->power.no_callbacks) 449 + callback = RPM_GET_CALLBACK(dev, runtime_idle); 450 + 451 + /* If no callback assume success. */ 452 + if (!callback || dev->power.no_callbacks) 451 453 goto out; 452 454 453 455 /* Carry out an asynchronous or a synchronous idle notification. */ ··· 466 462 467 463 dev->power.idle_notification = true; 468 464 469 - callback = RPM_GET_CALLBACK(dev, runtime_idle); 470 - 471 - if (callback) 472 - retval = __rpm_callback(callback, dev); 465 + retval = __rpm_callback(callback, dev); 473 466 474 467 dev->power.idle_notification = false; 475 468 wake_up_all(&dev->power.wait_queue); ··· 484 483 static int rpm_callback(int (*cb)(struct device *), struct device *dev) 485 484 { 486 485 int retval; 487 - 488 - if (!cb) 489 - return -ENOSYS; 490 486 491 487 if (dev->power.memalloc_noio) { 492 488 unsigned int noio_flag;

+2 -2

drivers/base/power/wakeirq.c

··· 182 182 183 183 wirq->dev = dev; 184 184 wirq->irq = irq; 185 - irq_set_status_flags(irq, IRQ_NOAUTOEN); 186 185 187 186 /* Prevent deferred spurious wakeirqs with disable_irq_nosync() */ 188 187 irq_set_status_flags(irq, IRQ_DISABLE_UNLAZY); ··· 191 192 * so we use a threaded irq. 192 193 */ 193 194 err = request_threaded_irq(irq, NULL, handle_threaded_wake_irq, 194 - IRQF_ONESHOT, wirq->name, wirq); 195 + IRQF_ONESHOT | IRQF_NO_AUTOEN, 196 + wirq->name, wirq); 195 197 if (err) 196 198 goto err_free_name; 197 199

+10 -1

drivers/cpufreq/cpufreq.c

··· 1367 1367 goto out_free_policy; 1368 1368 } 1369 1369 1370 + /* 1371 + * The initialization has succeeded and the policy is online. 1372 + * If there is a problem with its frequency table, take it 1373 + * offline and drop it. 1374 + */ 1370 1375 ret = cpufreq_table_validate_and_sort(policy); 1371 1376 if (ret) 1372 - goto out_exit_policy; 1377 + goto out_offline_policy; 1373 1378 1374 1379 /* related_cpus should at least include policy->cpus. */ 1375 1380 cpumask_copy(policy->related_cpus, policy->cpus); ··· 1519 1514 remove_cpu_dev_symlink(policy, get_cpu_device(j)); 1520 1515 1521 1516 up_write(&policy->rwsem); 1517 + 1518 + out_offline_policy: 1519 + if (cpufreq_driver->offline) 1520 + cpufreq_driver->offline(policy); 1522 1521 1523 1522 out_exit_policy: 1524 1523 if (cpufreq_driver->exit)

+2 -3

drivers/cpufreq/cpufreq_stats.c

··· 211 211 212 212 void cpufreq_stats_create_table(struct cpufreq_policy *policy) 213 213 { 214 - unsigned int i = 0, count = 0, ret = -ENOMEM; 214 + unsigned int i = 0, count; 215 215 struct cpufreq_stats *stats; 216 216 unsigned int alloc_size; 217 217 struct cpufreq_frequency_table *pos; ··· 253 253 stats->last_index = freq_table_get_index(stats, policy->cur); 254 254 255 255 policy->stats = stats; 256 - ret = sysfs_create_group(&policy->kobj, &stats_attr_group); 257 - if (!ret) 256 + if (!sysfs_create_group(&policy->kobj, &stats_attr_group)) 258 257 return; 259 258 260 259 /* We failed, release resources */

+234 -29

drivers/cpufreq/intel_pstate.c

··· 121 121 * @max_pstate_physical:This is physical Max P state for a processor 122 122 * This can be higher than the max_pstate which can 123 123 * be limited by platform thermal design power limits 124 - * @scaling: Scaling factor to convert frequency to cpufreq 125 - * frequency units 124 + * @perf_ctl_scaling: PERF_CTL P-state to frequency scaling factor 125 + * @scaling: Scaling factor between performance and frequency 126 126 * @turbo_pstate: Max Turbo P state possible for this platform 127 + * @min_freq: @min_pstate frequency in cpufreq units 127 128 * @max_freq: @max_pstate frequency in cpufreq units 128 129 * @turbo_freq: @turbo_pstate frequency in cpufreq units 129 130 * ··· 135 134 int min_pstate; 136 135 int max_pstate; 137 136 int max_pstate_physical; 137 + int perf_ctl_scaling; 138 138 int scaling; 139 139 int turbo_pstate; 140 + unsigned int min_freq; 140 141 unsigned int max_freq; 141 142 unsigned int turbo_freq; 142 143 }; ··· 369 366 } 370 367 } 371 368 372 - static int intel_pstate_get_cppc_guranteed(int cpu) 369 + static int intel_pstate_get_cppc_guaranteed(int cpu) 373 370 { 374 371 struct cppc_perf_caps cppc_perf; 375 372 int ret; ··· 385 382 } 386 383 387 384 #else /* CONFIG_ACPI_CPPC_LIB */ 388 - static void intel_pstate_set_itmt_prio(int cpu) 385 + static inline void intel_pstate_set_itmt_prio(int cpu) 389 386 { 390 387 } 391 388 #endif /* CONFIG_ACPI_CPPC_LIB */ ··· 470 467 471 468 acpi_processor_unregister_performance(policy->cpu); 472 469 } 470 + 471 + static bool intel_pstate_cppc_perf_valid(u32 perf, struct cppc_perf_caps *caps) 472 + { 473 + return perf && perf <= caps->highest_perf && perf >= caps->lowest_perf; 474 + } 475 + 476 + static bool intel_pstate_cppc_perf_caps(struct cpudata *cpu, 477 + struct cppc_perf_caps *caps) 478 + { 479 + if (cppc_get_perf_caps(cpu->cpu, caps)) 480 + return false; 481 + 482 + return caps->highest_perf && caps->lowest_perf <= caps->highest_perf; 483 + } 473 484 #else /* CONFIG_ACPI */ 474 485 static inline void intel_pstate_init_acpi_perf_limits(struct cpufreq_policy *policy) 475 486 { ··· 500 483 #endif /* CONFIG_ACPI */ 501 484 502 485 #ifndef CONFIG_ACPI_CPPC_LIB 503 - static int intel_pstate_get_cppc_guranteed(int cpu) 486 + static inline int intel_pstate_get_cppc_guaranteed(int cpu) 504 487 { 505 488 return -ENOTSUPP; 506 489 } 507 490 #endif /* CONFIG_ACPI_CPPC_LIB */ 491 + 492 + static void intel_pstate_hybrid_hwp_perf_ctl_parity(struct cpudata *cpu) 493 + { 494 + pr_debug("CPU%d: Using PERF_CTL scaling for HWP\n", cpu->cpu); 495 + 496 + cpu->pstate.scaling = cpu->pstate.perf_ctl_scaling; 497 + } 498 + 499 + /** 500 + * intel_pstate_hybrid_hwp_calibrate - Calibrate HWP performance levels. 501 + * @cpu: Target CPU. 502 + * 503 + * On hybrid processors, HWP may expose more performance levels than there are 504 + * P-states accessible through the PERF_CTL interface. If that happens, the 505 + * scaling factor between HWP performance levels and CPU frequency will be less 506 + * than the scaling factor between P-state values and CPU frequency. 507 + * 508 + * In that case, the scaling factor between HWP performance levels and CPU 509 + * frequency needs to be determined which can be done with the help of the 510 + * observation that certain HWP performance levels should correspond to certain 511 + * P-states, like for example the HWP highest performance should correspond 512 + * to the maximum turbo P-state of the CPU. 513 + */ 514 + static void intel_pstate_hybrid_hwp_calibrate(struct cpudata *cpu) 515 + { 516 + int perf_ctl_max_phys = cpu->pstate.max_pstate_physical; 517 + int perf_ctl_scaling = cpu->pstate.perf_ctl_scaling; 518 + int perf_ctl_turbo = pstate_funcs.get_turbo(); 519 + int turbo_freq = perf_ctl_turbo * perf_ctl_scaling; 520 + int perf_ctl_max = pstate_funcs.get_max(); 521 + int max_freq = perf_ctl_max * perf_ctl_scaling; 522 + int scaling = INT_MAX; 523 + int freq; 524 + 525 + pr_debug("CPU%d: perf_ctl_max_phys = %d\n", cpu->cpu, perf_ctl_max_phys); 526 + pr_debug("CPU%d: perf_ctl_max = %d\n", cpu->cpu, perf_ctl_max); 527 + pr_debug("CPU%d: perf_ctl_turbo = %d\n", cpu->cpu, perf_ctl_turbo); 528 + pr_debug("CPU%d: perf_ctl_scaling = %d\n", cpu->cpu, perf_ctl_scaling); 529 + 530 + pr_debug("CPU%d: HWP_CAP guaranteed = %d\n", cpu->cpu, cpu->pstate.max_pstate); 531 + pr_debug("CPU%d: HWP_CAP highest = %d\n", cpu->cpu, cpu->pstate.turbo_pstate); 532 + 533 + #ifdef CONFIG_ACPI 534 + if (IS_ENABLED(CONFIG_ACPI_CPPC_LIB)) { 535 + struct cppc_perf_caps caps; 536 + 537 + if (intel_pstate_cppc_perf_caps(cpu, &caps)) { 538 + if (intel_pstate_cppc_perf_valid(caps.nominal_perf, &caps)) { 539 + pr_debug("CPU%d: Using CPPC nominal\n", cpu->cpu); 540 + 541 + /* 542 + * If the CPPC nominal performance is valid, it 543 + * can be assumed to correspond to cpu_khz. 544 + */ 545 + if (caps.nominal_perf == perf_ctl_max_phys) { 546 + intel_pstate_hybrid_hwp_perf_ctl_parity(cpu); 547 + return; 548 + } 549 + scaling = DIV_ROUND_UP(cpu_khz, caps.nominal_perf); 550 + } else if (intel_pstate_cppc_perf_valid(caps.guaranteed_perf, &caps)) { 551 + pr_debug("CPU%d: Using CPPC guaranteed\n", cpu->cpu); 552 + 553 + /* 554 + * If the CPPC guaranteed performance is valid, 555 + * it can be assumed to correspond to max_freq. 556 + */ 557 + if (caps.guaranteed_perf == perf_ctl_max) { 558 + intel_pstate_hybrid_hwp_perf_ctl_parity(cpu); 559 + return; 560 + } 561 + scaling = DIV_ROUND_UP(max_freq, caps.guaranteed_perf); 562 + } 563 + } 564 + } 565 + #endif 566 + /* 567 + * If using the CPPC data to compute the HWP-to-frequency scaling factor 568 + * doesn't work, use the HWP_CAP gauranteed perf for this purpose with 569 + * the assumption that it corresponds to max_freq. 570 + */ 571 + if (scaling > perf_ctl_scaling) { 572 + pr_debug("CPU%d: Using HWP_CAP guaranteed\n", cpu->cpu); 573 + 574 + if (cpu->pstate.max_pstate == perf_ctl_max) { 575 + intel_pstate_hybrid_hwp_perf_ctl_parity(cpu); 576 + return; 577 + } 578 + scaling = DIV_ROUND_UP(max_freq, cpu->pstate.max_pstate); 579 + if (scaling > perf_ctl_scaling) { 580 + /* 581 + * This should not happen, because it would mean that 582 + * the number of HWP perf levels was less than the 583 + * number of P-states, so use the PERF_CTL scaling in 584 + * that case. 585 + */ 586 + pr_debug("CPU%d: scaling (%d) out of range\n", cpu->cpu, 587 + scaling); 588 + 589 + intel_pstate_hybrid_hwp_perf_ctl_parity(cpu); 590 + return; 591 + } 592 + } 593 + 594 + /* 595 + * If the product of the HWP performance scaling factor obtained above 596 + * and the HWP_CAP highest performance is greater than the maximum turbo 597 + * frequency corresponding to the pstate_funcs.get_turbo() return value, 598 + * the scaling factor is too high, so recompute it so that the HWP_CAP 599 + * highest performance corresponds to the maximum turbo frequency. 600 + */ 601 + if (turbo_freq < cpu->pstate.turbo_pstate * scaling) { 602 + pr_debug("CPU%d: scaling too high (%d)\n", cpu->cpu, scaling); 603 + 604 + cpu->pstate.turbo_freq = turbo_freq; 605 + scaling = DIV_ROUND_UP(turbo_freq, cpu->pstate.turbo_pstate); 606 + } 607 + 608 + cpu->pstate.scaling = scaling; 609 + 610 + pr_debug("CPU%d: HWP-to-frequency scaling factor: %d\n", cpu->cpu, scaling); 611 + 612 + cpu->pstate.max_freq = rounddown(cpu->pstate.max_pstate * scaling, 613 + perf_ctl_scaling); 614 + 615 + freq = perf_ctl_max_phys * perf_ctl_scaling; 616 + cpu->pstate.max_pstate_physical = DIV_ROUND_UP(freq, scaling); 617 + 618 + cpu->pstate.min_freq = cpu->pstate.min_pstate * perf_ctl_scaling; 619 + /* 620 + * Cast the min P-state value retrieved via pstate_funcs.get_min() to 621 + * the effective range of HWP performance levels. 622 + */ 623 + cpu->pstate.min_pstate = DIV_ROUND_UP(cpu->pstate.min_freq, scaling); 624 + } 508 625 509 626 static inline void update_turbo_state(void) 510 627 { ··· 946 795 947 796 static ssize_t show_base_frequency(struct cpufreq_policy *policy, char *buf) 948 797 { 949 - struct cpudata *cpu; 950 - u64 cap; 951 - int ratio; 798 + struct cpudata *cpu = all_cpu_data[policy->cpu]; 799 + int ratio, freq; 952 800 953 - ratio = intel_pstate_get_cppc_guranteed(policy->cpu); 801 + ratio = intel_pstate_get_cppc_guaranteed(policy->cpu); 954 802 if (ratio <= 0) { 803 + u64 cap; 804 + 955 805 rdmsrl_on_cpu(policy->cpu, MSR_HWP_CAPABILITIES, &cap); 956 806 ratio = HWP_GUARANTEED_PERF(cap); 957 807 } 958 808 959 - cpu = all_cpu_data[policy->cpu]; 809 + freq = ratio * cpu->pstate.scaling; 810 + if (cpu->pstate.scaling != cpu->pstate.perf_ctl_scaling) 811 + freq = rounddown(freq, cpu->pstate.perf_ctl_scaling); 960 812 961 - return sprintf(buf, "%d\n", ratio * cpu->pstate.scaling); 813 + return sprintf(buf, "%d\n", freq); 962 814 } 963 815 964 816 cpufreq_freq_attr_ro(base_frequency); ··· 985 831 986 832 static void intel_pstate_get_hwp_cap(struct cpudata *cpu) 987 833 { 834 + int scaling = cpu->pstate.scaling; 835 + 988 836 __intel_pstate_get_hwp_cap(cpu); 989 - cpu->pstate.max_freq = cpu->pstate.max_pstate * cpu->pstate.scaling; 990 - cpu->pstate.turbo_freq = cpu->pstate.turbo_pstate * cpu->pstate.scaling; 837 + 838 + cpu->pstate.max_freq = cpu->pstate.max_pstate * scaling; 839 + cpu->pstate.turbo_freq = cpu->pstate.turbo_pstate * scaling; 840 + if (scaling != cpu->pstate.perf_ctl_scaling) { 841 + int perf_ctl_scaling = cpu->pstate.perf_ctl_scaling; 842 + 843 + cpu->pstate.max_freq = rounddown(cpu->pstate.max_freq, 844 + perf_ctl_scaling); 845 + cpu->pstate.turbo_freq = rounddown(cpu->pstate.turbo_freq, 846 + perf_ctl_scaling); 847 + } 991 848 } 992 849 993 850 static void intel_pstate_hwp_set(unsigned int cpu) ··· 1530 1365 static struct attribute *intel_pstate_attributes[] = { 1531 1366 &status.attr, 1532 1367 &no_turbo.attr, 1533 - &turbo_pct.attr, 1534 - &num_pstates.attr, 1535 1368 NULL 1536 1369 }; 1537 1370 ··· 1553 1390 rc = sysfs_create_group(intel_pstate_kobject, &intel_pstate_attr_group); 1554 1391 if (WARN_ON(rc)) 1555 1392 return; 1393 + 1394 + if (!boot_cpu_has(X86_FEATURE_HYBRID_CPU)) { 1395 + rc = sysfs_create_file(intel_pstate_kobject, &turbo_pct.attr); 1396 + WARN_ON(rc); 1397 + 1398 + rc = sysfs_create_file(intel_pstate_kobject, &num_pstates.attr); 1399 + WARN_ON(rc); 1400 + } 1556 1401 1557 1402 /* 1558 1403 * If per cpu limits are enforced there are no global limits, so ··· 1587 1416 return; 1588 1417 1589 1418 sysfs_remove_group(intel_pstate_kobject, &intel_pstate_attr_group); 1419 + 1420 + if (!boot_cpu_has(X86_FEATURE_HYBRID_CPU)) { 1421 + sysfs_remove_file(intel_pstate_kobject, &num_pstates.attr); 1422 + sysfs_remove_file(intel_pstate_kobject, &turbo_pct.attr); 1423 + } 1590 1424 1591 1425 if (!per_cpu_limits) { 1592 1426 sysfs_remove_file(intel_pstate_kobject, &max_perf_pct.attr); ··· 1889 1713 1890 1714 static void intel_pstate_get_cpu_pstates(struct cpudata *cpu) 1891 1715 { 1716 + bool hybrid_cpu = boot_cpu_has(X86_FEATURE_HYBRID_CPU); 1717 + int perf_ctl_max_phys = pstate_funcs.get_max_physical(); 1718 + int perf_ctl_scaling = hybrid_cpu ? cpu_khz / perf_ctl_max_phys : 1719 + pstate_funcs.get_scaling(); 1720 + 1892 1721 cpu->pstate.min_pstate = pstate_funcs.get_min(); 1893 - cpu->pstate.max_pstate_physical = pstate_funcs.get_max_physical(); 1894 - cpu->pstate.scaling = pstate_funcs.get_scaling(); 1722 + cpu->pstate.max_pstate_physical = perf_ctl_max_phys; 1723 + cpu->pstate.perf_ctl_scaling = perf_ctl_scaling; 1895 1724 1896 1725 if (hwp_active && !hwp_mode_bdw) { 1897 1726 __intel_pstate_get_hwp_cap(cpu); 1727 + 1728 + if (hybrid_cpu) 1729 + intel_pstate_hybrid_hwp_calibrate(cpu); 1730 + else 1731 + cpu->pstate.scaling = perf_ctl_scaling; 1898 1732 } else { 1733 + cpu->pstate.scaling = perf_ctl_scaling; 1899 1734 cpu->pstate.max_pstate = pstate_funcs.get_max(); 1900 1735 cpu->pstate.turbo_pstate = pstate_funcs.get_turbo(); 1901 1736 } 1902 1737 1903 - cpu->pstate.max_freq = cpu->pstate.max_pstate * cpu->pstate.scaling; 1904 - cpu->pstate.turbo_freq = cpu->pstate.turbo_pstate * cpu->pstate.scaling; 1738 + if (cpu->pstate.scaling == perf_ctl_scaling) { 1739 + cpu->pstate.min_freq = cpu->pstate.min_pstate * perf_ctl_scaling; 1740 + cpu->pstate.max_freq = cpu->pstate.max_pstate * perf_ctl_scaling; 1741 + cpu->pstate.turbo_freq = cpu->pstate.turbo_pstate * perf_ctl_scaling; 1742 + } 1905 1743 1906 1744 if (pstate_funcs.get_aperf_mperf_shift) 1907 1745 cpu->aperf_mperf_shift = pstate_funcs.get_aperf_mperf_shift(); ··· 2277 2087 X86_MATCH(ATOM_GOLDMONT, core_funcs), 2278 2088 X86_MATCH(ATOM_GOLDMONT_PLUS, core_funcs), 2279 2089 X86_MATCH(SKYLAKE_X, core_funcs), 2090 + X86_MATCH(COMETLAKE, core_funcs), 2091 + X86_MATCH(ICELAKE_X, core_funcs), 2280 2092 {} 2281 2093 }; 2282 2094 MODULE_DEVICE_TABLE(x86cpu, intel_pstate_cpu_ids); ··· 2387 2195 unsigned int policy_min, 2388 2196 unsigned int policy_max) 2389 2197 { 2390 - int scaling = cpu->pstate.scaling; 2198 + int perf_ctl_scaling = cpu->pstate.perf_ctl_scaling; 2391 2199 int32_t max_policy_perf, min_policy_perf; 2200 + 2201 + max_policy_perf = policy_max / perf_ctl_scaling; 2202 + if (policy_max == policy_min) { 2203 + min_policy_perf = max_policy_perf; 2204 + } else { 2205 + min_policy_perf = policy_min / perf_ctl_scaling; 2206 + min_policy_perf = clamp_t(int32_t, min_policy_perf, 2207 + 0, max_policy_perf); 2208 + } 2392 2209 2393 2210 /* 2394 2211 * HWP needs some special consideration, because HWP_REQUEST uses 2395 2212 * abstract values to represent performance rather than pure ratios. 2396 2213 */ 2397 - if (hwp_active) 2214 + if (hwp_active) { 2398 2215 intel_pstate_get_hwp_cap(cpu); 2399 2216 2400 - max_policy_perf = policy_max / scaling; 2401 - if (policy_max == policy_min) { 2402 - min_policy_perf = max_policy_perf; 2403 - } else { 2404 - min_policy_perf = policy_min / scaling; 2405 - min_policy_perf = clamp_t(int32_t, min_policy_perf, 2406 - 0, max_policy_perf); 2217 + if (cpu->pstate.scaling != perf_ctl_scaling) { 2218 + int scaling = cpu->pstate.scaling; 2219 + int freq; 2220 + 2221 + freq = max_policy_perf * perf_ctl_scaling; 2222 + max_policy_perf = DIV_ROUND_UP(freq, scaling); 2223 + freq = min_policy_perf * perf_ctl_scaling; 2224 + min_policy_perf = DIV_ROUND_UP(freq, scaling); 2225 + } 2407 2226 } 2408 2227 2409 2228 pr_debug("cpu:%d min_policy_perf:%d max_policy_perf:%d\n", ··· 2608 2405 cpu->min_perf_ratio = 0; 2609 2406 2610 2407 /* cpuinfo and default policy values */ 2611 - policy->cpuinfo.min_freq = cpu->pstate.min_pstate * cpu->pstate.scaling; 2408 + policy->cpuinfo.min_freq = cpu->pstate.min_freq; 2612 2409 update_turbo_state(); 2613 2410 global.turbo_disabled_mf = global.turbo_disabled; 2614 2411 policy->cpuinfo.max_freq = global.turbo_disabled ? ··· 3338 3135 } 3339 3136 3340 3137 pr_info("HWP enabled\n"); 3138 + } else if (boot_cpu_has(X86_FEATURE_HYBRID_CPU)) { 3139 + pr_warn("Problematic setup: Hybrid processor with disabled HWP\n"); 3341 3140 } 3342 3141 3343 3142 return 0;

-1

drivers/cpufreq/loongson2_cpufreq.c

··· 16 16 #include <linux/cpufreq.h> 17 17 #include <linux/module.h> 18 18 #include <linux/err.h> 19 - #include <linux/sched.h> /* set_cpus_allowed() */ 20 19 #include <linux/delay.h> 21 20 #include <linux/platform_device.h> 22 21

drivers/cpufreq/sc520_freq.c

··· 42 42 default: 43 43 pr_err("error: cpuctl register has unexpected value %02x\n", 44 44 clockspeed_reg); 45 + fallthrough; 45 46 case 0x01: 46 47 return 100000; 47 48 case 0x02:

-1

drivers/cpufreq/sh-cpufreq.c

··· 23 23 #include <linux/cpumask.h> 24 24 #include <linux/cpu.h> 25 25 #include <linux/smp.h> 26 - #include <linux/sched.h> /* set_cpus_allowed() */ 27 26 #include <linux/clk.h> 28 27 #include <linux/percpu.h> 29 28 #include <linux/sh_clk.h>

+252 -232

drivers/cpuidle/governors/teo.c

··· 2 2 /* 3 3 * Timer events oriented CPU idle governor 4 4 * 5 - * Copyright (C) 2018 Intel Corporation 5 + * Copyright (C) 2018 - 2021 Intel Corporation 6 6 * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> 7 + */ 8 + 9 + /** 10 + * DOC: teo-description 7 11 * 8 12 * The idea of this governor is based on the observation that on many systems 9 13 * timer events are two or more orders of magnitude more frequent than any 10 - * other interrupts, so they are likely to be the most significant source of CPU 14 + * other interrupts, so they are likely to be the most significant cause of CPU 11 15 * wakeups from idle states. Moreover, information about what happened in the 12 16 * (relatively recent) past can be used to estimate whether or not the deepest 13 - * idle state with target residency within the time to the closest timer is 14 - * likely to be suitable for the upcoming idle time of the CPU and, if not, then 15 - * which of the shallower idle states to choose. 17 + * idle state with target residency within the (known) time till the closest 18 + * timer event, referred to as the sleep length, is likely to be suitable for 19 + * the upcoming CPU idle period and, if not, then which of the shallower idle 20 + * states to choose instead of it. 16 21 * 17 - * Of course, non-timer wakeup sources are more important in some use cases and 18 - * they can be covered by taking a few most recent idle time intervals of the 19 - * CPU into account. However, even in that case it is not necessary to consider 20 - * idle duration values greater than the time till the closest timer, as the 21 - * patterns that they may belong to produce average values close enough to 22 - * the time till the closest timer (sleep length) anyway. 22 + * Of course, non-timer wakeup sources are more important in some use cases 23 + * which can be covered by taking a few most recent idle time intervals of the 24 + * CPU into account. However, even in that context it is not necessary to 25 + * consider idle duration values greater than the sleep length, because the 26 + * closest timer will ultimately wake up the CPU anyway unless it is woken up 27 + * earlier. 23 28 * 24 - * Thus this governor estimates whether or not the upcoming idle time of the CPU 25 - * is likely to be significantly shorter than the sleep length and selects an 26 - * idle state for it in accordance with that, as follows: 29 + * Thus this governor estimates whether or not the prospective idle duration of 30 + * a CPU is likely to be significantly shorter than the sleep length and selects 31 + * an idle state for it accordingly. 27 32 * 28 - * - Find an idle state on the basis of the sleep length and state statistics 29 - * collected over time: 33 + * The computations carried out by this governor are based on using bins whose 34 + * boundaries are aligned with the target residency parameter values of the CPU 35 + * idle states provided by the %CPUIdle driver in the ascending order. That is, 36 + * the first bin spans from 0 up to, but not including, the target residency of 37 + * the second idle state (idle state 1), the second bin spans from the target 38 + * residency of idle state 1 up to, but not including, the target residency of 39 + * idle state 2, the third bin spans from the target residency of idle state 2 40 + * up to, but not including, the target residency of idle state 3 and so on. 41 + * The last bin spans from the target residency of the deepest idle state 42 + * supplied by the driver to infinity. 30 43 * 31 - * o Find the deepest idle state whose target residency is less than or equal 32 - * to the sleep length. 44 + * Two metrics called "hits" and "intercepts" are associated with each bin. 45 + * They are updated every time before selecting an idle state for the given CPU 46 + * in accordance with what happened last time. 33 47 * 34 - * o Select it if it matched both the sleep length and the observed idle 35 - * duration in the past more often than it matched the sleep length alone 36 - * (i.e. the observed idle duration was significantly shorter than the sleep 37 - * length matched by it). 48 + * The "hits" metric reflects the relative frequency of situations in which the 49 + * sleep length and the idle duration measured after CPU wakeup fall into the 50 + * same bin (that is, the CPU appears to wake up "on time" relative to the sleep 51 + * length). In turn, the "intercepts" metric reflects the relative frequency of 52 + * situations in which the measured idle duration is so much shorter than the 53 + * sleep length that the bin it falls into corresponds to an idle state 54 + * shallower than the one whose bin is fallen into by the sleep length (these 55 + * situations are referred to as "intercepts" below). 38 56 * 39 - * o Otherwise, select the shallower state with the greatest matched "early" 40 - * wakeups metric. 57 + * In addition to the metrics described above, the governor counts recent 58 + * intercepts (that is, intercepts that have occurred during the last 59 + * %NR_RECENT invocations of it for the given CPU) for each bin. 41 60 * 42 - * - If the majority of the most recent idle duration values are below the 43 - * target residency of the idle state selected so far, use those values to 44 - * compute the new expected idle duration and find an idle state matching it 45 - * (which has to be shallower than the one selected so far). 61 + * In order to select an idle state for a CPU, the governor takes the following 62 + * steps (modulo the possible latency constraint that must be taken into account 63 + * too): 64 + * 65 + * 1. Find the deepest CPU idle state whose target residency does not exceed 66 + * the current sleep length (the candidate idle state) and compute 3 sums as 67 + * follows: 68 + * 69 + * - The sum of the "hits" and "intercepts" metrics for the candidate state 70 + * and all of the deeper idle states (it represents the cases in which the 71 + * CPU was idle long enough to avoid being intercepted if the sleep length 72 + * had been equal to the current one). 73 + * 74 + * - The sum of the "intercepts" metrics for all of the idle states shallower 75 + * than the candidate one (it represents the cases in which the CPU was not 76 + * idle long enough to avoid being intercepted if the sleep length had been 77 + * equal to the current one). 78 + * 79 + * - The sum of the numbers of recent intercepts for all of the idle states 80 + * shallower than the candidate one. 81 + * 82 + * 2. If the second sum is greater than the first one or the third sum is 83 + * greater than %NR_RECENT / 2, the CPU is likely to wake up early, so look 84 + * for an alternative idle state to select. 85 + * 86 + * - Traverse the idle states shallower than the candidate one in the 87 + * descending order. 88 + * 89 + * - For each of them compute the sum of the "intercepts" metrics and the sum 90 + * of the numbers of recent intercepts over all of the idle states between 91 + * it and the candidate one (including the former and excluding the 92 + * latter). 93 + * 94 + * - If each of these sums that needs to be taken into account (because the 95 + * check related to it has indicated that the CPU is likely to wake up 96 + * early) is greater than a half of the corresponding sum computed in step 97 + * 1 (which means that the target residency of the state in question had 98 + * not exceeded the idle duration in over a half of the relevant cases), 99 + * select the given idle state instead of the candidate one. 100 + * 101 + * 3. By default, select the candidate state. 46 102 */ 47 103 48 104 #include <linux/cpuidle.h> ··· 116 60 117 61 /* 118 62 * Number of the most recent idle duration values to take into consideration for 119 - * the detection of wakeup patterns. 63 + * the detection of recent early wakeup patterns. 120 64 */ 121 - #define INTERVALS 8 65 + #define NR_RECENT 9 122 66 123 67 /** 124 - * struct teo_idle_state - Idle state data used by the TEO cpuidle governor. 125 - * @early_hits: "Early" CPU wakeups "matching" this state. 126 - * @hits: "On time" CPU wakeups "matching" this state. 127 - * @misses: CPU wakeups "missing" this state. 128 - * 129 - * A CPU wakeup is "matched" by a given idle state if the idle duration measured 130 - * after the wakeup is between the target residency of that state and the target 131 - * residency of the next one (or if this is the deepest available idle state, it 132 - * "matches" a CPU wakeup when the measured idle duration is at least equal to 133 - * its target residency). 134 - * 135 - * Also, from the TEO governor perspective, a CPU wakeup from idle is "early" if 136 - * it occurs significantly earlier than the closest expected timer event (that 137 - * is, early enough to match an idle state shallower than the one matching the 138 - * time till the closest timer event). Otherwise, the wakeup is "on time", or 139 - * it is a "hit". 140 - * 141 - * A "miss" occurs when the given state doesn't match the wakeup, but it matches 142 - * the time till the closest timer event used for idle state selection. 68 + * struct teo_bin - Metrics used by the TEO cpuidle governor. 69 + * @intercepts: The "intercepts" metric. 70 + * @hits: The "hits" metric. 71 + * @recent: The number of recent "intercepts". 143 72 */ 144 - struct teo_idle_state { 145 - unsigned int early_hits; 73 + struct teo_bin { 74 + unsigned int intercepts; 146 75 unsigned int hits; 147 - unsigned int misses; 76 + unsigned int recent; 148 77 }; 149 78 150 79 /** 151 80 * struct teo_cpu - CPU data used by the TEO cpuidle governor. 152 81 * @time_span_ns: Time between idle state selection and post-wakeup update. 153 82 * @sleep_length_ns: Time till the closest timer event (at the selection time). 154 - * @states: Idle states data corresponding to this CPU. 155 - * @interval_idx: Index of the most recent saved idle interval. 156 - * @intervals: Saved idle duration values. 83 + * @state_bins: Idle state data bins for this CPU. 84 + * @total: Grand total of the "intercepts" and "hits" mertics for all bins. 85 + * @next_recent_idx: Index of the next @recent_idx entry to update. 86 + * @recent_idx: Indices of bins corresponding to recent "intercepts". 157 87 */ 158 88 struct teo_cpu { 159 89 s64 time_span_ns; 160 90 s64 sleep_length_ns; 161 - struct teo_idle_state states[CPUIDLE_STATE_MAX]; 162 - int interval_idx; 163 - u64 intervals[INTERVALS]; 91 + struct teo_bin state_bins[CPUIDLE_STATE_MAX]; 92 + unsigned int total; 93 + int next_recent_idx; 94 + int recent_idx[NR_RECENT]; 164 95 }; 165 96 166 97 static DEFINE_PER_CPU(struct teo_cpu, teo_cpus); 167 98 168 99 /** 169 - * teo_update - Update CPU data after wakeup. 100 + * teo_update - Update CPU metrics after wakeup. 170 101 * @drv: cpuidle driver containing state data. 171 102 * @dev: Target CPU. 172 103 */ 173 104 static void teo_update(struct cpuidle_driver *drv, struct cpuidle_device *dev) 174 105 { 175 106 struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu); 176 - int i, idx_hit = 0, idx_timer = 0; 177 - unsigned int hits, misses; 107 + int i, idx_timer = 0, idx_duration = 0; 178 108 u64 measured_ns; 179 109 180 110 if (cpu_data->time_span_ns >= cpu_data->sleep_length_ns) { ··· 193 151 measured_ns /= 2; 194 152 } 195 153 154 + cpu_data->total = 0; 155 + 196 156 /* 197 - * Decay the "early hits" metric for all of the states and find the 198 - * states matching the sleep length and the measured idle duration. 157 + * Decay the "hits" and "intercepts" metrics for all of the bins and 158 + * find the bins that the sleep length and the measured idle duration 159 + * fall into. 199 160 */ 200 161 for (i = 0; i < drv->state_count; i++) { 201 - unsigned int early_hits = cpu_data->states[i].early_hits; 162 + s64 target_residency_ns = drv->states[i].target_residency_ns; 163 + struct teo_bin *bin = &cpu_data->state_bins[i]; 202 164 203 - cpu_data->states[i].early_hits -= early_hits >> DECAY_SHIFT; 165 + bin->hits -= bin->hits >> DECAY_SHIFT; 166 + bin->intercepts -= bin->intercepts >> DECAY_SHIFT; 204 167 205 - if (drv->states[i].target_residency_ns <= cpu_data->sleep_length_ns) { 168 + cpu_data->total += bin->hits + bin->intercepts; 169 + 170 + if (target_residency_ns <= cpu_data->sleep_length_ns) { 206 171 idx_timer = i; 207 - if (drv->states[i].target_residency_ns <= measured_ns) 208 - idx_hit = i; 172 + if (target_residency_ns <= measured_ns) 173 + idx_duration = i; 209 174 } 210 175 } 211 176 177 + i = cpu_data->next_recent_idx++; 178 + if (cpu_data->next_recent_idx >= NR_RECENT) 179 + cpu_data->next_recent_idx = 0; 180 + 181 + if (cpu_data->recent_idx[i] >= 0) 182 + cpu_data->state_bins[cpu_data->recent_idx[i]].recent--; 183 + 212 184 /* 213 - * Update the "hits" and "misses" data for the state matching the sleep 214 - * length. If it matches the measured idle duration too, this is a hit, 215 - * so increase the "hits" metric for it then. Otherwise, this is a 216 - * miss, so increase the "misses" metric for it. In the latter case 217 - * also increase the "early hits" metric for the state that actually 218 - * matches the measured idle duration. 185 + * If the measured idle duration falls into the same bin as the sleep 186 + * length, this is a "hit", so update the "hits" metric for that bin. 187 + * Otherwise, update the "intercepts" metric for the bin fallen into by 188 + * the measured idle duration. 219 189 */ 220 - hits = cpu_data->states[idx_timer].hits; 221 - hits -= hits >> DECAY_SHIFT; 222 - 223 - misses = cpu_data->states[idx_timer].misses; 224 - misses -= misses >> DECAY_SHIFT; 225 - 226 - if (idx_timer == idx_hit) { 227 - hits += PULSE; 190 + if (idx_timer == idx_duration) { 191 + cpu_data->state_bins[idx_timer].hits += PULSE; 192 + cpu_data->recent_idx[i] = -1; 228 193 } else { 229 - misses += PULSE; 230 - cpu_data->states[idx_hit].early_hits += PULSE; 194 + cpu_data->state_bins[idx_duration].intercepts += PULSE; 195 + cpu_data->state_bins[idx_duration].recent++; 196 + cpu_data->recent_idx[i] = idx_duration; 231 197 } 232 198 233 - cpu_data->states[idx_timer].misses = misses; 234 - cpu_data->states[idx_timer].hits = hits; 235 - 236 - /* 237 - * Save idle duration values corresponding to non-timer wakeups for 238 - * pattern detection. 239 - */ 240 - cpu_data->intervals[cpu_data->interval_idx++] = measured_ns; 241 - if (cpu_data->interval_idx >= INTERVALS) 242 - cpu_data->interval_idx = 0; 199 + cpu_data->total += PULSE; 243 200 } 244 201 245 202 static bool teo_time_ok(u64 interval_ns) 246 203 { 247 204 return !tick_nohz_tick_stopped() || interval_ns >= TICK_NSEC; 205 + } 206 + 207 + static s64 teo_middle_of_bin(int idx, struct cpuidle_driver *drv) 208 + { 209 + return (drv->states[idx].target_residency_ns + 210 + drv->states[idx+1].target_residency_ns) / 2; 248 211 } 249 212 250 213 /** ··· 287 240 { 288 241 struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu); 289 242 s64 latency_req = cpuidle_governor_latency_req(dev->cpu); 290 - int max_early_idx, prev_max_early_idx, constraint_idx, idx0, idx, i; 291 - unsigned int hits, misses, early_hits; 243 + unsigned int idx_intercept_sum = 0; 244 + unsigned int intercept_sum = 0; 245 + unsigned int idx_recent_sum = 0; 246 + unsigned int recent_sum = 0; 247 + unsigned int idx_hit_sum = 0; 248 + unsigned int hit_sum = 0; 249 + int constraint_idx = 0; 250 + int idx0 = 0, idx = -1; 251 + bool alt_intercepts, alt_recent; 292 252 ktime_t delta_tick; 293 253 s64 duration_ns; 254 + int i; 294 255 295 256 if (dev->last_state_idx >= 0) { 296 257 teo_update(drv, dev); ··· 310 255 duration_ns = tick_nohz_get_sleep_length(&delta_tick); 311 256 cpu_data->sleep_length_ns = duration_ns; 312 257 313 - hits = 0; 314 - misses = 0; 315 - early_hits = 0; 316 - max_early_idx = -1; 317 - prev_max_early_idx = -1; 318 - constraint_idx = drv->state_count; 319 - idx = -1; 320 - idx0 = idx; 258 + /* Check if there is any choice in the first place. */ 259 + if (drv->state_count < 2) { 260 + idx = 0; 261 + goto end; 262 + } 263 + if (!dev->states_usage[0].disable) { 264 + idx = 0; 265 + if (drv->states[1].target_residency_ns > duration_ns) 266 + goto end; 267 + } 321 268 322 - for (i = 0; i < drv->state_count; i++) { 269 + /* 270 + * Find the deepest idle state whose target residency does not exceed 271 + * the current sleep length and the deepest idle state not deeper than 272 + * the former whose exit latency does not exceed the current latency 273 + * constraint. Compute the sums of metrics for early wakeup pattern 274 + * detection. 275 + */ 276 + for (i = 1; i < drv->state_count; i++) { 277 + struct teo_bin *prev_bin = &cpu_data->state_bins[i-1]; 323 278 struct cpuidle_state *s = &drv->states[i]; 324 279 325 - if (dev->states_usage[i].disable) { 326 - /* 327 - * Ignore disabled states with target residencies beyond 328 - * the anticipated idle duration. 329 - */ 330 - if (s->target_residency_ns > duration_ns) 331 - continue; 280 + /* 281 + * Update the sums of idle state mertics for all of the states 282 + * shallower than the current one. 283 + */ 284 + intercept_sum += prev_bin->intercepts; 285 + hit_sum += prev_bin->hits; 286 + recent_sum += prev_bin->recent; 332 287 333 - /* 334 - * This state is disabled, so the range of idle duration 335 - * values corresponding to it is covered by the current 336 - * candidate state, but still the "hits" and "misses" 337 - * metrics of the disabled state need to be used to 338 - * decide whether or not the state covering the range in 339 - * question is good enough. 340 - */ 341 - hits = cpu_data->states[i].hits; 342 - misses = cpu_data->states[i].misses; 343 - 344 - if (early_hits >= cpu_data->states[i].early_hits || 345 - idx < 0) 346 - continue; 347 - 348 - /* 349 - * If the current candidate state has been the one with 350 - * the maximum "early hits" metric so far, the "early 351 - * hits" metric of the disabled state replaces the 352 - * current "early hits" count to avoid selecting a 353 - * deeper state with lower "early hits" metric. 354 - */ 355 - if (max_early_idx == idx) { 356 - early_hits = cpu_data->states[i].early_hits; 357 - continue; 358 - } 359 - 360 - /* 361 - * The current candidate state is closer to the disabled 362 - * one than the current maximum "early hits" state, so 363 - * replace the latter with it, but in case the maximum 364 - * "early hits" state index has not been set so far, 365 - * check if the current candidate state is not too 366 - * shallow for that role. 367 - */ 368 - if (teo_time_ok(drv->states[idx].target_residency_ns)) { 369 - prev_max_early_idx = max_early_idx; 370 - early_hits = cpu_data->states[i].early_hits; 371 - max_early_idx = idx; 372 - } 373 - 288 + if (dev->states_usage[i].disable) 374 289 continue; 375 - } 376 290 377 291 if (idx < 0) { 378 292 idx = i; /* first enabled state */ 379 - hits = cpu_data->states[i].hits; 380 - misses = cpu_data->states[i].misses; 381 293 idx0 = i; 382 294 } 383 295 384 296 if (s->target_residency_ns > duration_ns) 385 297 break; 386 298 387 - if (s->exit_latency_ns > latency_req && constraint_idx > i) 299 + idx = i; 300 + 301 + if (s->exit_latency_ns <= latency_req) 388 302 constraint_idx = i; 389 303 390 - idx = i; 391 - hits = cpu_data->states[i].hits; 392 - misses = cpu_data->states[i].misses; 393 - 394 - if (early_hits < cpu_data->states[i].early_hits && 395 - teo_time_ok(drv->states[i].target_residency_ns)) { 396 - prev_max_early_idx = max_early_idx; 397 - early_hits = cpu_data->states[i].early_hits; 398 - max_early_idx = i; 399 - } 304 + idx_intercept_sum = intercept_sum; 305 + idx_hit_sum = hit_sum; 306 + idx_recent_sum = recent_sum; 400 307 } 401 308 402 - /* 403 - * If the "hits" metric of the idle state matching the sleep length is 404 - * greater than its "misses" metric, that is the one to use. Otherwise, 405 - * it is more likely that one of the shallower states will match the 406 - * idle duration observed after wakeup, so take the one with the maximum 407 - * "early hits" metric, but if that cannot be determined, just use the 408 - * state selected so far. 409 - */ 410 - if (hits <= misses) { 411 - /* 412 - * The current candidate state is not suitable, so take the one 413 - * whose "early hits" metric is the maximum for the range of 414 - * shallower states. 415 - */ 416 - if (idx == max_early_idx) 417 - max_early_idx = prev_max_early_idx; 418 - 419 - if (max_early_idx >= 0) { 420 - idx = max_early_idx; 421 - duration_ns = drv->states[idx].target_residency_ns; 422 - } 423 - } 424 - 425 - /* 426 - * If there is a latency constraint, it may be necessary to use a 427 - * shallower idle state than the one selected so far. 428 - */ 429 - if (constraint_idx < idx) 430 - idx = constraint_idx; 431 - 309 + /* Avoid unnecessary overhead. */ 432 310 if (idx < 0) { 433 - idx = 0; /* No states enabled. Must use 0. */ 434 - } else if (idx > idx0) { 435 - unsigned int count = 0; 436 - u64 sum = 0; 311 + idx = 0; /* No states enabled, must use 0. */ 312 + goto end; 313 + } else if (idx == idx0) { 314 + goto end; 315 + } 316 + 317 + /* 318 + * If the sum of the intercepts metric for all of the idle states 319 + * shallower than the current candidate one (idx) is greater than the 320 + * sum of the intercepts and hits metrics for the candidate state and 321 + * all of the deeper states, or the sum of the numbers of recent 322 + * intercepts over all of the states shallower than the candidate one 323 + * is greater than a half of the number of recent events taken into 324 + * account, the CPU is likely to wake up early, so find an alternative 325 + * idle state to select. 326 + */ 327 + alt_intercepts = 2 * idx_intercept_sum > cpu_data->total - idx_hit_sum; 328 + alt_recent = idx_recent_sum > NR_RECENT / 2; 329 + if (alt_recent || alt_intercepts) { 330 + s64 last_enabled_span_ns = duration_ns; 331 + int last_enabled_idx = idx; 437 332 438 333 /* 439 - * The target residencies of at least two different enabled idle 440 - * states are less than or equal to the current expected idle 441 - * duration. Try to refine the selection using the most recent 442 - * measured idle duration values. 334 + * Look for the deepest idle state whose target residency had 335 + * not exceeded the idle duration in over a half of the relevant 336 + * cases (both with respect to intercepts overall and with 337 + * respect to the recent intercepts only) in the past. 443 338 * 444 - * Count and sum the most recent idle duration values less than 445 - * the current expected idle duration value. 339 + * Take the possible latency constraint and duration limitation 340 + * present if the tick has been stopped already into account. 446 341 */ 447 - for (i = 0; i < INTERVALS; i++) { 448 - u64 val = cpu_data->intervals[i]; 342 + intercept_sum = 0; 343 + recent_sum = 0; 449 344 450 - if (val >= duration_ns) 345 + for (i = idx - 1; i >= idx0; i--) { 346 + struct teo_bin *bin = &cpu_data->state_bins[i]; 347 + s64 span_ns; 348 + 349 + intercept_sum += bin->intercepts; 350 + recent_sum += bin->recent; 351 + 352 + if (dev->states_usage[i].disable) 451 353 continue; 452 354 453 - count++; 454 - sum += val; 455 - } 456 - 457 - /* 458 - * Give up unless the majority of the most recent idle duration 459 - * values are in the interesting range. 460 - */ 461 - if (count > INTERVALS / 2) { 462 - u64 avg_ns = div64_u64(sum, count); 463 - 464 - /* 465 - * Avoid spending too much time in an idle state that 466 - * would be too shallow. 467 - */ 468 - if (teo_time_ok(avg_ns)) { 469 - duration_ns = avg_ns; 470 - if (drv->states[idx].target_residency_ns > avg_ns) 471 - idx = teo_find_shallower_state(drv, dev, 472 - idx, avg_ns); 355 + span_ns = teo_middle_of_bin(i, drv); 356 + if (!teo_time_ok(span_ns)) { 357 + /* 358 + * The current state is too shallow, so select 359 + * the first enabled deeper state. 360 + */ 361 + duration_ns = last_enabled_span_ns; 362 + idx = last_enabled_idx; 363 + break; 473 364 } 365 + 366 + if ((!alt_recent || 2 * recent_sum > idx_recent_sum) && 367 + (!alt_intercepts || 368 + 2 * intercept_sum > idx_intercept_sum)) { 369 + idx = i; 370 + duration_ns = span_ns; 371 + break; 372 + } 373 + 374 + last_enabled_span_ns = span_ns; 375 + last_enabled_idx = i; 474 376 } 475 377 } 476 378 379 + /* 380 + * If there is a latency constraint, it may be necessary to select an 381 + * idle state shallower than the current candidate one. 382 + */ 383 + if (idx > constraint_idx) 384 + idx = constraint_idx; 385 + 386 + end: 477 387 /* 478 388 * Don't stop the tick if the selected state is a polling one or if the 479 389 * expected idle duration is shorter than the tick period length. ··· 498 478 499 479 memset(cpu_data, 0, sizeof(*cpu_data)); 500 480 501 - for (i = 0; i < INTERVALS; i++) 502 - cpu_data->intervals[i] = U64_MAX; 481 + for (i = 0; i < NR_RECENT; i++) 482 + cpu_data->recent_idx[i] = -1; 503 483 504 484 return 0; 505 485 }

-1

drivers/devfreq/Kconfig

··· 103 103 tristate "i.MX8M DDRC DEVFREQ Driver" 104 104 depends on (ARCH_MXC && HAVE_ARM_SMCCC) || \ 105 105 (COMPILE_TEST && HAVE_ARM_SMCCC) 106 - select DEVFREQ_GOV_SIMPLE_ONDEMAND 107 106 select DEVFREQ_GOV_USERSPACE 108 107 help 109 108 This adds the DEVFREQ driver for the i.MX8M DDR Controller. It allows

drivers/devfreq/devfreq.c

··· 823 823 if (devfreq->profile->timer < 0 824 824 || devfreq->profile->timer >= DEVFREQ_TIMER_NUM) { 825 825 mutex_unlock(&devfreq->lock); 826 + err = -EINVAL; 826 827 goto err_dev; 827 828 } 828 829

+2 -1

drivers/devfreq/governor_passive.c

··· 65 65 dev_pm_opp_put(p_opp); 66 66 67 67 if (IS_ERR(opp)) 68 - return PTR_ERR(opp); 68 + goto no_required_opp; 69 69 70 70 *freq = dev_pm_opp_get_freq(opp); 71 71 dev_pm_opp_put(opp); ··· 73 73 return 0; 74 74 } 75 75 76 + no_required_opp: 76 77 /* 77 78 * Get the OPP table's index of decided frequency by governor 78 79 * of parent device.

+5 -5

drivers/devfreq/governor_userspace.c

··· 31 31 return 0; 32 32 } 33 33 34 - static ssize_t store_freq(struct device *dev, struct device_attribute *attr, 35 - const char *buf, size_t count) 34 + static ssize_t set_freq_store(struct device *dev, struct device_attribute *attr, 35 + const char *buf, size_t count) 36 36 { 37 37 struct devfreq *devfreq = to_devfreq(dev); 38 38 struct userspace_data *data; ··· 52 52 return err; 53 53 } 54 54 55 - static ssize_t show_freq(struct device *dev, struct device_attribute *attr, 56 - char *buf) 55 + static ssize_t set_freq_show(struct device *dev, 56 + struct device_attribute *attr, char *buf) 57 57 { 58 58 struct devfreq *devfreq = to_devfreq(dev); 59 59 struct userspace_data *data; ··· 70 70 return err; 71 71 } 72 72 73 - static DEVICE_ATTR(set_freq, 0644, show_freq, store_freq); 73 + static DEVICE_ATTR_RW(set_freq); 74 74 static struct attribute *dev_entries[] = { 75 75 &dev_attr_set_freq.attr, 76 76 NULL,

-14

drivers/devfreq/imx-bus.c

··· 45 45 return 0; 46 46 } 47 47 48 - static int imx_bus_get_dev_status(struct device *dev, 49 - struct devfreq_dev_status *stat) 50 - { 51 - struct imx_bus *priv = dev_get_drvdata(dev); 52 - 53 - stat->busy_time = 0; 54 - stat->total_time = 0; 55 - stat->current_frequency = clk_get_rate(priv->clk); 56 - 57 - return 0; 58 - } 59 - 60 48 static void imx_bus_exit(struct device *dev) 61 49 { 62 50 struct imx_bus *priv = dev_get_drvdata(dev); ··· 117 129 return ret; 118 130 } 119 131 120 - priv->profile.polling_ms = 1000; 121 132 priv->profile.target = imx_bus_target; 122 - priv->profile.get_dev_status = imx_bus_get_dev_status; 123 133 priv->profile.exit = imx_bus_exit; 124 134 priv->profile.get_cur_freq = imx_bus_get_cur_freq; 125 135 priv->profile.initial_freq = clk_get_rate(priv->clk);

drivers/devfreq/tegra30-devfreq.c

··· 688 688 .polling_ms = ACTMON_SAMPLING_PERIOD, 689 689 .target = tegra_devfreq_target, 690 690 .get_dev_status = tegra_devfreq_get_dev_status, 691 + .is_cooling_device = true, 691 692 }; 692 693 693 694 static int tegra_governor_get_target(struct devfreq *devfreq,

+33

drivers/idle/intel_idle.c

··· 1484 1484 skl_cstates[6].flags |= CPUIDLE_FLAG_UNUSABLE; /* C9-SKL */ 1485 1485 } 1486 1486 1487 + /** 1488 + * skx_idle_state_table_update - Adjust the Sky Lake/Cascade Lake 1489 + * idle states table. 1490 + */ 1491 + static void __init skx_idle_state_table_update(void) 1492 + { 1493 + unsigned long long msr; 1494 + 1495 + rdmsrl(MSR_PKG_CST_CONFIG_CONTROL, msr); 1496 + 1497 + /* 1498 + * 000b: C0/C1 (no package C-state support) 1499 + * 001b: C2 1500 + * 010b: C6 (non-retention) 1501 + * 011b: C6 (retention) 1502 + * 111b: No Package C state limits. 1503 + */ 1504 + if ((msr & 0x7) < 2) { 1505 + /* 1506 + * Uses the CC6 + PC0 latency and 3 times of 1507 + * latency for target_residency if the PC6 1508 + * is disabled in BIOS. This is consistent 1509 + * with how intel_idle driver uses _CST 1510 + * to set the target_residency. 1511 + */ 1512 + skx_cstates[2].exit_latency = 92; 1513 + skx_cstates[2].target_residency = 276; 1514 + } 1515 + } 1516 + 1487 1517 static bool __init intel_idle_verify_cstate(unsigned int mwait_hint) 1488 1518 { 1489 1519 unsigned int mwait_cstate = MWAIT_HINT2CSTATE(mwait_hint) + 1; ··· 1544 1514 break; 1545 1515 case INTEL_FAM6_SKYLAKE: 1546 1516 sklh_idle_state_table_update(); 1517 + break; 1518 + case INTEL_FAM6_SKYLAKE_X: 1519 + skx_idle_state_table_update(); 1547 1520 break; 1548 1521 } 1549 1522

+10

drivers/opp/core.c

··· 893 893 if (!required_opp_tables) 894 894 return 0; 895 895 896 + /* 897 + * We only support genpd's OPPs in the "required-opps" for now, as we 898 + * don't know much about other use cases. Error out if the required OPP 899 + * doesn't belong to a genpd. 900 + */ 901 + if (unlikely(!required_opp_tables[0]->is_genpd)) { 902 + dev_err(dev, "required-opps don't belong to a genpd\n"); 903 + return -ENOENT; 904 + } 905 + 896 906 /* required-opps not fully initialized yet */ 897 907 if (lazy_linking_pending(opp_table)) 898 908 return -EBUSY;

+3 -24

drivers/opp/of.c

··· 197 197 required_opp_tables[i] = _find_table_of_opp_np(required_np); 198 198 of_node_put(required_np); 199 199 200 - if (IS_ERR(required_opp_tables[i])) { 200 + if (IS_ERR(required_opp_tables[i])) 201 201 lazy = true; 202 - continue; 203 - } 204 - 205 - /* 206 - * We only support genpd's OPPs in the "required-opps" for now, 207 - * as we don't know how much about other cases. Error out if the 208 - * required OPP doesn't belong to a genpd. 209 - */ 210 - if (!required_opp_tables[i]->is_genpd) { 211 - dev_err(dev, "required-opp doesn't belong to genpd: %pOF\n", 212 - required_np); 213 - goto free_required_tables; 214 - } 215 202 } 216 203 217 204 /* Let's do the linking later on */ ··· 366 379 struct dev_pm_opp *opp; 367 380 int i, ret; 368 381 369 - /* 370 - * We only support genpd's OPPs in the "required-opps" for now, 371 - * as we don't know much about other cases. 372 - */ 373 - if (!new_table->is_genpd) 374 - return; 375 - 376 382 mutex_lock(&opp_table_lock); 377 383 378 384 list_for_each_entry_safe(opp_table, temp, &lazy_opp_tables, lazy) { ··· 413 433 414 434 /* All required opp-tables found, remove from lazy list */ 415 435 if (!lazy) { 416 - list_del(&opp_table->lazy); 417 - INIT_LIST_HEAD(&opp_table->lazy); 436 + list_del_init(&opp_table->lazy); 418 437 419 438 list_for_each_entry(opp, &opp_table->opp_list, node) 420 439 _required_opps_available(opp, opp_table->required_opp_count); ··· 853 874 return ERR_PTR(-ENOMEM); 854 875 855 876 ret = _read_opp_key(new_opp, opp_table, np, &rate_not_available); 856 - if (ret < 0 && !opp_table->is_genpd) { 877 + if (ret < 0) { 857 878 dev_err(dev, "%s: opp key field not found\n", __func__); 858 879 goto free_opp; 859 880 }

include/linux/pm_domain.h

··· 198 198 struct notifier_block *power_nb; 199 199 int cpu; 200 200 unsigned int performance_state; 201 + unsigned int rpm_pstate; 201 202 ktime_t next_wakeup; 202 203 void *data; 203 204 };

include/linux/pm_runtime.h

··· 380 380 * The possible return values of this function are the same as for 381 381 * pm_runtime_resume() and the runtime PM usage counter of @dev remains 382 382 * incremented in all cases, even if it returns an error code. 383 + * Consider using pm_runtime_resume_and_get() instead of it, especially 384 + * if its return value is checked by the caller, as this is likely to result 385 + * in cleaner code. 383 386 */ 384 387 static inline int pm_runtime_get_sync(struct device *dev) 385 388 {

+6 -6

kernel/power/Kconfig

··· 98 98 default "" 99 99 help 100 100 The default resume partition is the partition that the suspend- 101 - to-disk implementation will look for a suspended disk image. 101 + to-disk implementation will look for a suspended disk image. 102 102 103 - The partition specified here will be different for almost every user. 103 + The partition specified here will be different for almost every user. 104 104 It should be a valid swap partition (at least for now) that is turned 105 - on before suspending. 105 + on before suspending. 106 106 107 107 The partition specified can be overridden by specifying: 108 108 109 - resume=/dev/<other device> 109 + resume=/dev/<other device> 110 110 111 - which will set the resume partition to the device specified. 111 + which will set the resume partition to the device specified. 112 112 113 113 Note there is currently not a way to specify which device to save the 114 - suspended image to. It will simply pick the first available swap 114 + suspended image to. It will simply pick the first available swap 115 115 device. 116 116 117 117 config PM_SLEEP

+1 -1

kernel/power/process.c

··· 1 1 // SPDX-License-Identifier: GPL-2.0 2 2 /* 3 - * drivers/power/process.c - Functions for starting/stopping processes on 3 + * drivers/power/process.c - Functions for starting/stopping processes on 4 4 * suspend transitions. 5 5 * 6 6 * Originally from swsusp.

+5 -5

kernel/power/snapshot.c

··· 331 331 * 332 332 * Memory bitmap is a structure consisting of many linked lists of 333 333 * objects. The main list's elements are of type struct zone_bitmap 334 - * and each of them corresonds to one zone. For each zone bitmap 334 + * and each of them corresponds to one zone. For each zone bitmap 335 335 * object there is a list of objects of type struct bm_block that 336 336 * represent each blocks of bitmap in which information is stored. 337 337 * ··· 1146 1146 Free_second_object: 1147 1147 kfree(bm2); 1148 1148 Free_first_bitmap: 1149 - memory_bm_free(bm1, PG_UNSAFE_CLEAR); 1149 + memory_bm_free(bm1, PG_UNSAFE_CLEAR); 1150 1150 Free_first_object: 1151 1151 kfree(bm1); 1152 1152 return -ENOMEM; ··· 1500 1500 /** 1501 1501 * swsusp_free - Free pages allocated for hibernation image. 1502 1502 * 1503 - * Image pages are alocated before snapshot creation, so they need to be 1503 + * Image pages are allocated before snapshot creation, so they need to be 1504 1504 * released after resume. 1505 1505 */ 1506 1506 void swsusp_free(void) ··· 2326 2326 * (@nr_highmem_p points to the variable containing the number of highmem image 2327 2327 * pages). The pages that are "safe" (ie. will not be overwritten when the 2328 2328 * hibernation image is restored entirely) have the corresponding bits set in 2329 - * @bm (it must be unitialized). 2329 + * @bm (it must be uninitialized). 2330 2330 * 2331 2331 * NOTE: This function should not be called if there are no highmem image pages. 2332 2332 */ ··· 2483 2483 2484 2484 /** 2485 2485 * prepare_image - Make room for loading hibernation image. 2486 - * @new_bm: Unitialized memory bitmap structure. 2486 + * @new_bm: Uninitialized memory bitmap structure. 2487 2487 * @bm: Memory bitmap with unsafe pages marked. 2488 2488 * 2489 2489 * Use @bm to mark the pages that will be overwritten in the process of

+1 -1

kernel/power/swap.c

··· 1125 1125 }; 1126 1126 1127 1127 /** 1128 - * Deompression function that runs in its own thread. 1128 + * Decompression function that runs in its own thread. 1129 1129 */ 1130 1130 static int lzo_decompress_threadfn(void *data) 1131 1131 {