Skip to content

Commit

Permalink
Merge tag 'pm-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/…
Browse files Browse the repository at this point in the history
…git/rafael/linux-pm

Pull power management updates from Rafael Wysocki:
 "These add hybrid processors support to the intel_pstate driver and
  make it work with more processor models when HWP is disabled, make the
  intel_idle driver use special C6 idle state paremeters when package
  C-states are disabled, add cooling support to the tegra30 devfreq
  driver, rework the TEO (timer events oriented) cpuidle governor,
  extend the OPP (operating performance points) framework to use the
  required-opps DT property in more cases, fix some issues and clean up
  a number of assorted pieces of code.

  Specifics:

   - Make intel_pstate support hybrid processors using abstract
     performance units in the HWP interface (Rafael Wysocki).

   - Add Icelake servers and Cometlake support in no-HWP mode to
     intel_pstate (Giovanni Gherdovich).

   - Make cpufreq_online() error path be consistent with the CPU device
     removal path in cpufreq (Rafael Wysocki).

   - Clean up 3 cpufreq drivers and the statistics code (Hailong Liu,
     Randy Dunlap, Shaokun Zhang).

   - Make intel_idle use special idle state parameters for C6 when
     package C-states are disabled (Chen Yu).

   - Rework the TEO (timer events oriented) cpuidle governor to address
     some theoretical shortcomings in it (Rafael Wysocki).

   - Drop unneeded semicolon from the TEO governor (Wan Jiabing).

   - Modify the runtime PM framework to accept unassigned suspend and
     resume callback pointers (Ulf Hansson).

   - Improve pm_runtime_get_sync() documentation (Krzysztof Kozlowski).

   - Improve device performance states support in the generic power
     domains (genpd) framework (Ulf Hansson).

   - Fix some documentation issues in genpd (Yang Yingliang).

   - Make the operating performance points (OPP) framework use the
     required-opps DT property in use cases that are not related to
     genpd (Hsin-Yi Wang).

   - Make lazy_link_required_opp_table() use list_del_init instead of
     list_del/INIT_LIST_HEAD (Yang Yingliang).

   - Simplify wake IRQs handling in the core system-wide sleep support
     code and clean up some coding style inconsistencies in it (Tian
     Tao, Zhen Lei).

   - Add cooling support to the tegra30 devfreq driver and improve its
     DT bindings (Dmitry Osipenko).

   - Fix some assorted issues in the devfreq core and drivers (Chanwoo
     Choi, Dong Aisheng, YueHaibing)"

* tag 'pm-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (39 commits)
  PM / devfreq: passive: Fix get_target_freq when not using required-opp
  cpufreq: Make cpufreq_online() call driver->offline() on errors
  opp: Allow required-opps to be used for non genpd use cases
  cpuidle: teo: remove unneeded semicolon in teo_select()
  dt-bindings: devfreq: tegra30-actmon: Add cooling-cells
  dt-bindings: devfreq: tegra30-actmon: Convert to schema
  PM / devfreq: userspace: Use DEVICE_ATTR_RW macro
  PM: runtime: Clarify documentation when callbacks are unassigned
  PM: runtime: Allow unassigned ->runtime_suspend|resume callbacks
  PM: runtime: Improve path in rpm_idle() when no callback
  PM: hibernate: remove leading spaces before tabs
  PM: sleep: remove trailing spaces and tabs
  PM: domains: Drop/restore performance state votes for devices at runtime PM
  PM: domains: Return early if perf state is already set for the device
  PM: domains: Split code in dev_pm_genpd_set_performance_state()
  cpuidle: teo: Use kerneldoc documentation in admin-guide
  cpuidle: teo: Rework most recent idle duration values treatment
  cpuidle: teo: Change the main idle state selection logic
  cpuidle: teo: Cosmetic modification of teo_select()
  cpuidle: teo: Cosmetic modifications of teo_update()
  ...
  • Loading branch information
torvalds committed Jun 29, 2021
2 parents 1dfb0f4 + 22b65d3 commit 3563f55
Show file tree
Hide file tree
Showing 31 changed files with 775 additions and 481 deletions.
77 changes: 2 additions & 75 deletions Documentation/admin-guide/pm/cpuidle.rst
Original file line number Diff line number Diff line change
Expand Up @@ -347,81 +347,8 @@ for tickless systems. It follows the same basic strategy as the ``menu`` `one
<menu-gov_>`_: it always tries to find the deepest idle state suitable for the
given conditions. However, it applies a different approach to that problem.

First, it does not use sleep length correction factors, but instead it attempts
to correlate the observed idle duration values with the available idle states
and use that information to pick up the idle state that is most likely to
"match" the upcoming CPU idle interval. Second, it does not take the tasks
that were running on the given CPU in the past and are waiting on some I/O
operations to complete now at all (there is no guarantee that they will run on
the same CPU when they become runnable again) and the pattern detection code in
it avoids taking timer wakeups into account. It also only uses idle duration
values less than the current time till the closest timer (with the scheduler
tick excluded) for that purpose.

Like in the ``menu`` governor `case <menu-gov_>`_, the first step is to obtain
the *sleep length*, which is the time until the closest timer event with the
assumption that the scheduler tick will be stopped (that also is the upper bound
on the time until the next CPU wakeup). That value is then used to preselect an
idle state on the basis of three metrics maintained for each idle state provided
by the ``CPUIdle`` driver: ``hits``, ``misses`` and ``early_hits``.

The ``hits`` and ``misses`` metrics measure the likelihood that a given idle
state will "match" the observed (post-wakeup) idle duration if it "matches" the
sleep length. They both are subject to decay (after a CPU wakeup) every time
the target residency of the idle state corresponding to them is less than or
equal to the sleep length and the target residency of the next idle state is
greater than the sleep length (that is, when the idle state corresponding to
them "matches" the sleep length). The ``hits`` metric is increased if the
former condition is satisfied and the target residency of the given idle state
is less than or equal to the observed idle duration and the target residency of
the next idle state is greater than the observed idle duration at the same time
(that is, it is increased when the given idle state "matches" both the sleep
length and the observed idle duration). In turn, the ``misses`` metric is
increased when the given idle state "matches" the sleep length only and the
observed idle duration is too short for its target residency.

The ``early_hits`` metric measures the likelihood that a given idle state will
"match" the observed (post-wakeup) idle duration if it does not "match" the
sleep length. It is subject to decay on every CPU wakeup and it is increased
when the idle state corresponding to it "matches" the observed (post-wakeup)
idle duration and the target residency of the next idle state is less than or
equal to the sleep length (i.e. the idle state "matching" the sleep length is
deeper than the given one).

The governor walks the list of idle states provided by the ``CPUIdle`` driver
and finds the last (deepest) one with the target residency less than or equal
to the sleep length. Then, the ``hits`` and ``misses`` metrics of that idle
state are compared with each other and it is preselected if the ``hits`` one is
greater (which means that that idle state is likely to "match" the observed idle
duration after CPU wakeup). If the ``misses`` one is greater, the governor
preselects the shallower idle state with the maximum ``early_hits`` metric
(or if there are multiple shallower idle states with equal ``early_hits``
metric which also is the maximum, the shallowest of them will be preselected).
[If there is a wakeup latency constraint coming from the `PM QoS framework
<cpu-pm-qos_>`_ which is hit before reaching the deepest idle state with the
target residency within the sleep length, the deepest idle state with the exit
latency within the constraint is preselected without consulting the ``hits``,
``misses`` and ``early_hits`` metrics.]

Next, the governor takes several idle duration values observed most recently
into consideration and if at least a half of them are greater than or equal to
the target residency of the preselected idle state, that idle state becomes the
final candidate to ask for. Otherwise, the average of the most recent idle
duration values below the target residency of the preselected idle state is
computed and the governor walks the idle states shallower than the preselected
one and finds the deepest of them with the target residency within that average.
That idle state is then taken as the final candidate to ask for.

Still, at this point the governor may need to refine the idle state selection if
it has not decided to `stop the scheduler tick <idle-cpus-and-tick_>`_. That
generally happens if the target residency of the idle state selected so far is
less than the tick period and the tick has not been stopped already (in a
previous iteration of the idle loop). Then, like in the ``menu`` governor
`case <menu-gov_>`_, the sleep length used in the previous computations may not
reflect the real time until the closest timer event and if it really is greater
than that time, a shallower state with a suitable target residency may need to
be selected.

.. kernel-doc:: drivers/cpuidle/governors/teo.c
:doc: teo-description

.. _idle-states-representation:

Expand Down
6 changes: 6 additions & 0 deletions Documentation/admin-guide/pm/intel_pstate.rst
Original file line number Diff line number Diff line change
Expand Up @@ -365,6 +365,9 @@ argument is passed to the kernel in the command line.
inclusive) including both turbo and non-turbo P-states (see
`Turbo P-states Support`_).

This attribute is present only if the value exposed by it is the same
for all of the CPUs in the system.

The value of this attribute is not affected by the ``no_turbo``
setting described `below <no_turbo_attr_>`_.

Expand All @@ -374,6 +377,9 @@ argument is passed to the kernel in the command line.
Ratio of the `turbo range <turbo_>`_ size to the size of the entire
range of supported P-states, in percent.

This attribute is present only if the value exposed by it is the same
for all of the CPUs in the system.

This attribute is read-only.

.. _no_turbo_attr:
Expand Down

This file was deleted.

126 changes: 126 additions & 0 deletions Documentation/devicetree/bindings/devfreq/nvidia,tegra30-actmon.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
%YAML 1.2
---
$id: http://devicetree.org/schemas/devfreq/nvidia,tegra30-actmon.yaml#
$schema: http://devicetree.org/meta-schemas/core.yaml#

title: NVIDIA Tegra30 Activity Monitor

maintainers:
- Dmitry Osipenko <[email protected]>
- Jon Hunter <[email protected]>
- Thierry Reding <[email protected]>

description: |
The activity monitor block collects statistics about the behaviour of other
components in the system. This information can be used to derive the rate at
which the external memory needs to be clocked in order to serve all requests
from the monitored clients.
properties:
compatible:
enum:
- nvidia,tegra30-actmon
- nvidia,tegra114-actmon
- nvidia,tegra124-actmon
- nvidia,tegra210-actmon

reg:
maxItems: 1

clocks:
maxItems: 2

clock-names:
items:
- const: actmon
- const: emc

resets:
maxItems: 1

reset-names:
items:
- const: actmon

interrupts:
maxItems: 1

interconnects:
minItems: 1
maxItems: 12

interconnect-names:
minItems: 1
maxItems: 12
description:
Should include name of the interconnect path for each interconnect
entry. Consult TRM documentation for information about available
memory clients, see MEMORY CONTROLLER and ACTIVITY MONITOR sections.

operating-points-v2:
description:
Should contain freqs and voltages and opp-supported-hw property, which
is a bitfield indicating SoC speedo ID mask.

"#cooling-cells":
const: 2

required:
- compatible
- reg
- clocks
- clock-names
- resets
- reset-names
- interrupts
- interconnects
- interconnect-names
- operating-points-v2
- "#cooling-cells"

additionalProperties: false

examples:
- |
#include <dt-bindings/memory/tegra30-mc.h>
mc: memory-controller@7000f000 {
compatible = "nvidia,tegra30-mc";
reg = <0x7000f000 0x400>;
clocks = <&clk 32>;
clock-names = "mc";
interrupts = <0 77 4>;
#iommu-cells = <1>;
#reset-cells = <1>;
#interconnect-cells = <1>;
};
emc: external-memory-controller@7000f400 {
compatible = "nvidia,tegra30-emc";
reg = <0x7000f400 0x400>;
interrupts = <0 78 4>;
clocks = <&clk 57>;
nvidia,memory-controller = <&mc>;
operating-points-v2 = <&dvfs_opp_table>;
power-domains = <&domain>;
#interconnect-cells = <0>;
};
actmon@6000c800 {
compatible = "nvidia,tegra30-actmon";
reg = <0x6000c800 0x400>;
interrupts = <0 45 4>;
clocks = <&clk 119>, <&clk 57>;
clock-names = "actmon", "emc";
resets = <&rst 119>;
reset-names = "actmon";
operating-points-v2 = <&dvfs_opp_table>;
interconnects = <&mc TEGRA30_MC_MPCORER &emc>;
interconnect-names = "cpu-read";
#cooling-cells = <2>;
};
15 changes: 14 additions & 1 deletion Documentation/power/runtime_pm.rst
Original file line number Diff line number Diff line change
Expand Up @@ -378,7 +378,11 @@ drivers/base/power/runtime.c and include/linux/pm_runtime.h:

`int pm_runtime_get_sync(struct device *dev);`
- increment the device's usage counter, run pm_runtime_resume(dev) and
return its result
return its result;
note that it does not drop the device's usage counter on errors, so
consider using pm_runtime_resume_and_get() instead of it, especially
if its return value is checked by the caller, as this is likely to
result in cleaner code.

`int pm_runtime_get_if_in_use(struct device *dev);`
- return -EINVAL if 'power.disable_depth' is nonzero; otherwise, if the
Expand Down Expand Up @@ -827,6 +831,15 @@ or driver about runtime power changes. Instead, the driver for the device's
parent must take responsibility for telling the device's driver when the
parent's power state changes.

Note that, in some cases it may not be desirable for subsystems/drivers to call
pm_runtime_no_callbacks() for their devices. This could be because a subset of
the runtime PM callbacks needs to be implemented, a platform dependent PM
domain could get attached to the device or that the device is power managed
through a supplier device link. For these reasons and to avoid boilerplate code
in subsystems/drivers, the PM core allows runtime PM callbacks to be
unassigned. More precisely, if a callback pointer is NULL, the PM core will act
as though there was a callback and it returned 0.

9. Autosuspend, or automatically-delayed suspends
=================================================

Expand Down
Loading

0 comments on commit 3563f55

Please sign in to comment.