Let the governors schedutil, conservative and ondemand to work, if possible
on efficient frequencies only.
Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
This newly introduced flag can be applied by a governor to a CPUFreq
relation, when looking for a frequency within the policy table. The
resolution would then only walk through efficient frequencies.
Even with the flag set, the policy max limit will still be honoured. If no
efficient frequencies can be found within the limits of the policy, an
inefficient one would be returned.
Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Some SoCs such as the sd855 have OPPs within the same policy whose cost is
higher than others with a higher frequency. Those OPPs are inefficients
and it might be interesting for a governor to not use them.
cpufreq_table_set_inefficient() allows the caller to identify a specified
frequency as being inefficient. Inefficient frequencies are only supported
on sorted tables.
Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Pull more ARM cpufreq changes for v5.15-rc1 from Viresh Kumar:
"This adds a new cpufreq driver for Mediatek, which had been going
through reviews since last one year."
* 'cpufreq/arm/linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm:
cpufreq: mediatek-hw: Add support for CPUFREQ HW
cpufreq: Add of_perf_domain_get_sharing_cpumask
dt-bindings: cpufreq: add bindings for MediaTek cpufreq HW
Add of_perf_domain_get_sharing_cpumask function to group cpu
to specific performance domain.
Signed-off-by: Hector.Yuan <hector.yuan@mediatek.com>
[ Viresh: create separate routine parse_perf_domain() and always set the
cpumask. ]
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
This isn't used anymore, get rid of it.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Many cpufreq drivers register with the energy model for each policy and
do exactly the same thing. Follow the footsteps of thermal-cooling, to
get it done from the cpufreq core itself.
Provide a new callback, which will be called, if present, by the cpufreq
core at the right moment (more on that in the code's comment). Also
provide a generic implementation that uses dev_pm_opp_of_register_em().
This also allows us to register with the EM at a later point of time,
compared to ->init(), from where the EM core can access cpufreq policy
directly using cpufreq_cpu_get() type of helpers and perform other work,
like marking few frequencies inefficient, this will be done separately.
Reviewed-by: Quentin Perret <qperret@google.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Commit e3c0623608 ("cpufreq: add cpufreq_driver_resolve_freq()")
introduced this callback, back in 2016, for drivers that provide the
->target() callback.
The kernel hasn't seen a single user of it in the past 5 years and
it is not likely to be used any time soon.
Remove it for now.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
[ rjw: Changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Now that all users of ->stop_cpu() have been migrated to using other
callbacks, drop it from the core.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
[ rjw: Minor edits in the subject and changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
This flag is set by one of the drivers but it isn't used in the code
otherwise. Remove the unused flag and update the driver.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
During cpufreq driver's registration, if the ->init() callback for all
the CPUs fail then there is not much point in keeping the driver around
as it will only account for more of unnecessary noise, for example
cpufreq core will try to suspend/resume the driver which never got
registered properly.
The removal of such a driver is avoided if the driver carries the
CPUFREQ_STICKY flag. This was added way back [1] in 2004 and perhaps no
one should ever need it now. A lot of drivers do set this flag, probably
because they just copied it from other drivers.
This was added earlier for some platforms [2] because their cpufreq
drivers were getting registered before the CPUs were registered with
subsys framework. And hence they used to fail.
The same isn't true anymore though. The current code flow in the kernel
is:
start_kernel()
-> kernel_init()
-> kernel_init_freeable()
-> do_basic_setup()
-> driver_init()
-> cpu_dev_init()
-> subsys_system_register() //For CPUs
-> do_initcalls()
-> cpufreq_register_driver()
Clearly, the CPUs will always get registered with subsys framework
before any cpufreq driver can get probed. Remove the flag and update the
relevant drivers.
Link: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/include/linux/cpufreq.h?id=7cc9f0d9a1ab04cedc60d64fd8dcf7df224a3b4d # [1]
Link: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/arch/arm/mach-sa1100/cpu-sa1100.c?id=f59d3bbe35f6268d729f51be82af8325d62f20f5 # [2]
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
First off, some cpufreq drivers (eg. intel_pstate) can pass hints
beyond the current target frequency to the hardware and there are no
provisions for doing that in the cpufreq framework. In particular,
today the driver has to assume that it should not allow the frequency
to fall below the one requested by the governor (or the required
capacity may not be provided) which may not be the case and which may
lead to excessive energy usage in some scenarios.
Second, the hints passed by these drivers to the hardware need not be
in terms of the frequency, so representing the utilization numbers
coming from the scheduler as frequency before passing them to those
drivers is not really useful.
Address the two points above by adding a special-purpose replacement
for the ->fast_switch callback, called ->adjust_perf, allowing the
governor to pass abstract performance level (rather than frequency)
values for the minimum (required) and target (desired) performance
along with the CPU capacity to compare them to.
Also update the schedutil governor to use the new callback instead
of ->fast_switch if present and if the utilization mertics are
frequency-invariant (that is requisite for the direct mapping
between the utilization and the CPU performance levels to be a
reasonable approximation).
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Add a new field to be set when the CPUFREQ_GOV_STRICT_TARGET flag is
set for the current governor to struct cpufreq_policy, so that the
drivers needing to check CPUFREQ_GOV_STRICT_TARGET do not have to
access the governor object during every frequency transition.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Introduce a new governor flag, CPUFREQ_GOV_STRICT_TARGET, for the
governors that want the target frequency to be set exactly to the
given value without leaving any room for adjustments on the hardware
side and set this flag for the powersave and performance governors.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
A new cpufreq governor flag will be added subsequently, so replace
the bool dynamic_switching fleid in struct cpufreq_governor with a
flags field and introduce CPUFREQ_GOV_DYNAMIC_SWITCHING to set for
the "dynamic switching" governors instead of it.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
The restore_freq field in struct cpufreq_policy is only used by
__target_index() in one place and a local variable in that function
may as well be used instead of it, so drop it and modify
__target_index() accordingly.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Add a helper function to test the flags of the cpufreq driver in use
againt a given flags mask.
In particular, this will be needed to test the
CPUFREQ_NEED_UPDATE_LIMITS cpufreq driver flag in the schedutil
governor.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Generally, a cpufreq driver may need to update some internal upper
and lower frequency boundaries on policy max and min changes,
respectively, but currently this does not work if the target
frequency does not change along with the policy limit.
Namely, if the target frequency does not change along with the
policy min or max, the "target_freq == policy->cur" check in
__cpufreq_driver_target() prevents driver callbacks from being
invoked and they do not even have a chance to update the
corresponding internal boundary.
This particularly affects the "powersave" and "performance"
governors that always set the target frequency to one of the
policy limits and it never changes when the other limit is updated.
To allow cpufreq the drivers needing to update internal frequency
boundaries on policy limits changes to avoid this issue, introduce
a new driver flag, CPUFREQ_NEED_UPDATE_LIMITS, that (when set) will
neutralize the check mentioned above.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Compared to other arch_* functions, arch_set_freq_scale() has an atypical
weak definition that can be replaced by a strong architecture specific
implementation.
The more typical support for architectural functions involves defining
an empty stub in a header file if the symbol is not already defined in
architecture code. Some examples involve:
- #define arch_scale_freq_capacity topology_get_freq_scale
- #define arch_scale_freq_invariant topology_scale_freq_invariant
- #define arch_scale_cpu_capacity topology_get_cpu_scale
- #define arch_update_cpu_topology topology_update_cpu_topology
- #define arch_scale_thermal_pressure topology_get_thermal_pressure
- #define arch_set_thermal_pressure topology_set_thermal_pressure
Bring arch_set_freq_scale() in line with these functions by renaming it to
topology_set_freq_scale() in the arch topology driver, and by defining the
arch_set_freq_scale symbol to point to the new function for arm and arm64.
While there are other users of the arch_topology driver, this patch defines
arch_set_freq_scale for arm and arm64 only, due to their existing
definitions of arch_scale_freq_capacity. This is the getter function of the
frequency invariance scale factor and without a getter function, the
setter function - arch_set_freq_scale() has not purpose.
Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Sudeep Holla <sudeep.holla@arm.com> (BL_SWITCHER and topology parts)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The passed cpumask arguments to arch_set_freq_scale() and
arch_freq_counters_available() are only iterated over, so reflect this
in the prototype. This also allows to pass system cpumasks like
cpu_online_mask without getting a warning.
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Now that the update of the FI scale factor is done in cpufreq core for
selected functions - target(), target_index() and fast_switch(),
we can provide feedback to the task scheduler and architecture code
on whether cpufreq supports FI.
For this purpose provide an external function to expose whether the
cpufreq drivers support FI, by using a static key.
The logic behind the enablement of cpufreq-based invariance is as
follows:
- cpufreq-based invariance is disabled by default
- cpufreq-based invariance is enabled if any of the callbacks
above is implemented while the unsupported setpolicy() is not
The cpufreq_supports_freq_invariance() function only returns whether
cpufreq is instrumented with the arch_set_freq_scale() calls that
result in support for frequency invariance. Due to the lack of knowledge
on whether the implementation of arch_set_freq_scale() actually results
in the setting of a scale factor based on cpufreq information, it is up
to the architecture code to ensure the setting and provision of the
scale factor to the scheduler.
Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The relation can't be invalid here, so if it turns out to be invalid,
just WARN_ON_ONCE() and return 0.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
[ rjw: Subject and changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Allow intel_pstate to work in the passive mode with HWP enabled and
make it set the HWP minimum performance limit (HWP floor) to the
P-state value given by the target frequency supplied by the cpufreq
governor, so as to prevent the HWP algorithm and the CPU scheduler
from working against each other, at least when the schedutil governor
is in use, and update the intel_pstate documentation accordingly.
Among other things, this allows utilization clamps to be taken
into account, at least to a certain extent, when intel_pstate is
in use and makes it more likely that sufficient capacity for
deadline tasks will be provided.
After this change, the resulting behavior of an HWP system with
intel_pstate in the passive mode should be close to the behavior
of the analogous non-HWP system with intel_pstate in the passive
mode, except that the HWP algorithm is generally allowed to make the
CPU run at a frequency above the floor P-state set by intel_pstate in
the entire available range of P-states, while without HWP a CPU can
run in a P-state above the requested one if the latter falls into the
range of turbo P-states (referred to as the turbo range) or if the
P-states of all CPUs in one package are coordinated with each other
at the hardware level.
[Note that in principle the HWP floor may not be taken into account
by the processor if it falls into the turbo range, in which case the
processor has a license to choose any P-state, either below or above
the HWP floor, just like a non-HWP processor in the case when the
target P-state falls into the turbo range.]
With this change applied, intel_pstate in the passive mode assumes
complete control over the HWP request MSR and concurrent changes of
that MSR (eg. via the direct MSR access interface) are overridden by
it.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Pull ARM cpufreq driver changes for v5.9-rc1 from Viresh Kumar:
"Here are the details:
- Adaptive voltage scaling (AVS) support and minor cleanups for
brcmstb driver (Florian Fainelli and Markus Mayer).
- A new tegra driver and cleanup for the existing one (Sumit Gupta and
Jon Hunter).
- Bandwidth level support for Qcom driver along with OPP changes (Sibi
Sankar).
- Cleanups to sti, cpufreq-dt, ap806, CPPC drivers (Viresh Kumar, Lee
Jones, Ivan Kokshaysky, Sven Auhagen, and Xin Hao).
- Make schedutil default governor for ARM (Valentin Schneider).
- Fix dependency issues for imx (Walter Lozano).
- Cleanup around cached_resolved_idx in cpufreq core (Viresh Kumar)."
* 'cpufreq/arm/linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm:
cpufreq: make schedutil the default for arm and arm64
cpufreq: cached_resolved_idx can not be negative
cpufreq: Add Tegra194 cpufreq driver
dt-bindings: arm: Add NVIDIA Tegra194 CPU Complex binding
cpufreq: imx: Select NVMEM_IMX_OCOTP
cpufreq: sti-cpufreq: Fix some formatting and misspelling issues
cpufreq: tegra186: Simplify probe return path
cpufreq: CPPC: Reuse caps variable in few routines
cpufreq: ap806: fix cpufreq driver needs ap cpu clk
cpufreq: cppc: Reorder code and remove apply_hisi_workaround variable
cpufreq: dt: fix oops on armada37xx
cpufreq: brcmstb-avs-cpufreq: send S2_ENTER / S2_EXIT commands to AVS
cpufreq: brcmstb-avs-cpufreq: Support polling AVS firmware
cpufreq: brcmstb-avs-cpufreq: more flexible interface for __issue_avs_command()
cpufreq: qcom: Disable fast switch when scaling DDR/L3
cpufreq: qcom: Update the bandwidth levels on frequency change
OPP: Add and export helper to set bandwidth
cpufreq: blacklist SC7180 in cpufreq-dt-platdev
cpufreq: blacklist SDM845 in cpufreq-dt-platdev
It is not possible for cached_resolved_idx to be invalid here as the
cpufreq core always sets index to a positive value.
Change its type to unsigned int and fix qcom usage a bit.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Currently, most CPUFreq governors are registered at the core_initcall
time when the given governor is the default one, and the module_init
time otherwise.
In preparation for letting users specify the default governor on the
kernel command line, change all of them to be registered at the
core_initcall unconditionally, as it is already the case for the
schedutil and performance governors. This will allow us to assume
that builtin governors have been registered before the built-in
CPUFreq drivers probe.
And since all governors have similar init/exit patterns now, introduce
two new macros, cpufreq_governor_{init,exit}(), to factorize the code.
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
[ rjw: Changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Macro 'for_each_active_policy()' is defined internally. To avoid some
cpufreq driver needing this macro to iterate over all the policies in
'.set_boost' callback, we redefine '.set_boost' to act on only one
policy and pass the policy as an argument.
'cpufreq_boost_trigger_state()' iterates over all the policies to set
boost for the system.
This is preparation for adding SW BOOST support for CPPC.
To protect Boost enable/disable by sysfs from CPU online/offline,
add 'cpu_hotplug_lock' before calling '.set_boost' for each CPU.
Also move the lock from 'set_boost()' to 'store_cpb()' in
acpi_cpufreq.
Signed-off-by: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Suggested-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
[ rjw: Subject & changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Delete the duplicate "to", possibly double-typed.
Signed-off-by: Wang Wenhu <wenhu.wang@vivo.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Add weak function to return the hardware maximum frequency of a CPU,
with the default implementation returning cpuinfo.max_freq, which is
the best information we can generically get from the cpufreq framework.
The default can be overwritten by a strong function in platforms
that want to provide an alternative implementation, with more accurate
information, obtained either from hardware or firmware.
Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
The cpufreq_global_kobject is only used internally by cpufreq.c
after commit 2361be2366 ("cpufreq: Don't create empty
/sys/devices/system/cpu/cpufreq directory").
Make it static.
Signed-off-by: Yangtao Li <tiny.windzz@gmail.com>
[ rjw: Add empty line after cpufreq_global_kobject definition ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
In the process of modifying a cpufreq policy, the cpufreq core makes
a copy of it including all of the internals which is stored on the
CPU stack. Because struct cpufreq_policy is relatively large, this
may cause the size of the stack frame to exceed the 2 KB limit and
so the GCC complains when -Wframe-larger-than= is used.
In fact, it is not necessary to copy the entire policy structure
in order to modify it, however.
First, because cpufreq_set_policy() obtains the min and max policy
limits from frequency QoS now, it is not necessary to pass the limits
to it from the callers. The only things that need to be passed to it
from there are the new governor pointer or (if there is a built-in
governor in the driver) the "policy" value representing the governor
choice. They both can be passed as individual arguments, though, so
make cpufreq_set_policy() take them this way and rework its callers
accordingly. This avoids making copies of cpufreq policies in the
callers of cpufreq_set_policy().
Second, cpufreq_set_policy() still needs to pass the new policy
data to the ->verify() callback of the cpufreq driver whose task
is to sanitize the min and max policy limits. It still does not
need to make a full copy of struct cpufreq_policy for this purpose,
but it needs to pass a few items from it to the driver in case they
are needed (different drivers have different needs in that respect
and all of them have to be covered). For this reason, introduce
struct cpufreq_policy_data to hold copies of the members of
struct cpufreq_policy used by the existing ->verify() driver
callbacks and pass a pointer to a temporary structure of that
type to ->verify() (instead of passing a pointer to full struct
cpufreq_policy to it).
While at it, notice that intel_pstate and longrun don't really need
to verify the "policy" value in struct cpufreq_policy, so drop those
check from them to avoid copying "policy" into struct
cpufreq_policy_data (which allows it to be slightly smaller).
Also while at it fix up white space in a couple of places and make
cpufreq_set_policy() static (as it can be so).
Fixes: 3000ce3c52 ("cpufreq: Use per-policy frequency QoS")
Link: https://lore.kernel.org/linux-pm/CAMuHMdX6-jb1W8uC2_237m8ctCpsnGp=JCxqt8pCWVqNXHmkVg@mail.gmail.com
Reported-by: kbuild test robot <lkp@intel.com>
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: 5.4+ <stable@vger.kernel.org> # 5.4+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
The scheduler code calling cpufreq_update_util() may run during CPU
offline on the target CPU after the IRQ work lists have been flushed
for it, so the target CPU should be prevented from running code that
may queue up an IRQ work item on it at that point.
Unfortunately, that may not be the case if dvfs_possible_from_any_cpu
is set for at least one cpufreq policy in the system, because that
allows the CPU going offline to run the utilization update callback
of the cpufreq governor on behalf of another (online) CPU in some
cases.
If that happens, the cpufreq governor callback may queue up an IRQ
work on the CPU running it, which is going offline, and the IRQ work
may not be flushed after that point. Moreover, that IRQ work cannot
be flushed until the "offlining" CPU goes back online, so if any
other CPU calls irq_work_sync() to wait for the completion of that
IRQ work, it will have to wait until the "offlining" CPU is back
online and that may not happen forever. In particular, a system-wide
deadlock may occur during CPU online as a result of that.
The failing scenario is as follows. CPU0 is the boot CPU, so it
creates a cpufreq policy and becomes the "leader" of it
(policy->cpu). It cannot go offline, because it is the boot CPU.
Next, other CPUs join the cpufreq policy as they go online and they
leave it when they go offline. The last CPU to go offline, say CPU3,
may queue up an IRQ work while running the governor callback on
behalf of CPU0 after leaving the cpufreq policy because of the
dvfs_possible_from_any_cpu effect described above. Then, CPU0 is
the only online CPU in the system and the stale IRQ work is still
queued on CPU3. When, say, CPU1 goes back online, it will run
irq_work_sync() to wait for that IRQ work to complete and so it
will wait for CPU3 to go back online (which may never happen even
in principle), but (worse yet) CPU0 is waiting for CPU1 at that
point too and a system-wide deadlock occurs.
To address this problem notice that CPUs which cannot run cpufreq
utilization update code for themselves (for example, because they
have left the cpufreq policies that they belonged to), should also
be prevented from running that code on behalf of the other CPUs that
belong to a cpufreq policy with dvfs_possible_from_any_cpu set and so
in that case the cpufreq_update_util_data pointer of the CPU running
the code must not be NULL as well as for the CPU which is the target
of the cpufreq utilization update in progress.
Accordingly, change cpufreq_this_cpu_can_update() into a regular
function in kernel/sched/cpufreq.c (instead of a static inline in a
header file) and make it check the cpufreq_update_util_data pointer
of the local CPU if dvfs_possible_from_any_cpu is set for the target
cpufreq policy.
Also update the schedutil governor to do the
cpufreq_this_cpu_can_update() check in the non-fast-switch
case too to avoid the stale IRQ work issues.
Fixes: 99d14d0e16 ("cpufreq: Process remote callbacks from any CPU if the platform permits")
Link: https://lore.kernel.org/linux-pm/20191121093557.bycvdo4xyinbc5cb@vireshk-i7/
Reported-by: Anson Huang <anson.huang@nxp.com>
Tested-by: Anson Huang <anson.huang@nxp.com>
Cc: 4.14+ <stable@vger.kernel.org> # 4.14+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Tested-by: Peng Fan <peng.fan@nxp.com> (i.MX8QXP-MEK)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Replace the CPU device PM QoS used for the management of min and max
frequency constraints in cpufreq (and its users) with per-policy
frequency QoS to avoid problems with cpufreq policies covering
more then one CPU.
Namely, a cpufreq driver is registered with the subsys interface
which calls cpufreq_add_dev() for each CPU, starting from CPU0, so
currently the PM QoS notifiers are added to the first CPU in the
policy (i.e. CPU0 in the majority of cases).
In turn, when the cpufreq driver is unregistered, the subsys interface
doing that calls cpufreq_remove_dev() for each CPU, starting from CPU0,
and the PM QoS notifiers are only removed when cpufreq_remove_dev() is
called for the last CPU in the policy, say CPUx, which as a rule is
not CPU0 if the policy covers more than one CPU. Then, the PM QoS
notifiers cannot be removed, because CPUx does not have them, and
they are still there in the device PM QoS notifiers list of CPU0,
which prevents new PM QoS notifiers from being registered for CPU0
on the next attempt to register the cpufreq driver.
The same issue occurs when the first CPU in the policy goes offline
before unregistering the driver.
After this change it does not matter which CPU is the policy CPU at
the driver registration time and whether or not it is online all the
time, because the frequency QoS is per policy and not per CPU.
Fixes: 67d874c3b2 ("cpufreq: Register notifiers with the PM QoS framework")
Reported-by: Dmitry Osipenko <digetx@gmail.com>
Tested-by: Dmitry Osipenko <digetx@gmail.com>
Reported-by: Sudeep Holla <sudeep.holla@arm.com>
Tested-by: Sudeep Holla <sudeep.holla@arm.com>
Diagnosed-by: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lore.kernel.org/linux-pm/5ad2624194baa2f53acc1f1e627eb7684c577a19.1562210705.git.viresh.kumar@linaro.org/T/#md2d89e95906b8c91c15f582146173dce2e86e99f
Link: https://lore.kernel.org/linux-pm/20191017094612.6tbkwoq4harsjcqv@vireshk-i7/T/#m30d48cc23b9a80467fbaa16e30f90b3828a5a29b
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
No driver makes reference to these events now, remove them and the code
related to them.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
This effectively reverts some changes made by commit f9f41e3ef9
("cpufreq: Remove policy create/remove notifiers").
We have a new use case for policy create/remove notifiers (for
allocating/freeing QoS requests per policy), so add them back.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
[ rjw: Subject & changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* pm-cpufreq:
cpufreq: Make cpufreq_generic_init() return void
cpufreq: imx-cpufreq-dt: Add i.MX8MN support
cpufreq: Add QoS requests for userspace constraints
cpufreq: intel_pstate: Reuse refresh_frequency_limits()
cpufreq: Register notifiers with the PM QoS framework
PM / QoS: Add support for MIN/MAX frequency constraints
PM / QOS: Pass request type to dev_pm_qos_read_value()
PM / QOS: Rename __dev_pm_qos_read_value() and dev_pm_qos_raw_read_value()
PM / QOS: Pass request type to dev_pm_qos_{add|remove}_notifier()
It always returns 0 (success) and its return type should really be void.
Over that, many drivers have added error handling code based on its
return value, which is not required at all.
Change its return type to void and update all the callers.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
This implements QoS requests to manage userspace configuration of min
and max frequency.
Reviewed-by: Matthias Kaehlcke <mka@chromium.org>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Tested-by: syzbot <syzbot+de771ae9390dffed7266@syzkaller.appspotmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The implementation of intel_pstate_update_max_freq() is quite similar to
refresh_frequency_limits(), lets reuse it.
Finding minimum of policy->user_policy.max and policy->cpuinfo.max_freq
in intel_pstate_update_max_freq() is redundant as cpufreq_set_policy()
will call the ->verify() callback of intel-pstate driver, which will do
this comparison anyway and so dropping it from
intel_pstate_update_max_freq() doesn't harm.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Register notifiers for min/max frequency constraints with the PM QoS
framework. The constraints are also taken into consideration in
cpufreq_set_policy().
This also relocates cpufreq_policy_put_kobj() as it is required to be
called from cpufreq_policy_alloc() now.
refresh_frequency_limits() is updated to avoid calling
cpufreq_set_policy() for inactive policies and handle_update() is
updated to have proper locking in place.
No constraints are added until now though.
Reviewed-by: Matthias Kaehlcke <mka@chromium.org>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Tested-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* pm-cpufreq:
cpufreq: Avoid calling cpufreq_verify_current_freq() from handle_update()
cpufreq: Consolidate cpufreq_update_current_freq() and __cpufreq_get()
cpufreq: Don't skip frequency validation for has_target() drivers
cpufreq: Use has_target() instead of !setpolicy
cpufreq: Remove redundant !setpolicy check
cpufreq: Move the IS_ENABLED(CPU_THERMAL) macro into a stub
cpufreq: s5pv210: Don't flood kernel log after cpufreq change
cpufreq: pcc-cpufreq: Fail initialization if driver cannot be registered
cpufreq: add driver for Raspberry Pi
cpufreq: Switch imx7d to imx-cpufreq-dt for speed grading
cpufreq: imx-cpufreq-dt: Remove global platform match list
cpufreq: brcmstb-avs-cpufreq: Fix types for voltage/frequency
cpufreq: brcmstb-avs-cpufreq: Fix initial command check
cpufreq: armada-37xx: Remove set but not used variable 'freq'
cpufreq: imx-cpufreq-dt: Fix no OPPs available on unfused parts
dt-bindings: imx-cpufreq-dt: Document opp-supported-hw usage
cpufreq: Add imx-cpufreq-dt driver
cpufreq_online() and cpufreq_offline() [un]register the driver as
a cooling device. This is done if the driver is flagged as a cooling
device in addition with an IS_ENABLED() check to compile out the branching
code.
Group this test in a stub function added in the cpufreq header instead
of having the IS_ENABLED() in the code.
Suggested-by: Rafael J. Wysocki <rafael@kernel.org>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Based on 2 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license version 2 as
published by the free software foundation
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license version 2 as
published by the free software foundation #
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 4122 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Enrico Weigelt <info@metux.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190604081206.933168790@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Currently, the notifiers are called once for each CPU of the policy->cpus
cpumask. It would be more optimal if the notifier can be called only
once and all the relevant information be provided to it. Out of the 23
drivers that register for the transition notifiers today, only 4 of them
do per-cpu updates and the callback for the rest can be called only once
for the policy without any impact.
This would also avoid multiple function calls to the notifier callbacks
and reduce multiple iterations of notifier core's code (which does
locking as well).
This patch adds pointer to the cpufreq policy to the struct
cpufreq_freqs, so the notifier callback has all the information
available to it with a single call. The five drivers which perform
per-cpu updates are updated to use the cpufreq policy. The freqs->cpu
field is redundant now and is removed.
Acked-by: David S. Miller <davem@davemloft.net> (sparc)
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
While the cpuinfo.max_freq value doesn't really matter for
intel_pstate in the active mode, in the passive mode it is used by
governors as the maximum physical frequency of the CPU and the
results of governor computations generally depend on it. Also it
is made available to user space via sysfs and it should match the
current HW configuration.
For this reason, make intel_pstate update cpuinfo.max_freq for all
CPUs if it detects a global change of turbo frequency settings from
"disable" to "enable" or the other way associated with a _PPC change
notification from the platform firmware.
Note that policy_is_inactive(), cpufreq_cpu_acquire(),
cpufreq_cpu_release(), and cpufreq_set_policy() need to be made
available to it for this purpose.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=200759
Reported-by: Gabriele Mazzotta <gabriele.mzt@gmail.com>
Tested-by: Gabriele Mazzotta <gabriele.mzt@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
In some cases, the platform firmware disables or enables turbo
frequencies for all CPUs globally before triggering a _PPC change
notification for one of them. Obviously, that global change affects
all CPUs, not just the notified one, and it needs to be acted upon by
cpufreq.
The intel_pstate driver is able to detect such global changes of
the settings, but it also needs to update policy limits for all
CPUs if that happens, in particular if turbo frequencies are
enabled globally - to allow them to be used.
For this reason, introduce a new cpufreq driver callback to be
invoked on _PPC notifications, if present, instead of simply
calling cpufreq_update_policy() for the notified CPU and make
intel_pstate use it to trigger policy updates for all CPUs
in the system if global settings change.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=200759
Reported-by: Gabriele Mazzotta <gabriele.mzt@gmail.com>
Tested-by: Gabriele Mazzotta <gabriele.mzt@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
The cpufreq core doesn't remove the cpufreq policy anymore on CPU
offline operation, rather that happens when the CPU device gets
unregistered from the kernel. This allows faster recovery when the CPU
comes back online. This is also very useful during system wide
suspend/resume where we offline all non-boot CPUs during suspend and
then bring them back on resume.
This commit takes the same idea a step ahead to allow drivers to do
light weight tear-down and bring-up during CPU offline and online
operations.
A new set of callbacks is introduced, online/offline(). online() gets
called when the first CPU of an inactive policy is brought up and
offline() gets called when all the CPUs of a policy are offlined.
The existing init/exit() callback get called on policy
creation/destruction. They also get called instead of online/offline()
callbacks if the online/offline() callbacks aren't provided.
This also moves around some code to get executed only for the new-policy
case going forward.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
All cpufreq drivers do similar things to register as a cooling device.
Provide a cpufreq driver flag so drivers can just ask the cpufreq core
to register the cooling device on their behalf. This allows us to get
rid of duplicated code in the drivers.
In order to allow this, we add a struct thermal_cooling_device pointer
to struct cpufreq_policy so that drivers don't need to store it in a
private data structure.
Suggested-by: Stephen Boyd <swboyd@chromium.org>
Suggested-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Amit Kucheria <amit.kucheria@linaro.org>
Reviewed-by: Matthias Kaehlcke <mka@chromium.org>
Tested-by: Matthias Kaehlcke <mka@chromium.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The cpufreq_global_kobject is created using kobject_create_and_add()
helper, which assigns the kobj_type as dynamic_kobj_ktype and show/store
routines are set to kobj_attr_show() and kobj_attr_store().
These routines pass struct kobj_attribute as an argument to the
show/store callbacks. But all the cpufreq files created using the
cpufreq_global_kobject expect the argument to be of type struct
attribute. Things work fine currently as no one accesses the "attr"
argument. We may not see issues even if the argument is used, as struct
kobj_attribute has struct attribute as its first element and so they
will both get same address.
But this is logically incorrect and we should rather use struct
kobj_attribute instead of struct global_attr in the cpufreq core and
drivers and the show/store callbacks should take struct kobj_attribute
as argument instead.
This bug is caught using CFI CLANG builds in android kernel which
catches mismatch in function prototypes for such callbacks.
Reported-by: Donghee Han <dh.han@samsung.com>
Reported-by: Sangkyu Kim <skwith.kim@samsung.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Minor clean-up to use BIT() and keep checkpatch happy. Clean up the
comment formatting while we're at it to make it easier to read.
Signed-off-by: Amit Kucheria <amit.kucheria@linaro.org>
Reviewed-by: Stephen Boyd <swboyd@chromium.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
This routine checks if the CPU running this code belongs to the policy
of the target CPU or if not, can it do remote DVFS for it remotely. But
the current name of it implies as if it is only about doing remote
updates.
Rename it to make it more relevant.
Suggested-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
This isn't used anymore. Remove the helper and update documentation
accordingly.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
By design, cpufreq drivers are responsible for calling
cpufreq_frequency_table_cpuinfo() from their ->init()
callbacks to validate the frequency table.
However, if a cpufreq driver is buggy and fails to do so properly, it
lead to unexpected behavior of the driver or the cpufreq core at a
later point in time. It would be better if the core could
validate the frequency table during driver initialization.
To that end, introduce cpufreq_table_validate_and_sort() and make
the cpufreq core call it right after invoking the ->init() callback
of the driver and destroy the cpufreq policy if the table is invalid.
For the time being the validation of the table happens twice, once
from the driver and then from the core. The individual drivers will
be updated separately to drop table validation if they don't need it
for other reasons.
The frequency table is marked "sorted" or "unsorted" by the new helper
now instead of in cpufreq_table_validate_and_show(), as it should only
be done after validating the table (which the drivers won't do going
forward).
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
[ rjw: Subject/changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Pointer subtraction is slow and tedious. Therefore, replace all instances
where cpufreq_for_each_{valid_,}entry loops contained such substractions
with an iteration macro providing an index to the frequency_table entry.
Suggested-by: Al Viro <viro@ZenIV.linux.org.uk>
Link: http://lkml.kernel.org/r/20180120020237.GM13338@ZenIV.linux.org.uk
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
After commit 890da9cf09 (Revert "x86: do not use cpufreq_quick_get()
for /proc/cpuinfo "cpu MHz"") the "cpu MHz" number in /proc/cpuinfo
on x86 can be either the nominal CPU frequency (which is constant)
or the frequency most recently requested by a scaling governor in
cpufreq, depending on the cpufreq configuration. That is somewhat
inconsistent and is different from what it was before 4.13, so in
order to restore the previous behavior, make it report the current
CPU frequency like the scaling_cur_freq sysfs file in cpufreq.
To that end, modify the /proc/cpuinfo implementation on x86 to use
aperfmperf_snapshot_khz() to snapshot the APERF and MPERF feedback
registers, if available, and use their values to compute the CPU
frequency to be reported as "cpu MHz".
However, do that carefully enough to avoid accumulating delays that
lead to unacceptable access times for /proc/cpuinfo on systems with
many CPUs. Run aperfmperf_snapshot_khz() once on all CPUs
asynchronously at the /proc/cpuinfo open time, add a single delay
upfront (if necessary) at that point and simply compute the current
frequency while running show_cpuinfo() for each individual CPU.
Also, to avoid slowing down /proc/cpuinfo accesses too much, reduce
the default delay between consecutive APERF and MPERF reads to 10 ms,
which should be sufficient to get large enough numbers for the
frequency computation in all cases.
Fixes: 890da9cf09 (Revert "x86: do not use cpufreq_quick_get() for /proc/cpuinfo "cpu MHz"")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@kernel.org>
Frequency-invariant accounting support based on the ratio of current
frequency and maximum supported frequency is an optional feature an arch
can implement.
Since there are cpufreq drivers (e.g. cpufreq-dt) which can be build for
different arch's a default implementation of the frequency-invariance
setter function arch_set_freq_scale() is needed.
This default implementation is an empty weak function which will be
overwritten by a strong function in case the arch provides one.
The setter function passes the cpumask of related (to the frequency
change) cpus (online and offline cpus), the (new) current frequency and
the maximum supported frequency.
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* pm-cpufreq-sched:
cpufreq: schedutil: Always process remote callback with slow switching
cpufreq: schedutil: Don't restrict kthread to related_cpus unnecessarily
cpufreq: Return 0 from ->fast_switch() on errors
cpufreq: Simplify cpufreq_can_do_remote_dvfs()
cpufreq: Process remote callbacks from any CPU if the platform permits
sched: cpufreq: Allow remote cpufreq callbacks
cpufreq: schedutil: Use unsigned int for iowait boost
cpufreq: schedutil: Make iowait boost more energy efficient
The if () in cpufreq_can_do_remote_dvfs() is superfluous, so drop
it and simply return the value of the expression under it.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
On many platforms, CPUs can do DVFS across cpufreq policies. i.e CPU
from policy-A can change frequency of CPUs belonging to policy-B.
This is quite common in case of ARM platforms where we don't
configure any per-cpu register.
Add a flag to identify such platforms and update
cpufreq_can_do_remote_dvfs() to allow remote callbacks if this flag is
set.
Also enable the flag for cpufreq-dt driver which is used only on ARM
platforms currently.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Saravana Kannan <skannan@codeaurora.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
With Android UI and benchmarks the latency of cpufreq response to
certain scheduling events can become very critical. Currently, callbacks
into cpufreq governors are only made from the scheduler if the target
CPU of the event is the same as the current CPU. This means there are
certain situations where a target CPU may not run the cpufreq governor
for some time.
One testcase to show this behavior is where a task starts running on
CPU0, then a new task is also spawned on CPU0 by a task on CPU1. If the
system is configured such that the new tasks should receive maximum
demand initially, this should result in CPU0 increasing frequency
immediately. But because of the above mentioned limitation though, this
does not occur.
This patch updates the scheduler core to call the cpufreq callbacks for
remote CPUs as well.
The schedutil, ondemand and conservative governors are updated to
process cpufreq utilization update hooks called for remote CPUs where
the remote CPU is managed by the cpufreq policy of the local CPU.
The intel_pstate driver is updated to always reject remote callbacks.
This is tested with couple of usecases (Android: hackbench, recentfling,
galleryfling, vellamo, Ubuntu: hackbench) on ARM hikey board (64 bit
octa-core, single policy). Only galleryfling showed minor improvements,
while others didn't had much deviation.
The reason being that this patch only targets a corner case, where
following are required to be true to improve performance and that
doesn't happen too often with these tests:
- Task is migrated to another CPU.
- The task has high demand, and should take the target CPU to higher
OPPs.
- And the target CPU doesn't call into the cpufreq governor until the
next tick.
Based on initial work from Steve Muckle.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Saravana Kannan <skannan@codeaurora.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The policy->transition_latency field is used for multiple purposes
today and its not straight forward at all. This is how it is used:
A. Set the correct transition_latency value.
B. Set it to CPUFREQ_ETERNAL because:
1. We don't want automatic dynamic switching (with
ondemand/conservative) to happen at all.
2. We don't know the transition latency.
This patch handles the B.1. case in a more readable way. A new flag for
the cpufreq drivers is added to disallow use of cpufreq governors which
have dynamic_switching flag set.
All the current cpufreq drivers which are setting transition_latency
unconditionally to CPUFREQ_ETERNAL are updated to use it. They don't
need to set transition_latency anymore.
There shouldn't be any functional change after this patch.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Dominik Brodowski <linux@dominikbrodowski.net>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
There is no limitation in the ondemand or conservative governors which
disallow the transition_latency to be greater than 10 ms.
The max_transition_latency field is rather used to disallow automatic
dynamic frequency switching for platforms which didn't wanted these
governors to run.
Replace max_transition_latency with a boolean (dynamic_switching) and
check for transition_latency == CPUFREQ_ETERNAL along with that. This
makes it pretty straight forward to read/understand now.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The policy->transition_delay_us field is used only by the schedutil
governor currently, and this field describes how fast the driver wants
the cpufreq governor to change CPUs frequency. It should rather be a
common thing across all governors, as it doesn't have any schedutil
dependency here.
Create a new helper cpufreq_policy_transition_delay_us() to get the
transition delay across all governors.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The cpufreq core and governors aren't supposed to set a limit on how
fast we want to try changing the frequency. This is currently done for
the legacy governors with help of min_sampling_rate.
At worst, we may end up setting the sampling rate to a value lower than
the rate at which frequency can be changed and then one of the CPUs in
the policy will be only changing frequency for ever.
But that is something for the user to decide and there is no need to
have special handling for such cases in the core. Leave it for the user
to figure out.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Pull thermal management updates from Zhang Rui:
- Improve thermal cpu_cooling interaction with cpufreq core.
The cpu_cooling driver is designed to use CPU frequency scaling to
avoid high thermal states for a platform. But it wasn't glued really
well with cpufreq core.
For example clipped-cpus is copied from the policy structure and its
much better to use the policy->cpus (or related_cpus) fields directly
as they may have got updated. Not that things were broken before this
series, but they can be optimized a bit more.
This series tries to improve interactions between cpufreq core and
cpu_cooling driver and does some fixes/cleanups to the cpu_cooling
driver. (Viresh Kumar)
- A couple of fixes and cleanups in thermal core and imx, hisilicon,
bcm_2835, int340x thermal drivers. (Arvind Yadav, Dan Carpenter,
Sumeet Pawnikar, Srinivas Pandruvada, Willy WOLFF)
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux: (24 commits)
thermal: bcm2835: fix an error code in probe()
thermal: hisilicon: Handle return value of clk_prepare_enable
thermal: imx: Handle return value of clk_prepare_enable
thermal: int340x: check for sensor when PTYP is missing
Thermal/int340x: Fix few typos and kernel-doc style
thermal: fix source code documentation for parameters
thermal: cpu_cooling: Replace kmalloc with kmalloc_array
thermal: cpu_cooling: Rearrange struct cpufreq_cooling_device
thermal: cpu_cooling: 'freq' can't be zero in cpufreq_state2power()
thermal: cpu_cooling: don't store cpu_dev in cpufreq_cdev
thermal: cpu_cooling: get_level() can't fail
thermal: cpu_cooling: create structure for idle time stats
thermal: cpu_cooling: merge frequency and power tables
thermal: cpu_cooling: get rid of 'allowed_cpus'
thermal: cpu_cooling: OPPs are registered for all CPUs
thermal: cpu_cooling: store cpufreq policy
cpufreq: create cpufreq_table_count_valid_entries()
thermal: cpu_cooling: use cpufreq_policy to register cooling device
thermal: cpu_cooling: get rid of a variable in cpufreq_set_cur_state()
thermal: cpu_cooling: remove cpufreq_cooling_get_level()
...
The goal of this change is to give users a uniform and meaningful
result when they read /sys/...cpufreq/scaling_cur_freq
on modern x86 hardware, as compared to what they get today.
Modern x86 processors include the hardware needed
to accurately calculate frequency over an interval --
APERF, MPERF, and the TSC.
Here we provide an x86 routine to make this calculation
on supported hardware, and use it in preference to any
driver driver-specific cpufreq_driver.get() routine.
MHz is computed like so:
MHz = base_MHz * delta_APERF / delta_MPERF
MHz is the average frequency of the busy processor
over a measurement interval. The interval is
defined to be the time between successive invocations
of aperfmperf_khz_on_cpu(), which are expected to to
happen on-demand when users read sysfs attribute
cpufreq/scaling_cur_freq.
As with previous methods of calculating MHz,
idle time is excluded.
base_MHz above is from TSC calibration global "cpu_khz".
This x86 native method to calculate MHz returns a meaningful result
no matter if P-states are controlled by hardware or firmware
and/or if the Linux cpufreq sub-system is or is-not installed.
When this routine is invoked more frequently, the measurement
interval becomes shorter. However, the code limits re-computation
to 10ms intervals so that average frequency remains meaningful.
Discerning users are encouraged to take advantage of
the turbostat(8) utility, which can gracefully handle
concurrent measurement intervals of arbitrary length.
Signed-off-by: Len Brown <len.brown@intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
We need such a routine at two places already, lets create one.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Eduardo Valentin <edubezval@gmail.com>
Make the schedutil governor take the initial (default) value of the
rate_limit_us sysfs attribute from the (new) transition_delay_us
policy parameter (to be set by the scaling driver).
That will allow scaling drivers to make schedutil use smaller default
values of rate_limit_us and reduce the default average time interval
between consecutive frequency changes.
Make intel_pstate set transition_delay_us to 500.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Those were added by:
commit fcd7af917a ("cpufreq: stats: handle cpufreq_unregister_driver()
and suspend/resume properly")
but aren't used anymore since:
commit 1aefc75b24 ("cpufreq: stats: Make the stats code non-modular").
Remove them. Also remove the redundant parameter to the respective
routines.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The return value of cpufreq_update_policy() is never used, so make
it void.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Allow CPUfreq statistics to be cleared by writing anything to
/sys/.../cpufreq/stats/reset.
Signed-off-by: Markus Mayer <mmayer@broadcom.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Skip invalid entries when searching the frequency. This fixes cpufreq
at least on loongson2 MIPS board.
Fixes: da0c6dc00c (cpufreq: Handle sorted frequency tables more efficiently)
Signed-off-by: Aaro Koskinen <aaro.koskinen@iki.fi>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: 4.8+ <stable@vger.kernel.org> # 4.8+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cpufreq governors may need to know what a particular target frequency
maps to in the driver without necessarily wanting to set the frequency.
Support this operation via a new cpufreq API,
cpufreq_driver_resolve_freq(). This API returns the lowest driver
frequency equal or greater than the target frequency
(CPUFREQ_RELATION_L), subject to any policy (min/max) or driver
limitations. The mapping is also cached in the policy so that a
subsequent fast_switch operation can avoid repeating the same lookup.
The API will call a new cpufreq driver callback, resolve_freq(), if it
has been registered by the driver. Otherwise the frequency is resolved
via cpufreq_frequency_table_target(). Rather than require ->target()
style drivers to provide a resolve_freq() callback it is left to the
caller to ensure that the driver implements this callback if necessary
to use cpufreq_driver_resolve_freq().
Suggested-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Steve Muckle <smuckle@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
cpufreq drivers aren't required to provide a sorted frequency table
today, and even the ones which provide a sorted table aren't handled
efficiently by cpufreq core.
This patch adds infrastructure to verify if the freq-table provided by
the drivers is sorted or not, and use efficient helpers if they are
sorted.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
This routine can't fail unless the frequency table is invalid and
doesn't contain any valid entries.
Make it return the index and WARN() in case it is used for an invalid
table.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The policy already has this pointer set, use it instead.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Most of the callers of cpufreq_frequency_get_table() already have the
pointer to a valid 'policy' structure and they don't really need to go
through the per-cpu variable first and then a check to validate the
frequency, in order to find the freq-table for the policy.
Directly use the policy->freq_table field instead for them.
Only one user of that API is left after above changes, cpu_cooling.c and
it accesses the freq_table in a racy way as the policy can get freed in
between.
Fix it by using cpufreq_cpu_get() properly.
Since there are no more users of cpufreq_frequency_get_table() left, get
rid of it.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Javi Merino <javi.merino@arm.com> (cpu_cooling.c)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The modularity of cpufreq_stats is quite problematic.
First off, the usage of policy notifiers for the initialization
and cleanup in the cpufreq_stats module is inherently racy with
respect to CPU offline/online and the initialization and cleanup
of the cpufreq driver.
Second, fast frequency switching (used by the schedutil governor)
cannot be enabled if any transition notifiers are registered, so
if the cpufreq_stats module (that registers a transition notifier
for updating transition statistics) is loaded, the schedutil governor
cannot use fast frequency switching.
On the other hand, allowing cpufreq_stats to be built as a module
doesn't really add much value. Arguably, there's not much reason
for that code to be modular at all.
For the above reasons, make the cpufreq stats code non-modular,
modify the core to invoke functions provided by that code directly
and drop the notifiers from it.
Make the stats sysfs attributes appear empty if fast frequency
switching is enabled as the statistics will not be updated in that
case anyway (and returning -EBUSY from those attributes breaks
powertop).
While at it, clean up Kconfig help for the CPU_FREQ_STAT and
CPU_FREQ_STAT_DETAILS options.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
The 'initialized' field in struct cpufreq_governor is only used by
the conservative governor (as a usage counter) and the way that
happens is far from straightforward and arguably incorrect.
Namely, the value of 'initialized' is checked by
cpufreq_dbs_governor_init() and cpufreq_dbs_governor_exit() and
the results of those checks are passed (as the second argument) to
the ->init() and ->exit() callbacks in struct dbs_governor. Those
callbacks are only implemented by the ondemand and conservative
governors and ondemand doesn't use their second argument at all.
In turn, the conservative governor uses it to decide whether or not
to either register or unregister a transition notifier.
That whole mechanism is not only unnecessarily convoluted, but also
racy, because the 'initialized' field of struct cpufreq_governor is
updated in cpufreq_init_governor() and cpufreq_exit_governor() under
policy->rwsem which doesn't help if one of these functions is run
twice in parallel for different policies (which isn't impossible in
principle), for example.
Instead of it, add a proper usage counter to the conservative
governor and update it from cs_init() and cs_exit() which is
guaranteed to be non-racy, as those functions are only called
under gov_dbs_data_mutex which is global.
With that in place, drop the 'initialized' field from struct
cpufreq_governor as it is not used any more.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Create a new helper to avoid code duplication across governors.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The design of the cpufreq governor API is not very straightforward,
as struct cpufreq_governor provides only one callback to be invoked
from different code paths for different purposes. The purpose it is
invoked for is determined by its second "event" argument, causing it
to act as a "callback multiplexer" of sorts.
Unfortunately, that leads to extra complexity in governors, some of
which implement the ->governor() callback as a switch statement
that simply checks the event argument and invokes a separate function
to handle that specific event.
That extra complexity can be eliminated by replacing the all-purpose
->governor() callback with a family of callbacks to carry out specific
governor operations: initialization and exit, start and stop and policy
limits updates. That also turns out to reduce the code size too, so
do it.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Due to differences in the cpufreq core's handling of runtime CPU
offline and nonboot CPUs disabling during system suspend-to-RAM,
fast frequency switching gets disabled after a suspend-to-RAM and
resume cycle on all of the nonboot CPUs.
To prevent that from happening, move the invocation of
cpufreq_disable_fast_switch() from cpufreq_exit_governor() to
sugov_exit(), as the schedutil governor is the only user of fast
frequency switching today anyway.
That simply prevents cpufreq_disable_fast_switch() from being called
without invoking the ->governor callback for the CPUFREQ_GOV_POLICY_EXIT
event (which happens during system suspend now).
Fixes: b7898fda5b (cpufreq: Support for fast frequency switching)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Modify the ACPI cpufreq driver to provide a method for switching
CPU frequencies from interrupt context and update the cpufreq core
to support that method if available.
Introduce a new cpufreq driver callback, ->fast_switch, to be
invoked for frequency switching from interrupt context by (future)
governors supporting that feature via (new) helper function
cpufreq_driver_fast_switch().
Add two new policy flags, fast_switch_possible, to be set by the
cpufreq driver if fast frequency switching can be used for the
given policy and fast_switch_enabled, to be set by the governor
if it is going to use fast frequency switching for the given
policy. Also add a helper for setting the latter.
Since fast frequency switching is inherently incompatible with
cpufreq transition notifiers, make it possible to set the
fast_switch_enabled only if there are no transition notifiers
already registered and make the registration of new transition
notifiers fail if fast_switch_enabled is set for at least one
policy.
Implement the ->fast_switch callback in the ACPI cpufreq driver
and make it set fast_switch_possible during policy initialization
as appropriate.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Move definitions of symbols related to transition latency and
sampling rate to include/linux/cpufreq.h so they can be used by
(future) goverernors located outside of drivers/cpufreq/.
No functional changes.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Move definitions and function headers related to struct gov_attr_set
to include/linux/cpufreq.h so they can be used by (future) goverernors
located outside of drivers/cpufreq/.
No functional changes.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Create cpufreq.c under kernel/sched/ and move the cpufreq code
related to the scheduler to that file and to sched.h.
Redefine cpufreq_update_util() as a static inline function to avoid
function calls at its call sites in the scheduler code (as suggested
by Peter Zijlstra).
Also move the definition of struct update_util_data and declaration
of cpufreq_set_update_util_data() from include/linux/cpufreq.h to
include/linux/sched.h.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
The entire sequence of events (like INIT/START or STOP/EXIT) for which
cpufreq_governor() is called, is guaranteed to be protected by
policy->rwsem now.
The additional checks that were added earlier (as we were forced to drop
policy->rwsem before calling cpufreq_governor() for EXIT event), aren't
required anymore.
Over that, they weren't sufficient really. They just take care of
START/STOP events, but not INIT/EXIT and the state machine was never
maintained properly by them.
Kill the unnecessary checks and policy->governor_enabled field.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Earlier, when the struct freq-attr was used to represent governor
attributes, the standard cpufreq show/store sysfs attribute callbacks
were applied to the governor tunable attributes and they always acquire
the policy->rwsem lock before carrying out the operation. That could
have resulted in an ABBA deadlock if governor tunable attributes are
removed under policy->rwsem while one of them is being accessed
concurrently (if sysfs attributes removal wins the race, it will wait
for the access to complete with policy->rwsem held while the attribute
callback will block on policy->rwsem indefinitely).
We attempted to address this issue by dropping policy->rwsem around
governor tunable attributes removal (that is, around invocations of the
->governor callback with the event arg equal to CPUFREQ_GOV_POLICY_EXIT)
in cpufreq_set_policy(), but that opened up race conditions that had not
been possible with policy->rwsem held all the time.
The previous commit, "cpufreq: governor: New sysfs show/store callbacks
for governor tunables", fixed the original ABBA deadlock by adding new
governor specific show/store callbacks.
We don't have to drop rwsem around invocations of governor event
CPUFREQ_GOV_POLICY_EXIT anymore, and original fix can be reverted now.
Fixes: 955ef48335 (cpufreq: Drop rwsem lock around CPUFREQ_GOV_POLICY_EXIT)
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reported-by: Juri Lelli <juri.lelli@arm.com>
Tested-by: Juri Lelli <juri.lelli@arm.com>
Tested-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Introduce a mechanism by which parts of the cpufreq subsystem
("setpolicy" drivers or the core) can register callbacks to be
executed from cpufreq_update_util() which is invoked by the
scheduler's update_load_avg() on CPU utilization changes.
This allows the "setpolicy" drivers to dispense with their timers
and do all of the computations they need and frequency/voltage
adjustments in the update_load_avg() code path, among other things.
The update_load_avg() changes were suggested by Peter Zijlstra.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
That macro uses an internal static inline function that is first
totally unnecessary and second hard to read, so simplify it and
get rid of that monster.
No functional changes.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
The preprocessor magic used for setting the default cpufreq governor
(and for using the performance governor as a fallback one for that
matter) is really nasty, so replace it with __weak functions and
overrides.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Saravana Kannan <skannan@codeaurora.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Notice that the boost_supported field in struct cpufreq_driver is
redundant, because the driver's ->set_boost callback may be left
unset if "boost" is not supported. Moreover, the only driver
populating the ->set_boost callback is acpi_cpufreq, so make it
avoid populating that callback if "boost" is not supported, rework
the core to check ->set_boost instead of boost_supported to
verify "boost" support and drop boost_supported which isn't
used any more.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
cpufreq_boost_supported() is not used outside of cpufreq.c, so make
it static.
While at it, refactor it as a one-liner (which it really is).
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>