Commit Graph

189 Commits

Author SHA1 Message Date
Rafael J. Wysocki
fa3fa55de0 cpuidle: governors: menu: Avoid using invalid recent intervals data
Marc has reported that commit 85975daeaa ("cpuidle: menu: Avoid
discarding useful information") caused the number of wakeup interrupts
to increase on an idle system [1], which was not expected to happen
after merely allowing shallower idle states to be selected by the
governor in some cases.

However, on the system in question, all of the idle states deeper than
WFI are rejected by the driver due to a firmware issue [2].  This causes
the governor to only consider the recent interval duriation data
corresponding to attempts to enter WFI that are successful and the
recent invervals table is filled with values lower than the scheduler
tick period.  Consequently, the governor predicts an idle duration
below the scheduler tick period length and avoids stopping the tick
more often which leads to the observed symptom.

Address it by modifying the governor to update the recent intervals
table also when entering the previously selected idle state fails, so
it knows that the short idle intervals might have been the minority
had the selected idle states been actually entered every time.

Fixes: 85975daeaa ("cpuidle: menu: Avoid discarding useful information")
Link: https://lore.kernel.org/linux-pm/86o6sv6n94.wl-maz@kernel.org/ [1]
Link: https://lore.kernel.org/linux-pm/7ffcb716-9a1b-48c2-aaa4-469d0df7c792@arm.com/ [2]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/2793874.mvXUDI8C0e@rafael.j.wysocki
2025-08-11 21:46:14 +02:00
Zhongqiu Han
d4a7882f93 cpuidle: menu: Optimize bucket assignment when next_timer_ns equals KTIME_MAX
Directly assign the last bucket value instead of calling which_bucket()
when next_timer_ns equals KTIME_MAX, the largest possible value that
always falls into the last bucket.

This avoids unnecessary calculations and enhances performance.

Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Zhongqiu Han <quic_zhonhan@quicinc.com>
Link: https://patch.msgid.link/20250405135308.1854342-1-quic_zhonhan@quicinc.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-04-09 19:37:30 +02:00
Atul Kumar Pant
194c396e8a cpuidle: teo: Fix typos in two comments
Fix a typo and correct a spelling mistake in comments.

Signed-off-by: Atul Kumar Pant <atulpant.linux@gmail.com>
Link: https://patch.msgid.link/20250405060909.2026332-1-atulpant.linux@gmail.com
[ rjw: Subject and changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-04-09 19:34:23 +02:00
Rafael J. Wysocki
5c35041099 cpuidle: menu: Update documentation after get_typical_interval() changes
The documentation of the menu cpuidle governor needs to be updated
to match the code behavior after some changes made recently.

No functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/4998484.31r3eYUQgx@rjwysocki.net
[ rjw: More specific subject, two typos fixed in the changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-02-25 12:12:46 +01:00
Rafael J. Wysocki
85975daeaa cpuidle: menu: Avoid discarding useful information
When giving up on making a high-confidence prediction,
get_typical_interval() always returns UINT_MAX which means that the
next idle interval prediction will be based entirely on the time till
the next timer.  However, the information represented by the most
recent intervals may not be completely useless in those cases.

Namely, the largest recent idle interval is an upper bound on the
recently observed idle duration, so it is reasonable to assume that
the next idle duration is unlikely to exceed it.  Moreover, this is
still true after eliminating the suspected outliers if the sample
set still under consideration is at least as large as 50% of the
maximum sample set size.

Accordingly, make get_typical_interval() return the current maximum
recent interval value in that case instead of UINT_MAX.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reported-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Link: https://patch.msgid.link/7770672.EvYhyI6sBW@rjwysocki.net
2025-02-25 12:00:43 +01:00
Rafael J. Wysocki
8de7606f0f cpuidle: menu: Eliminate outliers on both ends of the sample set
Currently, get_typical_interval() attempts to eliminate outliers at the
high end of the sample set only (probably in order to bias the prediction
toward lower values), but this it problematic because if the outliers are
present at the low end of the sample set, discarding the highest values
will not help to reduce the variance.

Since the presence of outliers at the low end of the sample set is
generally as likely as their presence at the high end of the sample
set, modify get_typical_interval() to treat samples at the largest
distances from the average (on both ends of the sample set) as outliers.

This should increase the likelihood of making a meaningful prediction
in some cases.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reported-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Link: https://patch.msgid.link/2301940.iZASKD2KPV@rjwysocki.net
2025-02-25 12:00:35 +01:00
Rafael J. Wysocki
60256e458e cpuidle: menu: Tweak threshold use in get_typical_interval()
To prepare get_typical_interval() for subsequent changes, rearrange
the use of the data point threshold in it a bit and initialize that
threshold to UINT_MAX which is more consistent with its data type.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Link: https://patch.msgid.link/8490144.T7Z3S40VBb@rjwysocki.net
2025-02-25 12:00:28 +01:00
Rafael J. Wysocki
13982929fb cpuidle: menu: Use one loop for average and variance computations
Use the observation that one loop is sufficient to compute the average
of an array of values and their variance to eliminate one of the loops
from get_typical_interval().

While at it, make get_typical_interval() consistently use u64 as the
64-bit unsigned integer data type and rearrange some white space and the
declarations of local variables in it (to make them follow the reverse
X-mas tree pattern).

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Link: https://patch.msgid.link/3339073.aeNJFYEL58@rjwysocki.net
2025-02-25 12:00:17 +01:00
Rafael J. Wysocki
d2cd195b57 cpuidle: menu: Drop a redundant local variable
Local variable min in get_typical_interval() is updated, but never
accessed later, so drop it.

No functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Link: https://patch.msgid.link/13699686.uLZWGnKmhe@rjwysocki.net
2025-02-25 11:59:02 +01:00
Rafael J. Wysocki
16c8d7586c cpuidle: teo: Skip sleep length computation for low latency constraints
If the idle state exit latency constraint is sufficiently low, it
is better to avoid the additional latency related to calling
tick_nohz_get_sleep_length().  It is also not necessary to compute
the sleep length in that case because shallow idle state selection
will be forced then regardless of the recent wakeup history.

Accordingly, skip the sleep length computation and subsequent
checks of the exit latency constraint is low enough.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/6122398.lOV4Wx5bFT@rjwysocki.net
2025-01-20 17:19:53 +01:00
Rafael J. Wysocki
65e18e6544 cpuidle: teo: Replace time_span_ns with a flag
After recent updates, the time_span_ns field in struct teo_cpu has
become an indicator on whether or not the most recent wakeup has been
"genuine" which may as well be indicated by a bool field without
calling local_clock(), so update the code accordingly.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/6010475.MhkbZ0Pkbq@rjwysocki.net
2025-01-20 17:18:02 +01:00
Rafael J. Wysocki
ddcfa79646 cpuidle: teo: Simplify handling of total events count
Instead of computing the total events count from scratch every time,
decay it and add a PULSE value to it in teo_update().

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/9388883.CDJkKcVGEf@rjwysocki.net
2025-01-20 17:17:56 +01:00
Rafael J. Wysocki
13ed5c4a6d cpuidle: teo: Skip getting the sleep length if wakeups are very frequent
Commit 6da8f9ba5a ("cpuidle: teo: Skip tick_nohz_get_sleep_length()
call in some cases") attempted to reduce the governor overhead in some
cases by making it avoid obtaining the sleep length (the time till the
next timer event) which may be costly.

Among other things, after the above commit, tick_nohz_get_sleep_length()
was not called any more when idle state 0 was to be returned, which
turned out to be problematic and the previous behavior in that respect
was restored by commit 4b20b07ce7 ("cpuidle: teo: Don't count non-
existent intercepts").

However, commit 6da8f9ba5a also caused the governor to avoid calling
tick_nohz_get_sleep_length() on systems where idle state 0 is a "polling"
one (that is, it is not really an idle state, but a loop continuously
executed by the CPU) when the target residency of the idle state to be
returned was low enough, so there was no practical need to refine the
idle state selection in any way.  This change was not removed by the
other commit, so now on systems where idle state 0 is a "polling" one,
tick_nohz_get_sleep_length() is called when idle state 0 is to be
returned, but it is not called when a deeper idle state with
sufficiently low target residency is to be returned.  That is arguably
confusing and inconsistent.

Moreover, there is no specific reason why the behavior in question
should depend on whether or not idle state 0 is a "polling" one.

One way to address this would be to make the governor always call
tick_nohz_get_sleep_length() to obtain the sleep length, but that would
effectively mean reverting commit 6da8f9ba5a and restoring the latency
issue that was the reason for doing it.  This approach is thus not
particularly attractive.

To address it differently, notice that if a CPU is woken up very often,
this is not likely to be caused by timers in the first place (user space
has a default timer slack of 50 us and there are relatively few timers
with a deadline shorter than several microseconds in the kernel) and
even if it were the case, the potential benefit from using a deep idle
state would then be questionable for latency reasons.  Therefore, if the
majority of CPU wakeups occur within several microseconds, it can be
assumed that all wakeups in that range are non-timer and the sleep
length need not be determined.

Accordingly, introduce a new metric for counting wakeups with the
measured idle duration below RESIDENCY_THRESHOLD_NS and modify the idle
state selection to skip the tick_nohz_get_sleep_length() invocation if
idle state 0 has been selected or the target residency of the candidate
idle state is below RESIDENCY_THRESHOLD_NS and the value of the new
metric is at least 1/2 of the total event count.

Since the above requires the measured idle duration to be determined
every time, except for the cases when one of the safety nets has
triggered in which the wakeup is counted as a hit in the deepest
idle state idle residency range, update the handling of those cases
to avoid skipping the idle duration computation when the CPU wakeup
is "genuine".

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://patch.msgid.link/3851791.kQq0lBPeGt@rjwysocki.net
Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
[ rjw: Renamed a struct field ]
[ rjw: Fixed typo in the subject and one in a comment ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-01-20 17:17:02 +01:00
Rafael J. Wysocki
d619b5cc67 cpuidle: teo: Simplify counting events used for tick management
Replace the tick_hits metric with a new tick_intercepts one that can be
used directly when deciding whether or not to stop the scheduler tick
and update the governor functional description accordingly.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/1987985.PYKUYFuaPT@rjwysocki.net
2025-01-20 17:16:53 +01:00
Rafael J. Wysocki
e24f8a55de cpuidle: teo: Clarify two code comments
Rewrite two code comments suposed to explain its behavior that are too
concise or not sufficiently clear.

No functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/8472971.T7Z3S40VBb@rjwysocki.net
[ rjw: Fixed 2 typos in new comments ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-01-20 17:16:45 +01:00
Rafael J. Wysocki
b9a6af26bd cpuidle: teo: Drop local variable prev_intercept_idx
Local variable prev_intercept_idx in teo_select() is redundant because
it cannot be 0 when candidate state index is 0.

The prev_intercept_idx value is the index of the deepest enabled idle
state, so if it is 0, state 0 is the deepest enabled idle state, in
which case it must be the only enabled idle state, but then teo_select()
would have returned early before initializing prev_intercept_idx.

Thus prev_intercept_idx must be nonzero and the check of it against 0
always passes, so it can be dropped altogether.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/3327997.aeNJFYEL58@rjwysocki.net
[ rjw: Fixed typo in the changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-01-20 17:16:37 +01:00
Rafael J. Wysocki
ea185406d1 cpuidle: teo: Combine candidate state index checks against 0
There are two candidate state index checks against 0 in teo_select()
that need not be separate, so combine them and update comments around
them.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/13676346.uLZWGnKmhe@rjwysocki.net
2025-01-20 17:16:30 +01:00
Rafael J. Wysocki
92ce5c07b7 cpuidle: teo: Reorder candidate state index checks
Since constraint_idx may be 0, the candidate state index may change to 0
after assigning constraint_idx to it, so first check if it is greater
than constraint_idx (and update it if so) and then check it against 0.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/1907276.tdWV9SEqCh@rjwysocki.net
2025-01-20 17:16:16 +01:00
Rafael J. Wysocki
425b753645 cpuidle: teo: Rearrange idle state lookup code
Rearrange code in the idle state lookup loop in teo_select() to make it
somewhat easier to follow and update comments around it.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/4619938.LvFx2qVVIh@rjwysocki.net
2025-01-20 17:15:44 +01:00
Rafael J. Wysocki
5a597a19a2 cpuidle: teo: Update documentation after previous changes
After previous changes, the description of the teo governor in the
documentation comment does not match the code any more, so update it
as appropriate.

Fixes: 4499143980 ("cpuidle: teo: Remove recent intercepts metric")
Fixes: 2662342079 ("cpuidle: teo: Gather statistics regarding whether or not to stop the tick")
Fixes: 6da8f9ba5a ("cpuidle: teo: Skip tick_nohz_get_sleep_length() call in some cases")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/6120335.lOV4Wx5bFT@rjwysocki.net
[ rjw: Corrected 3 typos found by Christian ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-01-13 20:46:27 +01:00
Christian Loehle
38f83090f5 cpuidle: menu: Remove iowait influence
Remove CPU iowaiters influence on idle state selection.

Remove the menu notion of performance multiplier which increased with
the number of tasks that went to iowait sleep on this CPU and haven't
woken up yet.

Relying on iowait for cpuidle is problematic for a few reasons:

 1. There is no guarantee that an iowaiting task will wake up on the
    same CPU.

 2. The task being in iowait says nothing about the idle duration, we
    could be selecting shallower states for a long time.

 3. The task being in iowait doesn't always imply a performance hit
    with increased latency.

 4. If there is such a performance hit, the number of iowaiting tasks
    doesn't directly correlate.

 5. The definition of iowait altogether is vague at best, it is
    sprinkled across kernel code.

Signed-off-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/20240905092645.2885200-2-christian.loehle@arm.com
[ rjw: Minor edits in the changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-09-30 17:00:55 +02:00
Christian Loehle
4b20b07ce7 cpuidle: teo: Don't count non-existent intercepts
When bailing out early, teo will not query the sleep length anymore
since commit 6da8f9ba5a ("cpuidle: teo:
Skip tick_nohz_get_sleep_length() call in some cases") with an
expected sleep_length_ns value of KTIME_MAX.
This lead to state0 accumulating lots of 'intercepts' because
the actually measured sleep length was < KTIME_MAX, so query the sleep
length instead for teo to recognize if it still is in an
intercept-likely scenario without alternating between the two modes.

Fundamentally we can only do one of the two:
 1. Skip sleep_length_ns query when we think intercept is likely.
 2. Have accurate data if sleep_length_ns is actually intercepted when
we believe it is currently intercepted.

Previously teo did the former while this patch chooses the latter as
the additional time it takes to query the sleep length was found to be
negligible and the variants of option 1 (count all unknowns as misses
or count all unknown as hits) had significant regressions (as misses
had lots of too shallow idle state selections and as hits had terrible
performance in intercept-heavy workloads).

Fixes: 6da8f9ba5a ("cpuidle: teo: Skip tick_nohz_get_sleep_length() call in some cases")
Link: https://patch.msgid.link/c40acf72-010f-4a8b-80e4-33f133ba266b@arm.com
Signed-off-by: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-07-01 18:58:55 +02:00
Christian Loehle
4499143980 cpuidle: teo: Remove recent intercepts metric
The logic for recent intercepts didn't work, there is an underflow
of the 'recent' value that can be observed during boot already, which
teo usually doesn't recover from, making the entire logic pointless.
Furthermore the recent intercepts also were never reset, thus not
actually being very 'recent'.

Having underflowed 'recent' values lead to teo always acting as if
we were in a scenario were expected sleep length based on timers is
too high and it therefore unnecessarily selecting shallower states.

Experiments show that the remaining 'intercept' logic is enough to
quickly react to scenarios in which teo cannot rely on the timer
expected sleep length.

See also here:
https://lore.kernel.org/lkml/0ce2d536-1125-4df8-9a5b-0d5e389cd8af@arm.com/

Fixes: 77577558f2 ("cpuidle: teo: Rework most recent idle duration values treatment")
Link: https://patch.msgid.link/20240628095955.34096-3-christian.loehle@arm.com
Signed-off-by: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-06-28 21:03:21 +02:00
Christian Loehle
0a2998fa48 Revert: "cpuidle: teo: Introduce util-awareness"
This reverts commit 9ce0f7c4bc.

Util-awareness was reported to be too aggressive in selecting shallower
states. Additionally a single threshold was found to not be suitable
for reasoning about sleep length as, for all practical purposes,
almost arbitrary sleep lengths are still possible for any load value.

Fixes: 9ce0f7c4bc ("cpuidle: teo: Introduce util-awareness")
Link: https://patch.msgid.link/20240628095955.34096-2-christian.loehle@arm.com
Reported-by: Qais Yousef <qyousef@layalina.io>
Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Qais Yousef <qyousef@layalina.io>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-06-28 21:03:21 +02:00
Atul Kumar Pant
63e6b02f57 cpuidle: governors: teo: Fix a typo in a comment
"terget" -> "target"

Signed-off-by: Atul Kumar Pant <atulpant.linux@gmail.com>
Link: https://patch.msgid.link/20240616124025.16477-1-atulpant.linux@gmail.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-06-21 14:57:07 +02:00
Christian Loehle
bf18311384 cpuidle: menu: Cleanup after loadavg removal
The performance impact of loadavg was removed with commit a7fe5190c0
("cpuidle: menu: Remove get_loadavg() from the performance multiplier")
With only iowait remaining the description can be simplified, remove
also the no longer needed includes.

Signed-off-by: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-06-07 20:55:00 +02:00
Jeff Johnson
774459238f cpuidle: ladder: fix ladder_do_selection() kernel-doc
make C=1 reports:

warning: Function parameter or struct member 'dev' not described in 'ladder_do_selection'

Document 'dev' for this function.

Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-05-07 13:54:58 +02:00
Parshuram Sangle
496d0a6485 cpuidle: haltpoll: do not shrink guest poll_limit_ns below grow_start
While adjusting guest halt poll limit, grow block starts at
guest_halt_poll_grow_start without taking intermediate values.
Similar behavior is expected while shrinking the value. This
avoids short interval values which are really not required.

VCPU1 trace (guest_halt_poll_shrink equals 2):

VCPU1 grow 10000
VCPU1 shrink 5000
VCPU1 shrink 2500
VCPU1 shrink 1250
VCPU1 shrink 625
VCPU1 shrink 312
VCPU1 shrink 156
VCPU1 shrink 78
VCPU1 shrink 39
VCPU1 shrink 19
VCPU1 shrink 9
VCPU1 shrink 4

Similar change is done in KVM halt poll flow with below patch:
Link: https://lore.kernel.org/kvm/20211006133021.271905-3-sashal@kernel.org/

Co-developed-by: Rajendran Jaishankar <jaishankar.rajendran@intel.com>
Signed-off-by: Rajendran Jaishankar <jaishankar.rajendran@intel.com>
Signed-off-by: Parshuram Sangle <parshuram.sangle@intel.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
[ rjw: Subject edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-02-12 17:02:51 +01:00
Rafael J. Wysocki
78aabcb321 cpuidle: teo: Avoid unnecessary variable assignments
Notice that it is not necessary to assign tick_intercept_sum in every
iteration of the first loop over idle states in teo_select(), because
the intercept_sum value does not change after the assignment in a
given iteration of the loop, so its value after the last iteration of
the loop can be used for computing the tick_intercept_sum value
directly.

Modify the code accordingly.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-08-23 18:25:04 +02:00
Rafael J. Wysocki
5484e31bbb cpuidle: menu: Skip tick_nohz_get_sleep_length() call in some cases
Because the cost of calling tick_nohz_get_sleep_length() may increase
in the future, reorder the code in menu_select() so it first uses the
statistics to determine the expected idle duration.  If that value is
higher than RESIDENCY_THRESHOLD_NS, tick_nohz_get_sleep_length() will
be called to obtain the time till the closest timer and refine the
idle duration prediction if necessary.

This causes the governor to always take the full overhead of
get_typical_interval() with the assumption that the cost will be
amortized by skipping the tick_nohz_get_sleep_length() call in the
cases when the predicted idle duration is relatively very small.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Doug Smythies <dsmythies@telus.net>
2023-08-17 11:28:38 +02:00
Rafael J. Wysocki
2662342079 cpuidle: teo: Gather statistics regarding whether or not to stop the tick
Currently, if the target residency of the deepest idle state is less than
the tick period length, which is quite likely for HZ=100, and the deepest
idle state is about to be selected by the TEO idle governor, the decision
on whether or not to stop the scheduler tick is based entirely on the
time till the closest timer.  This is often insufficient, because timers
may not be in heavy use and there may be a plenty of other CPU wakeup
events between the deepest idle state's target residency and the closest
tick.

Allow the governor to count those events by making the deepest idle
state's bin effectively end at TICK_NSEC and introducing an additional
"bin" for collecting "hit" events (ie. the ones in which the measured
idle duration falls into the same bin as the time till the closest
timer) with idle duration values past TICK_NSEC.

This way the "intercepts" metric for the deepest idle state's bin
becomes nonzero in general, and so it can influence the decision on
whether or not to stop the tick possibly increasing the governor's
accuracy in that respect.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Kajetan Puchalski <kajetan.puchalski@arm.com>
Tested-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
2023-08-09 19:58:53 +02:00
Rafael J. Wysocki
6da8f9ba5a cpuidle: teo: Skip tick_nohz_get_sleep_length() call in some cases
Make teo_select() avoid calling tick_nohz_get_sleep_length() if the
candidate idle state to return is state 0 or if state 0 is a polling
one and the target residency of the current candidate one is below
a certain threshold, in which cases it may be assumed that the CPU will
be woken up immediately by a non-timer wakeup source and the timers
are not likely to matter.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Kajetan Puchalski <kajetan.puchalski@arm.com>
Tested-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
2023-08-09 19:58:46 +02:00
Rafael J. Wysocki
21d28cd2fa cpuidle: teo: Do not call tick_nohz_get_sleep_length() upfront
Because the cost of calling tick_nohz_get_sleep_length() may increase
in the future, reorder the code in teo_select() so it first uses the
statistics to pick up a candidate idle state and applies the utilization
heuristic to it and only then calls tick_nohz_get_sleep_length() to
obtain the sleep length value and refine the selection if necessary.

This change by itself does not cause tick_nohz_get_sleep_length() to
be called less often, but it prepares the code for subsequent changes
that will do so.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Kajetan Puchalski <kajetan.puchalski@arm.com>
Tested-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
2023-08-09 19:58:05 +02:00
Rafael J. Wysocki
9a41e16f11 cpuidle: teo: Drop utilized from struct teo_cpu
Because the utilized field in struct teo_cpu is only used locally in
teo_select(), replace it with a local variable in that function.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-and-tested-by: Kajetan Puchalski <kajetan.puchalski@arm.com>
2023-08-03 22:37:11 +02:00
Rafael J. Wysocki
04bae4e226 cpuidle: teo: Avoid stopping the tick unnecessarily when bailing out
When teo_select() is going to return early in some special cases, make
it avoid stopping the tick if the idle state to be returned is shallow.

In particular, never stop the tick if state 0 is to be returned.

Link: https://lore.kernel.org/linux-pm/CAJZ5v0jJxHj65r2HXBTd3wfbZtsg=_StzwO1kA5STDnaPe_dWA@mail.gmail.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-and-tested-by: Kajetan Puchalski <kajetan.puchalski@arm.com>
2023-08-03 22:37:03 +02:00
Rafael J. Wysocki
3f0b0966b3 cpuidle: teo: Update idle duration estimate when choosing shallower state
The TEO governor takes CPU utilization into account by refining idle state
selection when the utilization is above a certain threshold.  This is done by
choosing an idle state shallower than the previously selected one.

However, when doing this, the idle duration estimate needs to be
adjusted so as to prevent the scheduler tick from being stopped when the
candidate idle state is shallow, which may lead to excessive energy
usage if the CPU is not woken up quickly enough going forward.
Moreover, if the scheduler tick has been stopped already and the new
idle duration estimate is too small, the replacement candidate state
cannot be used.

Modify the relevant code to take the above observations into account.

Fixes: 9ce0f7c4bc ("cpuidle: teo: Introduce util-awareness")
Link: https://lore.kernel.org/linux-pm/CAJZ5v0jJxHj65r2HXBTd3wfbZtsg=_StzwO1kA5STDnaPe_dWA@mail.gmail.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-and-tested-by: Kajetan Puchalski <kajetan.puchalski@arm.com>
2023-08-03 22:36:09 +02:00
Kajetan Puchalski
9ce0f7c4bc cpuidle: teo: Introduce util-awareness
Modern interactive systems, such as recent Android phones, tend to have
power efficient shallow idle states. Selecting deeper idle states on a
device while a latency-sensitive workload is running can adversely
impact performance due to increased latency. Additionally, if the CPU
wakes up from a deeper sleep before its target residency as is often the
case, it results in a waste of energy on top of that.

At the moment, none of the available idle governors take any scheduling
information into account. They also tend to overestimate the idle
duration quite often, which causes them to select excessively deep idle
states, thus leading to increased wakeup latency and lower performance
with no power saving. For 'menu' while web browsing on Android for
instance, those types of wakeups ('too deep') account for over 24% of
all wakeups.

At the same time, on some platforms idle state 0 can be power efficient
enough to warrant wanting to prefer it over idle state 1. This is
because the power usage of the two states can be so close that
sufficient amounts of too deep state 1 sleeps can completely offset the
state 1 power saving to the point where it would've been more power
efficient to just use state 0 instead. This is, of course, for systems
where state 0 is not a polling state, such as arm-based devices.

Sleeps that happened in state 0 while they could have used state 1 ('too
shallow') only save less power than they otherwise could have. Too deep
sleeps, on the other hand, harm performance and nullify the potential
power saving from using state 1 in the first place. While taking this
into account, it is clear that on balance it is preferable for an idle
governor to have more too shallow sleeps instead of more too deep sleeps
on those kinds of platforms.

This patch specifically tunes TEO to prefer shallower idle states in
order to reduce wakeup latency and achieve better performance.

To this end, before selecting the next idle state it uses the avg_util
signal of a CPU's runqueue in order to determine to what extent the CPU
is being utilized. This util value is then compared to a threshold
defined as a percentage of the CPU's capacity (capacity >> 6 ie. ~1.5%
in the current implementation). If the util is above the threshold, the
index of the idle state selected by TEO metrics will be reduced by 1,
thus selecting a shallower state. If the util is below the threshold,
the governor defaults to the TEO metrics mechanism to try to select the
deepest available idle state based on the closest timer event and its
own correctness.

The main goal of this is to reduce latency and increase performance for
some workloads. Under some workloads it will result in an increase in
power usage (Geekbench 5) while for other workloads it will also result
in a decrease in power usage compared to TEO (PCMark Web, Jankbench,
Speedometer).

It can provide drastically decreased latency and performance benefits in
certain types of workloads that are sensitive to latency.

Example test results:

1. GB5 (better score, latency & more power usage)

| metric                                | menu           | teo               | teo-util-aware    |
| ------------------------------------- | -------------- | ----------------- | ----------------- |
| gmean score                           | 2826.5 (0.0%)  | 2764.8 (-2.18%)   | 2865 (1.36%)      |
| gmean power usage [mW]                | 2551.4 (0.0%)  | 2606.8 (2.17%)    | 2722.3 (6.7%)     |
| gmean too deep %                      | 14.99%         | 9.65%             | 4.02%             |
| gmean too shallow %                   | 2.5%           | 5.96%             | 14.59%            |
| gmean task wakeup latency (asynctask) | 78.16μs (0.0%) | 61.60μs (-21.19%) | 54.45μs (-30.34%) |

2. Jankbench (better score, latency & less power usage)

| metric                                | menu           | teo               | teo-util-aware    |
| ------------------------------------- | -------------- | ----------------- | ----------------- |
| gmean frame duration                  | 13.9 (0.0%)    | 14.7 (6.0%)       | 12.6 (-9.0%)      |
| gmean jank percentage                 | 1.5 (0.0%)     | 2.1 (36.99%)      | 1.3 (-17.37%)     |
| gmean power usage [mW]                | 144.6 (0.0%)   | 136.9 (-5.27%)    | 121.3 (-16.08%)   |
| gmean too deep %                      | 26.00%         | 11.00%            | 2.54%             |
| gmean too shallow %                   | 4.74%          | 11.89%            | 21.93%            |
| gmean wakeup latency (RenderThread)   | 139.5μs (0.0%) | 116.5μs (-16.49%) | 91.11μs (-34.7%)  |
| gmean wakeup latency (surfaceflinger) | 124.0μs (0.0%) | 151.9μs (22.47%)  | 87.65μs (-29.33%) |

Signed-off-by: Kajetan Puchalski <kajetan.puchalski@arm.com>
[ rjw: Comment edits and white space adjustments ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-01-10 20:53:22 +01:00
Kajetan Puchalski
27f8508801 cpuidle: teo: Optionally skip polling states in teo_find_shallower_state()
Add a no_poll flag to teo_find_shallower_state() that will let the
function optionally not consider polling states.

This allows the caller to guard against the function inadvertently
resulting in TEO putting the CPU in a polling state when that
behaviour is undesirable.

Signed-off-by: Kajetan Puchalski <kajetan.puchalski@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-01-10 20:53:22 +01:00
Eiichi Tsukata
0da11bf0ca cpuidle: haltpoll: Add trace points for guest_halt_poll_ns grow/shrink
Add trace points as are implemented in KVM host halt polling.
This helps tune guest halt polling params.

Signed-off-by: Eiichi Tsukata <eiichi.tsukata@nutanix.com>
Acked-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2022-06-14 16:01:35 +02:00
Jason Wang
14e6c70671 cpuidle: menu: Fix typo in a comment
The word `these' in a comment is repeated, so drop one.

Signed-off-by: Jason Wang <wangborong@cdjrlc.com>
[ rjw: Changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-11-24 17:30:44 +01:00
Rafael J. Wysocki
4adae7dd10 cpuidle: teo: Rename two local variables in teo_select()
Rename two local variables in teo_select() so that their names better
reflect their purpose.

No functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-08-03 15:18:57 +02:00
Rafael J. Wysocki
c2ec772b87 cpuidle: teo: Fix alternative idle state lookup
There are three mistakes in the loop in teo_select() that is looking
for an alternative candidate idle state.  First, it should walk all
of the idle states shallower than the current candidate one,
including all of the disabled ones, but it terminates after the first
enabled idle state.  Second, it should not terminate its last step
if idle state 0 is disabled (which is related to the first issue).
Finally, it may return the current alternative candidate idle state
prematurely if the time span criterion is not met by the idle state
under consideration at the moment.

To address the issues mentioned above, make the loop in question walk
all of the idle states shallower than the current candidate idle state
all the way down to idle state 0 and rearrange the checks in it.

Fixes: 77577558f2 ("cpuidle: teo: Rework most recent idle duration values treatment")
Reported-by: Doug Smythies <dsmythies@telus.net>
Tested-by: Doug Smythies <dsmythies@telus.net>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-08-03 15:18:57 +02:00
Linus Torvalds
3563f55ce6 Power management updates for 5.14-rc1
- Make intel_pstate support hybrid processors using abstract
    performance units in the HWP interface (Rafael Wysocki).
 
  - Add Icelake servers and Cometlake support in no-HWP mode to
    intel_pstate (Giovanni Gherdovich).
 
  - Make cpufreq_online() error path be consistent with the CPU
    device removal path in cpufreq (Rafael Wysocki).
 
  - Clean up 3 cpufreq drivers and the statistics code (Hailong Liu,
    Randy Dunlap, Shaokun Zhang).
 
  - Make intel_idle use special idle state parameters for C6 when
    package C-states are disabled (Chen Yu).
 
  - Rework the TEO (timer events oriented) cpuidle governor to address
    some theoretical shortcomings in it (Rafael Wysocki).
 
  - Drop unneeded semicolon from the TEO governor (Wan Jiabing).
 
  - Modify the runtime PM framework to accept unassigned suspend
    and resume callback pointers (Ulf Hansson).
 
  - Improve pm_runtime_get_sync() documentation (Krzysztof Kozlowski).
 
  - Improve device performance states support in the generic power
    domains (genpd) framework (Ulf Hansson).
 
  - Fix some documentation issues in genpd (Yang Yingliang).
 
  - Make the operating performance points (OPP) framework use the
    required-opps DT property in use cases that are not related to
    genpd (Hsin-Yi Wang).
 
  - Make lazy_link_required_opp_table() use list_del_init instead of
    list_del/INIT_LIST_HEAD (Yang Yingliang).
 
  - Simplify wake IRQs handling in the core system-wide sleep support
    code and clean up some coding style inconsistencies in it (Tian
    Tao, Zhen Lei).
 
  - Add cooling support to the tegra30 devfreq driver and improve its
    DT bindings (Dmitry Osipenko).
 
  - Fix some assorted issues in the devfreq core and drivers (Chanwoo
    Choi, Dong Aisheng, YueHaibing).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmDbaFwSHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxR/gP/inFFjdwWpq3r6XD5P4XeVvum/MEjqQs
 rDUDHXgEEZmWGL9CpQ0PxhXyvc7SSqtNLakECTXxrOccfMbo0NKQtjCt8eMO1XMu
 nrAcp68hr/Rg2TaenRC3j0+oGXPzOw5/Mg5Y9subRymdI3o5HyoJNjBkU+LlkdGs
 HC7k8zWPqKQaEoFgjOpYPuXKf2bvEm2jIh4dzmtCRWXBUOxDgN0z6Kckhs59xrU+
 +CLP/W4XMDrWSYdd2zjPV2IBNsqfePFchZ+t2CMQwYycI+KJr2s8tLbAFnQXfxLz
 WRqxpZKvMUPthKsK2/vgnCQkQKhGq39NmlHdqRdJm8uivCPx1Q2uuHRYF9782M+o
 cWuO60VvtUax0RIk1prP2l6JBBU/3Hvln7uf4cBnIeh/3QZKKygIgnNI1YwaqXSq
 zP4EWY00kKNmKwRUZAkDR9ZavXHiHvtoytT44XU/NxS+YXh6nMLC34CeuDUQaqni
 JvniXQyZCIWecZhwbOpW+FmAXMBCyvqXarDM0Zw4coWoyFLN7Y8ow9C5T5EWcgQ+
 pKyGhS698HiePPJrwDOtOqzfPqxcgXEWUqwmxTeD8MRDSMlamzPnRJh+wxlrIdv9
 2c8SAOSD7xvRlQQcsOpEVcKVkjsWDy7tvK6/O2CtoBsUpZOXtKIKJhr7ixnLN3Ej
 XHX/voFVDIao
 =mgLK
 -----END PGP SIGNATURE-----

Merge tag 'pm-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management updates from Rafael Wysocki:
 "These add hybrid processors support to the intel_pstate driver and
  make it work with more processor models when HWP is disabled, make the
  intel_idle driver use special C6 idle state paremeters when package
  C-states are disabled, add cooling support to the tegra30 devfreq
  driver, rework the TEO (timer events oriented) cpuidle governor,
  extend the OPP (operating performance points) framework to use the
  required-opps DT property in more cases, fix some issues and clean up
  a number of assorted pieces of code.

  Specifics:

   - Make intel_pstate support hybrid processors using abstract
     performance units in the HWP interface (Rafael Wysocki).

   - Add Icelake servers and Cometlake support in no-HWP mode to
     intel_pstate (Giovanni Gherdovich).

   - Make cpufreq_online() error path be consistent with the CPU device
     removal path in cpufreq (Rafael Wysocki).

   - Clean up 3 cpufreq drivers and the statistics code (Hailong Liu,
     Randy Dunlap, Shaokun Zhang).

   - Make intel_idle use special idle state parameters for C6 when
     package C-states are disabled (Chen Yu).

   - Rework the TEO (timer events oriented) cpuidle governor to address
     some theoretical shortcomings in it (Rafael Wysocki).

   - Drop unneeded semicolon from the TEO governor (Wan Jiabing).

   - Modify the runtime PM framework to accept unassigned suspend and
     resume callback pointers (Ulf Hansson).

   - Improve pm_runtime_get_sync() documentation (Krzysztof Kozlowski).

   - Improve device performance states support in the generic power
     domains (genpd) framework (Ulf Hansson).

   - Fix some documentation issues in genpd (Yang Yingliang).

   - Make the operating performance points (OPP) framework use the
     required-opps DT property in use cases that are not related to
     genpd (Hsin-Yi Wang).

   - Make lazy_link_required_opp_table() use list_del_init instead of
     list_del/INIT_LIST_HEAD (Yang Yingliang).

   - Simplify wake IRQs handling in the core system-wide sleep support
     code and clean up some coding style inconsistencies in it (Tian
     Tao, Zhen Lei).

   - Add cooling support to the tegra30 devfreq driver and improve its
     DT bindings (Dmitry Osipenko).

   - Fix some assorted issues in the devfreq core and drivers (Chanwoo
     Choi, Dong Aisheng, YueHaibing)"

* tag 'pm-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (39 commits)
  PM / devfreq: passive: Fix get_target_freq when not using required-opp
  cpufreq: Make cpufreq_online() call driver->offline() on errors
  opp: Allow required-opps to be used for non genpd use cases
  cpuidle: teo: remove unneeded semicolon in teo_select()
  dt-bindings: devfreq: tegra30-actmon: Add cooling-cells
  dt-bindings: devfreq: tegra30-actmon: Convert to schema
  PM / devfreq: userspace: Use DEVICE_ATTR_RW macro
  PM: runtime: Clarify documentation when callbacks are unassigned
  PM: runtime: Allow unassigned ->runtime_suspend|resume callbacks
  PM: runtime: Improve path in rpm_idle() when no callback
  PM: hibernate: remove leading spaces before tabs
  PM: sleep: remove trailing spaces and tabs
  PM: domains: Drop/restore performance state votes for devices at runtime PM
  PM: domains: Return early if perf state is already set for the device
  PM: domains: Split code in dev_pm_genpd_set_performance_state()
  cpuidle: teo: Use kerneldoc documentation in admin-guide
  cpuidle: teo: Rework most recent idle duration values treatment
  cpuidle: teo: Change the main idle state selection logic
  cpuidle: teo: Cosmetic modification of teo_select()
  cpuidle: teo: Cosmetic modifications of teo_update()
  ...
2021-06-29 13:36:06 -07:00
Wan Jiabing
795e0e38de cpuidle: teo: remove unneeded semicolon in teo_select()
Fix following coccicheck warning:
drivers/cpuidle/governors/teo.c:315:10-11: Unneeded semicolon

Signed-off-by: Wan Jiabing <wanjiabing@vivo.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-06-17 14:13:53 +02:00
Rafael J. Wysocki
154ae8bb3c cpuidle: teo: Use kerneldoc documentation in admin-guide
There are two descriptions of the TEO (Timer Events Oriented) cpuidle
governor in the kernel source tree, one in the C file containing its
code and one in cpuidle.rst which is part of admin-guide.

Instead of trying to keep them both in sync and in order to reduce
text duplication, include the governor description from the C file
directly into cpuidle.rst.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-06-11 18:36:45 +02:00
Rafael J. Wysocki
77577558f2 cpuidle: teo: Rework most recent idle duration values treatment
The TEO (Timer Events Oriented) cpuidle governor uses several most
recent idle duration values for a given CPU to refine the idle state
selection in case the previous long-term trends have not been
followed recently and a new trend appears to be forming.  That is
done by computing the average of the most recent idle duration
values falling below the time till the next timer event ("sleep
length"), provided that they are the majority of the most recent
idle duration values taken into account, and using it as the new
expected idle duration value.

However, idle state selection based on that value may not be optimal,
because the average does not really indicate which of the idle states
with target residencies less than or equal to it is likely to be the
best fit.

Thus, instead of computing the average, make the governor carry out
computations based on the distribution of the most recent idle
duration values among the bins corresponding to different idle
states.  Namely, if the majority of the most recent idle duration
values taken into consideration are less than the current sleep
length (which means that the CPU is likely to wake up early), find
the idle state closest to the "candidate" one "matching" the sleep
length whose target residency is less than or equal to the majority
of the most recent idle duration values that have fallen below the
current sleep length (which means that it is likely to be "shallow
enough" this time).

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-06-11 18:36:45 +02:00
Rafael J. Wysocki
c410a9a142 cpuidle: teo: Change the main idle state selection logic
Two aspects of the current main idle state selection logic in the
TEO (Timer Events Oriented) cpuidle governor are quite questionable.

First of all, the "hits" and "misses" metrics used by it are only
updated for a given idle state if the time till the next timer event
("sleep length") is between the target residency of that state and
the target residency of the next one.  Consequently, they are likely
to become stale if the sleep length tends to fall outside that
interval which increases the likelihood of subomtimal idle state
selection.

Second, the decision on whether or not to select the idle state
"matching" the sleep length is based on the metrics collected for
that state alone, whereas in principle the metrics collected for
the other idle states should be taken into consideration when that
decision is made.  For example, if the measured idle duration is less
than the target residency of the idle state "matching" the sleep
length, then it is also less than the target residency of any deeper
idle state and that should be taken into account when considering
whether or not to select any of those states, but currently it is
not.

In order to address the above shortcomings, modify the main idle
state selection logic in the TEO governor to take the metrics
collected for all of the idle states into account when deciding
whether or not to select the one "matching" the sleep length.

Moreover, drop the "misses" metric that becomes redundant after the
above change and rename the "early_hits" metric to "intercepts" so
that its role is better reflected by its name (the idea being that
if a CPU wakes up earlier than indicated by the sleep length, then
it must be a result of a non-timer interrupt that "intercepts" the
CPU).

Also rename the states[] array in struct struct teo_cpu to
state_bins[] to avoid confusing it with the states[] array in
struct cpuidle_driver and update the documentation to match the
new code (and make it more comprehensive while at it).

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-06-11 18:36:45 +02:00
Rafael J. Wysocki
b18e0de1cf cpuidle: teo: Cosmetic modification of teo_select()
Initialize local variables in teo_select() where they are declared.

No functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-06-11 18:36:45 +02:00
Rafael J. Wysocki
f53cbdab01 cpuidle: teo: Cosmetic modifications of teo_update()
Rename a local variable in teo_update() so that its purpose is better
reflected by its name and use one more local variable in the loop
over the CPU idle states in that function to make the code somewhat
easier to read.

No functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-06-11 18:36:45 +02:00
Alexey Dobriyan
8fc2858e57 sched: Make nr_iowait_cpu() return 32-bit value
Runqueue ->nr_iowait counters are 32-bit anyway.

Propagate 32-bitness into other code, but don't try too hard.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210422200228.1423391-3-adobriyan@gmail.com
2021-05-12 21:34:16 +02:00