Pull timer changes from Thomas Gleixner:
"This assorted collection provides:
- A new timer based timer broadcast feature for systems which do not
provide a global accessible timer device. That allows those
systems to put CPUs into deep idle states where the per cpu timer
device stops.
- A few NOHZ_FULL related improvements to the timer wheel
- The usual updates to timer devices found in ARM SoCs
- Small improvements and updates all over the place"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits)
tick: Remove code duplication in tick_handle_periodic()
tick: Fix spelling mistake in tick_handle_periodic()
x86: hpet: Use proper destructor for delayed work
workqueue: Provide destroy_delayed_work_on_stack()
clocksource: CMT, MTU2, TMU and STI should depend on GENERIC_CLOCKEVENTS
timer: Remove code redundancy while calling get_nohz_timer_target()
hrtimer: Rearrange comments in the order struct members are declared
timer: Use variable head instead of &work_list in __run_timers()
clocksource: exynos_mct: silence a static checker warning
arm: zynq: Add support for cpufreq
arm: zynq: Don't use arm_global_timer with cpufreq
clocksource/cadence_ttc: Overhaul clocksource frequency adjustment
clocksource/cadence_ttc: Call clockevents_update_freq() with IRQs enabled
clocksource: Add Kconfig entries for CMT, MTU2, TMU and STI
sh: Remove Kconfig entries for TMU, CMT and MTU2
ARM: shmobile: Remove CMT, TMU and STI Kconfig entries
clocksource: armada-370-xp: Use atomic access for shared registers
clocksource: orion: Use atomic access for shared registers
clocksource: timer-keystone: Delete unnecessary variable
clocksource: timer-keystone: introduce clocksource driver for Keystone
...
Pull timer updates from Ingo Molnar:
"The main purpose is to fix a full dynticks bug related to
virtualization, where steal time accounting appears to be zero in
/proc/stat even after a few seconds of competing guests running busy
loops in a same host CPU. It's not a regression though as it was
there since the beginning.
The other commits are preparatory work to fix the bug and various
cleanups"
* 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
arch: Remove stub cputime.h headers
sched: Remove needless round trip nsecs <-> tick conversion of steal time
cputime: Fix jiffies based cputime assumption on steal accounting
cputime: Bring cputime -> nsecs conversion
cputime: Default implementation of nsecs -> cputime conversion
cputime: Fix nsecs_to_cputime() return type cast
We accidentally declared pid_alive without any extern/inline connotation.
Some platforms were fine with this, some like ia64 and mips were very angry.
If the function is inline, the prototype should be inline!
on ia64:
include/linux/sched.h:1718: warning: 'pid_alive' declared inline after
being called
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
Added the functions task_ppid_nr_ns() and task_ppid_nr() to abstract the lookup
of the PPID (real_parent's pid_t) of a process, including rcu locking, in the
arbitrary and init_pid_ns.
This provides an alternative to sys_getppid(), which is relative to the child
process' pid namespace.
(informed by ebiederman's 6c621b7e)
Cc: stable@vger.kernel.org
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
There are only two users of get_nohz_timer_target(): timer and hrtimer. Both
call it under same circumstances, i.e.
#ifdef CONFIG_NO_HZ_COMMON
if (!pinned && get_sysctl_timer_migration() && idle_cpu(this_cpu))
return get_nohz_timer_target();
#endif
So, it makes more sense to get all this as part of get_nohz_timer_target()
instead of duplicating code at two places. For this another parameter is
required to be passed to this routine, pinned.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: linaro-kernel@lists.linaro.org
Cc: fweisbec@gmail.com
Cc: peterz@infradead.org
Link: http://lkml.kernel.org/r/1e1b53537217d58d48c2d7a222a9c3ac47d5b64c.1395140107.git.viresh.kumar@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The architectures that override cputime_t (s390, ppc) don't provide
any version of nsecs_to_cputime(). Indeed this cputime_t implementation
by backend only happens when CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y under
which the core code doesn't make any use of nsecs_to_cputime().
At least for now.
We are going to make a broader use of it so lets provide a default
version with a per usecs granularity. It should be good enough for most
usecases.
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
might_sleep() can tell us where interrupts have been disabled, but we
have no idea what disabled preemption. Add some debug infrastructure.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1391803122-4425-4-git-send-email-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Track depth in cgroup tree, this is useful for things like
find_matching_se() where you need to get to a common parent of two
sched entities.
Keeping the depth avoids having to calculate it on the spot, which
saves a number of possible cache-misses.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1328936700.2476.17.camel@laptop
Signed-off-by: Ingo Molnar <mingo@kernel.org>
As patch "sched: Move the priority specific bits into a new header file" exposes
the priority related macros in linux/sched/prio.h, we don't have to implement
task_nice() in kernel/sched/core.c any more.
This patch implements it in linux/sched/sched.h as static inline function,
saving the kernel stack and enhancing performance a bit.
Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Cc: clark.williams@gmail.com
Cc: rostedt@goodmis.org
Cc: raistlin@linux.it
Cc: juri.lelli@gmail.com
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1390878045-7096-1-git-send-email-yangds.fnst@cn.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This changes 'do_execve()' to get the executable name as a 'struct
filename', and to free it when it is done. This is what the normal
users want, and it simplifies and streamlines their error handling.
The controlled lifetime of the executable name also fixes a
use-after-free problem with the trace_sched_process_exec tracepoint: the
lifetime of the passed-in string for kernel users was not at all
obvious, and the user-mode helper code used UMH_WAIT_EXEC to serialize
the pathname allocation lifetime with the execve() having finished,
which in turn meant that the trace point that happened after
mm_release() of the old process VM ended up using already free'd memory.
To solve the kernel string lifetime issue, this simply introduces
"getname_kernel()" that works like the normal user-space getname()
function, except with the source coming from kernel memory.
As Oleg points out, this also means that we could drop the tcomm[] array
from 'struct linux_binprm', since the pathname lifetime now covers
setup_new_exec(). That would be a separate cleanup.
Reported-by: Igor Zhbanov <i.zhbanov@samsung.com>
Tested-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tracing the code that decides the active nodes has made it abundantly clear
that the naive implementation of the faults_from code has issues.
Specifically, the garbage collector in some workloads will access orders
of magnitudes more memory than the threads that do all the active work.
This resulted in the node with the garbage collector being marked the only
active node in the group.
This issue is avoided if we weigh the statistics by CPU use of each task in
the numa group, instead of by how many faults each thread has occurred.
To achieve this, we normalize the number of faults to the fraction of faults
that occurred on each node, and then multiply that fraction by the fraction
of CPU time the task has used since the last time task_numa_placement was
invoked.
This way the nodes in the active node mask will be the ones where the tasks
from the numa group are most actively running, and the influence of eg. the
garbage collector and other do-little threads is properly minimized.
On a 4 node system, using CPU use statistics calculated over a longer interval
results in about 1% fewer page migrations with two 32-warehouse specjbb runs
on a 4 node system, and about 5% fewer page migrations, as well as 1% better
throughput, with two 8-warehouse specjbb runs, as compared with the shorter
term statistics kept by the scheduler.
Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Chegu Vinod <chegu_vinod@hp.com>
Link: http://lkml.kernel.org/r/1390860228-21539-7-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Use the active_nodes nodemask to make smarter decisions on NUMA migrations.
In order to maximize performance of workloads that do not fit in one NUMA
node, we want to satisfy the following criteria:
1) keep private memory local to each thread
2) avoid excessive NUMA migration of pages
3) distribute shared memory across the active nodes, to
maximize memory bandwidth available to the workload
This patch accomplishes that by implementing the following policy for
NUMA migrations:
1) always migrate on a private fault
2) never migrate to a node that is not in the set of active nodes
for the numa_group
3) always migrate from a node outside of the set of active nodes,
to a node that is in that set
4) within the set of active nodes in the numa_group, only migrate
from a node with more NUMA page faults, to a node with fewer
NUMA page faults, with a 25% margin to avoid ping-ponging
This results in most pages of a workload ending up on the actively
used nodes, with reduced ping-ponging of pages between those nodes.
Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Chegu Vinod <chegu_vinod@hp.com>
Link: http://lkml.kernel.org/r/1390860228-21539-6-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Track which nodes NUMA faults are triggered from, in other words
the CPUs on which the NUMA faults happened. This uses a similar
mechanism to what is used to track the memory involved in numa faults.
The next patches use this to build up a bitmap of which nodes a
workload is actively running on.
Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Chegu Vinod <chegu_vinod@hp.com>
Link: http://lkml.kernel.org/r/1390860228-21539-4-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In order to get a more consistent naming scheme, making it clear
which fault statistics track memory locality, and which track
CPU locality, rename the memory fault statistics.
Suggested-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Chegu Vinod <chegu_vinod@hp.com>
Link: http://lkml.kernel.org/r/1390860228-21539-3-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Excessive migration of pages can hurt the performance of workloads
that span multiple NUMA nodes. However, it turns out that the
p->numa_migrate_deferred knob is a really big hammer, which does
reduce migration rates, but does not actually help performance.
Now that the second stage of the automatic numa balancing code
has stabilized, it is time to replace the simplistic migration
deferral code with something smarter.
Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Chegu Vinod <chegu_vinod@hp.com>
Link: http://lkml.kernel.org/r/1390860228-21539-2-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We can kill either task->did_exec or PF_FORKNOEXEC, they are mutually
exclusive. The patch kills ->did_exec because it has a single user.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
dup_mm() is used only in kernel/fork.c
Signed-off-by: Daeseok Youn <daeseok.youn@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
get_task_state() and task_state_array[] look confusing and suboptimal, it
is not clear what it can actually report to user-space and
task_state_array[] blows .data for no reason.
1. state = (tsk->state & TASK_REPORT) | tsk->exit_state is not
clear. TASK_REPORT is self-documenting but it is not clear
what ->exit_state can add.
Move the potential exit_state's (EXIT_ZOMBIE and EXIT_DEAD)
into TASK_REPORT and use it to calculate the final result.
2. With the change above it is obvious that task_state_array[]
has the unused entries just to make BUILD_BUG_ON() happy.
Change this BUILD_BUG_ON() to use TASK_REPORT rather than
TASK_STATE_MAX and shrink task_state_array[].
3. Turn the "while (state)" loop into fls(state).
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1. Remove fs/coredump.h. It is not clear why do we need it,
it only declares __get_dumpable(), signal.c includes it
for no reason.
2. Now that get_dumpable() and __get_dumpable() are really
trivial make them inline in linux/sched.h.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Alex Kelly <alex.page.kelly@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Petr Matousek <pmatouse@redhat.com>
Cc: Vasily Kulikov <segoon@openwall.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nobody actually needs MMF_DUMPABLE/MMF_DUMP_SECURELY, they are only used
to enforce the encoding of SUID_DUMP_* enum in mm->flags &
MMF_DUMPABLE_MASK.
Now that set_dumpable() updates both bits atomically we can kill them and
simply store the value "as is" in 2 lower bits.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Alex Kelly <alex.page.kelly@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Petr Matousek <pmatouse@redhat.com>
Cc: Vasily Kulikov <segoon@openwall.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
while_each_thread() and next_thread() should die, almost every lockless
usage is wrong.
1. Unless g == current, the lockless while_each_thread() is not safe.
while_each_thread(g, t) can loop forever if g exits, next_thread()
can't reach the unhashed thread in this case. Note that this can
happen even if g is the group leader, it can exec.
2. Even if while_each_thread() itself was correct, people often use
it wrongly.
It was never safe to just take rcu_read_lock() and loop unless
you verify that pid_alive(g) == T, even the first next_thread()
can point to the already freed/reused memory.
This patch adds signal_struct->thread_head and task->thread_node to
create the normal rcu-safe list with the stable head. The new
for_each_thread(g, t) helper is always safe under rcu_read_lock() as
long as this task_struct can't go away.
Note: of course it is ugly to have both task_struct->thread_node and the
old task_struct->thread_group, we will kill it later, after we change
the users of while_each_thread() to use for_each_thread().
Perhaps we can kill it even before we convert all users, we can
reimplement next_thread(t) using the new thread_head/thread_node. But
we can't do this right now because this will lead to subtle behavioural
changes. For example, do/while_each_thread() always sees at least one
task, while for_each_thread() can do nothing if the whole thread group
has died. Or thread_group_empty(), currently its semantics is not clear
unless thread_group_leader(p) and we need to audit the callers before we
can change it.
So this patch adds the new interface which has to coexist with the old
one for some time, hopefully the next changes will be more or less
straightforward and the old one will go away soon.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Sergey Dyasly <dserrg@gmail.com>
Tested-by: Sergey Dyasly <dserrg@gmail.com>
Reviewed-by: Sameer Nanda <snanda@chromium.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mandeep Singh Baines <msb@chromium.org>
Cc: "Ma, Xindong" <xindong.ma@intel.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: "Tu, Xiaobing" <xiaobing.tu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With various drivers wanting to inject idle time; we get people
calling idle routines outside of the idle loop proper.
Therefore we need to be extra careful about not missing
TIF_NEED_RESCHED -> PREEMPT_NEED_RESCHED propagations.
While looking at this, I also realized there's a small window in the
existing idle loop where we can miss TIF_NEED_RESCHED; when it hits
right after the tif_need_resched() test at the end of the loop but
right before the need_resched() test at the start of the loop.
So move preempt_fold_need_resched() out of the loop where we're
guaranteed to have TIF_NEED_RESCHED set.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-x9jgh45oeayzajz2mjt0y7d6@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In order of deadline scheduling to be effective and useful, it is
important that some method of having the allocation of the available
CPU bandwidth to tasks and task groups under control.
This is usually called "admission control" and if it is not performed
at all, no guarantee can be given on the actual scheduling of the
-deadline tasks.
Since when RT-throttling has been introduced each task group have a
bandwidth associated to itself, calculated as a certain amount of
runtime over a period. Moreover, to make it possible to manipulate
such bandwidth, readable/writable controls have been added to both
procfs (for system wide settings) and cgroupfs (for per-group
settings).
Therefore, the same interface is being used for controlling the
bandwidth distrubution to -deadline tasks and task groups, i.e.,
new controls but with similar names, equivalent meaning and with
the same usage paradigm are added.
However, more discussion is needed in order to figure out how
we want to manage SCHED_DEADLINE bandwidth at the task group level.
Therefore, this patch adds a less sophisticated, but actually
very sensible, mechanism to ensure that a certain utilization
cap is not overcome per each root_domain (the single rq for !SMP
configurations).
Another main difference between deadline bandwidth management and
RT-throttling is that -deadline tasks have bandwidth on their own
(while -rt ones doesn't!), and thus we don't need an higher level
throttling mechanism to enforce the desired bandwidth.
This patch, therefore:
- adds system wide deadline bandwidth management by means of:
* /proc/sys/kernel/sched_dl_runtime_us,
* /proc/sys/kernel/sched_dl_period_us,
that determine (i.e., runtime / period) the total bandwidth
available on each CPU of each root_domain for -deadline tasks;
- couples the RT and deadline bandwidth management, i.e., enforces
that the sum of how much bandwidth is being devoted to -rt
-deadline tasks to stay below 100%.
This means that, for a root_domain comprising M CPUs, -deadline tasks
can be created until the sum of their bandwidths stay below:
M * (sched_dl_runtime_us / sched_dl_period_us)
It is also possible to disable this bandwidth management logic, and
be thus free of oversubscribing the system up to any arbitrary level.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-12-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Some method to deal with rt-mutexes and make sched_dl interact with
the current PI-coded is needed, raising all but trivial issues, that
needs (according to us) to be solved with some restructuring of
the pi-code (i.e., going toward a proxy execution-ish implementation).
This is under development, in the meanwhile, as a temporary solution,
what this commits does is:
- ensure a pi-lock owner with waiters is never throttled down. Instead,
when it runs out of runtime, it immediately gets replenished and it's
deadline is postponed;
- the scheduling parameters (relative deadline and default runtime)
used for that replenishments --during the whole period it holds the
pi-lock-- are the ones of the waiting task with earliest deadline.
Acting this way, we provide some kind of boosting to the lock-owner,
still by using the existing (actually, slightly modified by the previous
commit) pi-architecture.
We would stress the fact that this is only a surely needed, all but
clean solution to the problem. In the end it's only a way to re-start
discussion within the community. So, as always, comments, ideas, rants,
etc.. are welcome! :-)
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Added !RT_MUTEXES build fix. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-11-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Turn the pi-chains from plist to rb-tree, in the rt_mutex code,
and provide a proper comparison function for -deadline and
-priority tasks.
This is done mainly because:
- classical prio field of the plist is just an int, which might
not be enough for representing a deadline;
- manipulating such a list would become O(nr_deadline_tasks),
which might be to much, as the number of -deadline task increases.
Therefore, an rb-tree is used, and tasks are queued in it according
to the following logic:
- among two -priority (i.e., SCHED_BATCH/OTHER/RR/FIFO) tasks, the
one with the higher (lower, actually!) prio wins;
- among a -priority and a -deadline task, the latter always wins;
- among two -deadline tasks, the one with the earliest deadline
wins.
Queueing and dequeueing functions are changed accordingly, for both
the list of a task's pi-waiters and the list of tasks blocked on
a pi-lock.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-again-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-10-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Make it possible to specify a period (different or equal than
deadline) for -deadline tasks. Relative deadlines (D_i) are used on
task arrivals to generate new scheduling (absolute) deadlines as "d =
t + D_i", and periods (P_i) to postpone the scheduling deadlines as "d
= d + P_i" when the budget is zero.
This is in general useful to model (and schedule) tasks that have slow
activation rates (long periods), but have to be scheduled soon once
activated (short deadlines).
Signed-off-by: Harald Gustafsson <harald.gustafsson@ericsson.com>
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-7-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Introduces data structures relevant for implementing dynamic
migration of -deadline tasks and the logic for checking if
runqueues are overloaded with -deadline tasks and for choosing
where a task should migrate, when it is the case.
Adds also dynamic migrations to SCHED_DEADLINE, so that tasks can
be moved among CPUs when necessary. It is also possible to bind a
task to a (set of) CPU(s), thus restricting its capability of
migrating, or forbidding migrations at all.
The very same approach used in sched_rt is utilised:
- -deadline tasks are kept into CPU-specific runqueues,
- -deadline tasks are migrated among runqueues to achieve the
following:
* on an M-CPU system the M earliest deadline ready tasks
are always running;
* affinity/cpusets settings of all the -deadline tasks is
always respected.
Therefore, this very special form of "load balancing" is done with
an active method, i.e., the scheduler pushes or pulls tasks between
runqueues when they are woken up and/or (de)scheduled.
IOW, every time a preemption occurs, the descheduled task might be sent
to some other CPU (depending on its deadline) to continue executing
(push). On the other hand, every time a CPU becomes idle, it might pull
the second earliest deadline ready task from some other CPU.
To enforce this, a pull operation is always attempted before taking any
scheduling decision (pre_schedule()), as well as a push one after each
scheduling decision (post_schedule()). In addition, when a task arrives
or wakes up, the best CPU where to resume it is selected taking into
account its affinity mask, the system topology, but also its deadline.
E.g., from the scheduling point of view, the best CPU where to wake
up (and also where to push) a task is the one which is running the task
with the latest deadline among the M executing ones.
In order to facilitate these decisions, per-runqueue "caching" of the
deadlines of the currently running and of the first ready task is used.
Queued but not running tasks are also parked in another rb-tree to
speed-up pushes.
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-5-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Introduces the data structures, constants and symbols needed for
SCHED_DEADLINE implementation.
Core data structure of SCHED_DEADLINE are defined, along with their
initializers. Hooks for checking if a task belong to the new policy
are also added where they are needed.
Adds a scheduling class, in sched/dl.c and a new policy called
SCHED_DEADLINE. It is an implementation of the Earliest Deadline
First (EDF) scheduling algorithm, augmented with a mechanism (called
Constant Bandwidth Server, CBS) that makes it possible to isolate
the behaviour of tasks between each other.
The typical -deadline task will be made up of a computation phase
(instance) which is activated on a periodic or sporadic fashion. The
expected (maximum) duration of such computation is called the task's
runtime; the time interval by which each instance need to be completed
is called the task's relative deadline. The task's absolute deadline
is dynamically calculated as the time instant a task (better, an
instance) activates plus the relative deadline.
The EDF algorithms selects the task with the smallest absolute
deadline as the one to be executed first, while the CBS ensures each
task to run for at most its runtime every (relative) deadline
length time interval, avoiding any interference between different
tasks (bandwidth isolation).
Thanks to this feature, also tasks that do not strictly comply with
the computational model sketched above can effectively use the new
policy.
To summarize, this patch:
- introduces the data structures, constants and symbols needed;
- implements the core logic of the scheduling algorithm in the new
scheduling class file;
- provides all the glue code between the new scheduling class and
the core scheduler and refines the interactions between sched/dl
and the other existing scheduling classes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com>
Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add the syscalls needed for supporting scheduling algorithms
with extended scheduling parameters (e.g., SCHED_DEADLINE).
In general, it makes possible to specify a periodic/sporadic task,
that executes for a given amount of runtime at each instance, and is
scheduled according to the urgency of their own timing constraints,
i.e.:
- a (maximum/typical) instance execution time,
- a minimum interval between consecutive instances,
- a time constraint by which each instance must be completed.
Thus, both the data structure that holds the scheduling parameters of
the tasks and the system calls dealing with it must be extended.
Unfortunately, modifying the existing struct sched_param would break
the ABI and result in potentially serious compatibility issues with
legacy binaries.
For these reasons, this patch:
- defines the new struct sched_attr, containing all the fields
that are necessary for specifying a task in the computational
model described above;
- defines and implements the new scheduling related syscalls that
manipulate it, i.e., sched_setattr() and sched_getattr().
Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
proof of concept and for developing and testing purposes. Making them
available on other architectures is straightforward.
Since no "user" for these new parameters is introduced in this patch,
the implementation of the new system calls is just identical to their
already existing counterpart. Future patches that implement scheduling
policies able to exploit the new data structure must also take care of
modifying the sched_*attr() calls accordingly with their own purposes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
[ Rewrote to use sched_attr. ]
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Removed sched_setscheduler2() for now. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-3-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Christian suffers from a bad BIOS that wrecks his i5's TSC sync. This
results in him occasionally seeing time going backwards - which
crashes the scheduler ...
Most of our time accounting can actually handle that except the most
common one; the tick time update of sched_fair.
There is a further problem with that code; previously we assumed that
because we get a tick every TICK_NSEC our time delta could never
exceed 32bits and math was simpler.
However, ever since Frederic managed to get NO_HZ_FULL merged; this is
no longer the case since now a task can run for a long time indeed
without getting a tick. It only takes about ~4.2 seconds to overflow
our u32 in nanoseconds.
This means we not only need to better deal with time going backwards;
but also means we need to be able to deal with large deltas.
This patch reworks the entire code and uses mul_u64_u32_shr() as
proposed by Andy a long while ago.
We express our virtual time scale factor in a u32 multiplier and shift
right and the 32bit mul_u64_u32_shr() implementation reduces to a
single 32x32->64 multiply if the time delta is still short (common
case).
For 64bit a 64x64->128 multiply can be used if ARCH_SUPPORTS_INT128.
Reported-and-Tested-by: Christian Engelmayer <cengelma@gmx.at>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: fweisbec@gmail.com
Cc: Paul Turner <pjt@google.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20131118172706.GI3866@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
While hunting a preemption issue with Alexander, Ben noticed that the
currently generic PREEMPT_NEED_RESCHED stuff is horribly broken for
load-store architectures.
We currently rely on the IPI to fold TIF_NEED_RESCHED into
PREEMPT_NEED_RESCHED, but when this IPI lands while we already have
a load for the preempt-count but before the store, the store will erase
the PREEMPT_NEED_RESCHED change.
The current preempt-count only works on load-store archs because
interrupts are assumed to be completely balanced wrt their preempt_count
fiddling; the previous preempt_count load will match the preempt_count
state after the interrupt and therefore nothing gets lost.
This patch removes the PREEMPT_NEED_RESCHED usage from generic code and
pushes it into x86 arch code; the generic code goes back to relying on
TIF_NEED_RESCHED.
Boot tested on x86_64 and compile tested on ppc64.
Reported-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Reported-and-Tested-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20131128132641.GP10022@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull scheduler fixes from Ingo Molnar:
"Various smaller fixlets, all over the place"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/doc: Fix generation of device-drivers
sched: Expose preempt_schedule_irq()
sched: Fix a trivial typo in comments
sched: Remove unused variable in 'struct sched_domain'
sched: Avoid NULL dereference on sd_busy
sched: Check sched_domain before computing group power
MAINTAINERS: Update file patterns in the lockdep and scheduler entries
task_is_dead() has no users since commit 43e13cc107
("cred: remove task_is_dead() from __task_cred() validation"), and
nobody except exit.c should rely on ->exit_state (we still have
the users which should be changed).
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20131113143614.GA10547@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull irq cleanups from Ingo Molnar:
"This is a multi-arch cleanup series from Thomas Gleixner, which we
kept to near the end of the merge window, to not interfere with
architecture updates.
This series (motivated by the -rt kernel) unifies more aspects of IRQ
handling and generalizes PREEMPT_ACTIVE"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
preempt: Make PREEMPT_ACTIVE generic
sparc: Use preempt_schedule_irq
ia64: Use preempt_schedule_irq
m32r: Use preempt_schedule_irq
hardirq: Make hardirq bits generic
m68k: Simplify low level interrupt handling code
genirq: Prevent spurious detection for unconditionally polled interrupts
side: the HV and emulation flavors can now coexist in a single kernel
is probably the most interesting change from a user point of view.
On the x86 side there are nested virtualization improvements and a
few bugfixes. ARM got transparent huge page support, improved
overcommit, and support for big endian guests.
Finally, there is a new interface to connect KVM with VFIO. This
helps with devices that use NoSnoop PCI transactions, letting the
driver in the guest execute WBINVD instructions. This includes
some nVidia cards on Windows, that fail to start without these
patches and the corresponding userspace changes.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABAgAGBQJShPAhAAoJEBvWZb6bTYbyl48P/297GgmELHAGBgjvb6q7yyGu
L8+eHjKbh4XBAkPwyzbvUjuww5z2hM0N3JQ0BDV9oeXlO+zwwCEns/sg2Q5/NJXq
XxnTeShaKnp9lqVBnE6G9rAOUWKoyLJ2wItlvUL8JlaO9xJ0Vmk0ta4n2Nv5GqDp
db6UD7vju6rHtIAhNpvvAO51kAOwc01xxRixCVb7KUYOnmO9nvpixzoI/S0Rp1gu
w/OWMfCosDzBoT+cOe79Yx1OKcpaVW94X6CH1s+ShCw3wcbCL2f13Ka8/E3FIcuq
vkZaLBxio7vjUAHRjPObw0XBW4InXEbhI1DjzIvm8dmc4VsgmtLQkTCG8fj+jINc
dlHQUq6Do+1F4zy6WMBUj8tNeP1Z9DsABp98rQwR8+BwHoQpGQBpAxW0TE0ZMngC
t1caqyvjZ5pPpFUxSrAV+8Kg4AvobXPYOim0vqV7Qea07KhFcBXLCfF7BWdwq/Jc
0CAOlsLL4mHGIQWZJuVGw0YGP7oATDCyewlBuDObx+szYCoV4fQGZVBEL0KwJx/1
7lrLN7JWzRyw6xTgJ5VVwgYE1tUY4IFQcHu7/5N+dw8/xg9KWA3f4PeMavIKSf+R
qteewbtmQsxUnvuQIBHLs8NRWPnBPy+F3Sc2ckeOLIe4pmfTte6shtTXcLDL+LqH
NTmT/cfmYp2BRkiCfCiS
=rWNf
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM changes from Paolo Bonzini:
"Here are the 3.13 KVM changes. There was a lot of work on the PPC
side: the HV and emulation flavors can now coexist in a single kernel
is probably the most interesting change from a user point of view.
On the x86 side there are nested virtualization improvements and a few
bugfixes.
ARM got transparent huge page support, improved overcommit, and
support for big endian guests.
Finally, there is a new interface to connect KVM with VFIO. This
helps with devices that use NoSnoop PCI transactions, letting the
driver in the guest execute WBINVD instructions. This includes some
nVidia cards on Windows, that fail to start without these patches and
the corresponding userspace changes"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (146 commits)
kvm, vmx: Fix lazy FPU on nested guest
arm/arm64: KVM: PSCI: propagate caller endianness to the incoming vcpu
arm/arm64: KVM: MMIO support for BE guest
kvm, cpuid: Fix sparse warning
kvm: Delete prototype for non-existent function kvm_check_iopl
kvm: Delete prototype for non-existent function complete_pio
hung_task: add method to reset detector
pvclock: detect watchdog reset at pvclock read
kvm: optimize out smp_mb after srcu_read_unlock
srcu: API for barrier after srcu read unlock
KVM: remove vm mmap method
KVM: IOMMU: hva align mapping page size
KVM: x86: trace cpuid emulation when called from emulator
KVM: emulator: cleanup decode_register_operand() a bit
KVM: emulator: check rex prefix inside decode_register()
KVM: x86: fix emulation of "movzbl %bpl, %eax"
kvm_host: typo fix
KVM: x86: emulate SAHF instruction
MAINTAINERS: add tree for kvm.git
Documentation/kvm: add a 00-INDEX file
...
The get_dumpable() return value is not boolean. Most users of the
function actually want to be testing for non-SUID_DUMP_USER(1) rather than
SUID_DUMP_DISABLE(0). The SUID_DUMP_ROOT(2) is also considered a
protected state. Almost all places did this correctly, excepting the two
places fixed in this patch.
Wrong logic:
if (dumpable == SUID_DUMP_DISABLE) { /* be protective */ }
or
if (dumpable == 0) { /* be protective */ }
or
if (!dumpable) { /* be protective */ }
Correct logic:
if (dumpable != SUID_DUMP_USER) { /* be protective */ }
or
if (dumpable != 1) { /* be protective */ }
Without this patch, if the system had set the sysctl fs/suid_dumpable=2, a
user was able to ptrace attach to processes that had dropped privileges to
that user. (This may have been partially mitigated if Yama was enabled.)
The macros have been moved into the file that declares get/set_dumpable(),
which means things like the ia64 code can see them too.
CVE-2013-2929
Reported-by: Vasily Kulikov <segoon@openwall.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
fpu_counter in task_struct was used only by sh/x86. Both of these now
carry it in ARCH specific thread_struct, hence this can now be removed
from generic task_struct, shrinking it slightly for other arches.
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Paul Mundt <paul.mundt@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In certain occasions it is possible for a hung task detector
positive to be false: continuation from a paused VM, for example.
Add a method to reset detection, similar as is done
with other kernel watchdogs.
Acked-by: Don Zickus <dzickus@redhat.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Resolve cherry-picking conflicts:
Conflicts:
mm/huge_memory.c
mm/memory.c
mm/mprotect.c
See this upstream merge commit for more details:
52469b4fcd Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit 3812c8c8f3 ("mm: memcg: do not trap chargers with full
callstack on OOM") assumed that only a few places that can trigger a
memcg OOM situation do not return VM_FAULT_OOM, like optional page cache
readahead. But there are many more and it's impractical to annotate
them all.
First of all, we don't want to invoke the OOM killer when the failed
allocation is gracefully handled, so defer the actual kill to the end of
the fault handling as well. This simplifies the code quite a bit for
added bonus.
Second, since a failed allocation might not be the abrupt end of the
fault, the memcg OOM handler needs to be re-entrant until the fault
finishes for subsequent allocation attempts. If an allocation is
attempted after the task already OOMed, allow it to bypass the limit so
that it can quickly finish the fault and invoke the OOM killer.
Reported-by: azurIt <azurit@pobox.sk>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Shared faults can lead to lots of unnecessary page migrations,
slowing down the system, and causing private faults to hit the
per-pgdat migration ratelimit.
This patch adds sysctl numa_balancing_migrate_deferred, which specifies
how many shared page migrations to skip unconditionally, after each page
migration that is skipped because it is a shared fault.
This reduces the number of page migrations back and forth in
shared fault situations. It also gives a strong preference to
the tasks that are already running where most of the memory is,
and to moving the other tasks to near the memory.
Testing this with a much higher scan rate than the default
still seems to result in fewer page migrations than before.
Memory seems to be somewhat better consolidated than previously,
with multi-instance specjbb runs on a 4 node system.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-62-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
With the scan rate code working (at least for multi-instance specjbb),
the large hammer that is "sched: Do not migrate memory immediately after
switching node" can be replaced with something smarter. Revert temporarily
migration disabling and all traces of numa_migrate_seq.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-61-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Adjust numa_scan_period in task_numa_placement, depending on how much
useful work the numa code can do. The more local faults there are in a
given scan window the longer the period (and hence the slower the scan rate)
during the next window. If there are excessive shared faults then the scan
period will decrease with the amount of scaling depending on whether the
ratio of shared/private faults. If the preferred node changes then the
scan rate is reset to recheck if the task is properly placed.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-59-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Due to the way the pid is truncated, and tasks are moved between
CPUs by the scheduler, it is possible for the current task_numa_fault
to group together tasks that do not actually share memory together.
This patch adds a few easy sanity checks to task_numa_fault, joining
tasks together if they share the same tsk->mm, or if the fault was on
a page with an elevated mapcount, in a shared VMA.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-57-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It is possible for a task in a numa group to call exec, and
have the new (unrelated) executable inherit the numa group
association from its former self.
This has the potential to break numa grouping, and is trivial
to fix.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-51-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch uses the fraction of faults on a particular node for both task
and group, to figure out the best node to place a task. If the task and
group statistics disagree on what the preferred node should be then a full
rescan will select the node with the best combined weight.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-50-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
A newly spawned thread inside a process should stay on the same
NUMA node as its parent. This prevents processes from being "torn"
across multiple NUMA nodes every time they spawn a new thread.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-49-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
And here's a little something to make sure not the whole world ends up
in a single group.
As while we don't migrate shared executable pages, we do scan/fault on
them. And since everybody links to libc, everybody ends up in the same
group.
Suggested-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1381141781-10992-47-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It is desirable to model from userspace how the scheduler groups tasks
over time. This patch adds an ID to the numa_group and reports it via
/proc/PID/status.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-45-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
While parallel applications tend to align their data on the cache
boundary, they tend not to align on the page or THP boundary.
Consequently tasks that partition their data can still "false-share"
pages presenting a problem for optimal NUMA placement.
This patch uses NUMA hinting faults to chain tasks together into
numa_groups. As well as storing the NID a task was running on when
accessing a page a truncated representation of the faulting PID is
stored. If subsequent faults are from different PIDs it is reasonable
to assume that those two tasks share a page and are candidates for
being grouped together. Note that this patch makes no scheduling
decisions based on the grouping information.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1381141781-10992-44-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Use the new stop_two_cpus() to implement migrate_swap(), a function that
flips two tasks between their respective cpus.
I'm fairly sure there's a less crude way than employing the stop_two_cpus()
method, but everything I tried either got horribly fragile and/or complex. So
keep it simple for now.
The notable detail is how we 'migrate' tasks that aren't runnable
anymore. We'll make it appear like we migrated them before they went to
sleep. The sole difference is the previous cpu in the wakeup path, so we
override this.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Link: http://lkml.kernel.org/r/1381141781-10992-39-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When a preferred node is selected for a tasks there is an attempt to migrate
the task to a CPU there. This may fail in which case the task will only
migrate if the active load balancer takes action. This may never happen if
the conditions are not right. This patch will check at NUMA hinting fault
time if another attempt should be made to migrate the task. It will only
make an attempt once every five seconds.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-34-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Ideally it would be possible to distinguish between NUMA hinting faults
that are private to a task and those that are shared. This patch prepares
infrastructure for separately accounting shared and private faults by
allocating the necessary buffers and passing in relevant information. For
now, all faults are treated as private and detection will be introduced
later.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-26-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch favours moving tasks towards NUMA node that recorded a higher
number of NUMA faults during active load balancing. Ideally this is
self-reinforcing as the longer the task runs on that node, the more faults
it should incur causing task_numa_placement to keep the task running on that
node. In reality a big weakness is that the nodes CPUs can be overloaded
and it would be more efficient to queue tasks on an idle node and migrate
to the new node. This would require additional smarts in the balancer so
for now the balancer will simply prefer to place the task on the preferred
node for a PTE scans which is controlled by the numa_balancing_settle_count
sysctl. Once the settle_count number of scans has complete the schedule
is free to place the task on an alternative node if the load is imbalanced.
[srikar@linux.vnet.ibm.com: Fixed statistics]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
[ Tunable and use higher faults instead of preferred. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-23-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
NUMA hinting fault counts and placement decisions are both recorded in the
same array which distorts the samples in an unpredictable fashion. The values
linearly accumulate during the scan and then decay creating a sawtooth-like
pattern in the per-node counts. It also means that placement decisions are
time sensitive. At best it means that it is very difficult to state that
the buffer holds a decaying average of past faulting behaviour. At worst,
it can confuse the load balancer if it sees one node with an artifically high
count due to very recent faulting activity and may create a bouncing effect.
This patch adds a second array. numa_faults stores the historical data
which is used for placement decisions. numa_faults_buffer holds the
fault activity during the current scan window. When the scan completes,
numa_faults decays and the values from numa_faults_buffer are copied
across.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-22-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch selects a preferred node for a task to run on based on the
NUMA hinting faults. This information is later used to migrate tasks
towards the node during balancing.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-21-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch tracks what nodes numa hinting faults were incurred on.
This information is later used to schedule a task on the node storing
the pages most frequently faulted by the task.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-20-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The NUMA PTE scan rate is controlled with a combination of the
numa_balancing_scan_period_min, numa_balancing_scan_period_max and
numa_balancing_scan_size. This scan rate is independent of the size
of the task and as an aside it is further complicated by the fact that
numa_balancing_scan_size controls how many pages are marked pte_numa and
not how much virtual memory is scanned.
In combination, it is almost impossible to meaningfully tune the min and
max scan periods and reasoning about performance is complex when the time
to complete a full scan is is partially a function of the tasks memory
size. This patch alters the semantic of the min and max tunables to be
about tuning the length time it takes to complete a scan of a tasks occupied
virtual address space. Conceptually this is a lot easier to understand. There
is a "sanity" check to ensure the scan rate is never extremely fast based on
the amount of virtual memory that should be scanned in a second. The default
of 2.5G seems arbitrary but it is to have the maximum scan rate after the
patch roughly match the maximum scan rate before the patch was applied.
On a similar note, numa_scan_period is in milliseconds and not
jiffies. Properly placed pages slow the scanning rate but adding 10 jiffies
to numa_scan_period means that the rate scanning slows depends on HZ which
is confusing. Get rid of the jiffies_to_msec conversion and treat it as ms.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-18-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Yuanhan reported a serious throughput regression in his pigz
benchmark. Using the ftrace patch I found that several idle
paths need more TLC before we can switch the generic
need_resched() over to preempt_need_resched.
The preemption paths benefit most from preempt_need_resched and
do indeed use it; all other need_resched() users don't really
care that much so reverting need_resched() back to
tif_need_resched() is the simple and safe solution.
Reported-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: lkp@linux.intel.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20130927153003.GF15690@laptop.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When using per-cpu preempt_count variables we need to save/restore the
preempt_count on context switch (into per task storage; for instance
the old thread_info::preempt_count variable) because of
PREEMPT_ACTIVE.
However, this means that on fork() the preempt_count value of the last
context switch gets copied and if we had a PREEMPT_ACTIVE switch right
before cloning a child task the child task will now too have
PREEMPT_ACTIVE set and start its life with an extra PREEMPT_ACTIVE
count.
Therefore we need to make init_task_preempt_count() unconditional;
this resets whatever preempt_count we inherited from our parent
process.
Doing so for !per-cpu implementations is harmless.
For !PREEMPT_COUNT kernels we need to be careful not to start life
with an increased preempt_count.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-4k0b7oy1rcdyzochwiixuwi9@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Rewrite the preempt_count macros in order to extract the 3 basic
preempt_count value modifiers:
__preempt_count_add()
__preempt_count_sub()
and the new:
__preempt_count_dec_and_test()
And since we're at it anyway, replace the unconventional
$op_preempt_count names with the more conventional preempt_count_$op.
Since these basic operators are equivalent to the previous _notrace()
variants, do away with the _notrace() versions.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-ewbpdbupy9xpsjhg960zwbv8@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In order to combine the preemption and need_resched test we need to
fold the need_resched information into the preempt_count value.
Since the NEED_RESCHED flag is set across CPUs this needs to be an
atomic operation, however we very much want to avoid making
preempt_count atomic, therefore we keep the existing TIF_NEED_RESCHED
infrastructure in place but at 3 sites test it and fold its value into
preempt_count; namely:
- resched_task() when setting TIF_NEED_RESCHED on the current task
- scheduler_ipi() when resched_task() sets TIF_NEED_RESCHED on a
remote task it follows it up with a reschedule IPI
and we can modify the cpu local preempt_count from
there.
- cpu_idle_loop() for when resched_task() found tsk_is_polling().
We use an inverted bitmask to indicate need_resched so that a 0 means
both need_resched and !atomic.
Also remove the barrier() in preempt_enable() between
preempt_enable_no_resched() and preempt_check_resched() to avoid
having to reload the preemption value and allow the compiler to use
the flags of the previuos decrement. I couldn't come up with any sane
reason for this barrier() to be there as preempt_enable_no_resched()
already has a barrier() before doing the decrement.
Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-7a7m5qqbn5pmwnd4wko9u6da@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Mike reported that commit 7d1a9417 ("x86: Use generic idle loop")
regressed several workloads and caused excessive reschedule
interrupts.
The patch in question failed to notice that the x86 code had an
inverted sense of the polling state versus the new generic code (x86:
default polling, generic: default !polling).
Fix the two prominent x86 mwait based idle drivers and introduce a few
new generic polling helpers (fixing the wrong smp_mb__after_clear_bit
usage).
Also switch the idle routines to using tif_need_resched() which is an
immediate TIF_NEED_RESCHED test as opposed to need_resched which will
end up being slightly different.
Reported-by: Mike Galbraith <bitbucket@online.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: lenb@kernel.org
Cc: tglx@linutronix.de
Link: http://lkml.kernel.org/n/tip-nc03imb0etuefmzybzj7sprf@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch builds on patch 2 and periodically decays that max value to
do idle balancing per sched domain by approximately 1% per second. Also
decay the rq's max_idle_balance_cost value.
Signed-off-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1379096813-3032-4-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In this patch, we keep track of the max cost we spend doing idle load balancing
for each sched domain. If the avg time the CPU remains idle is less then the
time we have already spent on idle balancing + the max cost of idle balancing
in the sched domain, then we don't continue to attempt the balance. We also
keep a per rq variable, max_idle_balance_cost, which keeps track of the max
time spent on newidle load balances throughout all its domains so that we can
determine the avg_idle's max value.
By using the max, we avoid overrunning the average. This further reduces the
chance we attempt balancing when the CPU is not idle for longer than the cost
to balance.
Signed-off-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1379096813-3032-3-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The memcg OOM handling is incredibly fragile and can deadlock. When a
task fails to charge memory, it invokes the OOM killer and loops right
there in the charge code until it succeeds. Comparably, any other task
that enters the charge path at this point will go to a waitqueue right
then and there and sleep until the OOM situation is resolved. The problem
is that these tasks may hold filesystem locks and the mmap_sem; locks that
the selected OOM victim may need to exit.
For example, in one reported case, the task invoking the OOM killer was
about to charge a page cache page during a write(), which holds the
i_mutex. The OOM killer selected a task that was just entering truncate()
and trying to acquire the i_mutex:
OOM invoking task:
mem_cgroup_handle_oom+0x241/0x3b0
mem_cgroup_cache_charge+0xbe/0xe0
add_to_page_cache_locked+0x4c/0x140
add_to_page_cache_lru+0x22/0x50
grab_cache_page_write_begin+0x8b/0xe0
ext3_write_begin+0x88/0x270
generic_file_buffered_write+0x116/0x290
__generic_file_aio_write+0x27c/0x480
generic_file_aio_write+0x76/0xf0 # takes ->i_mutex
do_sync_write+0xea/0x130
vfs_write+0xf3/0x1f0
sys_write+0x51/0x90
system_call_fastpath+0x18/0x1d
OOM kill victim:
do_truncate+0x58/0xa0 # takes i_mutex
do_last+0x250/0xa30
path_openat+0xd7/0x440
do_filp_open+0x49/0xa0
do_sys_open+0x106/0x240
sys_open+0x20/0x30
system_call_fastpath+0x18/0x1d
The OOM handling task will retry the charge indefinitely while the OOM
killed task is not releasing any resources.
A similar scenario can happen when the kernel OOM killer for a memcg is
disabled and a userspace task is in charge of resolving OOM situations.
In this case, ALL tasks that enter the OOM path will be made to sleep on
the OOM waitqueue and wait for userspace to free resources or increase
the group's limit. But a userspace OOM handler is prone to deadlock
itself on the locks held by the waiting tasks. For example one of the
sleeping tasks may be stuck in a brk() call with the mmap_sem held for
writing but the userspace handler, in order to pick an optimal victim,
may need to read files from /proc/<pid>, which tries to acquire the same
mmap_sem for reading and deadlocks.
This patch changes the way tasks behave after detecting a memcg OOM and
makes sure nobody loops or sleeps with locks held:
1. When OOMing in a user fault, invoke the OOM killer and restart the
fault instead of looping on the charge attempt. This way, the OOM
victim can not get stuck on locks the looping task may hold.
2. When OOMing in a user fault but somebody else is handling it
(either the kernel OOM killer or a userspace handler), don't go to
sleep in the charge context. Instead, remember the OOMing memcg in
the task struct and then fully unwind the page fault stack with
-ENOMEM. pagefault_out_of_memory() will then call back into the
memcg code to check if the -ENOMEM came from the memcg, and then
either put the task to sleep on the memcg's OOM waitqueue or just
restart the fault. The OOM victim can no longer get stuck on any
lock a sleeping task may hold.
Debugged by Michal Hocko.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: azurIt <azurit@pobox.sk>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
System calls and kernel faults (uaccess, gup) can handle an out of memory
situation gracefully and just return -ENOMEM.
Enable the memcg OOM killer only for user faults, where it's really the
only option available.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: azurIt <azurit@pobox.sk>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
task_struct->pid/tgid should go away.
1. Change same_thread_group() to use task->signal for comparison.
2. Change has_group_leader_pid(task) to compare task_pid(task) with
signal->leader_pid.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Sergey Dyasly <dserrg@gmail.com>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull KVM updates from Gleb Natapov:
"The highlights of the release are nested EPT and pv-ticketlocks
support (hypervisor part, guest part, which is most of the code, goes
through tip tree). Apart of that there are many fixes for all arches"
Fix up semantic conflicts as discussed in the pull request thread..
* 'next' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (88 commits)
ARM: KVM: Add newlines to panic strings
ARM: KVM: Work around older compiler bug
ARM: KVM: Simplify tracepoint text
ARM: KVM: Fix kvm_set_pte assignment
ARM: KVM: vgic: Bump VGIC_NR_IRQS to 256
ARM: KVM: Bugfix: vgic_bytemap_get_reg per cpu regs
ARM: KVM: vgic: fix GICD_ICFGRn access
ARM: KVM: vgic: simplify vgic_get_target_reg
KVM: MMU: remove unused parameter
KVM: PPC: Book3S PR: Rework kvmppc_mmu_book3s_64_xlate()
KVM: PPC: Book3S PR: Make instruction fetch fallback work for system calls
KVM: PPC: Book3S PR: Don't corrupt guest state when kernel uses VMX
KVM: x86: update masterclock when kvmclock_offset is calculated (v2)
KVM: PPC: Book3S: Fix compile error in XICS emulation
KVM: PPC: Book3S PR: return appropriate error when allocation fails
arch: powerpc: kvm: add signed type cast for comparation
KVM: x86: add comments where MMIO does not return to the emulator
KVM: vmx: count exits to userspace during invalid guest emulation
KVM: rename __kvm_io_bus_sort_cmp to kvm_io_bus_cmp
kvm: optimize away THP checks in kvm_is_mmio_pfn()
...
This reverts commit df54d6fa54.
The commit isn't necessarily wrong, but because it recalculates the
random mmap_base every time, it seems to confuse user memory allocators
that expect contiguous mmap allocations even when the mmap address isn't
specified.
In particular, the MATLAB Java runtime seems to be unhappy. See
https://bugzilla.kernel.org/show_bug.cgi?id=60774
So we'll want to apply the random offset only once, and Radu has a patch
for that. Revert this older commit in order to apply the other one.
Reported-by: Jeff Shorey <shoreyjeff@gmail.com>
Cc: Radu Caragea <sinaelgl@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iQEcBAABAgAGBQJSCDSjAAoJEHm+PkMAQRiGDXMIAI7Loae0Oqb1eoeJkvjyZsBS
OJDeeEcn+k58VbxVHyRdc7hGo4yI4tUZm172SpnOaM8sZ/ehPU7zBrwJK2lzX334
/jAM3uvVPfxA2nu0I4paNpkED/NQ8NRRsYE1iTE8dzHXOH6dA3mgp5qfco50rQvx
rvseXpME4KIAJEq4jnyFZF5+nuHiPueM9JftPmSSmJJ3/KY9kY1LESovyWd7ttg1
jYSVPFal9J0E+tl2UQY5g9H16GqhhjYn+39Iei6Q5P4bL4ZubQgTRQTN9nyDc06Z
ezQtGoqZ8kEz/2SyRlkda6PzjSEhgXlc8mCL5J7AW+dMhTHHx2IrosjiCA80kG8=
=c0rK
-----END PGP SIGNATURE-----
Merge tag 'v3.11-rc5' into perf/core
Merge Linux 3.11-rc5, to sync up with the latest upstream fixes since -rc1.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Merge a bunch of fixes from Andrew Morton.
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
fs/proc/task_mmu.c: fix buffer overflow in add_page_map()
arch: *: Kconfig: add "kernel/Kconfig.freezer" to "arch/*/Kconfig"
ocfs2: fix null pointer dereference in ocfs2_dir_foreach_blk_id()
x86 get_unmapped_area(): use proper mmap base for bottom-up direction
ocfs2: fix NULL pointer dereference in ocfs2_duplicate_clusters_by_page
ocfs2: Revert 40bd62e to avoid regression in extended allocation
drivers/rtc/rtc-stmp3xxx.c: provide timeout for potentially endless loop polling a HW bit
hugetlb: fix lockdep splat caused by pmd sharing
aoe: adjust ref of head for compound page tails
microblaze: fix clone syscall
mm: save soft-dirty bits on file pages
mm: save soft-dirty bits on swapped pages
memcg: don't initialize kmem-cache destroying work for root caches
When the stack is set to unlimited, the bottomup direction is used for
mmap-ings but the mmap_base is not used and thus effectively renders
ASLR for mmapings along with PIE useless.
Cc: Michel Lespinasse <walken@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Adrian Sendroiu <molecula2788@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull scheduler fixes from Ingo Molnar:
"Docbook fixes that make 99% of the diffstat, plus a oneliner fix"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Ensure update_cfs_shares() is called for parents of continuously-running tasks
sched: Fix some kernel-doc warnings
Calling freeze_processes sets a global flag that will cause any
process that calls try_to_freeze to enter the refrigerator. It
skips sending a signal to the current task, but if the current
task ever hits try_to_freeze, all threads will be frozen and the
system will deadlock.
Set a new flag, PF_SUSPEND_TASK, on the task that calls
freeze_processes. The flag notifies the freezer that the thread
is involved in suspend and should not be frozen. Also add a
WARN_ON in thaw_processes if the caller does not have the
PF_SUSPEND_TASK flag set to catch if a different task calls
thaw_processes than the one that called freeze_processes, leaving
a task with PF_SUSPEND_TASK permanently set on it.
Threads that spawn off a task with PF_SUSPEND_TASK set (which
swsusp does) will also have PF_SUSPEND_TASK set, preventing them
from freezing while they are helping with suspend, but they need
to be dead by the time suspend is triggered, otherwise they may
run when userspace is expected to be frozen. Add a WARN_ON in
thaw_processes if more than one thread has the PF_SUSPEND_TASK
flag set.
Reported-and-tested-by: Michael Leun <lkml20130126@newton.leun.net>
Signed-off-by: Colin Cross <ccross@android.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The wake-affine scheduler feature is currently always trying to pull
the wakee close to the waker. In theory this should be beneficial if
the waker's CPU caches hot data for the wakee, and it's also beneficial
in the extreme ping-pong high context switch rate case.
Testing shows it can benefit hackbench up to 15%.
However, the feature is somewhat blind, from which some workloads
such as pgbench suffer. It's also time-consuming algorithmically.
Testing shows it can damage pgbench up to 50% - far more than the
benefit it brings in the best case.
So wake-affine should be smarter and it should realize when to
stop its thankless effort at trying to find a suitable CPU to wake on.
This patch introduces 'wakee_flips', which will be increased each
time the task flips (switches) its wakee target.
So a high 'wakee_flips' value means the task has more than one
wakee, and the bigger the number, the higher the wakeup frequency.
Now when making the decision on whether to pull or not, pay attention to
the wakee with a high 'wakee_flips', pulling such a task may benefit
the wakee. Also imply that the waker will face cruel competition later,
it could be very cruel or very fast depends on the story behind
'wakee_flips', waker therefore suffers.
Furthermore, if waker also has a high 'wakee_flips', that implies that
multiple tasks rely on it, then waker's higher latency will damage all
of them, so pulling wakee seems to be a bad deal.
Thus, when 'waker->wakee_flips / wakee->wakee_flips' becomes
higher and higher, the cost of pulling seems to be worse and worse.
The patch therefore helps the wake-affine feature to stop its pulling
work when:
wakee->wakee_flips > factor &&
waker->wakee_flips > (factor * wakee->wakee_flips)
The 'factor' here is the number of CPUs in the current CPU's NUMA node,
so a bigger node will lead to more pulling since the trial becomes more
severe.
After applying the patch, pgbench shows up to 40% improvements and no regressions.
Tested with 12 cpu x86 server and tip 3.10.0-rc7.
The percentages in the final column highlight the areas with the biggest wins,
all other areas improved as well:
pgbench base smart
| db_size | clients | tps | | tps |
+---------+---------+-------+ +-------+
| 22 MB | 1 | 10598 | | 10796 |
| 22 MB | 2 | 21257 | | 21336 |
| 22 MB | 4 | 41386 | | 41622 |
| 22 MB | 8 | 51253 | | 57932 |
| 22 MB | 12 | 48570 | | 54000 |
| 22 MB | 16 | 46748 | | 55982 | +19.75%
| 22 MB | 24 | 44346 | | 55847 | +25.93%
| 22 MB | 32 | 43460 | | 54614 | +25.66%
| 7484 MB | 1 | 8951 | | 9193 |
| 7484 MB | 2 | 19233 | | 19240 |
| 7484 MB | 4 | 37239 | | 37302 |
| 7484 MB | 8 | 46087 | | 50018 |
| 7484 MB | 12 | 42054 | | 48763 |
| 7484 MB | 16 | 40765 | | 51633 | +26.66%
| 7484 MB | 24 | 37651 | | 52377 | +39.11%
| 7484 MB | 32 | 37056 | | 51108 | +37.92%
| 15 GB | 1 | 8845 | | 9104 |
| 15 GB | 2 | 19094 | | 19162 |
| 15 GB | 4 | 36979 | | 36983 |
| 15 GB | 8 | 46087 | | 49977 |
| 15 GB | 12 | 41901 | | 48591 |
| 15 GB | 16 | 40147 | | 50651 | +26.16%
| 15 GB | 24 | 37250 | | 52365 | +40.58%
| 15 GB | 32 | 36470 | | 50015 | +37.14%
Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/51D50057.9000809@linux.vnet.ibm.com
[ Improved the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Linux as a guest on KVM hypervisor, the only user of the pvclock
vsyscall interface, does not require notification on task migration
because:
1. cpu ID number maps 1:1 to per-CPU pvclock time info.
2. per-CPU pvclock time info is updated if the
underlying CPU changes.
3. that version is increased whenever underlying CPU
changes.
Which is sufficient to guarantee nanoseconds counter
is calculated properly.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
When building the htmldocs (in verbose mode), scripts/kernel-doc
reports the follwing type of warnings:
Warning(kernel/sched/core.c:936): No description found for return value of 'task_curr'
...
Fix those by:
- adding the missing descriptions
- using "Return" sections for the descriptions
Signed-off-by: Yacine Belkadi <yacine.belkadi.1@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1373654747-2389-1-git-send-email-yacine.belkadi.1@gmail.com
[ While at it, fix the cpupri_set() explanation. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since all architectures have been converted to use vm_unmapped_area(),
there is no remaining use for the free_area_cache.
Signed-off-by: Michel Lespinasse <walken@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Richard Henderson <rth@twiddle.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull networking updates from David Miller:
"This is a re-do of the net-next pull request for the current merge
window. The only difference from the one I made the other day is that
this has Eliezer's interface renames and the timeout handling changes
made based upon your feedback, as well as a few bug fixes that have
trickeled in.
Highlights:
1) Low latency device polling, eliminating the cost of interrupt
handling and context switches. Allows direct polling of a network
device from socket operations, such as recvmsg() and poll().
Currently ixgbe, mlx4, and bnx2x support this feature.
Full high level description, performance numbers, and design in
commit 0a4db187a9 ("Merge branch 'll_poll'")
From Eliezer Tamir.
2) With the routing cache removed, ip_check_mc_rcu() gets exercised
more than ever before in the case where we have lots of multicast
addresses. Use a hash table instead of a simple linked list, from
Eric Dumazet.
3) Add driver for Atheros CQA98xx 802.11ac wireless devices, from
Bartosz Markowski, Janusz Dziedzic, Kalle Valo, Marek Kwaczynski,
Marek Puzyniak, Michal Kazior, and Sujith Manoharan.
4) Support reporting the TUN device persist flag to userspace, from
Pavel Emelyanov.
5) Allow controlling network device VF link state using netlink, from
Rony Efraim.
6) Support GRE tunneling in openvswitch, from Pravin B Shelar.
7) Adjust SOCK_MIN_RCVBUF and SOCK_MIN_SNDBUF for modern times, from
Daniel Borkmann and Eric Dumazet.
8) Allow controlling of TCP quickack behavior on a per-route basis,
from Cong Wang.
9) Several bug fixes and improvements to vxlan from Stephen
Hemminger, Pravin B Shelar, and Mike Rapoport. In particular,
support receiving on multiple UDP ports.
10) Major cleanups, particular in the area of debugging and cookie
lifetime handline, to the SCTP protocol code. From Daniel
Borkmann.
11) Allow packets to cross network namespaces when traversing tunnel
devices. From Nicolas Dichtel.
12) Allow monitoring netlink traffic via AF_PACKET sockets, in a
manner akin to how we monitor real network traffic via ptype_all.
From Daniel Borkmann.
13) Several bug fixes and improvements for the new alx device driver,
from Johannes Berg.
14) Fix scalability issues in the netem packet scheduler's time queue,
by using an rbtree. From Eric Dumazet.
15) Several bug fixes in TCP loss recovery handling, from Yuchung
Cheng.
16) Add support for GSO segmentation of MPLS packets, from Simon
Horman.
17) Make network notifiers have a real data type for the opaque
pointer that's passed into them. Use this to properly handle
network device flag changes in arp_netdev_event(). From Jiri
Pirko and Timo Teräs.
18) Convert several drivers over to module_pci_driver(), from Peter
Huewe.
19) tcp_fixup_rcvbuf() can loop 500 times over loopback, just use a
O(1) calculation instead. From Eric Dumazet.
20) Support setting of explicit tunnel peer addresses in ipv6, just
like ipv4. From Nicolas Dichtel.
21) Protect x86 BPF JIT against spraying attacks, from Eric Dumazet.
22) Prevent a single high rate flow from overruning an individual cpu
during RX packet processing via selective flow shedding. From
Willem de Bruijn.
23) Don't use spinlocks in TCP md5 signing fast paths, from Eric
Dumazet.
24) Don't just drop GSO packets which are above the TBF scheduler's
burst limit, chop them up so they are in-bounds instead. Also
from Eric Dumazet.
25) VLAN offloads are missed when configured on top of a bridge, fix
from Vlad Yasevich.
26) Support IPV6 in ping sockets. From Lorenzo Colitti.
27) Receive flow steering targets should be updated at poll() time
too, from David Majnemer.
28) Fix several corner case regressions in PMTU/redirect handling due
to the routing cache removal, from Timo Teräs.
29) We have to be mindful of ipv4 mapped ipv6 sockets in
upd_v6_push_pending_frames(). From Hannes Frederic Sowa.
30) Fix L2TP sequence number handling bugs, from James Chapman."
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1214 commits)
drivers/net: caif: fix wrong rtnl_is_locked() usage
drivers/net: enic: release rtnl_lock on error-path
vhost-net: fix use-after-free in vhost_net_flush
net: mv643xx_eth: do not use port number as platform device id
net: sctp: confirm route during forward progress
virtio_net: fix race in RX VQ processing
virtio: support unlocked queue poll
net/cadence/macb: fix bug/typo in extracting gem_irq_read_clear bit
Documentation: Fix references to defunct linux-net@vger.kernel.org
net/fs: change busy poll time accounting
net: rename low latency sockets functions to busy poll
bridge: fix some kernel warning in multicast timer
sfc: Fix memory leak when discarding scattered packets
sit: fix tunnel update via netlink
dt:net:stmmac: Add dt specific phy reset callback support.
dt:net:stmmac: Add support to dwmac version 3.610 and 3.710
dt:net:stmmac: Allocate platform data only if its NULL.
net:stmmac: fix memleak in the open method
ipv6: rt6_check_neigh should successfully verify neigh if no NUD information are available
net: ipv6: fix wrong ping_v6_sendmsg return value
...
This reverts commit bf26c01849 ("Prepare to fix racy accesses on task
breakpoints").
The patch was fine but we can no longer race with SIGKILL after commit
9899d11f65 ("ptrace: ensure arch_ptrace/ptrace_request can never race
with SIGKILL"), the __TASK_TRACED tracee can't be woken up and
->ptrace_bps[] can't go away.
Now that ptrace_get_breakpoints/ptrace_put_breakpoints have no callers,
we can kill them and remove task->ptrace_bp_refcnt.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Michael Neuling <mikey@neuling.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Prasad <prasad@linux.vnet.ibm.com>
Cc: Russell King <linux@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move __set_special_pids() from exit.c to sys.c close to its single caller
and make it static.
And rename it to set_special_pids(), another helper with this name has
gone away.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove CONFIG_FAIR_GROUP_SCHED that covers the runnable info, then
we can use runnable load variables.
Also remove 2 CONFIG_FAIR_GROUP_SCHED setting which is not in reverted
patch(introduced in 9ee474f), but also need to revert.
Signed-off-by: Alex Shi <alex.shi@intel.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/51CA76A3.3050207@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This is intended for use in loops which read data protected by RCU and may
have a large number of iterations. Such an example is dumping the list of
connections known to IPVS: ip_vs_conn_array() and ip_vs_conn_seq_next().
The benefits are for CONFIG_PREEMPT_RCU=y where we save CPU cycles
by moving rcu_read_lock and rcu_read_unlock out of large loops
but still allowing the current task to be preempted after every
loop iteration for the CONFIG_PREEMPT_RCU=n case.
The call to cond_resched() is not needed when CONFIG_PREEMPT_RCU=y.
Thanks to Paul E. McKenney for explaining this and for the
final version that checks the context with CONFIG_DEBUG_ATOMIC_SLEEP=y
for all possible configurations.
The function can be empty in the CONFIG_PREEMPT_RCU case,
rcu_read_lock and rcu_read_unlock are not needed in this case
because the task can be preempted on indication from scheduler.
Thanks to Peter Zijlstra for catching this and for his help
in trying a solution that changes __might_sleep.
Initial cond_resched_rcu_lock() function suggested by Eric Dumazet.
Tested-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pull audit changes from Eric Paris:
"Al used to send pull requests every couple of years but he told me to
just start pushing them to you directly.
Our touching outside of core audit code is pretty straight forward. A
couple of interface changes which hit net/. A simple argument bug
calling audit functions in namei.c and the removal of some assembly
branch prediction code on ppc"
* git://git.infradead.org/users/eparis/audit: (31 commits)
audit: fix message spacing printing auid
Revert "audit: move kaudit thread start from auditd registration to kaudit init"
audit: vfs: fix audit_inode call in O_CREAT case of do_last
audit: Make testing for a valid loginuid explicit.
audit: fix event coverage of AUDIT_ANOM_LINK
audit: use spin_lock in audit_receive_msg to process tty logging
audit: do not needlessly take a lock in tty_audit_exit
audit: do not needlessly take a spinlock in copy_signal
audit: add an option to control logging of passwords with pam_tty_audit
audit: use spin_lock_irqsave/restore in audit tty code
helper for some session id stuff
audit: use a consistent audit helper to log lsm information
audit: push loginuid and sessionid processing down
audit: stop pushing loginid, uid, sessionid as arguments
audit: remove the old depricated kernel interface
audit: make validity checking generic
audit: allow checking the type of audit message in the user filter
audit: fix build break when AUDIT_DEBUG == 2
audit: remove duplicate export of audit_enabled
Audit: do not print error when LSMs disabled
...
Pull block driver updates from Jens Axboe:
"It might look big in volume, but when categorized, not a lot of
drivers are touched. The pull request contains:
- mtip32xx fixes from Micron.
- A slew of drbd updates, this time in a nicer series.
- bcache, a flash/ssd caching framework from Kent.
- Fixes for cciss"
* 'for-3.10/drivers' of git://git.kernel.dk/linux-block: (66 commits)
bcache: Use bd_link_disk_holder()
bcache: Allocator cleanup/fixes
cciss: bug fix to prevent cciss from loading in kdump crash kernel
cciss: add cciss_allow_hpsa module parameter
drivers/block/mg_disk.c: add CONFIG_PM_SLEEP to suspend/resume functions
mtip32xx: Workaround for unaligned writes
bcache: Make sure blocksize isn't smaller than device blocksize
bcache: Fix merge_bvec_fn usage for when it modifies the bvm
bcache: Correctly check against BIO_MAX_PAGES
bcache: Hack around stuff that clones up to bi_max_vecs
bcache: Set ra_pages based on backing device's ra_pages
bcache: Take data offset from the bdev superblock.
mtip32xx: mtip32xx: Disable TRIM support
mtip32xx: fix a smatch warning
bcache: Disable broken btree fuzz tester
bcache: Fix a format string overflow
bcache: Fix a minor memory leak on device teardown
bcache: Documentation updates
bcache: Use WARN_ONCE() instead of __WARN()
bcache: Add missing #include <linux/prefetch.h>
...
Pull 'full dynticks' support from Ingo Molnar:
"This tree from Frederic Weisbecker adds a new, (exciting! :-) core
kernel feature to the timer and scheduler subsystems: 'full dynticks',
or CONFIG_NO_HZ_FULL=y.
This feature extends the nohz variable-size timer tick feature from
idle to busy CPUs (running at most one task) as well, potentially
reducing the number of timer interrupts significantly.
This feature got motivated by real-time folks and the -rt tree, but
the general utility and motivation of full-dynticks runs wider than
that:
- HPC workloads get faster: CPUs running a single task should be able
to utilize a maximum amount of CPU power. A periodic timer tick at
HZ=1000 can cause a constant overhead of up to 1.0%. This feature
removes that overhead - and speeds up the system by 0.5%-1.0% on
typical distro configs even on modern systems.
- Real-time workload latency reduction: CPUs running critical tasks
should experience as little jitter as possible. The last remaining
source of kernel-related jitter was the periodic timer tick.
- A single task executing on a CPU is a pretty common situation,
especially with an increasing number of cores/CPUs, so this feature
helps desktop and mobile workloads as well.
The cost of the feature is mainly related to increased timer
reprogramming overhead when a CPU switches its tick period, and thus
slightly longer to-idle and from-idle latency.
Configuration-wise a third mode of operation is added to the existing
two NOHZ kconfig modes:
- CONFIG_HZ_PERIODIC: [formerly !CONFIG_NO_HZ], now explicitly named
as a config option. This is the traditional Linux periodic tick
design: there's a HZ tick going on all the time, regardless of
whether a CPU is idle or not.
- CONFIG_NO_HZ_IDLE: [formerly CONFIG_NO_HZ=y], this turns off the
periodic tick when a CPU enters idle mode.
- CONFIG_NO_HZ_FULL: this new mode, in addition to turning off the
tick when a CPU is idle, also slows the tick down to 1 Hz (one
timer interrupt per second) when only a single task is running on a
CPU.
The .config behavior is compatible: existing !CONFIG_NO_HZ and
CONFIG_NO_HZ=y settings get translated to the new values, without the
user having to configure anything. CONFIG_NO_HZ_FULL is turned off by
default.
This feature is based on a lot of infrastructure work that has been
steadily going upstream in the last 2-3 cycles: related RCU support
and non-periodic cputime support in particular is upstream already.
This tree adds the final pieces and activates the feature. The pull
request is marked RFC because:
- it's marked 64-bit only at the moment - the 32-bit support patch is
small but did not get ready in time.
- it has a number of fresh commits that came in after the merge
window. The overwhelming majority of commits are from before the
merge window, but still some aspects of the tree are fresh and so I
marked it RFC.
- it's a pretty wide-reaching feature with lots of effects - and
while the components have been in testing for some time, the full
combination is still not very widely used. That it's default-off
should reduce its regression abilities and obviously there are no
known regressions with CONFIG_NO_HZ_FULL=y enabled either.
- the feature is not completely idempotent: there is no 100%
equivalent replacement for a periodic scheduler/timer tick. In
particular there's ongoing work to map out and reduce its effects
on scheduler load-balancing and statistics. This should not impact
correctness though, there are no known regressions related to this
feature at this point.
- it's a pretty ambitious feature that with time will likely be
enabled by most Linux distros, and we'd like you to make input on
its design/implementation, if you dislike some aspect we missed.
Without flaming us to crisp! :-)
Future plans:
- there's ongoing work to reduce 1Hz to 0Hz, to essentially shut off
the periodic tick altogether when there's a single busy task on a
CPU. We'd first like 1 Hz to be exposed more widely before we go
for the 0 Hz target though.
- once we reach 0 Hz we can remove the periodic tick assumption from
nr_running>=2 as well, by essentially interrupting busy tasks only
as frequently as the sched_latency constraints require us to do -
once every 4-40 msecs, depending on nr_running.
I am personally leaning towards biting the bullet and doing this in
v3.10, like the -rt tree this effort has been going on for too long -
but the final word is up to you as usual.
More technical details can be found in Documentation/timers/NO_HZ.txt"
* 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (39 commits)
sched: Keep at least 1 tick per second for active dynticks tasks
rcu: Fix full dynticks' dependency on wide RCU nocb mode
nohz: Protect smp_processor_id() in tick_nohz_task_switch()
nohz_full: Add documentation.
cputime_nsecs: use math64.h for nsec resolution conversion helpers
nohz: Select VIRT_CPU_ACCOUNTING_GEN from full dynticks config
nohz: Reduce overhead under high-freq idling patterns
nohz: Remove full dynticks' superfluous dependency on RCU tree
nohz: Fix unavailable tick_stop tracepoint in dynticks idle
nohz: Add basic tracing
nohz: Select wide RCU nocb for full dynticks
nohz: Disable the tick when irq resume in full dynticks CPU
nohz: Re-evaluate the tick for the new task after a context switch
nohz: Prepare to stop the tick on irq exit
nohz: Implement full dynticks kick
nohz: Re-evaluate the tick from the scheduler IPI
sched: New helper to prevent from stopping the tick in full dynticks
sched: Kick full dynticks CPU that have more than one task enqueued.
perf: New helper to prevent full dynticks CPUs from stopping tick
perf: Kick full dynticks CPU if events rotation is needed
...
The scheduler doesn't yet fully support environments
with a single task running without a periodic tick.
In order to ensure we still maintain the duties of scheduler_tick(),
keep at least 1 tick per second.
This makes sure that we keep the progression of various scheduler
accounting and background maintainance even with a very low granularity.
Examples include cpu load, sched average, CFS entity vruntime,
avenrun and events such as load balancing, amongst other details
handled in sched_class::task_tick().
This limitation will be removed in the future once we get
these individual items to work in full dynticks CPUs.
Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
The full dynticks tree needs the latest RCU and sched
upstream updates in order to fix some dependencies.
Merge a common upstream merge point that has these
updates.
Conflicts:
include/linux/perf_event.h
kernel/rcutree.h
kernel/rcutree_plugin.h
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>