mirror_ubuntu-kernels/include/linux/sched
Adrian Reber 49cb2fc42c fork: extend clone3() to support setting a PID
The main motivation to add set_tid to clone3() is CRIU.

To restore a process with the same PID/TID CRIU currently uses
/proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to
ns_last_pid and then (quickly) does a clone(). This works most of the
time, but it is racy. It is also slow as it requires multiple syscalls.

Extending clone3() to support *set_tid makes it possible restore a
process using CRIU without accessing /proc/sys/kernel/ns_last_pid and
race free (as long as the desired PID/TID is available).

This clone3() extension places the same restrictions (CAP_SYS_ADMIN)
on clone3() with *set_tid as they are currently in place for ns_last_pid.

The original version of this change was using a single value for
set_tid. At the 2019 LPC, after presenting set_tid, it was, however,
decided to change set_tid to an array to enable setting the PID of a
process in multiple PID namespaces at the same time. If a process is
created in a PID namespace it is possible to influence the PID inside
and outside of the PID namespace. Details also in the corresponding
selftest.

To create a process with the following PIDs:

      PID NS level         Requested PID
        0 (host)              31496
        1                        42
        2                         1

For that example the two newly introduced parameters to struct
clone_args (set_tid and set_tid_size) would need to be:

  set_tid[0] = 1;
  set_tid[1] = 42;
  set_tid[2] = 31496;
  set_tid_size = 3;

If only the PIDs of the two innermost nested PID namespaces should be
defined it would look like this:

  set_tid[0] = 1;
  set_tid[1] = 42;
  set_tid_size = 2;

The PID of the newly created process would then be the next available
free PID in the PID namespace level 0 (host) and 42 in the PID namespace
at level 1 and the PID of the process in the innermost PID namespace
would be 1.

The set_tid array is used to specify the PID of a process starting
from the innermost nested PID namespaces up to set_tid_size PID namespaces.

set_tid_size cannot be larger then the current PID namespace level.

Signed-off-by: Adrian Reber <areber@redhat.com>
Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com>
Acked-by: Andrei Vagin <avagin@gmail.com>
Link: https://lore.kernel.org/r/20191115123621.142252-1-areber@redhat.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2019-11-15 23:49:22 +01:00
..
autogroup.h
clock.h
coredump.h
cpufreq.h
cputime.h posix-cpu-timers: Move state tracking to struct posix_cputimers 2019-08-28 11:50:42 +02:00
deadline.h cpusets: Rebuild root domain deadline accounting information 2019-07-25 15:55:01 +02:00
debug.h
hotplug.h
idle.h
init.h
isolation.h KVM: LAPIC: Inject timer interrupt via posted interrupt 2019-07-20 09:00:40 +02:00
jobctl.h cgroup: cgroup v2 freezer 2019-04-19 11:26:48 -07:00
loadavg.h
mm.h sched/membarrier: Fix p->mm->membarrier_state racy load 2019-09-25 17:42:30 +02:00
nohz.h sched/fair: Remove the rq->cpu_load[] update code 2019-06-03 11:49:38 +02:00
numa_balancing.h sched/fair: Don't free p->numa_faults with concurrent readers 2019-07-25 15:37:04 +02:00
prio.h
rt.h
signal.h posix-cpu-timers: Move state tracking to struct posix_cputimers 2019-08-28 11:50:42 +02:00
smt.h
stat.h
sysctl.h sched/uclamp: Add system default clamps 2019-06-24 19:23:45 +02:00
task_stack.h
task.h fork: extend clone3() to support setting a PID 2019-11-15 23:49:22 +01:00
topology.h sched/topology: Add partition_sched_domains_locked() 2019-07-25 15:51:57 +02:00
types.h posix-cpu-timers: Provide array based access to expiry cache 2019-08-28 11:50:35 +02:00
user.h keys: Move the user and user-session keyrings to the user_namespace 2019-06-26 21:02:32 +01:00
wake_q.h locking/rwsem: Always release wait_lock before waking up tasks 2019-06-17 12:28:00 +02:00
xacct.h