On set{g,u}id() the kernel does:
/* dumpability changes */
if (!uid_eq(old->euid, new->euid) ||
!gid_eq(old->egid, new->egid) ||
!uid_eq(old->fsuid, new->fsuid) ||
!gid_eq(old->fsgid, new->fsgid) ||
!cred_cap_issubset(old, new)) {
if (task->mm)
set_dumpable(task->mm, suid_dumpable);
task->pdeath_signal = 0;
smp_wmb();
}
which means we need to re-enable the deat signal after the set{g,u}id().
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Since we are now dumpable we can open /proc/<child-pid>/ns/cgroup so let's
avoid the overhead of sending around fds.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
When set set{u,g}id() the kernel will make us undumpable. This is unnecessary
since we can guarantee that whatever is running inside the child process at
this point this is fully trusted by the parent. Making us dumpable let's users
use debuggers on the child process before the exec as well and also allows us
to open /proc/<child-pid> files in lieu of the child.
Note, that we only need to perform the prctl(PR_SET_DUMPABLE, ...) if our
effective uid on the host is not 0. If our effective uid on the host is 0 then
we will keep all capabilities in the child user namespace across set{g,u}id().
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
This is to avoid bad surprises caused by older glibc's pid cache (up to 2.25)
when using clone().
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Because of older glibc's pid cache (up to 2.25) whenever clone() is called the
child must must retrieve it's own pid via lxc_raw_getpid().
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Receive fd for LSM security module before we set{g,u}id(). The reason is that
on set{g,u}id() the kernel will a) make us undumpable and b) we will change our
effective uid. This means our effective uid will be different from the
effective uid of the process that created us which means that this processs no
longer has capabilities in our namespace including CAP_SYS_PTRACE. This means
we will not be able to read and /proc/<pid> files for the process anymore when
/proc is mounted with hidepid={1,2}. So let's get the lsm label fd before the
set{g,u}id().
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
This let's us simplify the whole file a lot and makes things way clearer. It
also let's us avoid the infamous pid cache.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
At this point, macros such DEBUG or ERROR does not take effect because
this code is called from cgroup_ops_init(cgroup.c), which runs with
__attribute__((constructor)), before any log level is set form any tool
like lxc-start, so these messages are lost.
For now on, use the same LXC_DEBUG_CGFSNG environment variable to
control these messages.
Signed-off-by: Marcos Paulo de Souza <marcos.souza.org@gmail.com>
Excerpt from dnsmasq(8):
By default, the DHCP server will attempt to ensure that an address in not
in use before allocating it to a host. It does this by sending an ICMP echo
request (aka "ping") to the address in question. If it gets a reply, then the
address must already be in use, and another is tried. This flag disables this check.
This is useful if one expects all the containers to get an IP address
from the LXC authoritative DHCP server and wants to speed up the process
of getting a lease.
Signed-off-by: Jonathan Calmels <jcalmels@nvidia.com>
- Merge dhclient-start and dhclient-stop into a single hook.
- Wait for a lease before returning from the hook.
- Generate a logfile when LXC log level is either DEBUG or TRACE.
- Rely on namespace file descriptors for the stop hook.
- Use settings from /<sysconf>/lxc/dhclient.conf if available.
- Attempt to cleanup if dhclient fails to shutdown properly.
Signed-off-by: Jonathan Calmels <jcalmels@nvidia.com>
This hook requires the nvidia-container-cli tool provided by libnvidia-container:
https://github.com/nvidia/libnvidia-container
For containers that do not have CUDA_VERSION or NVIDIA_VISIBLE_DEVICES
set in the environment, the hook will be a no-op.
To enable in the configuration file:
lxc.hook.mount = /usr/local/share/lxc/hooks/nvidia
Signed-off-by: Felix Abecassis <fabecassis@nvidia.com>
This is based on raw_clone in systemd but adapted to our needs. The main reason
is that we need an implementation of fork()/clone() that does guarantee us that
no pthread_atfork() handlers are run. While clone() in glibc currently doesn't
run pthread_atfork() handlers we should be fine but there's no guarantee that
this won't be the case in the future. So let's do the syscall directly - or as
direct as we can. An additional nice feature is that we get fork() behavior,
i.e. lxc_raw_clone() returns 0 in the child and the child pid in the parent.
Our implementation tries to make sure that we cover all cases according to
kernel sources. Note that we are not interested in any arguments that could be
passed after the stack.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>