This fixes a race in liblxc logging which can lead to deadlocks. The reproducer
for this issue before this is to simply compile with --enable-tests and then
run:
lxc-test-concurrent -j 20 -m create,start,stop,destroy -D
which should deadlock.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
So far, we opened a file descriptor refering to proc on the host inside the
host namespace and handed that fd to the attached process in
attach_child_main(). This was done to ensure that LSM labels were correctly
setup. However, by exploiting a potential kernel bug, ptrace could be used to
prevent the file descriptor from being closed which in turn could be used by an
unprivileged container to gain access to the host namespace. Aside from this
needing an upstream kernel fix, we should make sure that we don't pass the fd
for proc itself to the attached process. However, we cannot completely prevent
this, as the attached process needs to be able to change its apparmor profile
by writing to /proc/self/attr/exec or /proc/self/attr/current. To minimize the
attack surface, we only send the fd for /proc/self/attr/exec or
/proc/self/attr/current to the attached process. To do this we introduce a
little more IPC between the child and parent:
* IPC mechanism: (X is receiver)
* initial process intermediate attached
* X <--- send pid of
* attached proc,
* then exit
* send 0 ------------------------------------> X
* [do initialization]
* X <------------------------------------ send 1
* [add to cgroup, ...]
* send 2 ------------------------------------> X
* [set LXC_ATTACH_NO_NEW_PRIVS]
* X <------------------------------------ send 3
* [open LSM label fd]
* send 4 ------------------------------------> X
* [set LSM label]
* close socket close socket
* run program
The attached child tells the parent when it is ready to have its LSM labels set
up. The parent then opens an approriate fd for the child PID to
/proc/<pid>/attr/exec or /proc/<pid>/attr/current and sends it via SCM_RIGHTS
to the child. The child can then set its LSM laben. Both sides then close the
socket fds and the child execs the requested process.
Signed-off-by: Christian Brauner <christian.brauner@canonical.com>
They do not behave correctly on some architectures, so let's remove them for
now and come up with better ones later.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
The identifiers for namespaces used with lxc-unshare and lxc-attach as given on
the manpage do not align with the standard identifiers. This affects network,
mount, and uts namespaces. The standard identifiers are: "mnt", "uts", and
"net" whereas lxc-unshare and lxc-attach use "MOUNT", "UTSNAME", and "NETWORK".
I'm weary to hack this into namespace.{c.h} by e.g. adding additional members
to the ns_info struct or to special case this in lxc_fill_namespace_flags().
Internally, we should only accept standard identifiers to ensure that we are
always correctly aligned with the kernel. So let's use some cheap memmove()s to
replace them by their standard identifiers in lxc-unshare and lxc-attach.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
This function safely parses an unsigned integer. On success it returns 0 and
stores the unsigned integer in @converted. On error it returns a negative
errno.
Signed-off-by: Christian Brauner <christian.brauner@canonical.com>
If the file "/sys/devices/system/cpu/isolated" doesn't exist, we can't just
simply bail. We still need to check whether we need to copy the parents cpu
settings.
Signed-off-by: Christian Brauner <christian.brauner@canonical.com>
- add more logging
- only write to cpuset.cpus if we really have to
- simplify cleanup on error and success
Signed-off-by: Christian Brauner <christian.brauner@canonical.com>
Move the user namespace at the first position in the array so that we always
attach to it first when iterating over the struct and using setns() to switch
namespaces. This especially affects lxc_attach(): Suppose you cloned a new user
namespace and mount namespace as an unprivileged user on the host and want to
setns() to the mount namespace. This requires you to attach to the user
namespace first otherwise the kernel will fail this check:
if (!ns_capable(mnt_ns->user_ns, CAP_SYS_ADMIN) ||
!ns_capable(current_user_ns(), CAP_SYS_CHROOT) ||
!ns_capable(current_user_ns(), CAP_SYS_ADMIN))
return -EPERM;
in
linux/fs/namespace.c:mntns_install().
Signed-off-by: Christian Brauner <christian.brauner@canonical.com>
Using custom structs in attach.c risks getting out of sync with the commonly
used ns_info[LXC_NS_MAX] struct and thus attaching to wrong namespaces. Switch
to using ns_info[LXC_NS_MAX].
Signed-off-by: Christian Brauner <christian.brauner@canonical.com>
- simply check /proc/self/ns
- improve SYSERROR() report
- use #define to prevent gcc & clang to use a VLA
Signed-off-by: Christian Brauner <christian.brauner@canonical.com>
Improve log and comments in a bunch of places to make it easier for us on bug
reports.
Signed-off-by: Christian Brauner <christian.brauner@canonical.com>