Before we used tmpfile() to write out mount entries for the container. This
requires a writeable /tmp file system which can be a problem for systems where
this filesystem is not present. This commit switches from tmpfile() to using
the memfd_create() syscall. It allows us to create an anonymous tmpfs file (And
is somewhat similar to mmap().) which is automatically deleted as soon as any
references to it are dropped. In case we detect that syscall is not
implemented, we fallback to using tmpfile().
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
localtime_r() can lead to deadlocks because it calls __tzset() and
__tzconvert() internally. The deadlock stems from an interaction between these
functions and the functions in monitor.c and commands.{c,h}. The latter
functions will write to the log independent of the container thread that is
currently running. Since the monitor fork()ed it seems to duplicate the mutex
states of the time functions mentioned above causing the deadlock.
As a short termm fix, I suggest to simply disable receiving the time when
monitor.c or command.{c,h} functions are called. This should be ok, since the
[lxc monitor] will only emit a few messages and thread-safety is currently more
important than beautiful logs. The rest of the log stays the same as it was
before.
Here is an example output from logs where I printed the pid and tid of the
process that is currently writing to the log:
lxc 20161125170200.619 INFO lxc_start: 18695-18695: - start.c:lxc_check_inherited:243 - Closed inherited fd: 23.
lxc 20161125170200.640 DEBUG lxc_start: 18677-18677: - start.c:__lxc_start:1334 - Not dropping CAP_SYS_BOOT or watching utmp.
lxc 20161125170200.640 INFO lxc_cgroup: 18677-18677: - cgroups/cgroup.c:cgroup_init:68 - cgroup driver cgroupfs-ng initing for lxc-test-concurrent-0
----------> lxc 20150427012246.000 INFO lxc_monitor: 13017-18622: - monitor.c:lxc_monitor_sock_name:178 - using monitor sock name lxc/ad055575fe28ddd5//var/lib/lxc
lxc 20161125170200.662 DEBUG lxc_cgfsng: 18677-18677: - cgroups/cgfsng.c:filter_and_set_cpus:478 - No isolated cpus detected.
lxc 20161125170200.662 DEBUG lxc_cgfsng: 18677-18677: - cgroups/cgfsng.c:handle_cpuset_hierarchy:648 - "cgroup.clone_children" was already set to "1".
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
This fixes a race in liblxc logging which can lead to deadlocks. The reproducer
for this issue before this is to simply compile with --enable-tests and then
run:
lxc-test-concurrent -j 20 -m create,start,stop,destroy -D
which should deadlock.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
So far, we opened a file descriptor refering to proc on the host inside the
host namespace and handed that fd to the attached process in
attach_child_main(). This was done to ensure that LSM labels were correctly
setup. However, by exploiting a potential kernel bug, ptrace could be used to
prevent the file descriptor from being closed which in turn could be used by an
unprivileged container to gain access to the host namespace. Aside from this
needing an upstream kernel fix, we should make sure that we don't pass the fd
for proc itself to the attached process. However, we cannot completely prevent
this, as the attached process needs to be able to change its apparmor profile
by writing to /proc/self/attr/exec or /proc/self/attr/current. To minimize the
attack surface, we only send the fd for /proc/self/attr/exec or
/proc/self/attr/current to the attached process. To do this we introduce a
little more IPC between the child and parent:
* IPC mechanism: (X is receiver)
* initial process intermediate attached
* X <--- send pid of
* attached proc,
* then exit
* send 0 ------------------------------------> X
* [do initialization]
* X <------------------------------------ send 1
* [add to cgroup, ...]
* send 2 ------------------------------------> X
* [set LXC_ATTACH_NO_NEW_PRIVS]
* X <------------------------------------ send 3
* [open LSM label fd]
* send 4 ------------------------------------> X
* [set LSM label]
* close socket close socket
* run program
The attached child tells the parent when it is ready to have its LSM labels set
up. The parent then opens an approriate fd for the child PID to
/proc/<pid>/attr/exec or /proc/<pid>/attr/current and sends it via SCM_RIGHTS
to the child. The child can then set its LSM laben. Both sides then close the
socket fds and the child execs the requested process.
Signed-off-by: Christian Brauner <christian.brauner@canonical.com>
They do not behave correctly on some architectures, so let's remove them for
now and come up with better ones later.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
The identifiers for namespaces used with lxc-unshare and lxc-attach as given on
the manpage do not align with the standard identifiers. This affects network,
mount, and uts namespaces. The standard identifiers are: "mnt", "uts", and
"net" whereas lxc-unshare and lxc-attach use "MOUNT", "UTSNAME", and "NETWORK".
I'm weary to hack this into namespace.{c.h} by e.g. adding additional members
to the ns_info struct or to special case this in lxc_fill_namespace_flags().
Internally, we should only accept standard identifiers to ensure that we are
always correctly aligned with the kernel. So let's use some cheap memmove()s to
replace them by their standard identifiers in lxc-unshare and lxc-attach.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>