This is the default thread size for glibc, so it is reasonable to match
that when we clone().
Mostly this is a science experiment suggested by brauner, and who doesn't
love science?
Signed-off-by: Tycho Andersen <tycho@tycho.ws>
Handle offline cpus in v1 hierarchy.
In addition to isolated cpus we also need to account for offline cpus when our
ancestor cgroup is the root cgroup and we have not been initialized yet.
Closes#2953.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Let lxc_attach() reuse the already initialized container.
Closes https://github.com/lxc/lxd/issues/5755.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Updates lxc_restore_phys_nics_to_netns() to move phys netdevs back to the monitor's network namespace rather than the previously hardcoded PID 1 net ns.
This is to fix instances where LXC is started inside a net ns different from PID 1 and physical devices are moved back to a different net ns when the container is shutdown than the net ns than where the container was started from.
Signed-off-by: Thomas Parrott <thomas.parrott@canonical.com>
We have a do_clone(), which just calls a void f(void *) that it gets
passed. We build up a struct consisting of two args that are just the
actual arg and actual function. Let's just have the syscall do this for us.
Signed-off-by: Tycho Andersen <tycho@tycho.ws>
There are two problems with this code:
1. The math is wrong. We allocate a char *foo[__LXC_STACK_SIZE]; which
means it's really sizeof(char *) * __LXC_STACK_SIZE, instead of just
__LXC_STACK SIZE.
2. We can't actually allocate it on our stack. When we use CLONE_VM (which
we do in the shared ns case) that means that the new thread is just
running one page lower on the stack, but anything that allocates a page
on the stack may clobber data. This is a pretty short race window since
we just do the shared ns stuff and then do a clone without CLONE_VM.
However, it does point out an interesting possible privilege escalation if
things aren't configured correctly: do_share_ns() sets up namespaces while
it shares the address space of the task that spawned it; once it enters the
pid ns of the thing it's sharing with, the thing it's sharing with can
ptrace it and write stuff into the host's address space. Since the function
that does the clone() is lxc_spawn(), it has a struct cgroup_ops* on the
stack, which itself has function pointers called later in the function, so
it's possible to allocate shellcode in the address space of the host and
run it fairly easily.
ASLR doesn't mitigate this since we know exactly the stack offsets; however
this patch has the kernel allocate a new stack, which will help. Of course,
the attacker could just check /proc/pid/maps to find the location of the
stack, but they'd still have to guess where to write stuff in.
The thing that does prevent this is the default configuration of apparmor.
Since the apparmor profile is set in the second clone, and apparmor
prevents ptracing things under a different profile, attackers confined by
apparmor can't do this. However, if users are using a custom configuration
with shared namespaces, care must be taken to avoid this race.
Shared namespaces aren't widely used now, so perhaps this isn't a problem,
but with the advent of crio-lxc for k8s, this functionality will be used
more.
Signed-off-by: Tycho Andersen <tycho@tycho.ws>
GLIBC supports %m to avoid calling strerror(). Using it saves some code space.
==> This check will define HAVE_M_FORMAT to be use wherever possible (e.g. log.h)
Signed-off-by: Rachid Koucha <rachid.koucha@gmail.com>
Returning -1 in a function with return type bool is the same as
returning true. Change to return false to indicate error properly.
Detected with cppcheck.
Signed-off-by: Rikard Falkeborn <rikard.falkeborn@gmail.com>
Returning -1 in a function with return type bool is the same as
returning true. Change to return false to indicate error properly.
Detected with cppcheck.
Signed-off-by: Rikard Falkeborn <rikard.falkeborn@gmail.com>
Since _exit() will terminate, the return statement is dead code. Also,
returning -1 from a function with bool as return type is confusing.
Detected with cppcheck.
Signed-off-by: Rikard Falkeborn <rikard.falkeborn@gmail.com>
CRIU has only 4 levels of verbosity (errors, warnings, info, debug).
Thus, using `-v4` is more appropriate.
https://criu.org/Logging
Signed-off-by: Radostin Stoyanov <rstoyanov1@gmail.com>
lxc-ls without root privileges on privileged containers should not display
information. In lxc_container_new(), ongoing_create()'s result is not checked
for all possible returned values. Hence, an unprivileged user can send command
messages to the container's monitor. For example:
$ lxc-ls -P /.../tests -f
NAME STATE AUTOSTART GROUPS IPV4 IPV6 UNPRIVILEGED
ctr - 0 - - - false
$ sudo lxc-ls -P /.../tests -f
NAME STATE AUTOSTART GROUPS IPV4 IPV6 UNPRIVILEGED
ctr RUNNING 0 - 10.0.3.51 - false
After this change:
$ lxc-ls -P /.../tests -f <-------- No more display without root privileges
$ sudo lxc-ls -P /.../tests -f
NAME STATE AUTOSTART GROUPS IPV4 IPV6 UNPRIVILEGED
ctr RUNNING 0 - 10.0.3.37 - false
$
Signed-off-by: Rachid Koucha <rachid.koucha@gmail.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
. Add the "--bbpath" option to pass an alternate busybox pathname instead of the one found from ${PATH}.
. Take this opportunity to add some formatting in the usage display
. As a try is done to pick rootfs from the config file and set it to ${path}/rootfs, it is unnecessary to make it mandatory
Signed-off-by: Rachid Koucha <rachid.koucha@gmail.com>
Some error messages were not redirected to stderr.
Moreover, do "exit 0" instead of "exit 1" when "help" option is passed.
Signed-off-by: Rachid Koucha <rachid.koucha@gmail.com>
Use CLONE_PIDFD when possible.
Note the clone() syscall ignores unknown flags which is usually a design
mistake. However, for us this bug is a feature since we can just pass the flag
along and see whether the kernel has given us a pidfd.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
The phys devices will now have their original MTUs recorded at start and restored at shutdown.
This is to protect the original phys device from having any container level MTU customisation being applied to the device once it is restored to the host.
Signed-off-by: Thomas Parrott <thomas.parrott@canonical.com>