backing stores supported by qemu-nbd can be attached to a nbd block
device using qemu-nbd. This user-space process (pair) stays around for
the duration of the device attachment. Obviously we want it to go away
when the container shuts down, but not before the filesystems have been
cleanly unmounted.
The device attachment is done from the task which will become the
container monitor before the container setup+init task is spawned.
That task starts in a new pid namespace to ensure that the qemu-nbd
process will be killed if need be. It sets its parent death signal
to sighup, and, on receiving sighup, attempts to do a clean
qemu-device detach, then exits. This should ensure that the
device is detached if the qemu monitor crashes or exits.
It may be worth adding a delay before the qemu-nbd is detached, but
my brief tests haven't seen any data corruption.
Only the parts required for running a nbd-backed container are
implemented here. Create, destroy, and clone are not. The first
use of this that I imagine is for people to use downloaded nbd-backed
images (like ubuntu cloud images, or anything previously used with
qemu). I imagine people will want to create/clone/destroy out of
band using qemu-img, but if I'm wrong about that we can implement
the rest later.
Because attach_block_device() is done before the bdev is initialized,
and bdev_init needs to know the nbd index so that it can mount the
filesystem, we now need to pass the lxc_conf.
file_exists() is moved to utils.c so we can use it from bdev.c
The nbd attach/detach should lay the groundwork for trivial implementation
of qed and raw images.
changelog (may 12): fix idx check at detach
changelog (may 15): generalize qcow2 to nbd
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Dwight Engen <dwight.engen@oracle.com>
If the user specifies cgroup or cgroup-full without a specifier (:ro,
:rw or :mixed), this changes the behavior. Previously, these were
simple aliases for the :mixed variants; now they depend on whether the
container also has CAP_SYS_ADMIN; if it does they resolve to the :rw
variants, if it doesn't to the :mixed variants (as before).
If a container has CAP_SYS_ADMIN privileges, any filesystem can be
remounted read-write from within, so initially mounting the cgroup
filesystems partially read-only as a default creates a false sense of
security. It is better to default to full read-write mounts to show the
administrator what keeping CAP_SYS_ADMIN entails.
If an administrator really wants both CAP_SYS_ADMIN and the :mixed
variant of cgroup or cgroup-full automatic mounts, they can still
specify that explicitly; this commit just changes the default without
specifier.
Signed-off-by: Christian Seiler <christian@iwakd.de>
Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Currently, setup_caps and dropcaps_except both use the same parsing
logic for parsing capabilities (try to identify by name, but allow
numerical specification). Since this is a common routine, separate it
out to improve maintainability and reuseability.
Signed-off-by: Christian Seiler <christian@iwakd.de>
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Check for symlinks before attempting create.
When attempting to create the compulsory symlinks in /dev,
check for the existence of the link using stat first before
blindly attempting to create the link.
This works around an apparent quirk in the kernel VFS on read-only
file systems where the returned error code might be EEXIST or EROFS
depending on previous access to the /dev directory and its entries.
Reported-by: William Dauchy <william@gandi.net>
Signed-off-by: Michael H. Warfield <mhw@WittsEnd.com>
Tested-by: William Dauchy <william@gandi.net>
If you 'ip netns add x1', this creates /run/netns and /run/netns/x1
as shared mounts. When a container starts, it umounts these after
pivot_root, and the umount is propagated to the host.
Worse, doing mount("", "/", NULL, MS_SLAVE|MS_REC, NULL) does not
suffice to change those, even after binding /proc/mounts onto
/etc/mtab.
So, I give up. Do this manually, walking over /proc/self/mountinfo
and changing the mount propagation on everything marked as shared.
With this patch, lxc-start no longer unmounts /run/netns/* on the
host.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
this expands c597baa8f9 and 2c6f3fc932.
Also move the block using detect_ramfs_rootfs() from setup_rootfs() to
lxc_setup()
Signed-off-by: Florian Klink <flokli@flokli.de>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
This prevents things like bridges from being destroyed by the kernel.
My hope is that just doing this will be enough to also ensure that
the device will be available to be renamed immediately, so that
we don't need to do a retry loop.
Tested with a dummy device. renaming dummy0 to dummy5 in container,
then shutting down container, returns dummy0 to the host.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>
If the user maps container root to his host uid, chown_mapped_rootid
tries to make the same mapping twice and gets -EINVAL.
Reported-by: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>
Only do the funky chroot_into_slave if / is in fact the rootfs.
Rootfs is a special blacklisted case for pivot_root.
If / is not rootfs but is shared, just mount / rslave. We're
already in our own namespace.
This appears to solve the extra /proc/$$/mount entries in
containers and the host directories in lxc-attach which have
been plagueing at least fedora and arch.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>
This change makes it possible to create unprivileged containers as root.
They will be stored in the usual system wide location, use the usual
system wide cache but will be running using a uid/gid map.
This also updates lxc_usernsexec to use the same function as the rest of
LXC, centralizing all the userns switch in a single function.
That function now detects the presence of newuidmap and newgidmap on the
system, if they are present, they will be used for containers created as
either user or root. If they're not and the user isn't root, an error is
shown. If they're not and the user is root, LXC will directly set the
uid_map and gid_map values.
All that should allow for a consistent experience as well as supporting
distributions that don't yet ship newuidmap/newgidmap.
To make things simpler in the future, an helper function "on_path" is
also introduced and used to detect the presence of newuidmap and
newgidmap.
Signed-off-by: Stéphane Graber <stgraber@ubuntu.com>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
The container with "lxc.network.type=phys" halted with error on reboot.
Error message:
*** glibc detected *** lxc-start: realloc(): invalid pointer: 0x0948eed0 ***
We have a sequence:
1) conf->saved_nic = relloc(NULL) on start start.c:container save_phys_nics()
2) free(conf->saved_nics) after stop container
conf.c:lxc_rename_phys_nics_on_shutdown()
3) conf->saved_nic = relloc(conf->saved_nics) on restart container
start.c:save_phys_nics() -> error relloc()
free(conf->saved_nics) in lxc_rename_phys_nics_on_shutdown()
unnecessary, it will be called later in lxc_clear_saved_nics().
Signed-off-by: Vitaly Lavrov <vel21ripn@gmail.com>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>
(this expands on Dwight's recent patch, commit c597baa8f9)
After unshare(CLONE_NEWNS) and before doing any mounting, always
check whether rootfs is shared. Otherwise template runs or clone
scripts can bleed mount activity to the host.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Dwight Engen <dwight.engen@oracle.com>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>
If we are unprivileged and have asked for a veth device, then create
a pipe over which to pass the veth names.
Network-related todos:
1. set mtu on the container side of veth device
2. set mtu in lxc-user-nic. Note that this probably requires an
update to the /etc/lxc/lxc-usernet file :(
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>
after commit 4e4ca16158 we are
checking for optional in mntopts after we forcibly remove it.
Cache whether we had it before removing it.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Otherwise mount may return -EINVAL if in-kernel super-block parser
objects (as is the case with ext4).
Changelog v2:
also drop 'optional'
specifically drop create=dir, not create=*
fix order of arguments for memmove
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>
lxc-user-nic now returns the names of the interfaces and
unpriv_assign_nic function parses that information to fill
missing netdev->veth_attr.pair and netdev->name.
With this patch get_running_config_item started to provide
correct information;
>>> import lxc; c = lxc.Container("rubik"); c.get_running_config_item("lxc.network.0.name"); c.get_running_config_item("lxc.network.0.veth.pair");
'eth0'
'veth9MT2L4'
>>>
and lxc-info started to show network stats;
lxc-info -n rubik
Name: rubik
State: RUNNING
PID: 23061
IP: 10.0.3.233
CPU use: 3.86 seconds
BlkIO use: 88.00 KiB
Memory use: 6.53 MiB
KMem use: 0 bytes
Link: veth9MT2L4
TX bytes: 3.45 KiB
RX bytes: 8.83 KiB
Total bytes: 12.29 KiB
Signed-off-by: S.Çağlar Onur <caglar@10ur.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>
The kernel's Documentation/devices.txt says that these symlinks should
exist in /dev (they are listed in the "Compulsory" section). I'm not
currently adding nfsd and X0R since they are required for iBCS, but
they can be easily added to the array later if need be.
Signed-off-by: Dwight Engen <dwight.engen@oracle.com>
Acked-by: Michael H. Warfield <mhw@WittsEnd.com>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>
We used to do chdir(path), chroot(path). That's correct but not properly
handled coverity, so do chroot(path), chdir("/") instead as that's the
recommended way.
Signed-off-by: Stéphane Graber <stgraber@ubuntu.com>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
This is pretty much copy/paste from overlayfs.
Signed-off-by: Stéphane Graber <stgraber@ubuntu.com>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
The previous check for access to rootfs->path failed in the case of
overlayfs or loop backign stores. Instead just check early on for
access to lxcpath.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>
Also make sure to chown the new rootfs path to the container owner.
This is how we make sure that the container root is allowed to write
under delta0.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>
With this patch, if an unprivileged user has $HOME 700 or
750 and does
lxc-start -n c1
he'll see an error like:
lxc_container: Permission denied - could not access /home/serge. Please grant it 'x' access, or add an ACL for t he container root.
(This addresses bug pad.lv/1277466)
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>
- refactor cgroup into two backends, the classic cgfs driver and the new
cgmanager. Instead of lxc_handler knowing about the internals of each,
have it just store an opaque pointer to a struct that is private to
each backend.
- rename a couple of cgroup functions for consistency: those that are
considered an API (ie. exported by lxc.h) begin with lxc_ and those that
are not are just cgroup_*
- made as many backend routines static as possible, only cg*_ops_init is
exported
- made a nrtasks op which is needed by the utmp code for monitoring
container shutdown, currently only implemented for the cgfs backend
Signed-off-by: Dwight Engen <dwight.engen@oracle.com>
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
lxc.id_map bug when writing directly to /proc/pid/[ug]id_map
There's some code in src/lxc/conf.c that sets up the UID/GID mapping. It
can use the external newuidmap/newgidmap tools, or it can write to
/proc/pid/[ug]id_map directly. The latter case is broken: lines are written
without a newline (\n) at the end. This patch fixes that. Note that
I did not check if the newuidmap/newgidmap case still works. It should,
but I wasn't able to test it.
Signed-off-by: Miquel van Smoorenburg <mikevs@xs4all.net>
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
That way templates can fix group ownership alongside uid ownership.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>
This introduces a new lxc.rootfs.options which lets you pass new
mountflags/mountdata when mounting the root filesystem.
Signed-off-by: Stéphane Graber <stgraber@ubuntu.com>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
If it (or any variation thereof) is in the container configuration,
then mount /sys/fs/cgroup/cgmanager.lower (if it exists) or
/sys/fs/cgroup/cgmanager into the container so it can run a
cgproxy.
Also make sure to clear our groups when we start or attach to a
container. Else with unprivileged containers we end up with
lots of nogroups listed in /proc/1/status.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>
This fixes various compile errors when building with musl libc. For
example:
In file included from start.c:66:0:
monitor.h:38:12: error: 'NAME_MAX' undeclared here (not in a function)
char name[NAME_MAX+1];
^
start.c: In function 'setup_signal_fd':
start.c:202:2: error: implicit declaration of function 'sigfillset' [-Werror=implicit-function-declaration]
if (sigfillset(&mask) ||
^
...
In file included from freezer.c:36:0:
monitor.h:39:12: error: 'NAME_MAX' undeclared here (not in a function)
char name[NAME_MAX+1];
^
...
In file included from cgroup.c:45:0:
conf.h:87:13: error: 'IFNAMSIZ' undeclared here (not in a function)
char veth1[IFNAMSIZ]; /* needed for deconf */
^
cgroup.c: In function 'find_cgroup_subsystems':
cgroup.c:230:3: error: implicit declaration of function 'strdup' [-Werror=implicit-function-declaration]
(*kernel_subsystems)[kernel_subsystems_count] = strdup(line);
^
...
In file included from conf.c:65:0:
conf.h:87:13: error: 'IFNAMSIZ' undeclared here (not in a function)
char veth1[IFNAMSIZ]; /* needed for deconf */
^
In file included from conf.c:66:0:
conf.c: In function 'run_buffer':
log.h:263:9: error: implicit declaration of function 'strsignal' [-Werror=implicit-function-declaration]
struct lxc_log_locinfo locinfo = LXC_LOG_LOCINFO_INIT; \
^
...
af_unix.c: In function 'lxc_abstract_unix_send_credential':
af_unix.c:208:9: error: variable 'cred' has initializer but incomplete type
struct ucred cred = {
^
af_unix.c:209:3: error: unknown field 'pid' specified in initializer
.pid = getpid(),
^
af_unix.c:209:3: error: excess elements in struct initializer [-Werror]
af_unix.c:209:3: error: (near initialization for 'cred') [-Werror]
af_unix.c:210:3: error: unknown field 'uid' specified in initializer
.uid = getuid(),
^
af_unix.c:210:3: error: excess elements in struct initializer [-Werror]
af_unix.c:210:3: error: (near initialization for 'cred') [-Werror]
af_unix.c:211:3: error: unknown field 'gid' specified in initializer
.gid = getgid(),
^
and more...
Signed-off-by: Natanael Copa <ncopa@alpinelinux.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
lxc_map_ids can call system(3), which on error from the
spawned process returns > 0. No path should return > 0
when it meant success. So check the lxc_map_ids() value
to be != rather than just < 0.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>
The geteuid() addition is being made the first element of the lxc_list,
but the first element is just a head whose entry is ignored. Therefore
userns_exec_1() was starting its tasks without the caller's uid mapped
into the namespace.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>