This allows us to use a bunch of macros in our static build for init.lxc.static
without having to link against all of utils.{c,h}.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Specifier %lli was insufficient for the type uint64_t, all values
between 2^63-1 and 2^64-1 were silently converted to 2^63-1.
We can't use %llu since it doesn't handle hexadecimal. Instead, we
parse the values as strings and then use strtoull(3).
Signed-off-by: Felix Abecassis <fabecassis@nvidia.com>
Sigh, this is going to be fun. Essentially, dynamic memory allocation through
malloc() and friends is unsafe when fork()ing in threads. The locking state
that glibc maintains internally might get messed up when the process that
fork()ed calls malloc or calls functions that malloc() internally. Functions
that internally malloc() include fopen(). One solution here is to use open() +
mmap() instead of fopen() + getline().
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Sometimes we want to know whether we are privileged wrt our
namespaces, and sometimes we want to know whether we are priv
wrt init_user_ns.
Signed-off-by: Serge Hallyn <shallyn@cisco.com>
In particular, if we are already in a user namespace we are unprivileged,
and doing things like moving the physical nics back to the host netns won't
work. Let's do the same thing LXD does if euid == 0: inspect
/proc/self/uid_map and see what that says.
Signed-off-by: Tycho Andersen <tycho@tycho.ws>
This is to avoid bad surprises caused by older glibc's pid cache (up to 2.25)
when using clone().
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
This was made necessary by changes to the overlay driver.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Adrian Reber <areber@redhat.com>
Move duplicated implementatin of sethostname from conf.c and
lxc_unshare.c to utils.h
Signed-off-by: Marcos Paulo de Souza <marcos.souza.org@gmail.com>
lxc_string_split_quoted() splits a string on spaces, but keeps
groups in single or double qoutes together. In other words,
generally what we'd want for argv behavior.
Switch lxc-execute to use this for lxc.execute.cmd.
Switch lxc-oci template to put the lxc.execute.cmd inside single
quotes, because parse_line() will eat those. If we don't do that,
then if we have lxc.execute.cmd = /bin/echo "hello, world", then the
last double quote will disappear.
Signed-off-by: Serge Hallyn <shallyn@cisco.com>
lxc_unstack_mountpoint() tries to clear all mountpoints from a given path.
It return the number of successful umounts on success and -errno on error.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Use the loop device helpers I wrote for LXD in LXC as well. They should be more
efficient.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
The new{g,u}idmap binaries where a source of trouble for users when they lacked
sufficient privileges. This commit adds code to check for sufficient privilege.
It checks whether new{g,u}idmap is root owned and has the setuid bit set and if
it doesn't it checks whether new{g,u}idmap is root owned and has CAP_SETUID in
its CAP_PERMITTED and CAP_EFFECTIVE set.
Closes#296.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
This commit adds lxc_switch_uid_gid() which allows to switch the uid and gid of
a process via setuid() and setgid() and lxc_setgroups() which allows to set
groups via setgroups(). The main advantage is that they nicely log the switches
they perform.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
This macro can be used to set or allocate a string buffer that can hold any
64bit representable number.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
This function safely parses an unsigned integer. On success it returns 0 and
stores the unsigned integer in @converted. On error it returns a negative
errno.
Signed-off-by: Christian Brauner <christian.brauner@canonical.com>
lxc_append_string() appends strings without separator. This is mostly useful
for reading in whole files line-by-line.
Signed-off-by: Christian Brauner <christian.brauner@canonical.com>
This is required by systemd to cleanly shutdown. Other init systems should not
have SIGRTMIN+3 in the blocked signals set.
Signed-off-by: Christian Brauner <cbrauner@suse.de>
In order to do this we make use of the MAP_FIXED flag of mmap(). MAP_FIXED
should be safe to use when it replaces an already existing mapping. To this
end, we establish an anonymous mapping that is one byte larger than the
underlying file. The pages handed to us are zero filled. Now we establish a
fixed-address mapping starting at the address we received from our anonymous
mapping and replace all bytes excluding the additional \0-byte with the file.
This allows us to use normal string-handling function. The idea implemented
here is similar to how shared libraries are mapped.
Signed-off-by: Christian Brauner <christian.brauner@mailbox.org>
This makes simplifying assumptions: all usable cgroups must be
mounted under /sys/fs/cgroup/controller or /sys/fs/cgroup/contr1,contr2.
Currently this will only work with cgroup namespaces, because
lxc.mount.auto = cgroup is not implemented. So cgfsng_ops_init()
returns NULL if cgroup namespaces are not enabled.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
We'll probably want to make this configurable with a
lxc.cgroupns = [1|0], but for now just always do it.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
---
Changelog 20160104: only try to unshare if /proc/self/ns/cgroup exists.
When a container starts up, lxc sets up the container's inital fstree
by doing a bunch of mounting, guided by the container configuration
file. The container config is owned by the admin or user on the host,
so we do not try to guard against bad entries. However, since the
mount target is in the container, it's possible that the container admin
could divert the mount with symbolic links. This could bypass proper
container startup (i.e. confinement of a root-owned container by the
restrictive apparmor policy, by diverting the required write to
/proc/self/attr/current), or bypass the (path-based) apparmor policy
by diverting, say, /proc to /mnt in the container.
To prevent this,
1. do not allow mounts to paths containing symbolic links
2. do not allow bind mounts from relative paths containing symbolic
links.
Details:
Define safe_mount which ensures that the container has not inserted any
symbolic links into any mount targets for mounts to be done during
container setup.
The host's mount path may contain symbolic links. As it is under the
control of the administrator, that's ok. So safe_mount begins the check
for symbolic links after the rootfs->mount, by opening that directory.
It opens each directory along the path using openat() relative to the
parent directory using O_NOFOLLOW. When the target is reached, it
mounts onto /proc/self/fd/<targetfd>.
Use safe_mount() in mount_entry(), when mounting container proc,
and when needed. In particular, safe_mount() need not be used in
any case where:
1. the mount is done in the container's namespace
2. the mount is for the container's rootfs
3. the mount is relative to a tmpfs or proc/sysfs which we have
just safe_mount()ed ourselves
Since we were using proc/net as a temporary placeholder for /proc/sys/net
during container startup, and proc/net is a symbolic link, use proc/tty
instead.
Update the lxc.container.conf manpage with details about the new
restrictions.
Finally, add a testcase to test some symbolic link possibilities.
Reported-by: Roman Fiedler
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>
In various places throughout the code, we want to "nullify" the std fds,
opening them to /dev/null or zero or so. Instead, let's unify this code and do
it in such a way that Coverity (probably) won't complain.
v2: use /dev/null for stdin as well
v3: add a comment about use of C's short circuiting
v4: axe comment, check errors on dup2, s/quiet/need_null_stdfds
Reported-by: Coverity
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
To set lsm labels, a namespace-local proc mount is needed.
If a container does not have a lxc.mount.auto = proc set, then
tasks in the container do not have a correct /proc mount until
init feels like doing the mount. At startup we handlie this
by mounting a temporary /proc if needed. We weren't doing this
at attach, though, so that
lxc-start -n $container
lxc-wait -t 5 -s RUNNING -n $container
lxc-attach -n $container -- uname -a
could in a racy way fail with something like
lxc-attach: lsm/apparmor.c: apparmor_process_label_set: 183 No such file or directory - failed to change apparmor profile to lxc-container-default
Thanks to Chris Townsend for finding this bug at
https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1452451
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>
Doing this requires some btrfs functions from bdev to be used in
utils.c Because utils.h is imported by lxc_init.c, I had to create
a new initutils.[ch] which are used by both lxc_init.c and utils.c
We could instead put the btrfs functions into utils.c, which would
be a shorter patch, but it really doesn't belong there. So I went
the other way figuring there may be more such cases coming up of
fns in utils.c needing code from bdev.c which can't go into lxc_init.
Currently, if we detect a btrfs subvolume we just remove it. The
st_dev on that dir is different, so we cannot detect if this is
bound in from another fs easily. If we care, we should check
whether this is a mountpoint, this patch doesn't do that.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>
Instead of having a parent process that's called whatever the caller of the
library is called, we instead set it to "[lxc monitor] <lxcpath> <container>"
Closes#180
v2: check for null in tok for loop, only truncate environment when necessary
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
If you have 'lxc.include = /some/dir' and /some/dir is a directory, then any
'*.conf" files under /some/dir will be read.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>
Function of enter_to_ns() is useful but currently is static for
lxccontainer.c.
This patch split it into two parts named as switch_to_newuser()
and switch_to_newnet() into utils.c.
Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
This patch adds support for checkpointing and restoring containers via CRIU.
It adds two api calls, ->checkpoint and ->restore, which are wrappers around
the CRIU CLI. CRIU has an RPC API, but reasons for preferring exec() are
discussed in [1].
To checkpoint, users specify a directory to dump the container metadata (CRIU
dump files, plus some additional information about veth pairs and which
bridges they are attached to) into this directory. On restore, this
information is read out of the directory, a CRIU command line is constructed,
and CRIU is exec()d. CRIU uses the lxc-restore-net callback (which in turn
inspects the image directory with the NIC data) to properly restore the
network.
This will only work with the current git master of CRIU; anything as of
a152c843 should work. There is a known bug where containers which have been
restored cannot be checkpointed [2].
[1]: http://lists.openvz.org/pipermail/criu/2014-July/015117.html
[2]: http://lists.openvz.org/pipermail/criu/2014-August/015876.html
v2: fixed some problems with the s/int/bool return code form api function
v3: added a testcase, fixed up the man page synopsis
v4: fix a small typo in lxc-test-checkpoint-restore
v5: remove a reference to the old CRIU_PATH, and a bad error about the same
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>
Originally we kept snapshots under /var/lib/lxcsnaps. If a
separate btrfs is mounted at /var/lib/lxc, then we can't
make btrfs snapshots under /var/lib/lxcsnaps.
This patch moves the default directory to /var/lib/lxc/c/snaps.
If /var/lib/lxcsnaps already exists, then we continue to use that.
add c->destroy_with_snapshots() and c->snapshot_destroy_all()
API methods. c->snashot_destroy_all() can be triggered from
lxc-snapshot using '-d ALL'. There is no command to call
c->destroy_with_snapshots(c) as of yet.
lxclock: use ".$lxcname" for container lock files
that way we can use /run/lock/lxc/$lxcpath/$lxcname/snaps as a
directory when locking snapshots without having to worry about
/run/lock//lxc/$lxcpath/$lxcname being a file.
destroy: split off a container_destroy
container_destroy() doesn't check for snapshots, so snapshot_rename can
use it. api_destroy() now does check for snapshots (previously it only
checked for fs - i.e. overlayfs/aufs - snapshots).
Add destroy to the manpage, as it was previously undocumented.
Update snapshot testcase accordingly.
[ rebased in the face of commits 840f05df and 7e36f87e. ]
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: S.Çağlar Onur <caglar@10ur.org>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>
Move choose_init into utils.c so we can re-use it. Make it and on_path
accept an optional rootfs argument to prepend to the paths when checking
whether the file exists.
Also add lxc.init.static to .gitignore
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>
backing stores supported by qemu-nbd can be attached to a nbd block
device using qemu-nbd. This user-space process (pair) stays around for
the duration of the device attachment. Obviously we want it to go away
when the container shuts down, but not before the filesystems have been
cleanly unmounted.
The device attachment is done from the task which will become the
container monitor before the container setup+init task is spawned.
That task starts in a new pid namespace to ensure that the qemu-nbd
process will be killed if need be. It sets its parent death signal
to sighup, and, on receiving sighup, attempts to do a clean
qemu-device detach, then exits. This should ensure that the
device is detached if the qemu monitor crashes or exits.
It may be worth adding a delay before the qemu-nbd is detached, but
my brief tests haven't seen any data corruption.
Only the parts required for running a nbd-backed container are
implemented here. Create, destroy, and clone are not. The first
use of this that I imagine is for people to use downloaded nbd-backed
images (like ubuntu cloud images, or anything previously used with
qemu). I imagine people will want to create/clone/destroy out of
band using qemu-img, but if I'm wrong about that we can implement
the rest later.
Because attach_block_device() is done before the bdev is initialized,
and bdev_init needs to know the nbd index so that it can mount the
filesystem, we now need to pass the lxc_conf.
file_exists() is moved to utils.c so we can use it from bdev.c
The nbd attach/detach should lay the groundwork for trivial implementation
of qed and raw images.
changelog (may 12): fix idx check at detach
changelog (may 15): generalize qcow2 to nbd
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Dwight Engen <dwight.engen@oracle.com>
Only do the funky chroot_into_slave if / is in fact the rootfs.
Rootfs is a special blacklisted case for pivot_root.
If / is not rootfs but is shared, just mount / rslave. We're
already in our own namespace.
This appears to solve the extra /proc/$$/mount entries in
containers and the host directories in lxc-attach which have
been plagueing at least fedora and arch.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>
This change makes it possible to create unprivileged containers as root.
They will be stored in the usual system wide location, use the usual
system wide cache but will be running using a uid/gid map.
This also updates lxc_usernsexec to use the same function as the rest of
LXC, centralizing all the userns switch in a single function.
That function now detects the presence of newuidmap and newgidmap on the
system, if they are present, they will be used for containers created as
either user or root. If they're not and the user isn't root, an error is
shown. If they're not and the user is root, LXC will directly set the
uid_map and gid_map values.
All that should allow for a consistent experience as well as supporting
distributions that don't yet ship newuidmap/newgidmap.
To make things simpler in the future, an helper function "on_path" is
also introduced and used to detect the presence of newuidmap and
newgidmap.
Signed-off-by: Stéphane Graber <stgraber@ubuntu.com>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
(this expands on Dwight's recent patch, commit c597baa8f9)
After unshare(CLONE_NEWNS) and before doing any mounting, always
check whether rootfs is shared. Otherwise template runs or clone
scripts can bleed mount activity to the host.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Dwight Engen <dwight.engen@oracle.com>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>
Since we're no longer always returning a getenv result or some defined
string, the callers should cleanup the variable after use.
As a result, change from const char* to char*, add the needed free()
everywhere and use strdup() on strings coming from getenv.
Signed-off-by: Stéphane Graber <stgraber@ubuntu.com>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>