backing stores supported by qemu-nbd can be attached to a nbd block
device using qemu-nbd. This user-space process (pair) stays around for
the duration of the device attachment. Obviously we want it to go away
when the container shuts down, but not before the filesystems have been
cleanly unmounted.
The device attachment is done from the task which will become the
container monitor before the container setup+init task is spawned.
That task starts in a new pid namespace to ensure that the qemu-nbd
process will be killed if need be. It sets its parent death signal
to sighup, and, on receiving sighup, attempts to do a clean
qemu-device detach, then exits. This should ensure that the
device is detached if the qemu monitor crashes or exits.
It may be worth adding a delay before the qemu-nbd is detached, but
my brief tests haven't seen any data corruption.
Only the parts required for running a nbd-backed container are
implemented here. Create, destroy, and clone are not. The first
use of this that I imagine is for people to use downloaded nbd-backed
images (like ubuntu cloud images, or anything previously used with
qemu). I imagine people will want to create/clone/destroy out of
band using qemu-img, but if I'm wrong about that we can implement
the rest later.
Because attach_block_device() is done before the bdev is initialized,
and bdev_init needs to know the nbd index so that it can mount the
filesystem, we now need to pass the lxc_conf.
file_exists() is moved to utils.c so we can use it from bdev.c
The nbd attach/detach should lay the groundwork for trivial implementation
of qed and raw images.
changelog (may 12): fix idx check at detach
changelog (may 15): generalize qcow2 to nbd
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Dwight Engen <dwight.engen@oracle.com>
This is a fix to commit 5f2ea8cfcb.
Sorry, not sure how I missed this in testing the original patch.
Signed-off-by: Dwight Engen <dwight.engen@oracle.com>
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
when using btrfs backend lxc-create first creates rootfs in /usr/lib/lxc/rootfs
directory before moving it to /var/lib/lxc or other directory supplied by the
command line. Archlinux template relied in $rootfs_path which made containers
created with btrfs backend have lxc.rootfs set to /usr/lib/lxc/rootfs. By using
$path instead of $rootfs_path we make sure that lxc.rootfs is always correct.
Signed-off-by: Edvinas Klovas <edvinas@pnd.io>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Don't spawn a getty on /dev/console when running under libvirt-lxc
Signed-off-by: Dwight Engen <dwight.engen@oracle.com>
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
On older cgmanager the support was broken. So rather than
fail container starts altogether, just keep the old lxc behavior
in this case by not using name= subsystems.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>
archlinux is using systemd and systemd's configuration does not have any
services setup to handle sigpwr hook which is sent by lxc-stop command. By
enabling sigpwr service we make sure that lxc-stop will work.
Signed-off-by: Edvinas Klovas <edvinas@pnd.io>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>
If an unprivileged user does 'lxc-start -n u1' in one
login session, followed by 'lxc-attach -n u1' in another
session, the attach will fail if the sessions are in different
cgroups. The same is true of lxc-cgroup commands.
Address this by using the GetPidCgroupAbs and MovePidAbs
which work with the containers' cgroup path relative to
the cgproxy.
Since GetPidCgroupAbs is new to api version 3 in cgmanager,
use the old method if we are on an older cgmanager.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Tested-by: "S.Çağlar Onur" <caglar@10ur.org>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>
Read /proc/self/cgroup instead of /proc/cgroups, so as to catch
named subsystems. Otherwise the contaienrs will not be fully
moved into the container cgroups.
Also free line which was being leaked.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>
Do this by calling the bdev->destroy() hook from a user namespace
configured as the container's.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>
btrfs subvolume ioctls are usable by unprivileged users, so allow
unprivileged containers to reside on btrfs.
This patch does not yet enable destroy.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>
If the user specifies cgroup or cgroup-full without a specifier (:ro,
:rw or :mixed), this changes the behavior. Previously, these were
simple aliases for the :mixed variants; now they depend on whether the
container also has CAP_SYS_ADMIN; if it does they resolve to the :rw
variants, if it doesn't to the :mixed variants (as before).
If a container has CAP_SYS_ADMIN privileges, any filesystem can be
remounted read-write from within, so initially mounting the cgroup
filesystems partially read-only as a default creates a false sense of
security. It is better to default to full read-write mounts to show the
administrator what keeping CAP_SYS_ADMIN entails.
If an administrator really wants both CAP_SYS_ADMIN and the :mixed
variant of cgroup or cgroup-full automatic mounts, they can still
specify that explicitly; this commit just changes the default without
specifier.
Signed-off-by: Christian Seiler <christian@iwakd.de>
Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Currently, setup_caps and dropcaps_except both use the same parsing
logic for parsing capabilities (try to identify by name, but allow
numerical specification). Since this is a common routine, separate it
out to improve maintainability and reuseability.
Signed-off-by: Christian Seiler <christian@iwakd.de>
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Ubuntu containers have had trouble with automatic cgroup mounting that
was not read-write (i.e. lxc.mount.auto = cgroup{,-full}:{ro,mixed}) in
containers without CAP_SYS_ADMIN. Ubuntu's mountall program reads
/lib/init/fstab, which contains an entry for /sys/fs/cgroup. Since
there is no ro option specified for that filesystem, mountall will try
to remount it readwrite if it is already mounted. Without
CAP_SYS_ADMIN, that fails and mountall will interrupt boot and wait for
user input on whether to proceed anyway or to manually fix it,
effectively hanging container bootup.
This patch makes sure that /sys/fs/cgroup is always a readwrite tmpfs,
but that the actual cgroup hierarchy paths (/sys/fs/cgroup/$subsystem)
are readonly if :ro or :mixed is used. This still has the desired
effect within the container (no cgroup escalation possible and programs
get errors if they try to do so anyway), while keeping Ubuntu
containers happy.
Signed-off-by: Christian Seiler <christian@iwakd.de>
Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Set a base class for the network object and set the encoding in the
header. Neither of those changes are required for python3 but they do
make it easier for anyone trying to make a python2 binding.
Signed-off-by: Stéphane Graber <stgraber@ubuntu.com>
Corrected a small oversight when locking related code was moved from
src/lxc/utils.c to src/lxc/lxclock.c.
Signed-off-by: Stephen M Bennett <stephen_m_bennett@hotmail.com>
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
When using --nesting, we exec ourselves in the container context, if we
somehow need to dynamically-load modules from there, things break. So
make sure we pre-load everything we may need.
Signed-off-by: Stéphane Graber <stgraber@ubuntu.com>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
This makes sure we only query lxc.group once and then reuse that list
for filtering, listing groups and autostart.
When a container is auto-started only as part of a group, autostart will
now show by-group instead of yes.
Signed-off-by: Stéphane Graber <stgraber@ubuntu.com>
/sys/fs/cgroup is just a size-limited tmpfs, and making it ro does
nothing to affect our ability alter mount settings of its subdirs.
OTOH making it ro can upset mountall in the container which tries
to remount it rw, which may be refused.
So just don't do it.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Cc: Christian Seiler <christian@iwakd.de>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>
There wasn't a good reason for that limit, we can simply make the code
slightly slower when --groups is passed and still have the expected
output even without --fancy.
Signed-off-by: Stéphane Graber <stgraber@ubuntu.com>
This introduces a new -g/--group argument to filter containers based on
their groups.
This supports the rather obvious: --group blah
Which will only list containers that are in group blah.
It may also be passed multiple times: --group blah --group bleh
Which will list containers that are in either (or both) blah or bleh.
And it also takes: --group blah,bleh --group doh
Which will list containers that are either in BOTH blah and bleh or in doh.
Signed-off-by: Stéphane Graber <stgraber@ubuntu.com>
Acked-by: Michael H. Warfield <mhw@WittsEnd.com>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
lxc-init got moved into SBINDIR/init.lxc recently.
This broke sshd template because path wasn't updated there.
This patch should fix this issue.
Signed-off-by: Nikolay Martynov <mar.kolya@gmail.com>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>
- Some scriptlets expect fstab to exist so create it before doing the
yum install
- Set the rootfs selinux label same as the hosts or else the PREIN script
from initscripts will fail when running groupadd utmp, which prevents
creation of OL4.x containers on hosts > OL6.x.
- Move creation of devices into a separate function
Signed-off-by: Dwight Engen <dwight.engen@oracle.com>
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
/proc/sys/kernel/sem* and /proc/sys/kernel/msg* are ipc sysctls
which are properly namespaced. Allow writes to them from
containers.
Reported-by: Dan Kegel <dank@kegel.com>
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>
According to Serge, we no longer need to keep cgmanager connection open.
As long as my tests go it seems to be working fine.
Signed-off-by: S.Çağlar Onur <caglar@10ur.org>
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
When outputing the lxc.arch setting, use i686 instead of x86 since the
later is not a valid input to setarch, nor will the kernel output
UTS_MACHINE as x86. The kernel sets utsname.machine to i[3456]86, which
all map to PER_LINUX32.
Signed-off-by: Dwight Engen <dwight.engen@oracle.com>
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
This change accepts all the same strings for lxc.arch that setarch(8) does.
Note that we continue to parse plain x86 as PER_LINUX32 so as not to break
existing lxc configuration files.
Signed-off-by: Dwight Engen <dwight.engen@oracle.com>
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Failures were being ignored, leading up to an eventual segfault.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>
This only converts punctuation marks from FULLWIDTH COMMA/FULL STOP to
IDEOGRAPHIC COMMA/FULL STOP in Japanese man pages. The contents of man
pages do not change at all.
Signed-off-by: KATOH Yasufumi <karma@jazz.email.ne.jp>
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
I inadvertently introduced this with commit 8bf1e61e.
Signed-off-by: Dwight Engen <dwight.engen@oracle.com>
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Check for symlinks before attempting create.
When attempting to create the compulsory symlinks in /dev,
check for the existence of the link using stat first before
blindly attempting to create the link.
This works around an apparent quirk in the kernel VFS on read-only
file systems where the returned error code might be EEXIST or EROFS
depending on previous access to the /dev directory and its entries.
Reported-by: William Dauchy <william@gandi.net>
Signed-off-by: Michael H. Warfield <mhw@WittsEnd.com>
Tested-by: William Dauchy <william@gandi.net>
Originally we kept snapshots under /var/lib/lxcsnaps. If a
separate btrfs is mounted at /var/lib/lxc, then we can't
make btrfs snapshots under /var/lib/lxcsnaps.
This patch moves the default directory to /var/lib/lxc/lxcsnaps.
If /var/lib/lxcsnaps already exists, then use that. Don't allow
any container to be used with the name 'lxcsnaps'.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
If you 'ip netns add x1', this creates /run/netns and /run/netns/x1
as shared mounts. When a container starts, it umounts these after
pivot_root, and the umount is propagated to the host.
Worse, doing mount("", "/", NULL, MS_SLAVE|MS_REC, NULL) does not
suffice to change those, even after binding /proc/mounts onto
/etc/mtab.
So, I give up. Do this manually, walking over /proc/self/mountinfo
and changing the mount propagation on everything marked as shared.
With this patch, lxc-start no longer unmounts /run/netns/* on the
host.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
In the body of the manpage, replace a few errant 'fssize's with the
more appropriate word.
Reported-by: MegaBrutal <megabrutal@megabrutal.com>
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>