After looking through some logs, it is a little cleaner to do it as
below, instead of what I originally posted.
Tycho
In order for LXC to be the parent of the restored process, CRIU needs to
restore init as its sibling, not as its child. This was previously accomplished
essentially via luck :). CRIU now has a --restore-sibling option which forces
this behavior that LXC expects. See more discussion in this thread:
http://lists.openvz.org/pipermail/criu/2014-September/thread.html#16330
v2: don't pass --restore-sibling to dump. This is mostly cosmetic, but will
look less confusing in the logs if people ever look at them.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
We can also narrow the scope of this, since we only need it in the process that
is actually going to use it.
Reported-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
If we just return here, we end up with two processes executing the caller's
code, which is not good.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
This option is required when migrating containers across hosts; it is used to
restore inotify via file paths instead of file handles, which aren't preserved
across hosts.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
And add a testcase.
The code to update hwaddrs in a clone was walking through the container
configuration and re-printing all network entries. However network
entries from an include file which should not be printed out were being
added to the unexpanded config. With this patch, at clone we simply
update the hwaddr in-place in the unexpanded configuration file, making
sure to make the same update to the expanded network configuration.
The code to update out lxc.hook statements had the same problem.
We also update it in-place in the unexpanded configuration, though
we mirror the logic we use when updating the expanded configuration.
(Perhaps that should be changed, to simplify future updates)
This code isn't particularly easy to review, so testcases are added
to make sure that (1) extra lxc.network entries are not added (or
removed), even if they are present in an included file, (2) lxc.hook
entries are not added, (3) hwaddr entries are updated, and (4)
the lxc.hook entries are properly updated (only when they should be).
Reported-by: Stéphane Graber <stgraber@ubuntu.com>
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>
With the new hashed command socket names (e85898415c), it's possible to
have something like below;
[caglar@qop:~/go/src/github.com/lxc/go-lxc(master)] cat /proc/net/unix | grep lxc
0000000000000000: 00000002 00000000 00010000 0001 01 53465 @lxc/d086e835c86f4b8d/command
[...]
list_active_containers reads /proc/net/unix to find all running
containers but this new format no longer includes the container name or
its lxcpath.
This patch introduces two new commands (LXC_CMD_GET_NAME and
LXC_CMD_GET_LXCPATH) and starts to use those in list_active_containers
call.
changes since v1:
- added sanity check proposed by Serge
Signed-off-by: S.Çağlar Onur <caglar@10ur.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
This patch adds support for checkpointing and restoring containers via CRIU.
It adds two api calls, ->checkpoint and ->restore, which are wrappers around
the CRIU CLI. CRIU has an RPC API, but reasons for preferring exec() are
discussed in [1].
To checkpoint, users specify a directory to dump the container metadata (CRIU
dump files, plus some additional information about veth pairs and which
bridges they are attached to) into this directory. On restore, this
information is read out of the directory, a CRIU command line is constructed,
and CRIU is exec()d. CRIU uses the lxc-restore-net callback (which in turn
inspects the image directory with the NIC data) to properly restore the
network.
This will only work with the current git master of CRIU; anything as of
a152c843 should work. There is a known bug where containers which have been
restored cannot be checkpointed [2].
[1]: http://lists.openvz.org/pipermail/criu/2014-July/015117.html
[2]: http://lists.openvz.org/pipermail/criu/2014-August/015876.html
v2: fixed some problems with the s/int/bool return code form api function
v3: added a testcase, fixed up the man page synopsis
v4: fix a small typo in lxc-test-checkpoint-restore
v5: remove a reference to the old CRIU_PATH, and a bad error about the same
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>
This prevents u2 from going into /home/u1/.local/share/lxc/u1/rootfs
and running setuid-root applications to get write access to u1's
container rootfs.
v2: set umask to 002 for the mkdir. Otherwise if umask happens to be,
say, 022, then user does not have write permissions under the container
dir and creation of $containerdir/partial file will fail.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>
When we read a lxc.network.hwaddr line, if it contained any 'x's then
those get quitely filled in at config_network_hwaddr. If that happens
then we want to save the autogenerated hwaddr in the unexpanded config
so that when we write it to disk, it is saved.
This patch dumbly re-generates the network configuration in the
unexp configuration every time we load a config file, just as we do
after every clone.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>
This commit broke the testsuite for unprivileged containers as the
container directory is now 0750 with the owner being the container root
and the group being the user's group, meaning that the parent user can
only enter the directory, not create entries in there.
This reverts commit c86da6a3ac.
This prevents u2 from going into /home/u1/.local/share/lxc/u1/rootfs
and running setuid-root applications to get write access to u1's
container rootfs.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Dwight Engen <dwight.engen@oracle.com>
Actually, get rid of the temporary variables, and set newname
and lxcpath to usable values if they were NULL.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>
Especially when using the Python API, the child process inherits of
the file descriptiors of the script.
Signed-off-by: Vincent Giersch <vincent.giersch@ovh.net>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>
They don't work right now, so until we fix that, don't allow it.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>
Originally, we only kept a struct lxc_conf representing the current
container configuration. This was insufficient because lxc.include's
were expanded, so a clone or a snapshot would contain the expanded
include file contents, rather than the original "lxc.include". If
the host's include files are updated, clones and snapshots would not
inherit those updates.
To address this, we originally added a lxc_unexp_conf, which mirrored
the lxc_conf, except that lxc.include was not expanded.
This has its own cshortcomings, however, In particular, if a lxc.include
has a lxc.cgroup setting, and you use the api to say:
c.clear_config_item("lxc.cgroup")
this is not representable in the lxc_unexp_conf. (The original problem,
which was pointed out to me by stgraber, was slightly different, but
unlike this problem it was not unsolvable).
This patch changes the unexpanded configuration to be a textual
representation of the configuration. This allows us *order* the
configuration commands, which is what was not possible using the
struct lxc_conf *lxc_unexp_conf.
The write_config() now becomes a simple fwrite. However, lxc_clone
is slightly complicated in parts, the worst of which is the need to
rewrite the network configuration if we are changing the macaddrs.
With this patch, lxc-clone and clear_config_item do the right thing.
lxc-test-saveconfig and lxc-test-clonetest both pass.
There is room for improvement - multiple calls to
c.append_config_item("lxc.network.link", "lxcbr0")
will result in multiple such lines in the configuration file. In that
particular case it is harmless. There may be cases where it is not.
Overall, this should be a huge improvement in terms of correctness.
Changelog: Aug 1: updated to current lxc git head. All lxc-test* and
python api test passed.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>
This gives me:
ubuntu@c-t1:~$ lxc-create -t download -n u1
lxc_container: No mapping for container root
lxc_container: Error chowning /home/ubuntu/.local/share/lxc/u1/rootfs to container root
lxc_container: You must either run as root, or define uid mappings
lxc_container: To pass uid mappings to lxc-create, you could create
lxc_container: ~/.config/lxc/default.conf:
lxc_container: lxc.include = /etc/lxc/default.conf
lxc_container: lxc.id_map = u 0 100000 65536
lxc_container: lxc.id_map = g 0 100000 65536
lxc_container: Error creating backing store type (none) for u1
lxc_container: Error creating container u1
when I create a container without having an id mapping defined.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Originally we kept snapshots under /var/lib/lxcsnaps. If a
separate btrfs is mounted at /var/lib/lxc, then we can't
make btrfs snapshots under /var/lib/lxcsnaps.
This patch moves the default directory to /var/lib/lxc/c/snaps.
If /var/lib/lxcsnaps already exists, then we continue to use that.
add c->destroy_with_snapshots() and c->snapshot_destroy_all()
API methods. c->snashot_destroy_all() can be triggered from
lxc-snapshot using '-d ALL'. There is no command to call
c->destroy_with_snapshots(c) as of yet.
lxclock: use ".$lxcname" for container lock files
that way we can use /run/lock/lxc/$lxcpath/$lxcname/snaps as a
directory when locking snapshots without having to worry about
/run/lock//lxc/$lxcpath/$lxcname being a file.
destroy: split off a container_destroy
container_destroy() doesn't check for snapshots, so snapshot_rename can
use it. api_destroy() now does check for snapshots (previously it only
checked for fs - i.e. overlayfs/aufs - snapshots).
Add destroy to the manpage, as it was previously undocumented.
Update snapshot testcase accordingly.
[ rebased in the face of commits 840f05df and 7e36f87e. ]
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: S.Çağlar Onur <caglar@10ur.org>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>
Currently when a container's configuration file has lxc.includes,
any future write_config() will expand the lxc.includes. This
affects container clones (and snapshots) as well as users of the
API who make an update and then c.save_config().
To fix this, separately track the expanded and unexpanded lxc_conf. The
unexpanded conf does not contain values read from lxc.includes. The
expanded conf does. Lxc functions mainly need the expanded conf to
figure out how to configure the container. The unexpanded conf is used
at write_config().
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>
before using it, like the other snapshot api methods do.
This will need to go into stable-1.0 as well.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>
backing stores supported by qemu-nbd can be attached to a nbd block
device using qemu-nbd. This user-space process (pair) stays around for
the duration of the device attachment. Obviously we want it to go away
when the container shuts down, but not before the filesystems have been
cleanly unmounted.
The device attachment is done from the task which will become the
container monitor before the container setup+init task is spawned.
That task starts in a new pid namespace to ensure that the qemu-nbd
process will be killed if need be. It sets its parent death signal
to sighup, and, on receiving sighup, attempts to do a clean
qemu-device detach, then exits. This should ensure that the
device is detached if the qemu monitor crashes or exits.
It may be worth adding a delay before the qemu-nbd is detached, but
my brief tests haven't seen any data corruption.
Only the parts required for running a nbd-backed container are
implemented here. Create, destroy, and clone are not. The first
use of this that I imagine is for people to use downloaded nbd-backed
images (like ubuntu cloud images, or anything previously used with
qemu). I imagine people will want to create/clone/destroy out of
band using qemu-img, but if I'm wrong about that we can implement
the rest later.
Because attach_block_device() is done before the bdev is initialized,
and bdev_init needs to know the nbd index so that it can mount the
filesystem, we now need to pass the lxc_conf.
file_exists() is moved to utils.c so we can use it from bdev.c
The nbd attach/detach should lay the groundwork for trivial implementation
of qed and raw images.
changelog (may 12): fix idx check at detach
changelog (may 15): generalize qcow2 to nbd
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Dwight Engen <dwight.engen@oracle.com>
Do this by calling the bdev->destroy() hook from a user namespace
configured as the container's.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>
btrfs subvolume ioctls are usable by unprivileged users, so allow
unprivileged containers to reside on btrfs.
This patch does not yet enable destroy.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>
Originally we kept snapshots under /var/lib/lxcsnaps. If a
separate btrfs is mounted at /var/lib/lxc, then we can't
make btrfs snapshots under /var/lib/lxcsnaps.
This patch moves the default directory to /var/lib/lxc/lxcsnaps.
If /var/lib/lxcsnaps already exists, then use that. Don't allow
any container to be used with the name 'lxcsnaps'.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
And add a testcase to catch regressions.
Without this patch, restoring a snapshot of an overlayfs based
container fails, because we do not pass in LXC_CLONE_SNAPSHOT,
and overlayfs does not support clone without snapshot.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>
1. remove the cgm_dbus_disconnected handler. We're using a proxy
anyway, and not keeping it around.
2. comment most of the cgm functions to describe when they are called, to
ease locking review
3. the cgmanager mutex is now held for the duration of a connection, from
cgm_dbus_connect to cgm_dbus_disconnect.
3b. so remove the mutex lock/unlock from functions which are called during
container startup with the cgmanager connection already up
4. remove the cgroup_restart(). It's no longer needed since we don't
daemonize while we have the cgmanager socket open.
5. report errors and return early if cgm_dbus_connect() fails
6. don't keep the cgm connection open after cgm_ops_init. I'm a bit torn
on this one as it means that things like lxc-start will always connect
twice. But if we do this there is no good answer, given threaded API
users, on when to drop that initial connection.
7. cgm_unfreeze and nrtasks: grab the dbus connection, as we'll never
have it at that point. (technically i doubt anyone will use
cgmanager and utmp helper on the same host :)
8. lxc_spawn: make sure we only disconnect cgroups if they were already
connected.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>
If clone is called from the api, the container object in memory
retains the bad fs. The line is wrong, being a leftover from a
previous attempt before copy_storage was moved earlier.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Otherwise an interrupted clone can lead to the original rootfs
being delete.
There is a period during lxcapi_clone during which we have written down
a temporary configuration file on disk, for the new container, using the
old rootfs. Interruption of clone doesn't allow us to do the cleanup we
do in error paths, so a subsequent lxc-destroy removes the old rootfs.
Fix this by doing the copy_storage as early as possible, and not
writing down the rootfs when we write down the temporary configuration
file.
(note - I tested this by putting a series of
'if (strcmp(newname, "u%d") == 0) exit(1)' inline to trigger
interruption between most blocks. If someone has a good idea
for a generic way to regression-test this henceforth that'd be
great)
See https://bugs.launchpad.net/lxc/+bug/1285850
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>
This change makes it possible to create unprivileged containers as root.
They will be stored in the usual system wide location, use the usual
system wide cache but will be running using a uid/gid map.
This also updates lxc_usernsexec to use the same function as the rest of
LXC, centralizing all the userns switch in a single function.
That function now detects the presence of newuidmap and newgidmap on the
system, if they are present, they will be used for containers created as
either user or root. If they're not and the user isn't root, an error is
shown. If they're not and the user is root, LXC will directly set the
uid_map and gid_map values.
All that should allow for a consistent experience as well as supporting
distributions that don't yet ship newuidmap/newgidmap.
To make things simpler in the future, an helper function "on_path" is
also introduced and used to detect the presence of newuidmap and
newgidmap.
Signed-off-by: Stéphane Graber <stgraber@ubuntu.com>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
(this expands on Dwight's recent patch, commit c597baa8f9)
After unshare(CLONE_NEWNS) and before doing any mounting, always
check whether rootfs is shared. Otherwise template runs or clone
scripts can bleed mount activity to the host.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Dwight Engen <dwight.engen@oracle.com>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>
Systems based on systemd mount the root shared by default. We don't want
mounts done during creation by templates nor those done internally by
bdev during rsync based clones to propagate to the root mntns.
The create case already had the right check, but the mount call was
missing "/", so it was failing.
Signed-off-by: Dwight Engen <dwight.engen@oracle.com>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
We used to do chdir(path), chroot(path). That's correct but not properly
handled coverity, so do chroot(path), chdir("/") instead as that's the
recommended way.
Signed-off-by: Stéphane Graber <stgraber@ubuntu.com>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
This is pretty much copy/paste from overlayfs.
Signed-off-by: Stéphane Graber <stgraber@ubuntu.com>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Instead force a copy clone. Else if the user makes a change
to the original container, the snapshot will be affected.
The user should first create a snapshot clone, then use
and snapshot that clone while leaving the original container
untouched.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>
The goal is to avoid an absolute symlink in the guest redirecting
us to the host's /dev. Thanks to the libvirt team for considering
that possibility!
We want to work on kernels which do not support setns, so we simply
chroot into the container before doing any rm/mknod. If /dev/vda5
is a symlink to /XXX, or /dev is a symlink to /etc, this is now
correctly resolved locally in the chroot.
We would have preferred to use realpath() to check that the resolved
path is not changed, but realpath across /proc/pid/root does not
work as expected.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>
That way templates can fix group ownership alongside uid ownership.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>
With this change, shutdown() will no longer call stop() after the
timeout, instead it'll just return false and it's up to the caller to
then call stop() if appropriate.
This also updates the bindings, tests and other scripts.
lxc-stop is then updated to do proper option checking and use shutdown,
stop or reboot as appropriate.
Signed-off-by: Stéphane Graber <stgraber@ubuntu.com>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
The timeout argument should be handled as follows:
-1 => Wait forever
0 => Don't wait
> 0 => Wait for timeout seconds
Without this patch, the 0 case is mapped to -1.
Signed-off-by: Robert Vogelgesang <vogel@users.sourceforge.net>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>