If the user specified lxc.net.[i].veth.pair attribute to request that the host
side of a veth pair be given a specific name let's log it at the trace level.
Otherwise, if the user didn't not specify lxc.net.[i].veth.pair veth_attr.veth1
will contain the name of the host side veth device.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
When lxc-user-nic is called with the "delete" subcommand we need to make sure
that we are actually privileged over the network namespace for which we are
supposed to delete devices on the host. To this end we require that path to the
affected network namespace is passed. We then setns() to the network namespace
and drop privilege to the caller's real user id. Then we try to delete the
loopback interface which is not possible. If we are privileged over the network
namespace this operation will fail with ENOTSUP. If we are not privileged over
the network namespace we will get EPERM.
This is the first part of the commit. As of now nothing guarantees that the
caller does not just give us a random path to a network namespace it is
privileged over.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
This is the cause of the unnecessary extraneous slashes when creating cgroups.
Our lxc.system.conf page also clearly shows "lxc/%n" as example, not "/lxc%n".
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
This reverts commit 92c590ae1e.
Problem:
Commit 92c590ae1e introduced the following
behavior:
> cgfsng: try to delete parent cgroups
>
> Say we have
>
> lxc.uts.name = c1
> lxc.cgroup.dir = lxd/a/b/c
>
> the path for the container's cgroup would be
>
> lxd/a/b/c/c1
>
> When the container is shutdown we should not just try to delete "c1" we
> should also try to delete "c", "b", "a", and "lxd". This is to ensure
> that we don't leave empty cgroups around thereby increasing the chance
> that we run into trouble with cgroup limits. The algorithm for this isn't
> too costly since we can simply stop walking upwards at the first rmdir()
> failure.
The algorithm employs recursive_destroy() which opens each directory
specified in lxc.cgroup.dir and tries to delete each directory within that
directory. For example, assume "/sys/fs/cgroup/memory/lxd/a/b/c" only
contains the cgroup "c1" for container "c1". Assume that "c1" calls
recursive_destroy() to cleanup it's cgroups. It will first delete "c1" and
anything underneath it. This is perfectly fine since anything underneath
that cgroup is under its control. The new algorithm will then tell it to
"recurse upwards". So recursive_destroy() will try to delete
"/sys/fs/cgroup/lxd/a/b/c" next. Now assume that a second container "c2"
has "lxc.cgroup.dir = lxd/a/b/c" set in its config file and calls
cgroup_create(). This will create the *empty* cgroup
"/sys/fs/cgroup/memory/lxd/a/b/c/c2". Now assume that after having created
"c2" container "c1"'s call to recursive_destroy() reaches
"/sys/fs/cgroup/memory/lxd/a/b/c/c2" before it is populated. Then the
cgroup "c2" will be removed. Now "c2" calls cgroup_enter() to enter its
created cgroup. This will fail since c1 deleted the cgroup "c2". (As a
sidenote: This is in the set of the few race conditions that are actually
easy to describe.)
Possible Solution:
Instead of calling recursive_destroy() on all cgroups specified in
lxc.cgroup.dir we only call recursive_destroy() on the container's own
cgroup "/sys/fs/cgroup/memory/lxd/a/b/c/c1". When we start to recurse
upwards we only call unlinkat(AT_FDCWD, path, AT_REMOVEDIR). This should
avoid the race described above. My argument is as follows. Assume that the
container c1 has created the cgroup "/sys/fs/cgroup/lxd/a/b/c/c1" for
itself. Now c1 calls cgroup_destroy(). First, recursive_destroy() will be
called on the cgroup "c1" which will delete any emtpy cgroup directories
underneath "c1" and finally "c1" itself. This is fine since everything
under "c1" is the container's c1 sole property. Now container c1 will call
unlinkat() on "/sys/fs/cgroup/memory/lxd/a/b/c/c1":
- Assume that in the meantime container c2 has created the cgroup
"/sys/fs/cgroup/memory/lxd/a/b/c/c2". Then c1's unlinkat() will fail.
This will stop c1 from recursing upwards. So c2's cgroup_enter() call
will find all its cgroups intact and well. unlinkat() will come with the
appropriate in-kernel locking which will stop it from racing with
mkdir().
- There's still a subtle race left. c2 might be calling an implementation
of mkdir -p to try and create e.g. the cgroup
"/sys/fs/cgroup/memory/lxd/a/b". Let's assume "b" exists then c2 will
receive EEXIST on "b" and move on to create "c". Let's further assume c1
has already deleted "c". c1 will now be able to delete
"/sys/fs/cgroup/memory/lxd/a/b/" and c2's call to create "c" will fail.
The latter subtle race makes me rethink this approach. For now we'll just leave
empty cgroups behind since I don't want to start locking stuff.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
This moves all of the network handling code into network.{c,h}. This makes what
is going on much clearer. Also it's easier to find relevant code if it is all
in one place.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Older instances of liblxc allowed to specify networks like this:
lxc.network.type = veth
lxc.network.flags = up
lxc.network.link = lxdbr0
lxc.network.name= eth0
lxc.network.type = veth
lxc.network.flags = up
lxc.network.link = lxdbr0
lxc.network.name = eth1
Each occurrence of "lxc.network.type" indicated the definition of a new
network. This syntax is not allowed in newer liblxc instances. Instead, network
must carry an index. So in new liblxc these two networks would be translated to:
lxc.net.0.type = veth
lxc.net.0.flags = up
lxc.net.0.link = lxdbr0
lxc.net.0.name= eth0
lxc.net.1.type = veth
lxc.net.1.flags = up
lxc.net.1.link = lxdbr0
lxc.net.1.name = eth1
The update script did not handle this case correctly. It should now.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
We use the ifindex as an indicator that liblxc created the network so let's
record it for the unprivileged case as well.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
- lxc-user-nic gains the subcommands {create,delete}
- dup2() STDERR_FILENO as well so that we can show helpful messages in our logs
on failure
- initialize output buffer so that we don't print garbage
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
get_new_nicname() calls lxc_mkifname() which allocates memory and returns it to
the caller. The way get_new_nicname() and get_nic_if_avail() were implemented
they hid that fact by returning a boolean. That doesn't make sense. Let's
rather have them return a pointer to the allocated nic name which the caller
needs to free.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Say we have
lxc.uts.name = c1
lxc.cgroup.dir = lxd/a/b/c
the path for the container's cgroup would be
lxd/a/b/c/c1
When the container is shutdown we should not just try to delete "c1" we should
also try to delete "c", "b", "a", and "lxd". This is to ensure that we don't
leave empty cgroups around thereby increasing the chance that we run into
trouble with cgroup limits. The algorithm for this isn't too costly since we
can simply stop walking upwards at the first rmdir() failure.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Say we have
lxc.uts.name = c1
lxc.cgroup.dir = lxd
the actual path should be
lxd/c1
Right now it would just be
lxd
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
So far, when creating veth devices attached to openvswitch bridges we used to
fork() off a thread on container startup. This thread was kept around until the
container shut down. I have no good explanation why we did it that why but it's
certainly not necessary. Instead, let's fork() off the thread on container
shutdown to delete the veth.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
"lxc.cgroup.dir" can be used to set the name of the directory the container's
cgroup will be created in. For example, setting
lxc.uts.name = c1
lxc.cgroup.dir = lxd
would make liblxc create the cgroup
lxd/c1
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Surfaced while building lxc-2.0.8 on e2k architecture with lcc,
looks like its -Wall is more pedantic than gcc's:
lcc: "conf.c", line 1514: error: unrecognized character escape sequence
[-Werror]
DEBUG("created directory for console and tty devices at \%s\"", path);
^
in expansion of macro "DEBUG" at line 1514
Another byte is a leading whitespace fix while at that.
Signed-off-by: Michael Shigorin <mike@altlinux.org>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>