Commit Graph

6355 Commits

Author SHA1 Message Date
Christian Brauner
da0f9977a1
conf: do not log uninitialized memory
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2017-09-05 13:46:53 +02:00
Christian Brauner
2187efd310
conf: fix tty creation
We allocate pty {master,slave} file descriptors in the childs namespaces after
we have setup devpts. After we have sent the pty file descriptors to the parent
and set up the pty file descriptors under /dev/tty* and before we exec the init
binary we need to delete these file descriptors in the child. However, one of
my commits made the deletion occur before setting up the file descriptors under
/dev/tty*. This caused a failures when trying to attach to the container's ttys
since they werent actually configured although the file descriptors were
available in the in-memory configuration of the parent.
This commit reworks setting up tty such that deletion occurs after all setup
has been performed. The commit is actually minimal but needs to also move all
the functions into one place since they well now be called from
"lxc_create_ttys()".

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2017-09-05 12:41:30 +02:00
Christian Brauner
73363c6134
conf: non-functional changes
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2017-09-05 12:19:28 +02:00
Stéphane Graber
8a0c503344 Merge pull request #1785 from brauner/2017-09-05/record_idmap_in_log
conf: record idmap that gets written
2017-09-04 20:18:25 -04:00
Christian Brauner
54fbbeb573
conf: record idmap that gets written
This will serve us well in the future!

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2017-09-05 02:00:29 +02:00
Stéphane Graber
27b129202f Merge pull request #1784 from brauner/2017-09-05/document_handler_fields
start: document all handler fields
2017-09-04 18:45:32 -04:00
Christian Brauner
35a02107f2
start: document all handler fields
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2017-09-05 00:30:33 +02:00
Stéphane Graber
a753a7a931 Merge pull request #1783 from brauner/2017-09-04/criu_version
criu: add cmp_version()
2017-09-04 15:52:44 -04:00
Federico Briata
74ad36079c
criu: add cmp_version()
We cannot use strcmp(). Otherwise we incorrectly report e.g. that criu 2.12.1
is less than 2.8.

Signed-off-by: Federico Briata <federico-pietro.briata@cnhind.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2017-09-04 21:30:18 +02:00
Stéphane Graber
56c2d600a8 Merge pull request #1771 from brauner/2017-08-30/remove_executable_bit_from_console.c
console: non-functional change
2017-09-04 12:54:59 -04:00
Stéphane Graber
b41431e87d Merge pull request #1782 from brauner/2017-09-04/fix_tty_sending
conf: don't send ttys when none are configured
2017-09-04 11:54:23 -04:00
Christian Brauner
7729f8e519
start: don't let data_sock users close the fd
It is bad style to close an fd inside a function which didn't create it. Let's
rather close it transparently in start.c.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2017-09-04 14:48:37 +02:00
Christian Brauner
1f9bbd230c
conf: don't send ttys when none are configured
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2017-09-04 14:35:02 +02:00
Serge Hallyn
36259ede73 Merge pull request #1773 from brauner/2017-08-31/ensure_lxc_user_nic_tests_privilege_over_netns
network: improvements + bugfixes
2017-09-03 21:17:43 -05:00
Christian Brauner
b9f522c5b4
start: switch from SOCK_DGRAM to SOCK_STREAM
Writes < PIPE_BUF will be atomic. PIPE_BUF is guaranteed to be 512 by POSIX and
Linux guarantess 4096. Nothing we send around goes over this limit.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2017-09-04 01:27:30 +02:00
Christian Brauner
672c1e5821
conf: send ttys in batches of 2
I thought we could send all ttys at once but this limits the number of ttys
users can use because of iovec_len restrictions. So let's sent them in batches
of 2.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2017-09-04 01:27:04 +02:00
Christian Brauner
cb5659e1cd
lxc-user-nic: simplify
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2017-09-03 20:49:54 +02:00
Christian Brauner
966e9f1fc8
network: remove allocation from lxc_mkifname()
lxc_mkifname() really doesn't need to allocate any memory.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2017-09-03 20:39:59 +02:00
Christian Brauner
2958919632
network: fix grammar
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2017-09-03 17:55:28 +02:00
Christian Brauner
a1ae535a4f
network: user send()/recv()
Also move all functions to network.{c,h}.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2017-09-03 17:55:28 +02:00
Christian Brauner
d0fbc7bab7
handler: root -> am_root
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2017-09-03 17:55:28 +02:00
Christian Brauner
6d8c277969
lxc-user-nic: bugfixes
Since find_line() was changed before count_entries() started counting lines
wrong. It would report maximum reached before you actually reached your alloted
maximum.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2017-09-03 17:55:28 +02:00
Christian Brauner
d75c14e262
utils: add lxc_nic_exists()
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2017-09-03 16:35:48 +02:00
Christian Brauner
3231134523
lxc-user-nic: keep lines from other {users,links}
Assume the db contained the following entries:

    chb veth lxcbr0 veth1
    chb veth lxcbr0 veth2
    chb veth lxdbr0 veth3
    chb veth lxdbr0 veth2
    didi veth lxcbr0 veth4

And you request

    cull_entries("chb", "veth", "lxdbr0", "veth3");

lxc-user-nic would wipe any entries that did not match irrespective of whether
they existed or not. Let's fix that.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2017-09-02 19:44:10 +02:00
Christian Brauner
a92028b27f
lxc-user-nic: fix adding database entries
The code before inserted \0-bytes after every new line which made the db
basically unusable.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2017-09-02 02:26:28 +02:00
Christian Brauner
7ab1ba029b
network: remove netpipe
We use data_sock for all things we need to send around between parent and child
now. It doesn't make sense to have so many different pipes and sockets if one
will do just fine.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2017-09-01 23:51:44 +02:00
Dimitri John Ledkov
db3c8336ac Check that there is netplan binary, rather than just just a config directory.
Signed-off-by: Dimitri John Ledkov <xnox@ubuntu.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2017-09-01 21:49:03 +02:00
Dimitri John Ledkov
649f249cd0 templates/ubuntu: support netplan in newer releases by default
If netplan is present in the container, configure default networking
with neplan instead of ifupdown. Also, do not install ifupdown when
boostrapping minbase variant, unless using currently support
non-netplan releases (trusty, zenial, zesty).

Signed-off-by: Dimitri John Ledkov <xnox@ubuntu.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>
2017-09-01 21:49:00 +02:00
Christian Brauner
8843fde445
network: use correct network device name
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2017-09-01 21:36:52 +02:00
Christian Brauner
b809f23228
network: stop recording saved physical net devices
liblxc will now correctly log any network device names and ifindeces in their
respective network namespaces. So there's no need to record physical network
devices any more. This spares us heap allocations and memory we need to have
lying around til the container is shutdown.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2017-09-01 21:36:50 +02:00
Christian Brauner
790255cf8e
network: retrieve correct names and ifindices
On privileged network creation we only retrieved the names and ifindeces of
network devices in the host's network namespace. This meant that the monitor
process was acting on possibly incorrect information. With this commit we have
the child send back the correct device names and ifindeces in the container's
network namespace.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2017-09-01 16:08:31 +02:00
Christian Brauner
c6012571f3
start: non-functional changes
This renames the socketpair() variable "ttysock" to "data_sock" since we will
use it to send arbitrary data around, not just ttys anymore.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2017-09-01 13:04:00 +02:00
Christian Brauner
535e88591d
network: non-functional changes
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2017-09-01 12:54:43 +02:00
Christian Brauner
de4855a8bc network: use static memory for net device names
All network devices can only be of size < IFNAMSIZ. So let's spare the useless
heap allocations and use static memory.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2017-09-01 12:32:56 +02:00
Christian Brauner
99573f4aea
lxc-user-nic: initialize vars to silence gcc-7
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2017-08-31 23:13:44 +02:00
Christian Brauner
8424b4e14b
lxc-user-nic: free memory and check for error
- check for error on ifindex retrieval
- free allocated memory

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2017-08-31 23:09:38 +02:00
Christian Brauner
d0b915aab9
start: non-functional changes
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2017-08-31 23:09:38 +02:00
Christian Brauner
8da62485e8
network: retrieve the host's veth device ifindex
- Retrieve the host's veth device ifindex in the host's network namespace.
- Add a note why we retrieve the container's veth device ifindex in the host's
  network namespace.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2017-08-31 23:09:37 +02:00
Serge Hallyn
94a182afcc Merge pull request #1772 from brauner/2017-08-31/ensure_lxc_user_nic_tests_privilege_over_netns
lxc-user-nic: test privilege over netns on delete
2017-08-31 12:15:22 -05:00
Christian Brauner
74c6e2b015
network: rework network creation
- On unprivileged veth network creation have lxc-user-nic send the names of the
  veth devices and their respective ifindeces. The advantage of retrieving this
  information from lxc-user-nic is that we spare us sending around more stuff
  via the netpipe in start.c. Also, lxc-user-nic operates in both namespaces
  (the container's namespace and the hosts's namespace) via setns and so is
  guaranteed to retrieve the correct ifindex via if_nametoindex() which is an
  network namespace aware ioctl() call. While I'm pretty sure the ifindeces for
  veth devices are identical across network namespaces I'm weary to rely on
  this. We need the ifindexes to guarantee safe deletion of unprivileged
  network devices via lxc-user-nic later on since we use them to identify the
  network devices in their corresponding network namespaces.
- Move the network device logging from the child to the parent. The child does
  not have all of the information about the network devices available only the
  few bits it actually needs to now. The monitor process is the only process
  that needs all this information.
- The network creation code for privileged and unprivileged networks was
  previously mangled into one single function but at the same time some of the
  privileged code had additional functions that were called in other places in
  start.c. Let's divide and conquer and split out the privileged and
  unprivileged network creation into completely separate functions. This makes
  what's happening way more clear. This will also have no performance impact
  since either you are privileged and only execute the privileged network
  creation functions or you are unprivileged and only execute the unprivileged
  network creation functions.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2017-08-31 16:43:37 +02:00
Christian Brauner
d952b351d2
network: log ifindex for host side veth device
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2017-08-31 15:30:08 +02:00
Christian Brauner
085bb443cc
network: document all fields in struct lxc_netdev
This is menial work but I'll thank myself later... a lot.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2017-08-31 15:30:08 +02:00
Christian Brauner
4239e9c3bb
network: add ifindex field for host veth device
We should not just record the ifindex for the container's veth device but also
for the host's veth device. This is useful when {configuring,deconfiguring}
veth devices and becomes crucial when calling our lxc-user-nic setuid helper
where we rely on the ifindex to make decisions about whether we are licensed to
perform certain operations on the veth device in question.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2017-08-31 15:30:08 +02:00
Christian Brauner
8ce727fcd9
network: log veth_attr.pair and veth_attr.veth1
If the user specified lxc.net.[i].veth.pair attribute to request that the host
side of a veth pair be given a specific name let's log it at the trace level.
Otherwise, if the user didn't not specify lxc.net.[i].veth.pair veth_attr.veth1
will contain the name of the host side veth device.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2017-08-31 15:30:07 +02:00
Christian Brauner
1bd8d726a7 lxc-user-nic: test privilege over netns on delete
When lxc-user-nic is called with the "delete" subcommand we need to make sure
that we are actually privileged over the network namespace for which we are
supposed to delete devices on the host. To this end we require that path to the
affected network namespace is passed. We then setns() to the network namespace
and drop privilege to the caller's real user id. Then we try to delete the
loopback interface which is not possible. If we are privileged over the network
namespace this operation will fail with ENOTSUP. If we are not privileged over
the network namespace we will get EPERM.

This is the first part of the commit. As of now nothing guarantees that the
caller does not just give us a random path to a network namespace it is
privileged over.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2017-08-31 15:30:06 +02:00
Christian Brauner
3a12c64d94
configure: remove slash from cgroup pattern
This is the cause of the unnecessary extraneous slashes when creating cgroups.
Our lxc.system.conf page also clearly shows "lxc/%n" as example, not "/lxc%n".

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2017-08-30 16:45:45 +02:00
Christian Brauner
06caa01200
console: non-functional change
Remove executable bit.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2017-08-30 16:38:06 +02:00
Stéphane Graber
70a498157a Merge pull request #1769 from brauner/2017-08-30/improve_empty_cgroup_deletion
Revert "cgfsng: try to delete parent cgroups"
2017-08-30 10:35:06 -04:00
Christian Brauner
cf7faeb345
confile: remove unnecessary cleanup code
set_config_string_item() already free()s before setting the new value.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2017-08-30 12:26:42 +02:00
Christian Brauner
308a6c946d
Revert "cgfsng: try to delete parent cgroups"
This reverts commit 92c590ae1e.

Problem:

    Commit 92c590ae1e introduced the following
    behavior:

    > cgfsng: try to delete parent cgroups
    >
    > Say we have
    >
    >     lxc.uts.name = c1
    >     lxc.cgroup.dir = lxd/a/b/c
    >
    > the path for the container's cgroup would be
    >
    >     lxd/a/b/c/c1
    >
    > When the container is shutdown we should not just try to delete "c1" we
    > should also try to delete "c", "b", "a", and "lxd". This is to ensure
    > that we don't leave empty cgroups around thereby increasing the chance
    > that we run into trouble with cgroup limits. The algorithm for this isn't
    > too costly since we can simply stop walking upwards at the first rmdir()
    > failure.

    The algorithm employs recursive_destroy() which opens each directory
    specified in lxc.cgroup.dir and tries to delete each directory within that
    directory. For example, assume "/sys/fs/cgroup/memory/lxd/a/b/c" only
    contains the cgroup "c1" for container "c1". Assume that "c1" calls
    recursive_destroy() to cleanup it's cgroups. It will first delete "c1" and
    anything underneath it. This is perfectly fine since anything underneath
    that cgroup is under its control. The new algorithm will then tell it to
    "recurse upwards". So recursive_destroy() will try to delete
    "/sys/fs/cgroup/lxd/a/b/c" next. Now assume that a second container "c2"
    has "lxc.cgroup.dir = lxd/a/b/c" set in its config file and calls
    cgroup_create(). This will create the *empty* cgroup
    "/sys/fs/cgroup/memory/lxd/a/b/c/c2". Now assume that after having created
    "c2" container "c1"'s call to recursive_destroy() reaches
    "/sys/fs/cgroup/memory/lxd/a/b/c/c2" before it is populated. Then the
    cgroup "c2" will be removed. Now "c2" calls cgroup_enter() to enter its
    created cgroup. This will fail since c1 deleted the cgroup "c2". (As a
    sidenote: This is in the set of the few race conditions that are actually
    easy to describe.)

Possible Solution:

    Instead of calling recursive_destroy() on all cgroups specified in
    lxc.cgroup.dir we only call recursive_destroy() on the container's own
    cgroup "/sys/fs/cgroup/memory/lxd/a/b/c/c1". When we start to recurse
    upwards we only call unlinkat(AT_FDCWD, path, AT_REMOVEDIR). This should
    avoid the race described above. My argument is as follows. Assume that the
    container c1 has created the cgroup "/sys/fs/cgroup/lxd/a/b/c/c1" for
    itself. Now c1 calls cgroup_destroy(). First, recursive_destroy() will be
    called on the cgroup "c1" which will delete any emtpy cgroup directories
    underneath "c1" and finally "c1" itself. This is fine since everything
    under "c1" is the container's c1 sole property. Now container c1 will call
    unlinkat() on "/sys/fs/cgroup/memory/lxd/a/b/c/c1":
    - Assume that in the meantime container c2 has created the cgroup
      "/sys/fs/cgroup/memory/lxd/a/b/c/c2". Then c1's unlinkat() will fail.
      This will stop c1 from recursing upwards. So c2's cgroup_enter() call
      will find all its cgroups intact and well. unlinkat() will come with the
      appropriate in-kernel locking which will stop it from racing with
      mkdir().
    - There's still a subtle race left. c2 might be calling an implementation
      of mkdir -p to try and create e.g. the cgroup
      "/sys/fs/cgroup/memory/lxd/a/b". Let's assume "b" exists then c2 will
      receive EEXIST on "b" and move on to create "c". Let's further assume c1
      has already deleted "c". c1 will now be able to delete
      "/sys/fs/cgroup/memory/lxd/a/b/" and c2's call to create "c" will fail.

The latter subtle race makes me rethink this approach. For now we'll just leave
empty cgroups behind since I don't want to start locking stuff.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2017-08-30 12:26:10 +02:00