bpf_devices_cgroup_supported() tries to load a simple BPF program to
test if BPF works. This is problematic because the function used to load
the program - bpf_program_load_kernel() - emits an error to the log if
BPF is not enabled in the kernel although device controller is not
requested in the configuration. Users could interpret that as a problem.
Make bpf_devices_cgroup_supported() check if the BPF syscall is available
before calling bpf_program_load_kernel(). We can do it by passing a NULL
pointer instead of the syscall argument as the kernel returns either
ENOSYS, when the syscall is not implemented or EFAULT, when it is
implemented.
Signed-off-by: Petr Malat <oss@malat.biz>
If a device file is opened and there isn't the underlying device,
the open call fails with ENXIO, but the path can be opened with
O_PATH, which is enough for mounting over the device file.
Generalize this idea and use O_PATH for all cases when the file
is there. One still must check for both ENXIO and EEXIST as it's
unspecified what error is reported if multiple error conditions
occur at the same time.
Signed-off-by: Petr Malat <oss@malat.biz>
With the changes introduced in:
b7b1e3a34c
the hierarchy-struct did not have the path_lim set anymore, which is
needed by setup_limits_legacy (->cg_legacy_set_data->lxc_write_openat)
to actually access the cgroup directory.
The issue can be reproduced with a container config having
```
lxc.cgroup.devices.deny = a
```
(or any lxc.cgroup.devices entry) set on a system booted with
systemd.unified_cgroup_hierarchy=0.
This affects all privileged containers on PVE (due to the default
devices.deny entry).
Signed-off-by: Stoiko Ivanov <s.ivanov@proxmox.com>
This is not a fatal error and the fallback codepath is equally safe.
When we use TIOCGPTPEER we're using a stashed fd to the container's
devpts mount's ptmx device and allocating a new fd non-path based
through this ioctl. If this ioctl can't be used we're falling back to
allocating a pts device from the host's devpts mount's ptmx device which
is path-based but is not under control of the container and so that's
safe. The difference is just that the first method gets you a nice
native terminal with all the pleasantries of having tty and friends
working whereas the latter method does not.
Fixes: #3625
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Use as much as possible from each command `--help` for completion.
Some options require a long list of completions that should be dumped by
some command option. These are not added here yet.
Examples of those are: `lxc-info --config`, `lxc-execute --define` and
`lxc-start --define`.
Signed-off-by: Edenis Freindorfer Azevedo <edenisfa@gmail.com>
By default, there is no out-of-the-box bash completion for lxc tools.
This is due to dynamic loading of completions, that requires the
completion filename to be the same as the command (e.g. `lxc-start`
expects a completion filename `lxc-start`). But all commands are in file
`lxc`, which is not read.
Signed-off-by: Edenis Freindorfer Azevedo <edenisfa@gmail.com>
If an include directive ends with a trailing slash, we now
always assume it is a directory and do not treat the
non-existence as an error.
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
Old versions of Docker emulate a cgroup namespace by bind-mounting the
container's cgroup over the corresponding controller:
/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod7d4424e6_bb13_42f4_a47a_45a4828bf54d.slice/docker-d0b3604b67ac7930dd34ba3a796627e3e4717d12309e90a4afe3f38b6816ac98.scope /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime master:11 - cgroup cgroup rw,xattr,name=systemd
/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod7d4424e6_bb13_42f4_a47a_45a4828bf54d.slice/docker-d0b3604b67ac7930dd34ba3a796627e3e4717d12309e90a4afe3f38b6816ac98.scope /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime master:15 - cgroup cgroup rw,net_cls,net_prio
/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod7d4424e6_bb13_42f4_a47a_45a4828bf54d.slice/docker-d0b3604b67ac7930dd34ba3a796627e3e4717d12309e90a4afe3f38b6816ac98.scope /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime master:16 - cgroup cgroup rw,cpu,cpuacct
/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod7d4424e6_bb13_42f4_a47a_45a4828bf54d.slice/docker-d0b3604b67ac7930dd34ba3a796627e3e4717d12309e90a4afe3f38b6816ac98.scope /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime master:17 - cgroup cgroup rw,memory
/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod7d4424e6_bb13_42f4_a47a_45a4828bf54d.slice/docker-d0b3604b67ac7930dd34ba3a796627e3e4717d12309e90a4afe3f38b6816ac98.scope /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime master:18 - cgroup cgroup rw,devices
/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod7d4424e6_bb13_42f4_a47a_45a4828bf54d.slice/docker-d0b3604b67ac7930dd34ba3a796627e3e4717d12309e90a4afe3f38b6816ac98.scope /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime master:19 - cgroup cgroup rw,hugetlb
/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod7d4424e6_bb13_42f4_a47a_45a4828bf54d.slice/docker-d0b3604b67ac7930dd34ba3a796627e3e4717d12309e90a4afe3f38b6816ac98.scope /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime master:20 - cgroup cgroup rw,perf_event
/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod7d4424e6_bb13_42f4_a47a_45a4828bf54d.slice/docker-d0b3604b67ac7930dd34ba3a796627e3e4717d12309e90a4afe3f38b6816ac98.scope /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime master:21 - cgroup cgroup rw,cpuset
/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod7d4424e6_bb13_42f4_a47a_45a4828bf54d.slice/docker-d0b3604b67ac7930dd34ba3a796627e3e4717d12309e90a4afe3f38b6816ac98.scope /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime master:22 - cgroup cgroup rw,blkio
/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod7d4424e6_bb13_42f4_a47a_45a4828bf54d.slice/docker-d0b3604b67ac7930dd34ba3a796627e3e4717d12309e90a4afe3f38b6816ac98.scope /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime master:23 - cgroup cgroup rw,pids
/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod7d4424e6_bb13_42f4_a47a_45a4828bf54d.slice/docker-d0b3604b67ac7930dd34ba3a796627e3e4717d12309e90a4afe3f38b6816ac98.scope /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime master:24 - cgroup cgroup rw,freezer
New versions of LXC always stash a file descriptor for the root of the
cgroup mount at /sys/fs/cgroup and then resolve the current cgroup
parsed from /proc/{1,self}/cgroup relative to that file descriptor. This
doesn't work when the caller's cgroup is mouned over the controllers.
Older versions of LXC simply counted such layouts as having no cgroups
available for delegation at all and moved on provided no cgroup limits
were requested. But mainline LXC would fail such layouts. While I would
argue that failing such layouts is the semantically clean approach we
shouldn't regress users so make mainline LXC treat such cgroup layouts
as having no cgroups available for delegation.
Fixes: #3890
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Some tools require /sys/devices/virtual/net to be read-write. At the
same time we want all other parts of /sys to be read-only. To do this we
created a layout where we hade a read-only instance of sysfs mounted on
top of a read-write instance of sysfs:
`-/sys sysfs sysfs rw,nosuid,nodev,noexec,relatime
`-/sys sysfs sysfs ro,nosuid,nodev,noexec,relatime
|-/sys/devices/virtual/net sysfs sysfs rw,relatime
| `-/sys/devices/virtual/net sysfs[/devices/virtual/net] sysfs rw,nosuid,nodev,noexec,relatime
This causes issues for systemd services that create a separate mount
namespace as they get confused to what mount options need to be
respected.
Simplify our mounting logic so we end up with a single read-only mount
of sysfs on /sys and a read-write bind-mount of /sys/devices/virtual/net:
├─/sys sysfs sysfs ro,nosuid,nodev,noexec,relatime
│ ├─/sys/devices/virtual/net sysfs[/devices/virtual/net] sysfs rw,nosuid,nodev,noexec,relatime
Link: systemd/systemd#20032
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>