Old versions of Docker emulate a cgroup namespace by bind-mounting the
container's cgroup over the corresponding controller:
/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod7d4424e6_bb13_42f4_a47a_45a4828bf54d.slice/docker-d0b3604b67ac7930dd34ba3a796627e3e4717d12309e90a4afe3f38b6816ac98.scope /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime master:11 - cgroup cgroup rw,xattr,name=systemd
/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod7d4424e6_bb13_42f4_a47a_45a4828bf54d.slice/docker-d0b3604b67ac7930dd34ba3a796627e3e4717d12309e90a4afe3f38b6816ac98.scope /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime master:15 - cgroup cgroup rw,net_cls,net_prio
/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod7d4424e6_bb13_42f4_a47a_45a4828bf54d.slice/docker-d0b3604b67ac7930dd34ba3a796627e3e4717d12309e90a4afe3f38b6816ac98.scope /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime master:16 - cgroup cgroup rw,cpu,cpuacct
/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod7d4424e6_bb13_42f4_a47a_45a4828bf54d.slice/docker-d0b3604b67ac7930dd34ba3a796627e3e4717d12309e90a4afe3f38b6816ac98.scope /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime master:17 - cgroup cgroup rw,memory
/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod7d4424e6_bb13_42f4_a47a_45a4828bf54d.slice/docker-d0b3604b67ac7930dd34ba3a796627e3e4717d12309e90a4afe3f38b6816ac98.scope /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime master:18 - cgroup cgroup rw,devices
/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod7d4424e6_bb13_42f4_a47a_45a4828bf54d.slice/docker-d0b3604b67ac7930dd34ba3a796627e3e4717d12309e90a4afe3f38b6816ac98.scope /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime master:19 - cgroup cgroup rw,hugetlb
/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod7d4424e6_bb13_42f4_a47a_45a4828bf54d.slice/docker-d0b3604b67ac7930dd34ba3a796627e3e4717d12309e90a4afe3f38b6816ac98.scope /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime master:20 - cgroup cgroup rw,perf_event
/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod7d4424e6_bb13_42f4_a47a_45a4828bf54d.slice/docker-d0b3604b67ac7930dd34ba3a796627e3e4717d12309e90a4afe3f38b6816ac98.scope /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime master:21 - cgroup cgroup rw,cpuset
/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod7d4424e6_bb13_42f4_a47a_45a4828bf54d.slice/docker-d0b3604b67ac7930dd34ba3a796627e3e4717d12309e90a4afe3f38b6816ac98.scope /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime master:22 - cgroup cgroup rw,blkio
/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod7d4424e6_bb13_42f4_a47a_45a4828bf54d.slice/docker-d0b3604b67ac7930dd34ba3a796627e3e4717d12309e90a4afe3f38b6816ac98.scope /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime master:23 - cgroup cgroup rw,pids
/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod7d4424e6_bb13_42f4_a47a_45a4828bf54d.slice/docker-d0b3604b67ac7930dd34ba3a796627e3e4717d12309e90a4afe3f38b6816ac98.scope /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime master:24 - cgroup cgroup rw,freezer
New versions of LXC always stash a file descriptor for the root of the
cgroup mount at /sys/fs/cgroup and then resolve the current cgroup
parsed from /proc/{1,self}/cgroup relative to that file descriptor. This
doesn't work when the caller's cgroup is mouned over the controllers.
Older versions of LXC simply counted such layouts as having no cgroups
available for delegation at all and moved on provided no cgroup limits
were requested. But mainline LXC would fail such layouts. While I would
argue that failing such layouts is the semantically clean approach we
shouldn't regress users so make mainline LXC treat such cgroup layouts
as having no cgroups available for delegation.
Fixes: #3890
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Some tools require /sys/devices/virtual/net to be read-write. At the
same time we want all other parts of /sys to be read-only. To do this we
created a layout where we hade a read-only instance of sysfs mounted on
top of a read-write instance of sysfs:
`-/sys sysfs sysfs rw,nosuid,nodev,noexec,relatime
`-/sys sysfs sysfs ro,nosuid,nodev,noexec,relatime
|-/sys/devices/virtual/net sysfs sysfs rw,relatime
| `-/sys/devices/virtual/net sysfs[/devices/virtual/net] sysfs rw,nosuid,nodev,noexec,relatime
This causes issues for systemd services that create a separate mount
namespace as they get confused to what mount options need to be
respected.
Simplify our mounting logic so we end up with a single read-only mount
of sysfs on /sys and a read-write bind-mount of /sys/devices/virtual/net:
├─/sys sysfs sysfs ro,nosuid,nodev,noexec,relatime
│ ├─/sys/devices/virtual/net sysfs[/devices/virtual/net] sysfs rw,nosuid,nodev,noexec,relatime
Link: systemd/systemd#20032
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
lxc_container_init() creates the container payload process as it's child
so lxc_container_init() itself never really exits and thus the parent
isn't notified about the child exec'ing since the sync file descriptor
is never closed. Make sure it's closed to notify the parent about the
child's exec.
In addition we're currently leaking all file descriptors associated with
the handler into the stub init. Make sure that all file descriptors
other than stderr are closed.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Instead of having a statically linked init that we put on the host fs
somewhere via packaging, have to either bind mount in or detect fexecve()
functionality, let's just call it as a library function. This way we don't
have to do any of that.
This also fixes up a bunch of conditions from:
if (quiet)
fprintf(stderr, "log message");
to
if (!quiet)
fprintf(stderr, "log message");
:)
and it drops all the code for fexecve() detection and bind mounting our
init in, since we no longer need any of that.
A couple other thoughts:
* I left the lxc-init binary in since we ship it, so someone could be using
it outside of the internal uses.
* There are lots of unused arguments to lxc-init (including presumably
--quiet, since nobody noticed the above); those may be part of the API
though and so we don't want to drop them.
Signed-off-by: Tycho Andersen <tycho@tycho.pizza>
and the item is copied (strdup()) to the array.
Thus, when an item is removed from an array, memory allocated for that item
should be freed, successive items should be left-shifted and the array
realloc()ed again (size-1).
Additional changes:
- If strdup() fails in add_to_array(), then an array should be
realloc()ed again to original size.
- Initialize an array in list_all_containers().
Signed-off-by: Tomasz Blaszczak <tomasz.blaszczak@consult.red>
When an item is added to an array, then the array is realloc()ed (to size+1),
and the item is copied (strdup()) to the array.
Thus, when an item is removed from an array, allocated memory pointed by
the item (not the item itself) should be freed, successive items should
be left-shifted and the array realloc()ed again (size-1).
Additional changes:
- Initialize an array in list_all_containers().
Signed-off-by: Tomasz Blaszczak <tomasz.blaszczak@consult.red>
and the item is copied (strdup()) to the array.
Thus, when an item is removed from an array, memory allocated for that item
should be freed, successive items should be left-shifted and the array
realloc()ed again (size-1).
Additional changes:
- If strdup() fails in add_to_array(), then an array should be
realloc()ed again to original size.
- Initialize an array in list_all_containers().
Signed-off-by: Tomasz Blaszczak <tomasz.blaszczak@consult.red>
The LISTEN_FDS environment variable defines the number of
file descriptors that should be inherited by the container,
in addition to stdio.
The LISTEN_FDS environment variable is defined in the OCI spec
and used to support socket activation.
Refs #3845
Signed-off-by: Ruben Jenster <r.jenster@drachenfels.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
The function lxc_string_split_quoted and lxc_string_split_and_trim use
realloc to reduce the memory. But the result may be NULL, the the
returned memory will be uninitialized
Signed-off-by: LiFeng <lifeng68@huawei.com>
This indicates that LXC supports idmapping the rootfs and
idmapped lxc.mount.entry entries.
Link: https://github.com/lxc/lxd/issues/8870
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Closes#3093Closes#3602
Add support for nftables firewall rules if `nft` command line
interface is available in the system
Signed-off-by: Pablo Correa Gómez <ablocorrea@hotmail.com>
Some users report that compilation fails because of reports that this
variable can be used uninitialized. Initialize it to silence the
compiler.
Fixes: https://github.com/lxc/lxc/issues/3850
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>