LXC has supported the bpf device controlller for a while now. A bpf device
program can be attached to the container's cgroup if this is a pure cgroup2
host.
The format for specifying device rules for the cgroup2 bpf device controller is
the same as for the legacy cgroup device controller; only the configuration key
prefix has to change. Specifically, device rules for the legacy cgroup device
controller are specified by via lxc.cgroup.devices.{allow,deny} whereas for the
cgroup2 bpf device controller lxc.cgroup2.devices.{allow,deny} must be used.
The following semantics apply:
1. The device rule "lxc.cgroup2.devices.deny = a" will cause LXC to instruct
the kernel to block access to all devices by default. To grant access to
devices "allow device rules" must be added via the
"lxc.cgroup2.devices.allow" key. This is referred to as a "allowlist" device
program.
2. The device rule "lxc.cgroup2.devices.allow = a" will cause LXC to instruct
the kernel to allow access to all devices by default. To deny access to
devices "deny device rules" must be added via "lxc.cgroup2.devices.deny"
key. This is referred to as a "denylist" device program.
3. Specifying a rule as explained in 1. or 2. will cause all previous rules to
be cleared, i.e. the device list will be reset.
For example the set of rules:
lxc.cgroup2.devices.deny = a
lxc.cgroup2.devices.allow = c *:* m
lxc.cgroup2.devices.allow = b *:* m
lxc.cgroup2.devices.allow = c 1:3 rwm
implements a "allowlist" device program, i.e. the kernel will block access to
all devices not specifically allowed in this list. This particular program
states that all character and block devices might be created but only /dev/null
might be read or written.
If we to switch to the set of rules to:
lxc.cgroup2.devices.allow = a
lxc.cgroup2.devices.deny = c *:* m
lxc.cgroup2.devices.deny = b *:* m
lxc.cgroup2.devices.deny = c 1:3 rwm
then LXC would instruct the kernel to implement a "denylist", i.e. the kernel
will allow access to all devices not specifically denied in this list. This
particular program states that no character devices or block devices might be
created and that /dev/null is not allow allowed to be read, written, or
created.
Consider the same program but followed by a rule as explained in 1. or 2.:
lxc.cgroup2.devices.allow = a
lxc.cgroup2.devices.deny = c *:* m
lxc.cgroup2.devices.deny = b *:* m
lxc.cgroup2.devices.deny = c 1:3 rwm
lxc.cgroup2.devices.allow = a
The last line will cause LXC to reset the device list without changing the type
of device program.
lxc.cgroup2.devices.allow = a
lxc.cgroup2.devices.deny = c *:* m
lxc.cgroup2.devices.deny = b *:* m
lxc.cgroup2.devices.deny = c 1:3 rwm
lxc.cgroup2.devices.deny = a
The last line will cause LXC to reset the device list and switch from a
"allowlist" program to a "denylist" program.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
It turns out that since Linux 5.1 there are now per-LSM subdirectories
for major LSMs, which users are recommended to use over the "legacy"
top-level /proc/$pid/attr/... files[1]:
> Process attributes associated with “major” security modules should be
> accessed and maintained using the special files in /proc/.../attr. A
> security module may maintain a module specific subdirectory there,
> named after the module. /proc/.../attr/smack is provided by the Smack
> security module and contains all its special files. The files directly
> in /proc/.../attr remain as legacy interfaces for modules that provide
> subdirectories.
AppArmor has had such a directory since Linux 5.8[2], and it turns out
that with certain CONFIG_LSM configurations you can end up with AppArmor
files not being accessible from the legacy interface. Arch Linux
recently added BPF as one of the enabled LSM in their configuration, and
this broke runc[3] and LXC.
The solution is to first try to use /proc/$pid/attr/apparmor/current and
fall back to /proc/$pid/attr/current if the former is not available.
[1]: https://www.kernel.org/doc/html/latest/admin-guide/LSM/index.html
[2]: Linux 5.8 ; commit 6413f852ce08 ("apparmor: add proc subdir to attrs")
[3]: https://github.com/opencontainers/runc/issues/2801
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Rather than open-coding file reading and retry semantics and
implementing the path generation logic separately to
apparmor_process_label_fd_get, refactor the logic so that it looks
closer to the pidfd version.
This will make it easier to implement the two-step handling for
/proc/self/attr/apparmor/current and makes this code slightly less
confusing.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>