Long story behind this. Many years ago, Stéphane Graber
discovered an issue with apparmor mount rules.
Since
7f2b13275d
commit ("apparmor: Update mount states handling") it was prohibited
to change mount propagation flags, just because adding rules which
allow mount propagation user inside the container gets an ability
to mount everything [1].
Now with modern systemd versions this problem become more critical than
before. For instance, ArchLinux containers fail to start without
nesting apparmor profile enabled (because nesting profile effectively
just allow all mounts). Of course, that's a security issue.
We've also enabled sharing on the container rootfs:
https://github.com/lxc/lxc/pull/4229
Now for many workloads it's needed to change propagation flag to
private (see https://github.com/canonical/craft-parts/pull/400).
Issue:
$ lxc-start -F archlinux-test
systemd 253-1-arch running in system mode (+PAM +AUDIT -SELINUX -APPARMOR -IMA +SMACK +SECCOMP +GCRYPT +GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN +IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 -PWQUALITY +P11KIT -QRENCODE +TPM2 +BZIP2 +LZ4 +XZ +ZLIB +ZSTD +BPF_FRAMEWORK +XKBCOMMON +UTMP -SYSVINIT default-hierarchy=unified)
Detected virtualization lxc.
Detected architecture x86-64.
Welcome to Arch Linux!
bpf-lsm: BPF LSM hook not enabled in the kernel, BPF LSM not supported
Failed to remount root directory as MS_SLAVE: Permission denied
(sd-gens) failed with exit status 1.
[!!!!!!] Failed to start up manager.
Exiting PID 1...
Workaround (unsafe):
$ lxc-start -s lxc.apparmor.allow_nesting=1 -s lxc.apparmor.profile=generated -F arch-test
John Johansen (Apparmor maintainer) and LXD team worked on fix [2].
It was merged to stable AppArmor 3.0 and 3.1 branches already.
There is no stable AppArmor version tag for that, but I think it will
be in the AppArmor version 3.0.10.
See also:
[1] https://bugs.launchpad.net/apparmor/+bug/1597017
[2] https://gitlab.com/apparmor/apparmor/-/merge_requests/333Fixes: #4280
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
RW bind mounts need to be restricted for some paths in
order to avoid MAC restriction bypasses, but read-only bind
mounts shouldn't have that problem.
Additionally, combinations of 'nosuid', 'nodev' and
'noexec' flags shouldn't be a problem either and are
required with newer systemd versions, so let's allow those
as long as they're combined with 'ro,remount,bind'.
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
Properly list all of the states and the right apparmor stanza for them,
then comment them all as actually enabling this would currently let the
user bypass apparmor entirely.
Signed-off-by: Stéphane Graber <stgraber@ubuntu.com>
Bind-mounts aren't harmful in containers, so long as they're not used to
bypass MAC policies.
This change allows bind-mounting of any path which isn't a dangerous
filesystem that's otherwise blocked by apparmor.
This also allows switching paths {r}shared or {r}private.
Signed-off-by: Stéphane Graber <stgraber@ubuntu.com>
Prevent privileged containers from messing with the host's pci devices
directly. Refuse access under /proc/bus, and drop cap_sys_rawio. Some
containers may need to re-enable cap_sys_rawio (i.e. if they run an
X server).
It may be desirable to break some of this stuff into files which can be
separately included (or not included), but this patch isn't the right
place for that.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Unprivileged containers cannot read it anyway, but also prevent root
owned containers from doing so. Sadly upstart's mountall won't run
if we try to prevent it from being mounted at all.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
This reverts commit 833bf9c2b2.
This change wasn't actually safe and is now superseded by the cgns profile.
Signed-off-by: Stéphane Graber <stgraber@ubuntu.com>
Some systems need to be able to bind-mount /run to /var/run
and /run/lock to /var/run/lock. (Tested with opensuse 13.1
containers migrated from openvz.)
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
Just like we block access to mem and kmem, there's no good reason for
the container to have access to kcore.
Reported-by: Marc Schaefer
Signed-off-by: Stéphane Graber <stgraber@ubuntu.com>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Restrict signal and ptrace for processes running under the container
profile. Rules based on AppArmor base abstraction. Add unix rules for
processes running under the container profile.
Signed-off-by: Jamie Strandboge <jamie@canonical.com>
Acked-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>
This uses the generate-apparmor-rules.py script I sent out some time
ago to auto-generate apparmor rules based on a higher level set of
block/allow rules.
Add apparmor policy testcase to make sure that some of the paths we
expect to be denied (and allowed) write access to are in fact in
effect in the final policy.
With this policy, libvirt in a container is able to start its
default network, which previously it could not.
v2: address feedback from stgraber
put lxc-generate-aa-rules.py into EXTRA_DIST
add lxc-test-apparmor, container-base and container-rules to .gitignore
take lxc-test-apparmor out of EXTRA_DIST
make lxc-generate-aa-rules.py pep8-compliant
don't automatically generate apparmor rules
This is only bc we can't be guaranteed that python3 will be
available.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Stéphane Graber <stgraber@ubuntu.com>