mirror of
				https://github.com/qemu/qemu.git
				synced 2025-10-31 04:06:46 +00:00 
			
		
		
		
	 29320530cf
			
		
	
	
		29320530cf
		
	
	
	
	
		
			
			Refer to 26ec190964 virtiofsd: Do not use a thread pool by default
Signed-off-by: Liu Yiding <liuyd.fnst@fujitsu.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Message-id: 20220413042054.1484640-1-liuyd.fnst@fujitsu.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
		
	
			
		
			
				
	
	
		
			404 lines
		
	
	
		
			13 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
			
		
		
	
	
			404 lines
		
	
	
		
			13 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
| QEMU virtio-fs shared file system daemon
 | |
| ========================================
 | |
| 
 | |
| Synopsis
 | |
| --------
 | |
| 
 | |
| **virtiofsd** [*OPTIONS*]
 | |
| 
 | |
| Description
 | |
| -----------
 | |
| 
 | |
| Share a host directory tree with a guest through a virtio-fs device.  This
 | |
| program is a vhost-user backend that implements the virtio-fs device.  Each
 | |
| virtio-fs device instance requires its own virtiofsd process.
 | |
| 
 | |
| This program is designed to work with QEMU's ``--device vhost-user-fs-pci``
 | |
| but should work with any virtual machine monitor (VMM) that supports
 | |
| vhost-user.  See the Examples section below.
 | |
| 
 | |
| This program must be run as the root user.  The program drops privileges where
 | |
| possible during startup although it must be able to create and access files
 | |
| with any uid/gid:
 | |
| 
 | |
| * The ability to invoke syscalls is limited using seccomp(2).
 | |
| * Linux capabilities(7) are dropped.
 | |
| 
 | |
| In "namespace" sandbox mode the program switches into a new file system
 | |
| namespace and invokes pivot_root(2) to make the shared directory tree its root.
 | |
| A new pid and net namespace is also created to isolate the process.
 | |
| 
 | |
| In "chroot" sandbox mode the program invokes chroot(2) to make the shared
 | |
| directory tree its root. This mode is intended for container environments where
 | |
| the container runtime has already set up the namespaces and the program does
 | |
| not have permission to create namespaces itself.
 | |
| 
 | |
| Both sandbox modes prevent "file system escapes" due to symlinks and other file
 | |
| system objects that might lead to files outside the shared directory.
 | |
| 
 | |
| Options
 | |
| -------
 | |
| 
 | |
| .. program:: virtiofsd
 | |
| 
 | |
| .. option:: -h, --help
 | |
| 
 | |
|   Print help.
 | |
| 
 | |
| .. option:: -V, --version
 | |
| 
 | |
|   Print version.
 | |
| 
 | |
| .. option:: -d
 | |
| 
 | |
|   Enable debug output.
 | |
| 
 | |
| .. option:: --syslog
 | |
| 
 | |
|   Print log messages to syslog instead of stderr.
 | |
| 
 | |
| .. option:: -o OPTION
 | |
| 
 | |
|   * debug -
 | |
|     Enable debug output.
 | |
| 
 | |
|   * flock|no_flock -
 | |
|     Enable/disable flock.  The default is ``no_flock``.
 | |
| 
 | |
|   * modcaps=CAPLIST
 | |
|     Modify the list of capabilities allowed; CAPLIST is a colon separated
 | |
|     list of capabilities, each preceded by either + or -, e.g.
 | |
|     ''+sys_admin:-chown''.
 | |
| 
 | |
|   * log_level=LEVEL -
 | |
|     Print only log messages matching LEVEL or more severe.  LEVEL is one of
 | |
|     ``err``, ``warn``, ``info``, or ``debug``.  The default is ``info``.
 | |
| 
 | |
|   * posix_lock|no_posix_lock -
 | |
|     Enable/disable remote POSIX locks.  The default is ``no_posix_lock``.
 | |
| 
 | |
|   * readdirplus|no_readdirplus -
 | |
|     Enable/disable readdirplus.  The default is ``readdirplus``.
 | |
| 
 | |
|   * sandbox=namespace|chroot -
 | |
|     Sandbox mode:
 | |
|     - namespace: Create mount, pid, and net namespaces and pivot_root(2) into
 | |
|     the shared directory.
 | |
|     - chroot: chroot(2) into shared directory (use in containers).
 | |
|     The default is "namespace".
 | |
| 
 | |
|   * source=PATH -
 | |
|     Share host directory tree located at PATH.  This option is required.
 | |
| 
 | |
|   * timeout=TIMEOUT -
 | |
|     I/O timeout in seconds.  The default depends on cache= option.
 | |
| 
 | |
|   * writeback|no_writeback -
 | |
|     Enable/disable writeback cache. The cache allows the FUSE client to buffer
 | |
|     and merge write requests.  The default is ``no_writeback``.
 | |
| 
 | |
|   * xattr|no_xattr -
 | |
|     Enable/disable extended attributes (xattr) on files and directories.  The
 | |
|     default is ``no_xattr``.
 | |
| 
 | |
|   * posix_acl|no_posix_acl -
 | |
|     Enable/disable posix acl support.  Posix ACLs are disabled by default.
 | |
| 
 | |
|   * security_label|no_security_label -
 | |
|     Enable/disable security label support. Security labels are disabled by
 | |
|     default. This will allow client to send a MAC label of file during
 | |
|     file creation. Typically this is expected to be SELinux security
 | |
|     label. Server will try to set that label on newly created file
 | |
|     atomically wherever possible.
 | |
| 
 | |
|   * killpriv_v2|no_killpriv_v2 -
 | |
|     Enable/disable ``FUSE_HANDLE_KILLPRIV_V2`` support. KILLPRIV_V2 is enabled
 | |
|     by default as long as the client supports it. Enabling this option helps
 | |
|     with performance in write path.
 | |
| 
 | |
| .. option:: --socket-path=PATH
 | |
| 
 | |
|   Listen on vhost-user UNIX domain socket at PATH.
 | |
| 
 | |
| .. option:: --socket-group=GROUP
 | |
| 
 | |
|   Set the vhost-user UNIX domain socket gid to GROUP.
 | |
| 
 | |
| .. option:: --fd=FDNUM
 | |
| 
 | |
|   Accept connections from vhost-user UNIX domain socket file descriptor FDNUM.
 | |
|   The file descriptor must already be listening for connections.
 | |
| 
 | |
| .. option:: --thread-pool-size=NUM
 | |
| 
 | |
|   Restrict the number of worker threads per request queue to NUM.  The default
 | |
|   is 0.
 | |
| 
 | |
| .. option:: --cache=none|auto|always
 | |
| 
 | |
|   Select the desired trade-off between coherency and performance.  ``none``
 | |
|   forbids the FUSE client from caching to achieve best coherency at the cost of
 | |
|   performance.  ``auto`` acts similar to NFS with a 1 second metadata cache
 | |
|   timeout.  ``always`` sets a long cache lifetime at the expense of coherency.
 | |
|   The default is ``auto``.
 | |
| 
 | |
| Extended attribute (xattr) mapping
 | |
| ----------------------------------
 | |
| 
 | |
| By default the name of xattr's used by the client are passed through to the server
 | |
| file system.  This can be a problem where either those xattr names are used
 | |
| by something on the server (e.g. selinux client/server confusion) or if the
 | |
| ``virtiofsd`` is running in a container with restricted privileges where it
 | |
| cannot access some attributes.
 | |
| 
 | |
| Mapping syntax
 | |
| ~~~~~~~~~~~~~~
 | |
| 
 | |
| A mapping of xattr names can be made using -o xattrmap=mapping where the ``mapping``
 | |
| string consists of a series of rules.
 | |
| 
 | |
| The first matching rule terminates the mapping.
 | |
| The set of rules must include a terminating rule to match any remaining attributes
 | |
| at the end.
 | |
| 
 | |
| Each rule consists of a number of fields separated with a separator that is the
 | |
| first non-white space character in the rule.  This separator must then be used
 | |
| for the whole rule.
 | |
| White space may be added before and after each rule.
 | |
| 
 | |
| Using ':' as the separator a rule is of the form:
 | |
| 
 | |
| ``:type:scope:key:prepend:``
 | |
| 
 | |
| **scope** is:
 | |
| 
 | |
| - 'client' - match 'key' against a xattr name from the client for
 | |
|              setxattr/getxattr/removexattr
 | |
| - 'server' - match 'prepend' against a xattr name from the server
 | |
|              for listxattr
 | |
| - 'all' - can be used to make a single rule where both the server
 | |
|           and client matches are triggered.
 | |
| 
 | |
| **type** is one of:
 | |
| 
 | |
| - 'prefix' - is designed to prepend and strip a prefix;  the modified
 | |
|   attributes then being passed on to the client/server.
 | |
| 
 | |
| - 'ok' - Causes the rule set to be terminated when a match is found
 | |
|   while allowing matching xattr's through unchanged.
 | |
|   It is intended both as a way of explicitly terminating
 | |
|   the list of rules, and to allow some xattr's to skip following rules.
 | |
| 
 | |
| - 'bad' - If a client tries to use a name matching 'key' it's
 | |
|   denied using EPERM; when the server passes an attribute
 | |
|   name matching 'prepend' it's hidden.  In many ways it's use is very like
 | |
|   'ok' as either an explicit terminator or for special handling of certain
 | |
|   patterns.
 | |
| 
 | |
| - 'unsupported' - If a client tries to use a name matching 'key' it's
 | |
|   denied using ENOTSUP; when the server passes an attribute
 | |
|   name matching 'prepend' it's hidden.  In many ways it's use is very like
 | |
|   'ok' as either an explicit terminator or for special handling of certain
 | |
|   patterns.
 | |
| 
 | |
| **key** is a string tested as a prefix on an attribute name originating
 | |
| on the client.  It maybe empty in which case a 'client' rule
 | |
| will always match on client names.
 | |
| 
 | |
| **prepend** is a string tested as a prefix on an attribute name originating
 | |
| on the server, and used as a new prefix.  It may be empty
 | |
| in which case a 'server' rule will always match on all names from
 | |
| the server.
 | |
| 
 | |
| e.g.:
 | |
| 
 | |
|   ``:prefix:client:trusted.:user.virtiofs.:``
 | |
| 
 | |
|   will match 'trusted.' attributes in client calls and prefix them before
 | |
|   passing them to the server.
 | |
| 
 | |
|   ``:prefix:server::user.virtiofs.:``
 | |
| 
 | |
|   will strip 'user.virtiofs.' from all server replies.
 | |
| 
 | |
|   ``:prefix:all:trusted.:user.virtiofs.:``
 | |
| 
 | |
|   combines the previous two cases into a single rule.
 | |
| 
 | |
|   ``:ok:client:user.::``
 | |
| 
 | |
|   will allow get/set xattr for 'user.' xattr's and ignore
 | |
|   following rules.
 | |
| 
 | |
|   ``:ok:server::security.:``
 | |
| 
 | |
|   will pass 'securty.' xattr's in listxattr from the server
 | |
|   and ignore following rules.
 | |
| 
 | |
|   ``:ok:all:::``
 | |
| 
 | |
|   will terminate the rule search passing any remaining attributes
 | |
|   in both directions.
 | |
| 
 | |
|   ``:bad:server::security.:``
 | |
| 
 | |
|   would hide 'security.' xattr's in listxattr from the server.
 | |
| 
 | |
| A simpler 'map' type provides a shorter syntax for the common case:
 | |
| 
 | |
| ``:map:key:prepend:``
 | |
| 
 | |
| The 'map' type adds a number of separate rules to add **prepend** as a prefix
 | |
| to the matched **key** (or all attributes if **key** is empty).
 | |
| There may be at most one 'map' rule and it must be the last rule in the set.
 | |
| 
 | |
| Note: When the 'security.capability' xattr is remapped, the daemon has to do
 | |
| extra work to remove it during many operations, which the host kernel normally
 | |
| does itself.
 | |
| 
 | |
| Security considerations
 | |
| ~~~~~~~~~~~~~~~~~~~~~~~
 | |
| 
 | |
| Operating systems typically partition the xattr namespace using
 | |
| well defined name prefixes. Each partition may have different
 | |
| access controls applied. For example, on Linux there are multiple
 | |
| partitions
 | |
| 
 | |
|  * ``system.*`` - access varies depending on attribute & filesystem
 | |
|  * ``security.*`` - only processes with CAP_SYS_ADMIN
 | |
|  * ``trusted.*`` - only processes with CAP_SYS_ADMIN
 | |
|  * ``user.*`` - any process granted by file permissions / ownership
 | |
| 
 | |
| While other OS such as FreeBSD have different name prefixes
 | |
| and access control rules.
 | |
| 
 | |
| When remapping attributes on the host, it is important to
 | |
| ensure that the remapping does not allow a guest user to
 | |
| evade the guest access control rules.
 | |
| 
 | |
| Consider if ``trusted.*`` from the guest was remapped to
 | |
| ``user.virtiofs.trusted*`` in the host. An unprivileged
 | |
| user in a Linux guest has the ability to write to xattrs
 | |
| under ``user.*``. Thus the user can evade the access
 | |
| control restriction on ``trusted.*`` by instead writing
 | |
| to ``user.virtiofs.trusted.*``.
 | |
| 
 | |
| As noted above, the partitions used and access controls
 | |
| applied, will vary across guest OS, so it is not wise to
 | |
| try to predict what the guest OS will use.
 | |
| 
 | |
| The simplest way to avoid an insecure configuration is
 | |
| to remap all xattrs at once, to a given fixed prefix.
 | |
| This is shown in example (1) below.
 | |
| 
 | |
| If selectively mapping only a subset of xattr prefixes,
 | |
| then rules must be added to explicitly block direct
 | |
| access to the target of the remapping. This is shown
 | |
| in example (2) below.
 | |
| 
 | |
| Mapping examples
 | |
| ~~~~~~~~~~~~~~~~
 | |
| 
 | |
| 1) Prefix all attributes with 'user.virtiofs.'
 | |
| 
 | |
| ::
 | |
| 
 | |
|  -o xattrmap=":prefix:all::user.virtiofs.::bad:all:::"
 | |
| 
 | |
| 
 | |
| This uses two rules, using : as the field separator;
 | |
| the first rule prefixes and strips 'user.virtiofs.',
 | |
| the second rule hides any non-prefixed attributes that
 | |
| the host set.
 | |
| 
 | |
| This is equivalent to the 'map' rule:
 | |
| 
 | |
| ::
 | |
| 
 | |
|  -o xattrmap=":map::user.virtiofs.:"
 | |
| 
 | |
| 2) Prefix 'trusted.' attributes, allow others through
 | |
| 
 | |
| ::
 | |
| 
 | |
|    "/prefix/all/trusted./user.virtiofs./
 | |
|     /bad/server//trusted./
 | |
|     /bad/client/user.virtiofs.//
 | |
|     /ok/all///"
 | |
| 
 | |
| 
 | |
| Here there are four rules, using / as the field
 | |
| separator, and also demonstrating that new lines can
 | |
| be included between rules.
 | |
| The first rule is the prefixing of 'trusted.' and
 | |
| stripping of 'user.virtiofs.'.
 | |
| The second rule hides unprefixed 'trusted.' attributes
 | |
| on the host.
 | |
| The third rule stops a guest from explicitly setting
 | |
| the 'user.virtiofs.' path directly to prevent access
 | |
| control bypass on the target of the earlier prefix
 | |
| remapping.
 | |
| Finally, the fourth rule lets all remaining attributes
 | |
| through.
 | |
| 
 | |
| This is equivalent to the 'map' rule:
 | |
| 
 | |
| ::
 | |
| 
 | |
|  -o xattrmap="/map/trusted./user.virtiofs./"
 | |
| 
 | |
| 3) Hide 'security.' attributes, and allow everything else
 | |
| 
 | |
| ::
 | |
| 
 | |
|     "/bad/all/security./security./
 | |
|      /ok/all///'
 | |
| 
 | |
| The first rule combines what could be separate client and server
 | |
| rules into a single 'all' rule, matching 'security.' in either
 | |
| client arguments or lists returned from the host.  This stops
 | |
| the client seeing any 'security.' attributes on the server and
 | |
| stops it setting any.
 | |
| 
 | |
| SELinux support
 | |
| ---------------
 | |
| One can enable support for SELinux by running virtiofsd with option
 | |
| "-o security_label". But this will try to save guest's security context
 | |
| in xattr security.selinux on host and it might fail if host's SELinux
 | |
| policy does not permit virtiofsd to do this operation.
 | |
| 
 | |
| Hence, it is preferred to remap guest's "security.selinux" xattr to say
 | |
| "trusted.virtiofs.security.selinux" on host.
 | |
| 
 | |
| "-o xattrmap=:map:security.selinux:trusted.virtiofs.:"
 | |
| 
 | |
| This will make sure that guest and host's SELinux xattrs on same file
 | |
| remain separate and not interfere with each other. And will allow both
 | |
| host and guest to implement their own separate SELinux policies.
 | |
| 
 | |
| Setting trusted xattr on host requires CAP_SYS_ADMIN. So one will need
 | |
| add this capability to daemon.
 | |
| 
 | |
| "-o modcaps=+sys_admin"
 | |
| 
 | |
| Giving CAP_SYS_ADMIN increases the risk on system. Now virtiofsd is more
 | |
| powerful and if gets compromised, it can do lot of damage to host system.
 | |
| So keep this trade-off in my mind while making a decision.
 | |
| 
 | |
| Examples
 | |
| --------
 | |
| 
 | |
| Export ``/var/lib/fs/vm001/`` on vhost-user UNIX domain socket
 | |
| ``/var/run/vm001-vhost-fs.sock``:
 | |
| 
 | |
| .. parsed-literal::
 | |
| 
 | |
|   host# virtiofsd --socket-path=/var/run/vm001-vhost-fs.sock -o source=/var/lib/fs/vm001
 | |
|   host# |qemu_system| \\
 | |
|         -chardev socket,id=char0,path=/var/run/vm001-vhost-fs.sock \\
 | |
|         -device vhost-user-fs-pci,chardev=char0,tag=myfs \\
 | |
|         -object memory-backend-memfd,id=mem,size=4G,share=on \\
 | |
|         -numa node,memdev=mem \\
 | |
|         ...
 | |
|   guest# mount -t virtiofs myfs /mnt
 |